AdaLLM implements NVFP4-first inference on RTX 4090 with FP8 KV cache, RedFire-Image-Edit ships FP8/NVFP4 quants, and NVIDIA confirms FP4 pre-training for Nemotron3 — FP4 is moving from inference hack to first-class training format.
NVIDIA confirming FP4 pre-training for Nemotron3 (H1 2026) plus community NVFP4 inference on Ada Lovelace GPUs — next bottleneck is FP4 KV-cache accuracy loss at long contexts, currently worked around with FP8 KV.
3 sources
- reddit [Release] AdaLLM: NVFP4-first inference on RTX 4090 (FP8... 32pts
- reddit Quantz for RedFire-Image-Edit 1.0 FP8 / NVFP4 52pts
- reddit Nemotron3 Super/Ultra: FP4 pre-training, H1 2026... 70pts
Qwen3-TTS.cpp achieves 4x speedup over PyTorch with ~2GB memory for a 0.6B model, and KaniTTS2 (400M params) runs in 3GB VRAM with voice cloning — two independent open-source TTS implementations targeting real-time conversational use on consumer GPUs.
Qwen3-TTS.cpp hits 4x speedup via GGML and KaniTTS2 runs at 400M/3GB VRAM — next bottleneck is streaming first-token latency for conversational turn-taking, not yet benchmarked in either release.
3 sources
- reddit Qwen3-TTS.cpp 75pts
- reddit KaniTTS2 - open-source 400M TTS model with voice... 99pts
- reddit KaniTTS2, our text-to-speech model with frame-level... 32pts
Heretic 1.2 introduces Magnitude-Preserving Orthogonal Ablation for derestriction with 70% lower VRAM usage via quantization, 306 upvotes and 1000+ users in 3 months.
1 sources
- reddit Heretic 1.2 released: 70% lower VRAM usage with... 306pts
A 169-upvote thread collects local vibe-coding experiences across models, while a separate finding reveals Claude Code reprocesses full prompts every request when used with local models — local coding agents work but have prompt-caching and template friction.
Claude Code's full prompt reprocessing on every request with local models (due to x-anthropic cache headers) wastes compute — next step is local inference servers implementing prompt-cache-aware session management.
2 sources
- reddit local vibe coding 169pts
- reddit Claude Code with Local Models: Full Prompt Reprocessing... 77pts
ggerganov's PR #19375 optimizes Qwen3-Next graph for faster t/s, a JSON parser fix addresses OpenCode compatibility, and users compare the 60B distilled model to the full Qwen coder — active convergence on making Qwen3-Next usable in llama.cpp.
Qwen3-Next graph optimization PR is in progress with multiple companion fixes — remaining bottleneck is chat template incompatibilities breaking tool-calling and structured output.
3 sources
FireRed-Image-Edit 1.0 model weights released on HuggingFace with 236 upvotes and 61 comments, indicating strong community interest in open image editing models.
1 sources
- reddit FireRed-Image-Edit-1.0 model weights are released 236pts
One user reports unsolved anatomical deformities in Flux 2 Klein 9B distilled img2img, while another claims to have found specific layer settings that preserve original details — community is reverse-engineering which transformer layers control fidelity vs. editability.
2 sources
Users benchmark M2.5 on dual RTX 6000 Pros, discuss 4-bit GGUF quant options for 128GB RAM + 16GB VRAM systems, and share usage experiences — community is actively figuring out optimal quant/hardware configs for this MoE model.
3 sources
- reddit MiniMax M2.5 Performance Testing on dual RTX 6000 Pros 15pts
- reddit MiniMax M2.5 has been very patient with my dumb ass 28pts
- reddit MiniMax M2.5 - 4-Bit GGUF Options 41pts
Round 2 benchmark tests 21 small LLMs on tool-calling judgment with 60 upvotes and 35 comments — systematic evaluation of which sub-30B models can reliably decide when to invoke tools.
1 sources
Kyutai releases Hibiki-Zero, a 3B parameter simultaneous speech-to-speech translation model using GRPO reinforcement learning without word-level aligned data.
1 sources
- reddit Kyutai Releases Hibiki-Zero 24pts