MiniMax-M2.5 (230B total/10B active), Ming-flash-omni-2.0 (100B/6B active), and Puzzle MoE optimization all target high capability with fraction of parameters active at inference, with M2.5 hitting 80.2% SWE-Bench Verified.
MiniMax-M2.5 at 230B/10B active hits 80.2% SWE-Bench Verified — next bottleneck is efficient routing and expert load balancing at these sparsity ratios for local deployment.
5 sources
- reddit MiniMaxAI MiniMax-M2.5 has 230b parameters and 10b... 351pts
- reddit Minimax M2.5 Officially Out 503pts
- huggingface MiniMaxAI/MiniMax-M2.5
- reddit Ming-flash-omni-2.0: 100B MoE (6B active) omni-modal... 195pts
- arxiv Extending Puzzle for Mixture-of-Experts Reasoning Models...
MonarchRT replaces quadratic 3D self-attention with structured Monarch matrices for real-time video diffusion transformers; independently, a practitioner reports VACE running at 20-30fps on RTX 4090/5090 for autoregressive video generation.
VACE at 20-30fps on 4090/5090 and MonarchRT targeting quadratic attention cost both confirm attention is the primary bottleneck for real-time video diffusion — next constraint is maintaining temporal coherence at these speeds for longer sequences.
2 sources
Gemini 3 Deep Think (1066 HN points), Ring-1T-2.5 claiming SOTA deep thinking, and GLM-5 trained entirely on Huawei chips all released within 24 hours — a burst of competing frontier reasoning models from Google, inclusionAI, and Zhipu.
Three frontier reasoning models dropped in one day across Google, inclusionAI, and Zhipu (GLM-5 on Huawei chips) — next bottleneck is whether deep thinking models can be efficiently served given their extended token generation, which Puzzle MoE optimization already targets.
5 sources
- hn Gemini 3 Deep Think 1066pts
- hn Gemini 3 Deep Think: Advancing science, research and engineering 29pts
- reddit Ring-1T-2.5 released by inclusionAI 174pts
- reddit Unsloth just unleashed Glm 5! GGUF NOW! 297pts
- hn GLM-5 was trained entirely on Huawei chips 20pts
DeepSight provides an all-in-one LM safety toolkit, a separate paper detects jailbreaks from internal LLM representations, and Mozilla evaluates multilingual guardrails — three independent efforts to systematize LLM safety evaluation and defense.
3 sources
DeepGen 1.0 targets image generation and editing below 10B parameters to reduce training/deployment cost; FireRed-Image-Edit from Rednote claims open-source SOTA for image editing — both push multimodal editing into smaller, more deployable models.
2 sources
- paperswithcode DeepGen 1.0: A Lightweight Unified Multimodal Model for...
- reddit New SOTA(?) Open Source Image Editing Model from Rednote? 223pts
Composition-RL addresses uninformative examples in RLVR prompt sets via composable verifiable prompts, a detection method identifies RLVR training data via structural convergence, and Native Reasoning Models propose training on unverifiable data to escape RLVR's reliance on verifiable-only rewards.
Three papers independently identify RLVR's dependence on verifiable-only data as a fundamental constraint — next bottleneck is whether reward signals from unverifiable domains can train reasoning without reward hacking.
3 sources
Blog post with 810 HN points shows improving 15 LLMs at coding by only changing the harness (not the model), while GPT-5.3-Codex-Spark launches via Cerebras partnership for real-time coding — both highlight that evaluation scaffolding and serving infrastructure matter as much as model weights.
810-point HN post demonstrates coding benchmark gains from harness changes alone — next bottleneck is standardizing evaluation harnesses so benchmark scores reflect model capability rather than scaffolding quality.
2 sources
One paper directly addresses the reference-policy mismatch in DPO that causes distribution drift; P-GenRM proposes personalized generative reward models with test-time user scaling — both target the core DPO limitation that the reference policy diverges from the learned policy during training.
2 sources
MiniCPM-SALA hybridizes sparse and linear attention for efficient long-context modeling; a new library implements multiple linear RNN architectures with accelerated kernels — both target the memory/compute wall of full quadratic attention at long sequences.
MiniCPM-SALA and a multi-architecture linear RNN library both target sub-quadratic long-context modeling — next bottleneck is whether hybrid approaches can match full attention quality on retrieval-heavy tasks beyond 128K tokens.
2 sources
- paperswithcode MiniCPM-SALA: Hybridizing Sparse and Linear Attention...
- reddit [P] A library for linear RNNs 17pts
Open-source multi-agent orchestrator coordinates 20+ Claude Code agents for long-running tasks, HN discussion on agent orchestrators for coding, and Moltis offers self-extending AI assistant with memory — all address single-agent failure on sustained complex work.