HiddenState — 2026-02-12

2026-02-12 Signals

W65 Sparse MoE models with low active parameter counts ▸

MiniMax-M2.5 (230B total/10B active), Ming-flash-omni-2.0 (100B/6B active), and Puzzle MoE optimization all target high capability with fraction of parameters active at inference, with M2.5 hitting 80.2% SWE-Bench Verified.

convergence

15/35 implementation

20/30 engagement

15/15 significance

15/20

MiniMax-M2.5 at 230B/10B active hits 80.2% SWE-Bench Verified — next bottleneck is efficient routing and expert load balancing at these sparsity ratios for local deployment.

5 sources

reddit MiniMaxAI MiniMax-M2.5 has 230b parameters and 10b... 351pts
reddit Minimax M2.5 Officially Out 503pts
huggingface MiniMaxAI/MiniMax-M2.5
reddit Ming-flash-omni-2.0: 100B MoE (6B active) omni-modal... 195pts
arxiv Extending Puzzle for Mixture-of-Experts Reasoning Models...

W62 Real-time video generation via efficient attention ▸

MonarchRT replaces quadratic 3D self-attention with structured Monarch matrices for real-time video diffusion transformers; independently, a practitioner reports VACE running at 20-30fps on RTX 4090/5090 for autoregressive video generation.

convergence

15/35 implementation

25/30 engagement

8/15 significance

14/20

VACE at 20-30fps on 4090/5090 and MonarchRT targeting quadratic attention cost both confirm attention is the primary bottleneck for real-time video diffusion — next constraint is maintaining temporal coherence at these speeds for longer sequences.

2 sources

reddit I got VACE working in real-time - ~20-30fps on 40/5090 209pts
arxiv MonarchRT: Efficient Attention for Real-Time Video Generation

W61 Frontier reasoning models (Gemini 3, GLM-5, Ring-1T) ▸

Gemini 3 Deep Think (1066 HN points), Ring-1T-2.5 claiming SOTA deep thinking, and GLM-5 trained entirely on Huawei chips all released within 24 hours — a burst of competing frontier reasoning models from Google, inclusionAI, and Zhipu.

convergence

15/35 implementation

15/30 engagement

15/15 significance

16/20

Three frontier reasoning models dropped in one day across Google, inclusionAI, and Zhipu (GLM-5 on Huawei chips) — next bottleneck is whether deep thinking models can be efficiently served given their extended token generation, which Puzzle MoE optimization already targets.

5 sources

hn Gemini 3 Deep Think 1066pts
hn Gemini 3 Deep Think: Advancing science, research and engineering 29pts
reddit Ring-1T-2.5 released by inclusionAI 174pts
reddit Unsloth just unleashed Glm 5! GGUF NOW! 297pts
hn GLM-5 was trained entirely on Huawei chips 20pts

2026-02-12 Tracking

W55 LLM safety toolkits and jailbreak detection ▸

DeepSight provides an all-in-one LM safety toolkit, a separate paper detects jailbreaks from internal LLM representations, and Mozilla evaluates multilingual guardrails — three independent efforts to systematize LLM safety evaluation and defense.

convergence

25/35 implementation

20/30 engagement

1/15 significance

9/20

3 sources

paperswithcode DeepSight: An All-in-One LM Safety Toolkit
arxiv Jailbreaking Leaves a Trace: Understanding and Detecting...
hn Evaluating Multilingual, Context-Aware Guardrails: A... 32pts

W51 Lightweight unified multimodal generation and editing models ▸

DeepGen 1.0 targets image generation and editing below 10B parameters to reduce training/deployment cost; FireRed-Image-Edit from Rednote claims open-source SOTA for image editing — both push multimodal editing into smaller, more deployable models.

convergence

10/35 implementation

20/30 engagement

11/15 significance

10/20

2 sources

paperswithcode DeepGen 1.0: A Lightweight Unified Multimodal Model for...
reddit New SOTA(?) Open Source Image Editing Model from Rednote? 223pts

W50 RLVR data quality and verifiable reward composition ▸

Composition-RL addresses uninformative examples in RLVR prompt sets via composable verifiable prompts, a detection method identifies RLVR training data via structural convergence, and Native Reasoning Models propose training on unverifiable data to escape RLVR's reliance on verifiable-only rewards.

convergence

15/35 implementation

20/30 engagement

3/15 significance

12/20

Three papers independently identify RLVR's dependence on verifiable-only data as a fundamental constraint — next bottleneck is whether reward signals from unverifiable domains can train reasoning without reward hacking.

3 sources

paperswithcode Composition-RL: Compose Your Verifiable Prompts for...
paperswithcode Detecting RLVR Training Data via Structural Convergence...
arxiv Native Reasoning Models: Training Language Models to...

W44 Coding harness and evaluation methodology improvements ▸

Blog post with 810 HN points shows improving 15 LLMs at coding by only changing the harness (not the model), while GPT-5.3-Codex-Spark launches via Cerebras partnership for real-time coding — both highlight that evaluation scaffolding and serving infrastructure matter as much as model weights.

convergence

15/35 implementation

0/30 engagement

15/15 significance

14/20

810-point HN post demonstrates coding benchmark gains from harness changes alone — next bottleneck is standardizing evaluation harnesses so benchmark scores reflect model capability rather than scaffolding quality.

2 sources

W44 DPO reference policy mismatch mitigation ▸

One paper directly addresses the reference-policy mismatch in DPO that causes distribution drift; P-GenRM proposes personalized generative reward models with test-time user scaling — both target the core DPO limitation that the reference policy diverges from the learned policy during training.

convergence

15/35 implementation

20/30 engagement

0/15 significance

9/20

2 sources

arxiv Mitigating Mismatch within Reference-based Preference...
paperswithcode P-GenRM: Personalized Generative Reward Model with...

W40 Hybrid sparse and linear attention for long context ▸

MiniCPM-SALA hybridizes sparse and linear attention for efficient long-context modeling; a new library implements multiple linear RNN architectures with accelerated kernels — both target the memory/compute wall of full quadratic attention at long sequences.

convergence

10/35 implementation

20/30 engagement

0/15 significance

10/20

MiniCPM-SALA and a multi-architecture linear RNN library both target sub-quadratic long-context modeling — next bottleneck is whether hybrid approaches can match full attention quality on retrieval-heavy tasks beyond 128K tokens.

2 sources

paperswithcode MiniCPM-SALA: Hybridizing Sparse and Linear Attention...
reddit [P] A library for linear RNNs 17pts

W39 Multi-agent LLM orchestration for complex tasks ▸

Open-source multi-agent orchestrator coordinates 20+ Claude Code agents for long-running tasks, HN discussion on agent orchestrators for coding, and Moltis offers self-extending AI assistant with memory — all address single-agent failure on sustained complex work.

convergence

7/35 implementation

15/30 engagement

8/15 significance

9/20

3 sources

hn Show HN: 20+ Claude Code agents coordinating on real... 52pts
hn Ask HN: Are you using an agent orchestrator to write code? 38pts
hn Show HN: Moltis – AI assistant with memory, tools, and... 120pts

Component	Max	What it measures
Convergence	35	How many independent sources report this. Single source = 0 — unless it links to working code, which counts as a second data point.
Implementation	30	Evidence of working code. GitHub repo = 30. HuggingFace model = 20. Paper only = 0.
Engagement	15	Upvotes, stars, points. Capped low so hype can't inflate the score.
Significance	20	Clustering model's assessment of technical importance.

Source	What we pull
arxiv	Preprints from cs.LG, cs.CL, cs.AI, cs.CV, stat.ML — the raw research firehose
Reddit	r/MachineLearning, r/LocalLLaMA, r/StableDiffusion, r/MLOps — practitioner signal
GitHub	Trending ML repos with 50+ stars — implementation evidence
Hacker News	ML-related posts with 15+ points — cross-domain attention
HuggingFace	Trending models + watched quantizers (bartowski, MaziyarPanahi, LoneStriker)
OpenReview	TMLR + NeurIPS workshops — peer-reviewed & bleeding-edge
Twitter	9 curated accounts (akhaliq, karpathy, srush, fchollet, etc.)
Papers w/ Code	Trending papers with implementations — community-vetted research
RSS Blogs	Lilian Weng, Chip Huyen, Eugene Yan, Simon Willison, Interconnects, Latent Space, Netflix Tech + PyTorch & HF blogs

Stage	What gets cut
Pre-filter	Short abstracts, low-engagement posts, duplicates across sources
Clustering	Items that don't converge on a shared mechanism with other items
Ranking	Clusters below the top 10 by W-index