/ ? ! $
2026-02-12 Signals
W65 Sparse MoE models with low active parameter counts

MiniMax-M2.5 (230B total/10B active), Ming-flash-omni-2.0 (100B/6B active), and Puzzle MoE optimization all target high capability with fraction of parameters active at inference, with M2.5 hitting 80.2% SWE-Bench Verified.

convergence
15/35 implementation
20/30 engagement
15/15 significance
15/20

MiniMax-M2.5 at 230B/10B active hits 80.2% SWE-Bench Verified — next bottleneck is efficient routing and expert load balancing at these sparsity ratios for local deployment.

5 sources
W62 Real-time video generation via efficient attention

MonarchRT replaces quadratic 3D self-attention with structured Monarch matrices for real-time video diffusion transformers; independently, a practitioner reports VACE running at 20-30fps on RTX 4090/5090 for autoregressive video generation.

convergence
15/35 implementation
25/30 engagement
8/15 significance
14/20

VACE at 20-30fps on 4090/5090 and MonarchRT targeting quadratic attention cost both confirm attention is the primary bottleneck for real-time video diffusion — next constraint is maintaining temporal coherence at these speeds for longer sequences.

2 sources
W61 Frontier reasoning models (Gemini 3, GLM-5, Ring-1T)

Gemini 3 Deep Think (1066 HN points), Ring-1T-2.5 claiming SOTA deep thinking, and GLM-5 trained entirely on Huawei chips all released within 24 hours — a burst of competing frontier reasoning models from Google, inclusionAI, and Zhipu.

convergence
15/35 implementation
15/30 engagement
15/15 significance
16/20

Three frontier reasoning models dropped in one day across Google, inclusionAI, and Zhipu (GLM-5 on Huawei chips) — next bottleneck is whether deep thinking models can be efficiently served given their extended token generation, which Puzzle MoE optimization already targets.

5 sources
2026-02-12 Tracking
W55 LLM safety toolkits and jailbreak detection

DeepSight provides an all-in-one LM safety toolkit, a separate paper detects jailbreaks from internal LLM representations, and Mozilla evaluates multilingual guardrails — three independent efforts to systematize LLM safety evaluation and defense.

convergence
25/35 implementation
20/30 engagement
1/15 significance
9/20
3 sources
W51 Lightweight unified multimodal generation and editing models

DeepGen 1.0 targets image generation and editing below 10B parameters to reduce training/deployment cost; FireRed-Image-Edit from Rednote claims open-source SOTA for image editing — both push multimodal editing into smaller, more deployable models.

convergence
10/35 implementation
20/30 engagement
11/15 significance
10/20
2 sources
W50 RLVR data quality and verifiable reward composition

Composition-RL addresses uninformative examples in RLVR prompt sets via composable verifiable prompts, a detection method identifies RLVR training data via structural convergence, and Native Reasoning Models propose training on unverifiable data to escape RLVR's reliance on verifiable-only rewards.

convergence
15/35 implementation
20/30 engagement
3/15 significance
12/20

Three papers independently identify RLVR's dependence on verifiable-only data as a fundamental constraint — next bottleneck is whether reward signals from unverifiable domains can train reasoning without reward hacking.

3 sources
W44 Coding harness and evaluation methodology improvements

Blog post with 810 HN points shows improving 15 LLMs at coding by only changing the harness (not the model), while GPT-5.3-Codex-Spark launches via Cerebras partnership for real-time coding — both highlight that evaluation scaffolding and serving infrastructure matter as much as model weights.

convergence
15/35 implementation
0/30 engagement
15/15 significance
14/20

810-point HN post demonstrates coding benchmark gains from harness changes alone — next bottleneck is standardizing evaluation harnesses so benchmark scores reflect model capability rather than scaffolding quality.

2 sources
W44 DPO reference policy mismatch mitigation

One paper directly addresses the reference-policy mismatch in DPO that causes distribution drift; P-GenRM proposes personalized generative reward models with test-time user scaling — both target the core DPO limitation that the reference policy diverges from the learned policy during training.

convergence
15/35 implementation
20/30 engagement
0/15 significance
9/20
2 sources
W40 Hybrid sparse and linear attention for long context

MiniCPM-SALA hybridizes sparse and linear attention for efficient long-context modeling; a new library implements multiple linear RNN architectures with accelerated kernels — both target the memory/compute wall of full quadratic attention at long sequences.

convergence
10/35 implementation
20/30 engagement
0/15 significance
10/20

MiniCPM-SALA and a multi-architecture linear RNN library both target sub-quadratic long-context modeling — next bottleneck is whether hybrid approaches can match full attention quality on retrieval-heavy tasks beyond 128K tokens.

2 sources
W39 Multi-agent LLM orchestration for complex tasks

Open-source multi-agent orchestrator coordinates 20+ Claude Code agents for long-running tasks, HN discussion on agent orchestrators for coding, and Moltis offers self-extending AI assistant with memory — all address single-agent failure on sustained complex work.

convergence
7/35 implementation
15/30 engagement
8/15 significance
9/20
3 sources
FAQ
What is HiddenState?

A daily briefing that scrapes 8 source types across the ML ecosystem, filters out the noise, and clusters what remains by technical mechanism — not topic.

Most ML news is recycled press releases. HiddenState watches for convergence: when multiple independent sources start working on the same bottleneck, something real is happening. Everything else is noise.

The top 10 mechanisms are ranked by W-index and split into Signals (strongest evidence) and Tracking (early signals worth watching) at the largest natural score gap.

What is W-index?

A 0–100 score measuring signal strength. Higher = more evidence that something real is happening.

ComponentMaxWhat it measures
Convergence35How many independent sources report this. Single source = 0 — unless it links to working code, which counts as a second data point.
Implementation30Evidence of working code. GitHub repo = 30. HuggingFace model = 20. Paper only = 0.
Engagement15Upvotes, stars, points. Capped low so hype can't inflate the score.
Significance20Clustering model's assessment of technical importance.

W60+ strong — W25-59 moderate — W<25 early/weak

Code beats vaporware. A shipped GitHub project with 3 sources will always outscore a hyped paper with 500 Reddit upvotes but no implementation.

Who are our sources?
SourceWhat we pull
arxivPreprints from cs.LG, cs.CL, cs.AI, cs.CV, stat.ML — the raw research firehose
Redditr/MachineLearning, r/LocalLLaMA, r/StableDiffusion, r/MLOps — practitioner signal
GitHubTrending ML repos with 50+ stars — implementation evidence
Hacker NewsML-related posts with 15+ points — cross-domain attention
HuggingFaceTrending models + watched quantizers (bartowski, MaziyarPanahi, LoneStriker)
OpenReviewTMLR + NeurIPS workshops — peer-reviewed & bleeding-edge
Twitter9 curated accounts (akhaliq, karpathy, srush, fchollet, etc.)
Papers w/ CodeTrending papers with implementations — community-vetted research
RSS BlogsLilian Weng, Chip Huyen, Eugene Yan, Simon Willison, Interconnects, Latent Space, Netflix Tech + PyTorch & HF blogs

Items that appear across multiple sources score higher. Single-source items start at zero convergence.

Signals vs Tracking — what's the difference?

Both sections show real signals. Up to 10 mechanisms are sorted by W-index and split at the largest natural score gap — Signals are above the gap, Tracking below. The split point changes daily based on the data; tied scores always land on the same side.

Tracking does not mean bad, unimportant, or wrong. It usually means a signal has fewer independent sources so far, or lacks public code — things that can change overnight. Some of the most consequential developments start in Tracking before the rest of the ecosystem catches up.

Likewise, a high W-index does not mean research is good, correct, or worth adopting. W-index measures visibility and convergence across sources, not quality. A flawed paper that gets widely discussed will score higher than a brilliant one nobody has noticed yet.

HiddenState is a detection tool, not an endorsement. It tells you where activity is clustering — what you do with that is up to you. Nothing here should be read as a recommendation, ranking of merit, or judgement on any researcher's work.

What does noise rejection mean?

Of all items collected, only 10 make it to the final briefing. The rejection rate is the percentage that got cut.

Filtering happens in three stages:

StageWhat gets cut
Pre-filterShort abstracts, low-engagement posts, duplicates across sources
ClusteringItems that don't converge on a shared mechanism with other items
RankingClusters below the top 10 by W-index

A 99% rejection rate means 99 out of 100 items were noise. That's the point — most ML news doesn't matter on any given day.

Privacy
Data collection

None. HiddenState collects no personal data, no email addresses, no IP logs, no usage analytics, and no telemetry of any kind.

Cookies & tracking

Zero cookies. No first-party, no third-party, no session cookies, no tracking pixels.

The only client-side storage is localStorage for your theme preference (dark/light). This never leaves your browser and contains no identifying information.

External requests

Pages load zero external scripts, fonts, stylesheets, or analytics. Everything is self-contained. The only outbound link is to Ko-fi if you choose to click it.

Data sources

HiddenState monitors 9 distinct public data streams (ArXiv, GitHub, Reddit, etc.) to detect cross-platform convergence. We do not use private user data; we only analyze what the community has already published.