HiddenState — 2026-02-11

2026-02-11 Signals

W66 GLM-5 and Chinese frontier model releases for agentic tasks ▸

GLM-5 scored 50 on the Intelligence Index as new open-weights leader with 66K+ downloads, released alongside MiniMax M2.5, both targeting long-horizon agentic engineering; Z.ai publicly stated GPU starvation.

convergence

15/35 implementation

20/30 engagement

15/15 significance

16/20

GLM-5 at 66K downloads and Intelligence Index score of 50 sets a new open-weights bar — next bottleneck is GPU supply for inference at scale, as Z.ai publicly acknowledged being GPU-starved.

9 sources

reddit GLM-5 scores 50 on the Intelligence Index and is the new... 638pts
reddit GLM-5 Officially Released 773pts
hn GLM-5: From Vibe Coding to Agentic Engineering 378pts
hn GLM-5: Targeting complex systems engineering and... 479pts
reddit GLM 5 Released 613pts
huggingface zai-org/GLM-5
reddit GLM 5.0 & MiniMax 2.5 Just Dropped, Are We Entering... 260pts
reddit MiniMax M2.5 Released 269pts
reddit Z.ai said they are GPU starved, openly. 1467pts

2026-02-11 Tracking

W59 Small general-purpose models under 5B parameters ▸

Nanbeige4.1-3B explores whether a 3B model can reason, align, and act as a general model; MiniCPM-SALA (426 likes, 2569 downloads) targets similar small-model general capability — both push the floor of useful model size.

convergence

15/35 implementation

20/30 engagement

15/15 significance

9/20

Nanbeige4.1-3B and MiniCPM-SALA both target general capability at 3B scale — next bottleneck is whether agentic tool-use and multi-step reasoning hold up at this size.

2 sources

reddit Nanbeige4.1-3B: A Small General Model that Reasons,... 156pts
huggingface openbmb/MiniCPM-SALA

W56 Flux 2 Klein 9B LoRA trainability and image editing ▸

Multiple users report Flux 2 Klein 9B outperforming Qwen Image for editing consistency and LoRA trainability at 4 inference steps, with successful style LoRAs at rank 32 over 7000 steps on Runpod.

convergence

10/35 implementation

25/30 engagement

15/15 significance

6/20

5 sources

reddit Who else left Qwen Image Edit for Flux 2 Klein 115pts
reddit I continue to be impressed by Flux.2 Klein 9B's trainability 100pts
reddit Google Street View 2077 (Klein 9b distilled edit) 136pts
reddit DC Ancient Futurism Style 1 875pts
reddit ZImageTurboProgressiveLockedUpscale (Works with Z Image... 87pts

W50 3D ControlNet conditioning from animated 3D assets ▸

ComfyUI custom node renders pose, depth, normal, and canny batches from FBX/GLB animation files (Mixamo) in an interactive 3D viewport for ControlNet conditioning.

convergence

10/35 implementation

25/30 engagement

9/15 significance

6/20

1 sources

reddit interactive 3D Viewport node to render Pose, Depth,... 227pts

W47 LLM agents for complex software engineering benchmarks ▸

GameDevBench evaluates multimodal coding agents on game development, FeatureBench benchmarks agentic coding for complex feature development, and CodeRLM uses tree-sitter indexing to improve how LLM agents navigate codebases.

convergence

15/35 implementation

20/30 engagement

3/15 significance

9/20

Multiple benchmarks now test agents on multi-file feature-level coding rather than single-function tasks — next bottleneck is reliable multi-step planning across large codebases.

3 sources

paperswithcode GameDevBench: Evaluating Agentic Capabilities Through...
arxiv FeatureBench: Benchmarking Agentic Coding for Complex...
hn Show HN: CodeRLM – Tree-sitter-backed code indexing for... 79pts

W42 Reinforcement learning for visual reasoning in MLLMs ▸

MetaphorStar applies visual RL to image metaphor understanding, while Reinforced Curriculum Pre-Alignment uses RL-style curriculum for domain-adaptive VLMs — both use reinforcement signals to improve visual reasoning beyond supervised fine-tuning.

convergence

15/35 implementation

20/30 engagement

0/15 significance

7/20

2 sources

paperswithcode MetaphorStar: Image Metaphor Understanding and Reasoning...
arxiv Reinforced Curriculum Pre-Alignment for Domain-Adaptive VLMs

W42 LLM safety evaluation for harmful persuasion ▸

Six-month follow-up on the Attempt-to-Persuade Eval shows GPT and Claude improved on harmful persuasion resistance while Gemini regressed.

convergence

10/35 implementation

25/30 engagement

0/15 significance

7/20

1 sources

reddit [R] Update: Frontier LLMs' Willingness to Persuade on... 13pts

W42 Instruct fine-tuning behavioral fingerprints in hidden states ▸

Probing 6 open-weight LLMs (7B-9B) via hidden state projections onto contrastive axes reveals instruct fine-tuning creates measurable behavioral constraints detectable without prompting.

convergence

10/35 implementation

25/30 engagement

0/15 significance

7/20

1 sources

reddit [R] I probed 6 open-weight LLMs (7B-9B) for...

W38 VLA models with world models for robot manipulation ▸

Three papers independently address VLA model brittleness in contact-rich manipulation: RISE adds a compositional world model for self-improvement, ABot-M0 uses action manifold learning across hardware, and MolmoSpaces provides a large-scale ecosystem for navigation/manipulation.

convergence

7/35 implementation

20/30 engagement

1/15 significance

10/20

Three concurrent papers attack VLA fragility in dynamic manipulation via world models and action manifolds — next bottleneck is sim-to-real transfer fidelity for contact-rich tasks.

3 sources

paperswithcode RISE: Self-Improving Robot Policy with Compositional World Model
arxiv ABot-M0: VLA Foundation Model for Robotic Manipulation...
paperswithcode MolmoSpaces: A Large-Scale Open Ecosystem for Robot...

W32 Discrete audio tokenizers for LLM-native audio processing ▸

MOSS-Audio-Tokenizer scales discrete audio tokenization for future audio foundation models (47 likes), while Voxtral Realtime achieves sub-second latency streaming ASR matching offline quality — both address the bottleneck of integrating audio natively into LLM architectures.

convergence

0/35 implementation

20/30 engagement

2/15 significance

10/20

MOSS-Audio-Tokenizer targets scaling tokenizers beyond pretrained codec limitations while Voxtral hits sub-second streaming latency — next bottleneck is joint speech understanding and generation in a single LLM pass.

2 sources

paperswithcode MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for...
paperswithcode Voxtral Realtime

Component	Max	What it measures
Convergence	35	How many independent sources report this. Single source = 0 — unless it links to working code, which counts as a second data point.
Implementation	30	Evidence of working code. GitHub repo = 30. HuggingFace model = 20. Paper only = 0.
Engagement	15	Upvotes, stars, points. Capped low so hype can't inflate the score.
Significance	20	Clustering model's assessment of technical importance.

Source	What we pull
arxiv	Preprints from cs.LG, cs.CL, cs.AI, cs.CV, stat.ML — the raw research firehose
Reddit	r/MachineLearning, r/LocalLLaMA, r/StableDiffusion, r/MLOps — practitioner signal
GitHub	Trending ML repos with 50+ stars — implementation evidence
Hacker News	ML-related posts with 15+ points — cross-domain attention
HuggingFace	Trending models + watched quantizers (bartowski, MaziyarPanahi, LoneStriker)
OpenReview	TMLR + NeurIPS workshops — peer-reviewed & bleeding-edge
Twitter	9 curated accounts (akhaliq, karpathy, srush, fchollet, etc.)
Papers w/ Code	Trending papers with implementations — community-vetted research
RSS Blogs	Lilian Weng, Chip Huyen, Eugene Yan, Simon Willison, Interconnects, Latent Space, Netflix Tech + PyTorch & HF blogs

Stage	What gets cut
Pre-filter	Short abstracts, low-engagement posts, duplicates across sources
Clustering	Items that don't converge on a shared mechanism with other items
Ranking	Clusters below the top 10 by W-index