Papers5

#Speculative Decoding

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5 Team, Aohan Zeng et al.Feb 17arXiv

GLM-5 is a new open-weight AI model that moves from 'vibe coding' (prompting the model to write code) to 'agentic engineering' (letting the model plan, build, test, and fix software on its own).

#GLM-5#Agentic Engineering#DeepSeek Sparse Attention

Not triaged yet

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Intermediate

Ailin Huang, Ang Li et al.Feb 11arXiv

Step 3.5 Flash is a huge but efficient AI that keeps 196 billion total parameters but only wakes up about 11 billion per token, so it thinks smart and fast.

#Sparse Mixture-of-Experts#Sliding-Window Attention#Head-wise Gated Attention

Not triaged yet

Scaling Embeddings Outperforms Scaling Experts in Language Models

Intermediate

Hong Liu, Jiaqi Zhang et al.Jan 29arXiv

The paper shows that growing the embedding part of a language model (especially with n-grams) can beat adding more MoE experts once you pass a certain sparsity 'sweet spot.'

#N-gram Embedding#Mixture-of-Experts (MoE)#Embedding Scaling

Not triaged yet

MiMo-V2-Flash Technical Report

Intermediate

Xiaomi LLM-Core Team, : et al.Jan 6arXiv

MiMo-V2-Flash is a giant but efficient language model that uses a team-of-experts design to think well while staying fast.

#Mixture-of-Experts#Sliding Window Attention#Global Attention

Not triaged yet

NVIDIA Nemotron 3: Efficient and Open Intelligence

Intermediate

NVIDIA, : et al.Dec 24arXiv

Nemotron 3 is a new family of open AI models (Nano, Super, Ultra) built to think better while running faster and cheaper.

#Nemotron 3#Mixture-of-Experts#LatentMoE

Not triaged yet