GLM-5: from Vibe Coding to Agentic Engineering
Key Summary
- •GLM-5 is a new open-weight AI model that moves from 'vibe coding' (prompting the model to write code) to 'agentic engineering' (letting the model plan, build, test, and fix software on its own).
- •It uses DeepSeek Sparse Attention (DSA) to focus compute on important tokens, cutting long-context costs roughly in half while keeping accuracy high up to 128K–200K tokens.
- •A new asynchronous reinforcement learning (RL) system lets rollouts (the model acting) run separately from training, greatly speeding up learning from long, multi-step tasks.
- •Special agent RL tricks—like a Multi-Task Rollout Orchestrator, Token-in-Token-out (TITO), and careful off-policy controls—make training stable even when tasks take many steps.
- •GLM-5 stacks training in stages: pretraining on 28.5T tokens, mid-training to extend context to 200K, then SFT and three RL phases (Reasoning, Agentic, General), capped by on-policy cross-stage distillation to avoid forgetting.
- •On public benchmarks, GLM-5 is the top open-weight model across many coding, reasoning, and agentic tests, with about 20% improvement over GLM-4.7 and an Intelligence Index v4.0 score of 50.
- •It shines on real-world coding: higher scores on SWE-bench variants, strong terminal and cybersecurity tasks, and better long-horizon capabilities (e.g., running a simulated business to a top-tier profit).
- •Practical engineering wins include faster, cheaper long-context inference, reliable tool use, better planning over hours-long tasks, and smooth deployment on diverse Chinese chips with mixed-precision quantization.
- •A simple but powerful context-management strategy for search agents (keep-recent-k + discard-all) boosts BrowseComp to 75.9, leading among open models.
- •Overall, GLM-5 shows how to build efficient, capable coding agents that can handle end-to-end software workflows—not just single code snippets.
Why This Research Matters
Real-world software work is long and messy; GLM-5 shows open models can handle it efficiently. By cutting long-context compute with DSA, teams can afford to let agents read entire repos, specs, and logs without breaking the bank. With asynchronous RL and verified environments, agents learn grounded skills—building, testing, and debugging—rather than just writing pretty code. The result is practical automation: faster bug fixes, safer refactors, and reliable multi-step tool use across hours-long sessions. Open weights and Chinese-chip optimization make it deployable in diverse settings, from startups to regulated enterprises. This is a step toward trustworthy AI teammates that can plan, act, and improve.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
The World Before: You know how a good student can solve a single math problem but might struggle to do a whole science fair project that takes days of planning? Early large language models (LLMs) were like that: great at one-shot answers and small code snippets, but not so great at long, messy, real-world projects that need planning, tools, and lots of back-and-forth. They also got very expensive when you asked them to read or remember very long documents or big codebases.
The Problem: As AI moved from chat helpers to real problem solvers, two roadblocks stood out. First, cost: reading 100,000+ tokens (like an entire repository or a long legal contract) made attention computations explode, slowing everything and draining GPUs. Second, adaptability: real software engineering isn’t a single answer—it’s a multi-step journey (plan → implement → run tests → debug → refine), and existing training pipelines weren’t built to teach models to act across many steps while staying consistent over hours.
Failed Attempts:
- Bigger-is-better only: Just scaling parameters helped a bit but made costs worse and didn’t magically teach multi-step planning.
- Dense attention everywhere: Accurate but painfully slow and pricey at 128K+ tokens; sliding-window shortcuts often forgot important details.
- Synchronous RL: When rollouts take minutes or hours (agents browsing, coding, testing), synchronous training left GPUs idle waiting for the slowest sample.
- Tool-use demos without verifiable environments: Agents looked smart in short demos but fell apart on real projects with builds, tests, and logs.
The Gap: What was missing was a combined solution: (1) an attention system that keeps long-context quality without paying full price (enter DSA), (2) an RL system that learns from many long rollouts in parallel (asynchronous, decoupled training), and (3) rigorous, verifiable environments (thousands of real SWE/terminal tasks) so the model learns grounded behaviors (building, testing, debugging) instead of just nice-sounding text.
Real Stakes:
- Software teams: Bugs cost money. A reliable coding agent that can read entire repos, make changes, run tests, and fix failures saves weeks.
- Long documents: Lawyers, researchers, and analysts need accurate retrieval and reasoning over 100K+ tokens; cost matters.
- Tools and web: Agents must browse, call tools, and manage context efficiently, or they stall and hallucinate.
- Open weights: Companies and researchers need controllable, deployable models on their own hardware (including Chinese chips), with predictable performance and cost.
Prerequisite Sandwiches (the building blocks):
-
🍞 Hook: Imagine a super helpful librarian who reads and writes for you. 🥬 The Concept (LLMs): A Large Language Model is a pattern-learning machine that predicts the next word to generate helpful text. How it works: (1) Read lots of examples, (2) learn patterns, (3) answer by continuing text smartly, (4) adjust with feedback. Why it matters: Without it, the model can’t communicate or code fluently. 🍞 Anchor: When you ask for a Python function to sort a list, the LLM writes it by predicting useful next tokens.
-
🍞 Hook: You know how when you read a page, your eyes jump to the important bits? 🥬 The Concept (Attention): Attention lets AI focus more on key words and less on fillers. How it works: (1) Look at all tokens, (2) score importance, (3) mix information from high-scoring tokens, (4) decide the next output. Why it matters: Without attention, the model drowns in unimportant details. 🍞 Anchor: In “What’s the capital of France?”, attention locks onto “capital” and “France” to say “Paris.”
-
🍞 Hook: Think of training a puppy with treats. 🥬 The Concept (Reinforcement Learning): RL teaches a model by rewarding good actions and discouraging bad ones. How it works: (1) Try, (2) get feedback, (3) adjust behavior to get more reward, (4) repeat. Why it matters: Without RL, models don’t learn to act well over many steps. 🍞 Anchor: A coding agent that gets a reward only when tests pass learns to write code that actually works.
02Core Idea
The “Aha!” in one sentence: Let the model become an efficient, long-context, tool-using engineer by pairing cost-saving sparse attention (DSA) with an asynchronous RL pipeline that teaches step-by-step planning, building, testing, and fixing.
Three analogies:
- Factory upgrade: Old models were a single craftsman; GLM-5 is a whole factory with conveyor belts (asynchronous rollouts), smart inspectors (verifiers/judges), and energy savers (DSA) that finishes more products faster and cheaper.
- Marathon coach: Instead of sprinting one answer, GLM-5 trains to run long races—plan the route, pace itself, refuel (tools), and keep going for hours without losing track.
- Smart spotlight: DSA is a spotlight that follows the important lines on a huge stage play (128K+ tokens) so the actor (the model) doesn’t get lost or exhausted.
Before vs After:
- Before: Dense attention at long contexts = slow and costly; short, static tasks; training tied to rollouts, so GPUs waited; agents often forgot or derailed over long horizons.
- After: DSA keeps long-context fidelity with far less compute; asynchronous RL lets exploration and learning happen in parallel; new agent RL tricks (TITO, orchestrator, clipping) keep training stable; verified environments ground the skills; result: end-to-end engineering that stands up to tests.
Why it works (math-free intuition):
- Most tokens in long inputs aren’t equally useful right now. If you can quickly point to the important ones (with a learned indexer) and compute attention only there, you keep the same brainpower while spending less energy.
- Learning to act across many steps is noisy and slow if you wait for everyone to finish together. Let actors run freely (asynchronous), stream exact token IDs (TITO) so training matches what was really done, and gently ignore samples that drift too far off-policy. You get steady, scalable learning.
- Multi-stage RL is like layering skills: reasoning first, then agent behaviors, then style and safety—finished by distillation so you don’t forget earlier skills.
Building Blocks (Sandwich, in dependency order):
-
🍞 Hook: You once asked an AI to “code the vibe,” and it wrote a snippet that felt right—but didn’t always run. 🥬 The Concept (Vibe Coding): Using prompts to get code that looks right quickly. How it works: (1) Describe intent, (2) model drafts code, (3) you tweak by hand. Why it matters: Without it, prototyping is slow; but it stops short of real engineering. 🍞 Anchor: “Build a cozy website” yields a pretty page, but no full backend or tests.
-
🍞 Hook: Imagine a robot teammate who plans features, writes code, runs tests, and debugs. 🥬 The Concept (Agentic Engineering): Training AI to act as an autonomous software engineer. How it works: (1) Plan, (2) implement, (3) run builds/tests/tools, (4) analyze logs, (5) iterate. Why it matters: Without it, AI can’t complete real end-to-end tasks. 🍞 Anchor: Given a GitHub issue, the agent edits files, runs unit tests, and pushes a fix.
-
🍞 Hook: Reading a giant book? You skim and zoom into key pages. 🥬 The Concept (DeepSeek Sparse Attention, DSA): An attention system that picks important tokens to attend to, cutting compute without losing accuracy. How it works: (1) A fast indexer scores tokens, (2) pick top-k, (3) compute attention sparsely, (4) adapt from dense via continued pretraining. Why it matters: Without DSA, long contexts are too expensive. 🍞 Anchor: At 128K tokens, DSA keeps quality similar to dense but at about half the cost.
-
🍞 Hook: Group projects run faster if teammates work in parallel. 🥬 The Concept (Asynchronous RL): Rollouts generate experiences while training updates the model, both running independently. How it works: (1) Actors explore, (2) learner updates, (3) periodic weight sync, (4) drop too-stale samples. Why it matters: Without async, GPUs wait and learning crawls. 🍞 Anchor: Coding agents browsing, editing, and testing can run nonstop while training catches up.
-
🍞 Hook: Think of a conductor coordinating many bands at once. 🥬 The Concept (Multi-Task Rollout Orchestrator): A server that schedules heterogeneous agent tasks and standardizes their logs. How it works: (1) Each task is a microservice, (2) the orchestrator balances sampling, (3) outputs unified message lists. Why it matters: Without it, multi-task RL is chaos. 🍞 Anchor: SWE tasks, terminals, and search agents all report in one shared format.
-
🍞 Hook: If you copy homework by hand, small mistakes creep in. 🥬 The Concept (Token-in-Token-out, TITO): Training uses the exact token IDs sampled during rollout—no re-tokenization. How it works: (1) Capture token IDs + metadata live, (2) train directly on them, (3) align actions with rewards exactly. Why it matters: Without TITO, small mismatches break learning. 🍞 Anchor: Streamed multi-turn traces line up perfectly with gradients.
-
🍞 Hook: A chef perfects a dish by tasting today’s batch, not yesterday’s. 🥬 The Concept (On-Policy Cross-Stage Distillation): At the end, learn from earlier best checkpoints so new skills don’t overwrite old ones. How it works: (1) Use teacher policies from earlier stages, (2) match their outputs on their data, (3) keep gains and recover drift. Why it matters: Without it, later RL can erase reasoning or coding skills. 🍞 Anchor: After Agentic RL, distillation recovers precise math/coding steps.
-
🍞 Hook: Your phone keeps the latest messages and archives the old. 🥬 The Concept (Context Management for Search Agents): Keep recent steps, periodically reset history to prevent overload. How it works: (1) Keep-recent-k rounds, (2) discard-all beyond length T, (3) combine for stability. Why it matters: Without it, super-long contexts degrade accuracy. 🍞 Anchor: BrowseComp jumps from 62.0 to 75.9 with this hybrid strategy.
03Methodology
At a high level: Input (huge mixed corpus + verifiable agent environments) → Base & Mid-Training (language, code, long context) → Continued pretraining with DSA → SFT with thinking modes → RL trio (Reasoning RL → Agentic RL → General RL) → On-policy distillation → Output (GLM-5: an efficient, long-context, agentic coding model).
Step-by-step (with reasons and mini-examples):
- Base Pretraining (28.5T tokens total across stages)
- What: Train on web, code, math/science. Heavy emphasis on high-quality code and reasoning.
- Why: Strong basics make later RL efficient; code + reasoning are the spine of agentic engineering.
- Example: From docstrings and tests, the model learns how Python classes and unit tests fit together.
- Mid-Training for Long Context (32K → 128K → 200K)
- What: Gradually extend the context window, upsample long documents, repo-level code, and synthetic long-range tasks.
- Why: Sudden jumps are unstable; progressive scaling keeps the model coherent across ultra-long inputs.
- Example: Concatenate multiple related files and issues from a repo so the model learns to connect distant pieces.
- Continued Pretraining with DSA
- What: Start from the long-context base model; warm up the sparse indexer, then co-train sparsely.
- Why: Preserve dense-quality while lowering compute for long sequences; no need to retrain from scratch.
- Example: On RULER-like tests up to 128K, DSA matches or nearly matches dense attention, with much lower cost.
- SFT with Interleaved/Preserved/Turn-level Thinking
- What: Supervise on curated dialogs and agent traces; introduce thinking modes.
- Interleaved: think → act → think → respond.
- Preserved: keep prior thinking blocks across turns.
- Turn-level: enable/disable reasoning per turn.
- Why: Teach the model to plan between actions, remember its earlier reasoning, and adapt depth to task difficulty.
- Example: In multi-turn coding, preserved thinking lets the agent reuse earlier debugging steps instead of re-deriving them.
- Reasoning RL (math, science, code, tool-integrated)
- What: Mixed-domain RL with stable training tricks (e.g., inference-vs-training policy separation, careful clipping) and verifiable rewards.
- Why: Encourage step-by-step correctness in domains that need truth, not style.
- Example: Reward only when a proof checks, a function passes tests, or a tool-based computation agrees with ground truth.
- Agentic RL (asynchronous, decoupled)
- What: Separate inference and training engines; orchestrate many tasks as microservices; use TITO and token-level clipping; drop stale/off-policy samples; route to preserve KV caches.
- Why: Long rollouts must not stall learning; stability requires exact token alignment and gentle off-policy control.
- Example: Hundreds of concurrent SWE/terminal rollouts stream token IDs; the learner updates continuously while actors keep exploring.
- General RL (foundational correctness, emotional intelligence, task-specific quality)
- What: Use a hybrid reward system (rules, outcome models, generative reward models) plus human-style anchors.
- Why: Avoid reward hacking; make outputs both correct and pleasant; tailor to tasks like writing, translation, or QA.
- Example: A rule catches formatting, an outcome model measures factuality, and a generative judge checks coherence and helpfulness.
- On-Policy Cross-Stage Distillation
- What: Learn from the best prior checkpoints on their own data distributions.
- Why: Prevent later stages from erasing earlier strengths (e.g., precise reasoning steps).
- Example: After Agentic RL, distillation re-aligns with earlier high-precision reasoning traces.
- Systems & Efficiency (the secret sauce behind the scenes)
- Memory: Gradient sharding (ZeRO2-style), offloading activations, sequence-chunked losses to slash peaks.
- Parallelism: Interleaved pipeline parallel, deferred gradient computation, hierarchical all-to-all.
- Speed: FP8 rollouts, MTP/speculative decoding with parameter sharing, prefill–decode disaggregation.
- Hardware: Full-stack adaptation to major Chinese chips; mixed W4A8 quantization for MoE experts; fused kernels for sparse attention.
- Why they matter: Without these, 200K contexts and multi-task async RL would be too slow or too memory-hungry.
What makes it clever:
- Marrying DSA (cost-saving attention that keeps quality) with a fully asynchronous, token-faithful RL pipeline (TITO) is the key combo.
- Simple, robust context management for search agents unlocks big gains with tiny engineering cost.
- On-policy cross-stage distillation neatly stitches all the stages together so gains add up instead of canceling.
04Experiments & Results
The Tests (what and why):
- Reasoning & General: Humanity’s Last Exam (with/without tools), AIME/HMMT/IMO-style math, GPQA-Diamond, LongBench v2 for long-context reasoning.
- Coding: SWE-bench Verified & Multilingual, Terminal-Bench 2.0 (two frameworks), CyberGym (security coding).
- Agentic: BrowseComp (+ BrowseComp-ZH), Ď„-Bench (multi-turn tool use), MCP-Atlas (real tools), Tool-Decathlon (long-horizon tools), Vending-Bench 2 (year-long business sim), GDPval-AA (economic value tasks).
- Real-world agentic engineering: CC-Bench-V2 (frontend, backend, long-horizon chained tasks) with builds, unit tests, and Agent-as-a-Judge.
The Competition:
- Open: GLM-4.7, DeepSeek-V3.2, Kimi K2.5.
- Proprietary: Claude Opus 4.5, Gemini 3 Pro, GPT-5.2 (xhigh).
Scoreboard with context:
- Overall: GLM-5 averages about 20% better than GLM-4.7 across ARC benchmarks and becomes the #1 open-weight model in many settings; Intelligence Index v4.0 = 50 (first open-weight to hit 50).
- Long-context reasoning: Top among opens on LongBench v2 and close to Gemini 3 Pro; DSA adaptation matches dense baselines with fewer training tokens.
- Coding: SOTA among open models on SWE-bench variants; beats or ties some proprietary models on SWE-bench Multilingual; strong, verified scores on Terminal-Bench 2.0; large gains on CyberGym vs GLM-4.7.
- Agentic search: BrowseComp rises from 62.0 to 75.9 with hybrid context management, leading among open models; consistent wins on BrowseComp-ZH.
- Tool use and workflows: Competitive with Claude Opus 4.5 on Ď„-Bench, MCP-Atlas, and Tool-Decathlon.
- Long-horizon planning: Vending-Bench 2 final balance $4,432 (best among opens, approaching Claude); strong CC-Bench-V2 repo exploration; better backend Pass@1 than GLM-4.7; chained tasks improved but still behind Claude.
- Human-side utility: LMArena shows GLM-5 as the #1 open model in both Text and Code arenas, with performance close to frontier proprietary systems.
Make the numbers meaningful:
- Intelligence Index 50: Like moving from a B to a solid A across breadth and reliability for an open model—first of its kind.
- BrowseComp 75.9: Not just a few points—this is the difference between occasionally getting lost vs consistently completing complex web hunts.
- Vending-Bench 2 $4,432: Imagine managing a small store for a year and ending near the top tier of all contenders.
- CC-Bench-V2 frontend BSR ~98–100%: Most projects actually build and run; CSR competitive with Claude, but ISR (complete-all-requirements) shows room to grow.
Surprising/Notable findings:
- DSA works with relatively modest continued pretraining budgets and stays stable under RL when top-k selection is deterministic.
- Simple context-management rules (keep-recent-k + discard-all) deliver large, robust gains on BrowseComp.
- TITO (token-in-token-out) and token-level clipping were critical: tiny alignment errors or off-policy drift quickly destabilized long-horizon RL without them.
- Slide generation improves not only structure but also aesthetics via multi-level rewards—showing RL can teach style with good verifiers.
05Discussion & Limitations
Limitations (be specific):
- End-to-end completion gap: On CC-Bench-V2, GLM-5’s Instance Success Rate in frontend remains below Claude Opus 4.5—many pieces are correct (high CSR), but putting every piece together perfectly is still hard.
- Error compounding: In chained backend tasks, an early small mistake can silently break later steps; long-horizon self-correction still needs work.
- Judge dependence: Some evaluations (e.g., BrowseComp) rely on proprietary judges for consistency; community-standard, open judges would improve transparency.
- Hardware tuning: While adapted to many Chinese chips, the best kernels and schedules are still platform-specific; generalizing peak efficiency remains engineering-heavy.
- Safety and robustness: As agents gain autonomy, robust guardrails and verifiable constraints are increasingly important beyond what benchmarks measure.
Required resources:
- Training: Large-scale GPU/NPU clusters, long-context data pipelines, tool-backed environments, orchestration servers.
- Inference: Multi-node KV cache capacity, FP8/MTP support, PD disaggregation for multi-turn stability.
When not to use:
- Tiny, single-turn tasks where a small model suffices; GLM-5’s strengths shine on long, tool-rich, or code-heavy problems.
- Strictly offline, no-tools environments that forbid context resets or judge verifiers—agentic advantages diminish.
Open questions:
- Long-horizon self-correction: How to reliably detect and repair early-chain errors automatically without heavy reruns?
- DSA + RL theory: Best practices for indexer training under RL pressure, and principled ways to tune k and determinism vs speed.
- Rewarding real outcomes: More robust, low-variance, anti-hacking reward models that reflect human value across creative and engineering tasks.
- Continual agent improvement: Safely learning from real deployments without drift or privacy risks.
- Truly universal context management: Adaptive policies that decide when to keep, fold, or discard based on task signals, not fixed thresholds.
06Conclusion & Future Work
Three-sentence summary: GLM-5 shows how to turn a strong language model into an efficient, long-context software engineer by combining DeepSeek Sparse Attention with a carefully engineered, asynchronous RL pipeline. It learns to plan, build, test, and fix code across thousands of verifiable environments while keeping costs low and stability high, then preserves all gains with on-policy cross-stage distillation. The result is the top open-weight model across many coding, reasoning, and agentic benchmarks, closing much of the gap to proprietary leaders.
Main achievement: A practical blueprint for agentic engineering: cost-saving long-context attention (DSA) + token-faithful, asynchronous RL + verified environments + final-stage distillation, yielding real end-to-end coding capability.
Future directions:
- Stronger long-horizon self-correction for chained tasks.
- More open, standardized judges and verifiers.
- Smarter, adaptive context management and tool choice policies.
- Broader domain agents (data engineering, ops, UI/UX), unified under the same orchestrator.
Why remember this: GLM-5 isn’t just better at benchmarks—it makes coding agents that actually build, test, and ship. It demonstrates that open-weight models can be efficient and agentic at the same time, pointing the way from vibe coding to reliable, production-grade AI teammates.
Practical Applications
- •Automated bug triage and repair across large repositories with unit-test verification.
- •End-to-end feature implementation: read issues, modify code, run tests, and submit patches.
- •Repository exploration assistants that locate relevant files and code paths in huge codebases.
- •Secure terminal agents for DevOps tasks (log analysis, environment setup, CI/CD debugging).
- •Long-document assistants for legal, research, and policy analysis over 100K+ tokens.
- •Enterprise search agents that browse internal wikis and the web with robust context management.
- •Cybersecurity coding support: write and validate secure code snippets and scripts.
- •Slide-generation assistants that produce well-structured, aesthetically sound HTML slides.
- •Multilingual customer-support copilots with reliable tool calling and formatting compliance.
- •On-prem deployments optimized for domestic NPUs/GPUs using mixed-precision quantization.