Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

Junru Lu; Jiarui Qin; Lingfeng Qiao; Yinghui Li; Xinyi Dai; Bo Ke; Jianfeng He; Ruizhi Qiao; Di Yin; Xing Sun; Yunsheng Wu; Yinsong Liu; Shuangyin Liu; Mingkong Tang; Haodong Lin; Jiayi Kuang; Fanxu Meng; Xiaojuan Tang; Yunjia Xi; Junjie Huang; Haotong Yang; Zhenyi Shen; Yangning Li; Qianwen Zhang; Yifei Yu; Siyu An; Junnan Dong; Qiufeng Wang; Jie Wang; Keyu Chen; Wei Wen; Taian Guo; Zhifeng Shen; Daohai Yu; Jiahao Li; Ke Li; Zongyi Li; Xiaoyu Tan

Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

Intermediate

Junru Lu, Jiarui Qin, Lingfeng Qiao et al.12/31/2025

arXiv PDF

Key Summary

•Youtu-LLM is a small (1.96B) language model that was trained from scratch to think, plan, and act like an agent instead of just copying bigger models.
•It uses a custom STEM-friendly tokenizer and a long 128k memory window so it can keep track of very long tasks and documents.
•The model’s architecture is Multi-Latent Attention (MLA), which is efficient and strong for reasoning on small devices.
•Training follows a ‘Commonsense → STEM/Code → Agent’ curriculum so skills build up like school grades.
•The team built 200B tokens of agent-like ‘trajectory’ data across math, coding, deep research, and tool use to teach planning, action, and reflection.
•A structured Agentic-CoT format (analysis → plan → action → reflection → summary) reduces messy overthinking and makes reasoning cleaner.
•On many tests, Youtu-LLM beats other models under 2–4B parameters and even rivals or surpasses some bigger ones on agent tasks.
•Agentic mid-training shows clear scaling: the first ~34B tokens give big gains, and improvements keep growing as more agent data is added.
•Smart engineering choices (FP16 precision, consistent sampling) made reinforcement learning more stable and prevented training-collapse issues.
•This work shows that lightweight models can be truly agentic, making fast, affordable, and capable on-device assistants realistic.

Why This Research Matters

Making small models truly agentic means powerful assistants can run on everyday devices—faster, cheaper, and more private. This enables coding copilots that can navigate big repos, research helpers that verify sources, and math tutors that reason step by step, all without cloud dependency. Hospitals and banks can deploy on-device agents to protect sensitive data while still getting strong reasoning and tool-use skills. Teachers and students gain access to capable, explainable AI that can show its work and adapt to curriculum needs. Startups can build reliable agents without massive compute budgets, speeding innovation. In short, this work democratizes strong AI behaviors by proving small can be smart—if trained the right way.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how a talented coach can turn a small team into champions with the right drills, playbook, and practice games? For years, AI teams have tried to do something similar with small language models: make them think better without needing the giant compute bills of huge models. Before this work, the usual trick was to distill knowledge from a large model into a smaller one or add a simple instruction-following layer on top. That helped with polite answers and format-following, but it didn’t grow real ‘agent brains’—the skills to plan steps, use tools, keep track of long tasks, and learn from feedback. The problem was simple to say but hard to solve: can a lightweight model (under 2B parameters) develop strong, native agent skills through pre-training, not just through after-the-fact patches? In the real world, agents need to do more than chat. They must plan multi-step research, fix code across large repositories, call tools with the right parameters, and reflect when things go wrong. Small models traditionally struggled here, especially when tasks were long or required precise, verifiable reasoning (like math or debugging). People tried a few things. Distillation copied answers but often missed the inner reasoning muscles. Instruction tuning made models follow directions but didn’t truly teach planning or reflection. Some changed model architectures or squeezed memory, but without the right learning diet, the models still failed when things got complex or long. What was missing was a principled path to teach agent behaviors from the ground up, with training data that shows not just what to answer, but how to think and act step-by-step. This paper fills that gap with Youtu-LLM, a 1.96B model built for native agentic intelligence. The team didn’t just pre-train on random web text. They used a curriculum that starts with commonsense, then leans into STEM and coding, and finally focuses on agentic data—massive, carefully built ‘trajectories’ showing analysis → plan → action → reflection → summary across math, code, deep research, and tool use. They also engineered the foundation: a STEM-friendly tokenizer, a 128k long context window, and an efficient MLA attention architecture that keeps memory small but thinking strong. Why should you care? Because the stakes are real. If you can run a smart, agentic assistant on a laptop or an edge device, you speed up research, debugging, data analysis, and learning without cloud costs or high latency. In schools, a lightweight tutor could reason through math with students. In companies, a small on-device agent could search, verify, and summarize proprietary data safely. In everyday life, you could have an assistant that plans, acts, and reflects—without needing a supercomputer. This paper shows that with the right recipe—better tokens, long memory, a strong but compact attention design, and most importantly, a staged diet of agent trajectories—even small models can act big, and do so natively, not just by imitating.

02Core Idea

The ‘aha!’ in one sentence: Teach a small model to be an agent by building its thinking ladder step-by-step—commonsense → STEM/code → agentic trajectories—on top of an efficient long-memory architecture and a STEM-smart tokenizer. Here are the core concepts using the Sandwich pattern:

🍞 Hook: Imagine chopping a paragraph into Lego bricks so a computer can snap ideas together cleanly. 🥬 The Concept: Tokenization is how text is split into pieces (tokens) that the model reads. How it works:

Pre-tokenize so symbols (like digits, code, and CJK scripts) don’t get glued together wrongly.
Start with a safe base vocab, then add focused Chinese and reasoning/code tokens.
Keep digits atomic (0–9) and add STEM/code chunks so formulas and code compress well. Why it matters: Bad tokens waste space and muddle meaning; good tokens make reasoning and coding shorter, cleaner, and faster. 🍞 Anchor: ‘3x^2 + 10’ stays in tidy pieces so the model can compute instead of guessing.

🍞 Hook: You know how remembering chapter 1 helps you solve a mystery in chapter 15? 🥬 The Concept: Long-Context Support lets the model hold and use very long inputs (up to 128k tokens). How it works:

Architect the model to accept long sequences.
Use a memory-efficient attention (MLA) so the KV cache stays small.
Train progressively at 8k → 32k → 128k so skills grow smoothly. Why it matters: Without long context, agents forget earlier clues and break multi-step tasks. 🍞 Anchor: The model can track a whole research session or a large codebase without losing the plot.

🍞 Hook: Picture a conductor guiding many musical lines while keeping the performance light and tight. 🥬 The Concept: Multi-Latent Attention (MLA) is an efficient attention design that compresses memories while preserving rich interactions. How it works:

Use low-rank projections to shrink the key/value cache.
Keep wider intermediate projections to boost expressiveness.
Maintain dense compute for on-device speed (no MoE I/O overheads). Why it matters: Without MLA, you spend too much memory or lose accuracy; MLA gives small models big thinking power. 🍞 Anchor: Youtu-LLM outperforms similar-sized GQA models on broad tasks with less memory.

🍞 Hook: Think of a treasure map that shows every stop you’ll make before you arrive. 🥬 The Concept: Trajectory Data Construction creates step-by-step records (analysis → plan → action → reflection → summary). How it works:

Generate or collect tasks (math, code, research, tools).
Synthesize or rewrite clean trajectories with checks and verifiers.
Keep both good and (carefully branched) failed paths to teach recovery. Why it matters: Without trajectories, the model sees only answers, not how to get there. 🍞 Anchor: A coding log shows where you looked, the bash commands you ran, test results, and how you fixed errors.

🍞 Hook: In school, you don’t start with calculus—you build from basics. 🥬 The Concept: Curriculum Learning teaches in stages that grow harder over time. How it works:

Stage 1: Commonsense and web/encyclopedia.
Stage 2: Heavier STEM and coding.
Stage 3: Long context; Stage 4: Agentic trajectories. Why it matters: Jumping to hard tasks too soon confuses the model; a ladder builds real skill. 🍞 Anchor: Like leveling up in a video game: each world prepares you for the next boss.

🍞 Hook: Imagine learning chess from games that show the thinking behind each move. 🥬 The Concept: Agentic Mid-training adds a focused dose of trajectories so the model internalizes planning and reflection. How it works:

Start after general skills and long-context are in place.
Feed diverse trajectories (math/code/research/tools) with verifiers.
Optimize with stable RL tricks (FP16, consistent sampling) to lock in behaviors. Why it matters: Without this phase, small models stay good talkers but weak doers. 🍞 Anchor: After this training, the model improves greatly on SWE-Bench-Verified and GAIA.

🍞 Hook: You know how good study notes highlight the main ideas instead of rambling? 🥬 The Concept: Agentic-CoT is a cleaned-up reasoning format (analysis → plan → action → reflection → summary). How it works:

Rewrite long, messy chains of thought into clear sections.
Tag each step so it’s easy to learn and evaluate.
Keep the logic, cut the fluff. Why it matters: Without structure, the model learns to overthink and repeat. 🍞 Anchor: For a math problem, it writes a plan, calculates, checks, and then summarizes the answer.

Before vs After:

Before: Small models often matched tone and format but failed at long, multi-step tasks.
After: With staged training and trajectories, a 2B model can plan, act, and reflect competitively with bigger models on agent benchmarks.

Why it works (intuition):

The tokenizer and MLA let the model ‘see’ precise chunks (code/math) over long contexts efficiently.
The curriculum grows difficulty at the right time, so skills compound.
Trajectories expose the hidden steps of expert behavior; RL and verifiers lock them in.

Building blocks:

STEM-oriented tokenizer; 128k context; MLA attention.
Massive, filtered commonsense + STEM/code corpora.
200B tokens of agentic trajectories (math, code, deep research, tool use) with Agentic-CoT.
Stable post-training (SFT + RL) with FP16 and consistent sampling.

03Methodology

At a high level: Inputs (curated text + STEM/code + agentic trajectories) → Tokenizer + MLA architecture setup → Multi-stage pre-training (Commonsense → STEM/Code → Long Context → Agentic mid-training) → Post-training (SFT, then RL with verifiers) → A lightweight model that plans, acts, and reflects. Step-by-step details:

Tokenizer engineering (what happens): The team trains a byte-level BPE tokenizer with strict pre-tokenization so digits, code, and CJK scripts don’t get mashed together. They keep digits atomic (0–9), add focused Chinese tokens, and then add specialized math/code tokens. Why this step exists: If tokens are messy, math and code blow up in length and meaning gets fuzzy, which weakens reasoning. Example: ‘sin(2x)+e^x’ stays compact and semantically clear, improving math compression by ~10% vs common baselines in their tests.
Architecture: MLA for efficient long attention (what happens): The model is a 1.96B-parameter dense transformer using Multi-Latent Attention. MLA reduces KV-cache size via low-rank projections while keeping expressiveness with larger intermediate projections. Why this step exists: It enables long-context (128k) reasoning on tight memory, essential for agent logs and large repos. Example: Compared to a 1B GQA baseline, an MLA 1B trained from scratch averaged better perplexity and benchmark scores across English/Chinese.
Data curation and filtering (what happens): From >10T raw tokens, they deduplicate, classify by 11 domains and 10 quality criteria, score with a small classifier trained to 95%+ agreement, and decontaminate STEM/code against tests. Why this step exists: Noisy data can teach bad habits; leaks inflate scores unfairly. Example: 80B high-quality filtered samples beat 100B raw in less training steps.
Multi-stage pre-training (the curriculum):

Stage 1 (Commonsense): 8.16T tokens at 8k context, mostly web/encyclopedia; warmup then standard LR schedule.
Stage 2 (STEM/Code): Increase STEM+code to ~60% while keeping peak LR to build technical muscles.
Stage 3 (Long Context): Extend to 32k then 128k while decaying LR; slightly more STEM/Code to stabilize long-horizon skills.
Stage 4 (Agentic Mid-training): Shift data to ~60% agentic trajectories and decay LR to 1e-7. Why this step exists: Skills ladder up; long-context first helps the model track multi-step agent traces. Example: Placing agent data after long-context training yielded larger gains than the reverse.

Agentic trajectory construction (the secret sauce):

Agentic-CoT: Rewrite messy chains into analysis → plan → action → reflection → summary; 25B tokens across domains.
Math trajectories (20B): A planning–action–feedback agent uses 11 atomic math skills (like symbol recognition, theorem use, self-reflection) to solve and verify problems; long and ultra-long (32k–128k) traces are included.
Code trajectories (70B): Scale tasks (SWE-gym, OpenHands, R2E-Gym), contexts (long-tail repos with static, verifiable tasks), and actions (branch at critical edit/test steps; use one-step variations; evaluate after branch) to teach exploration, editing, testing, and reflection.
Deep Research (60B): Closed-ended multi-hop QA with diverse frameworks and perturbed search; open-ended reports via forward ‘think twice’ research and inverse synthesis from expert documents using citation graphs; plus atomic skills like planning, summarization, reading comprehension.
Tool-use & planning (25B): Build a tool graph, synthesize multi-turn user–assistant dialogues that follow dependencies, verify JSON/XML formats and execution feasibility, then augment negatives (e.g., missing tools) to teach recovery. Why this step exists: Exposing the full ‘how’ trains planning and reflection, not just answers. Example: In code, reusing failed trajectories safely (single-step branches) boosts diversity without propagating errors.

Supervised fine-tuning (SFT):

Data engineering: Collect broad instruction data (math, code, science, agentics, QA, role-play, creative writing, safety). Add or reconstruct clean Chain-of-Thought when needed. Clean with heuristics, teacher scoring, and 32-gram decontamination.
Two-stage SFT: First, focus on reasoning-heavy data (math 40%, code 30%, science 20%, agentics 10%) to power up logic. Then, expand to general instructions while mixing back Stage I data to prevent forgetting. Add ‘think’ and ‘non-think’ control so the model can answer briefly or expose reasoning based on tokens. Why this step exists: Align the model to follow instructions without losing reasoning muscles. Example: The model can switch to short answers for simple tasks, saving latency.

Reinforcement learning (RL) with verifiers and stability tweaks:

Tasks and verifiers: Use auto-checkable math (structured answers), code execution environments and I/O inference, rubric-based complex instructions with LLM judges, and a safety reward model that favors helpful redirection over blunt refusal.
Stability: Prefer FP16 (less drift than BF16) and consistent sampling (discard batches where rollout vs train policies drift too much) to keep on-policy learning steady. Why this step exists: RL can wobble; these guardrails keep training on track and prevent ‘language drift’ or format breakage. Example: With FP16 + consistent sampling, math and code benchmarks avoided early plateaus and rose higher than BF16.

Secret sauce summary:

A STEM-friendly tokenizer + MLA attention + 128k context = efficient, long-horizon backbone.
A spiral curriculum ending in massive, verified trajectories = native planning and reflection.
Clean SFT + stable RL = polish without breaking the core skills.

04Experiments & Results

What they tested and why: The team measured general knowledge (MMLU variants, MLQA), STEM (GSM8K, MGSM-Zh, MATH, BBH, GPQA-MC, HLE-MC), coding (MBPP, HumanEval, LiveCodeBench, CRUXEval, RepoBench), and long context (LongBench v2, NIAH). For ‘agentic-ness,’ they used APTBench for base models (code, deep research, math, tools) and full agent benchmarks for instruct models (GAIA, xbench, SWE-Bench-Verified, EnConda-Bench, BFCL, τ-bench). This mix shows both raw knowledge and real planning/action power. Against whom: They compared Youtu-LLM 2B to similarly small or larger popular open models: Qwen3 1.7B & 4B, SmolLM3 3B, Gemma3 4B, and DeepSeek-R1 distill variants. Scoreboard with context:

General base model highlights (like getting an A when others got Bs): • GSM8K (8-shot) 77.6% (many 1.7–3B peers were in the 38–68% range); MATH (4-shot) 44.4% (strong for sub-2B); BBH 60.0% (competitive reasoning); GPQA-MC 33.3%. • Coding: MBPP+ 81.8% and HumanEval 64.6%—notable for 2B. LiveCodeBench v6 9.7% (tough dataset; still ahead of many peers). • Long context: NIAH 98.8% recall and solid LongBench v2.
Instruct model highlights (like outscoring bigger classmates on hard quizzes): • Coding: HumanEval 95.9%, HumanEval+ 89.0%, MBPP 85.0%, MBPP+ 71.7%—excellent for a 2B model. • Instruction/Text Reasoning: IFEval 81.2%, DROP 86.7% (strong discrete reasoning). • STEM: MATH-500 93.7%; AIME’24 65.4% and AIME’25 49.8% zero-shot—impressive for the size.
Agent benchmarks (the real ‘agent exams’): • Deep Research: GAIA 33.9% vs 25.5% for a 4B baseline—like jumping from a B- to a solid B+/A- on a hard open-book test. • Code agents: SWE-Bench-Verified 17.7% vs 5.7% for a 4B baseline—over 3x higher resolve rate; EnConda-Bench 21.5% (top among peers listed). • Tools: τ-bench 15.0% (up from 10.9% for a 4B baseline), BFCL V3 58.0% (competitive). Surprising findings:
Long-context first, then agent data works better: Training on long context before agentic trajectories gave bigger gains, likely because the model can better track multi-step histories.
FP16 > BF16 for RL stability: Using FP16 reduced rollout-vs-train drift and lifted math/code accuracy curves beyond BF16.
Don’t throw away failures: Branching failed code trajectories at key actions (edit/test) and adding a quick evaluation step provided diversity without poisoning the data.
Fast early returns from agent data: The first ~34B agentic tokens gave a big jump on APTBench; gains continued with more data in a clean logarithmic trend. What it means: For many real tasks, this 2B model performs like or better than 3–4B peers, and even beats some bigger baselines on agent tasks. The numbers say the design choices—tokenizer, MLA, curriculum, trajectories, and stabilized RL—combine to make small models truly agentic.

05Discussion & Limitations

Limitations:

Still behind the largest proprietary models: While outstanding for its size, there’s a gap to the very top models on the hardest agent tasks.
Latency from ‘thinking’: Agentic-CoT and long trajectories can slow responses; the think/non-think control helps, but efficiency on ultra-long tasks is still challenging.
Text-only scope: Current training focuses on text; multimodal perception (images, audio, UI understanding) isn’t included yet.
Tool ecology variance: Real-world APIs differ; being great on benchmark toolsets may not perfectly transfer to every company’s tools without adaptation.
Language coverage: English and Chinese are strong; other languages are less emphasized in the tokenizer and corpora. Required resources:
Training: Large-scale data pipelines, verifiers, and substantial compute to process ~10T+ tokens and long contexts.
Serving: Enough memory for 128k context windows (even with MLA efficiencies) and fast storage for tool/trajectory prompts. When not to use:
Ultra-specialized domains with scarce or private tools/data and zero opportunity to fine-tune.
Strict real-time settings where any multi-step reasoning latency is unacceptable.
Tasks demanding grounded perception beyond text (e.g., interpreting complex UIs, images, or sensor streams) until multimodal extensions arrive. Open questions:
Best curriculum schedule: Could different stage lengths or data mixes yield even better agent growth curves?
Theory of agentic pre-training: Can we formalize why trajectories with reflection transfer so well to unseen agent tasks?
Better math-agent benchmarks: We need standardized agentic math tests for instruct models, not just final-answer accuracy.
Efficient ‘thinking’: How to keep accuracy while shortening chains of thought (distill, prune, or reason-with-verifiers)?
Multimodal agents: How to blend text, code, and perception with the same lightweight efficiency?

06Conclusion & Future Work

In three sentences: Youtu-LLM shows that a 1.96B model can develop native agent skills by pairing an efficient long-memory backbone (MLA + 128k) with a carefully staged curriculum that ends in massive, verified agent trajectories. With a STEM-smart tokenizer, structured Agentic-CoT, and stable SFT+RL, it rivals or surpasses much larger peers on many agent benchmarks in code, deep research, and STEM. The results redefine what ‘small’ models can do—making fast, affordable, on-device agents far more practical. Main achievement: Proving that true agentic capability can be induced in a lightweight model via principled pre-training—especially through scalable, high-quality trajectory data—and that this transfers strongly to real agent benchmarks. Future directions: Build multimodal versions; compress reasoning without losing accuracy; expand language coverage; evolve into ‘world models’ that simulate environments for stronger planning and tool grounding. Why remember this: It marks a turning point where small models stop being just polite parrots and start becoming capable doers—planning, acting, and reflecting natively—bringing powerful AI within reach of everyday devices and workflows.

Practical Applications

•On-device coding assistant that searches, edits, tests, and patches code across large repositories.
•Research copilot that plans multi-hop searches, checks citations, and drafts balanced reports.
•STEM tutor that shows analysis → plan → action → reflection for math and science problems.
•Internal enterprise agent that summarizes and verifies long documents with privacy preserved.
•Customer support tool that plans troubleshooting steps, calls tools/APIs, and reflects on failures.
•Data-cleaning and analytics helper that writes queries (Text2SQL), validates outputs, and refines results.
•DevOps assistant that drafts runbooks, executes safe shell commands, and reports test outcomes.
•Spreadsheet/Excel agent that plans multi-step transformations and explains each change.
•Legal/finance briefing generator that compiles, cross-checks, and cites sources in long contexts.
•Education content creator that builds graded exercises with step-by-step solutions and reflections.

Version: 1