Olmo 3

Team Olmo;  :; Allyson Ettinger; Amanda Bertsch; Bailey Kuehl; David Graham; David Heineman; Dirk Groeneveld; Faeze Brahman; Finbarr Timbers; Hamish Ivison; Jacob Morrison; Jake Poznanski; Kyle Lo; Luca Soldaini; Matt Jordan; Mayee Chen; Michael Noukhovitch; Nathan Lambert; Pete Walsh; Pradeep Dasigi; Robert Berry; Saumya Malik; Saurabh Shah; Scott Geng; Shane Arora; Shashank Gupta; Taira Anderson; Teng Xiao; Tyler Murray; Tyler Romero; Victoria Graf; Akari Asai; Akshita Bhagia; Alexander Wettig; Alisa Liu; Aman Rangapur; Chloe Anastasiades; Costa Huang; Dustin Schwenk; Harsh Trivedi; Ian Magnusson; Jaron Lochner; Jiacheng Liu; Lester James V. Miranda; Maarten Sap; Malia Morgan; Michael Schmitz; Michal Guerquin; Michael Wilson; Regan Huff; Ronan Le Bras; Rui Xin; Rulin Shao; Sam Skjonsberg; Shannon Zejiang Shen; Shuyue Stella Li; Tucker Wilde; Valentina Pyatkin; Will Merrill; Yapei Chang; Yuling Gu; Zhiyuan Zeng; Ashish Sabharwal; Luke Zettlemoyer; Pang Wei Koh; Ali Farhadi; Noah A. Smith; Hannaneh Hajishirzi

Olmo 3

Beginner

Team Olmo, :, Allyson Ettinger et al.12/15/2025

arXiv PDF

Key Summary

•Olmo 3 is a family of fully-open AI language models (7B and 32B) where every step—from raw data to training code and checkpoints—is released.
•The team trains in three base stages (pretraining, midtraining, long-context extension) and then post-trains into Think (shows its reasoning), Instruct (short, direct answers and function calling), and RL-Zero (pure RL from base) models.
•A huge, carefully cleaned data pipeline (Dolma 3) plus new evaluation tools (OlmoBaseEval) guide what to learn and when—especially for math, code, QA, and long-context reading.
•Smart data tricks like global deduplication, quality-aware upsampling, and topic balancing make the most of training tokens.
•Long-context ability (up to ~65K tokens) is unlocked with millions of science PDFs and training changes like sliding window attention.
•Preference tuning with Delta Learning and reinforcement learning with verifiable rewards (OlmoRL) make the Think and Instruct models stronger and more reliable.
•Olmo 3 Base is the strongest fully-open base at its sizes, and Olmo 3.1 Think 32B is the strongest fully-open thinking model, nearing top open-weight models while using about 6× fewer tokens.
•They release a clean RL-Zero setup (data, code, checkpoints) so researchers can study RL effects without hidden data contamination.
•Results also reveal surprises: special chat tokens can hurt midtraining, decontamination doesn’t always boost scores, and merging model checkpoints (souping) can improve performance.

Why This Research Matters

Olmo 3 proves that we can build powerful language and reasoning models while making the entire process transparent and reusable. This boosts trust—teachers, startups, and scientists can check how the model was trained and adapt it safely. Long-context understanding means better help with long emails, reports, and PDFs many people handle every day. Stronger math and coding skills power better tutoring and development tools that explain their steps. Clean RL-Zero releases let researchers test ideas fairly and avoid misleading gains from data leakage. Overall, Olmo 3 is a practical path toward open, reliable AI that communities can learn from and improve together.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how a good school report doesn’t just show your final grade, but also your homework, quizzes, and teacher notes so you can see how you learned? For a long time, many AI models only shared the final grade: just the end weights. That made it hard for everyone else to learn from the full journey.

Here’s the world before Olmo 3. Open-weight models often shared their final parameters, but not the full pathway: training data, cleaning steps, mixing choices, intermediate checkpoints, and evaluation decisions. That meant two big problems. First, it was hard to reproduce or extend the work carefully, since people couldn’t see what actually shaped the model’s skills. Second, researchers couldn’t easily trace why a model said something—especially for step-by-step reasoning—because they didn’t know what data it learned from or which training choices mattered most.

At the same time, models struggled with a few tricky needs that show up in real life: reading very long documents, solving math step by step, writing and fixing code, and following instructions tightly (including calling tools and functions cleanly). And even when teams tried to improve these, development was often slow or noisy. Tiny models don’t always show clear signals on hard benchmarks, and contamination (training on evaluation examples) can inflate scores. So people sometimes chased ghosts: changes that looked helpful on paper but didn’t actually build reliable skills.

🍞 Top Bread (Hook): Imagine a huge library that keeps getting the same book many times, with typos and repeats—finding the best pages to learn from is tough. 🥬 Filling (The Actual Concept - Data processing): What it is: Data processing is how we clean, sort, and select text so the model learns from the best material. How it works (step by step):

Extract usable text from messy web pages and scanned PDFs.
Remove duplicates and near-duplicates so we don’t over-count repeats.
Classify topics (like science, software, health) and score quality.
Mix topics with a guided recipe and upsample the highest-quality slices.
Decontaminate by removing overlaps with eval benchmarks. Why it matters: Without careful processing, models memorize junk, miss rare skills, and can look better on tests just because they saw the test in training. 🍞 Bottom Bread (Anchor): For Olmo 3’s Dolma 3, they deduplicate trillions of tokens, score quality by topic, and upsample the best bits so math and code get real lift without overfitting.

Researchers tried a bunch of fixes before. Some used synthetic data (AI-made problems), but didn’t show how it was built or filtered. Others tuned models with supervised finetuning (SFT) alone, which helps, but can plateau on tough reasoning. Some extended context windows, but with long, expensive training and unclear recipes. And many relied on closed or partly closed datasets, making it hard to tell if gains were due to smarter choices or lucky data leaks.

The missing piece was a fully-open model flow: a complete, transparent recipe covering every stage, every checkpoint, every dataset, and all the code. That would let the community audit, customize, and fairly compare ideas—like a shared lab notebook everyone can learn from.

🍞 Top Bread (Hook): Think about reading a giant mystery novel and remembering people, places, and clues across hundreds of pages. 🥬 Filling (The Actual Concept - Long-context reasoning): What it is: Long-context reasoning lets the model understand and use very long inputs (tens of thousands of tokens) to solve problems. How it works (step by step):

Extend the model’s context window (up to ~65K tokens here).
Train on real long documents (millions of science PDFs).
Use attention tricks (like sliding windows) so compute stays manageable.
Practice reading, recalling, and tying distant details together. Why it matters: Without it, the model loses track of earlier parts and flubs tasks like long reports, legal docs, or multi-step plans. 🍞 Bottom Bread (Anchor): Olmo 3 can take a 100-page science PDF, keep track of definitions, and answer detailed questions that depend on page 2 and page 86.

Real stakes for everyday life are big. Long emails, instructions, or manuals become answerable. Coding helpers that read a whole repository can finally cleanly fix bugs. Math helpers can show their work for tutoring trust. And because the full flow is open, schools, startups, and scientists can tailor models to their exact needs, responsibly and reproducibly.

🍞 Top Bread (Hook): You know how some instructions are short and snappy ("Set a timer for 10 minutes") while others need deeper planning ("Show your steps solving this fraction problem")? 🥬 Filling (The Actual Concept - Instruction following): What it is: Instruction following is when the model understands and correctly does what you ask. How it works (step by step):

Train with many examples of requests and good responses.
Prefer answers that match tone, format, and constraints (e.g., be brief).
Reinforce patterns that users like. Why it matters: Without it, even smart models ramble, ignore formats, or miss the point. 🍞 Bottom Bread (Anchor): Olmo 3 Instruct answers concisely and uses function calls when appropriate instead of writing long essays.

🍞 Top Bread (Hook): Pressing the "start" button on a washing machine triggers a specific program. 🥬 Filling (The Actual Concept - Function calling): What it is: Function calling lets the model call precise tools (like search or a calculator) with structured outputs. How it works (step by step):

Learn schemas for each tool (names, arguments, types).
When a request needs a tool, format a correct call.
Read tool results and finish the user answer. Why it matters: Without it, the model guesses instead of reliably using the right tool. 🍞 Bottom Bread (Anchor): Ask for "weather in Paris tomorrow"—the model calls the weather API with the right city/date and returns a crisp forecast.

02Core Idea

The “aha!” in one sentence: Open the entire model flow and use a targeted, staged curriculum plus smarter preference and RL training to build top-tier reasoning and instruction-following models with fewer tokens—and make the whole process reproducible.

Three analogies:

Recipe book vs. just the cake: Olmo 3 releases the whole cookbook (ingredients, steps, oven times, taste tests), not just the finished cake.
School ladder: Start with broad reading (pretraining), add focused tutoring (midtraining), then practice long essays (long-context), and finally coaching for showing work or being concise (post-training).
Workshop with power tools: Base model is the sturdy workbench; Think adds rulers and checklists (reasoning traces), Instruct adds power tools (function calling), and RL-Zero is the training course where you only pass by proving your work.

Before vs. after:

Before: Open-weight models rarely showed their full data and training journey; reasoning boosts were costly and hard to verify; long-context recipes varied wildly.
After: Olmo 3 publishes every stage (data pools, mixes, code, checkpoints), provides compute-efficient evaluations, and demonstrates strong Think and Instruct models—with long-context—using carefully curated data and RL with verifiable rewards.

Why it works (intuition):

If you clean and balance data by topic and quality, you avoid wasting tokens on repeats and spam while feeding the model exactly the skills you want (math/code/QA).
If you measure with stable, clustered benchmarks, you can steer the model confidently—even at small scales—without chasing noise.
If you post-train with contrastive preferences (Delta Learning) and RL that can auto-check answers, the model learns to reason better and format outputs correctly.
If you open everything, the community can debug, improve, and adapt the recipe.

Building blocks (small pieces that click together):

Data pools and processing: trillions of cleaned, deduplicated tokens; topic/quality labels; decontamination; quality-aware upsampling.
Staged base training: pretraining (broad skills), midtraining (math/code/QA + seeds for instructions and thinking), long-context extension (science PDFs + synthetic long tasks) up to ~65K tokens.
Architecture tweaks: sliding window attention (most layers) to keep long sequences affordable, full attention at the last layer for global tie-together.
Evaluation design: OlmoBaseEval clusters (Math, Code, MC STEM/Non-STEM, GenQA, FIM) with proxy metrics (bits-per-byte) so small-scale experiments show signal.
Post-training paths: • Think: SFT (learn to show steps), DPO/Delta Learning (prefer better reasoning), RL with verifiable rewards (math, code, some chats) for stable gains. • Instruct: SFT for short, helpful answers and function calling, preference tuning for brevity, then RL to polish correctness without losing concision. • RL-Zero: Clean, fully-open RL starting from base models to isolate the true effect of RL (no hidden training data leaks).

Concept sandwiches continued: 🍞 Top Bread (Hook): Practicing with a coach who shows examples, then corrects you. 🥬 Filling (The Actual Concept - Supervised Finetuning, SFT): What it is: SFT teaches with example inputs and gold answers so the model copies the right patterns. How it works (step by step):

Feed question → great answer pairs (including full reasoning traces for Think).
Nudge the model to imitate those answers.
Repeat over many domains (math, code, chat). Why it matters: Without SFT, the model may not pick up reliable formats or multi-step habits. 🍞 Bottom Bread (Anchor): On math, SFT teaches the model to write the scratch work before the final result.

🍞 Top Bread (Hook): Two essays are turned in; the teacher marks which one is better and why. 🥬 Filling (The Actual Concept - Delta Learning): What it is: Preference tuning (Delta Learning) trains the model to pick the better of two responses. How it works (step by step):

Collect pairs: one preferred, one less preferred answer to the same prompt.
Train the model to score and produce the preferred style.
Focus the differences on what matters (conciseness, accuracy, correct tool calls). Why it matters: Without preferences, the model may be correct but wordy, or formatted wrong. 🍞 Bottom Bread (Anchor): When a user asks for a short recipe, the model learns to choose the brief, well-structured version over a rambling one.

🍞 Top Bread (Hook): Earning points by solving puzzles that can be checked by a calculator. 🥬 Filling (The Actual Concept - Reinforcement Learning from Base, RL-Zero): What it is: RL-Zero teaches purely with rewards from auto-checkable tasks (no hidden fine-tune data), starting from the base model. How it works (step by step):

The model tries answers.
A program checks them (pass/fail or a score).
The model updates itself to make higher-scoring choices next time. Why it matters: Without RL-Zero, it’s hard to isolate RL’s real effect because hidden prior training can mask what changed. 🍞 Bottom Bread (Anchor): The model writes a Python function; tests verify it; passing tests earn reward; next time the model is more likely to write testable, correct code.

03Methodology

At a high level: User input → Tokenize → Base training (Pretrain → Midtrain → Long-context) → Post-training (Think/Instruct/RL-Zero) → Output.

Step A: Build the base with the right data

What happens: The team assembles Dolma 3, a massive, cleaned data pool (web pages, code, and millions of science PDFs). They remove exact and near duplicates; classify topic and quality; and design a mixing plan that upsamples the best slices.
Why this step exists: Random web dumps waste tokens; repeats inflate confidence; topic imbalances starve math/code.
Example: In the web pool, the top 5% quality pages can be repeated more (say up to 7×) while low-quality pages are dropped—so the model sees more of the good math/code content instead of spam.

Step B: Pretraining (broad reading)

What happens: The model reads ~5.9T tokens (Dolma 3 Mix) to learn general language, coding syntax, facts, and styles. Architecture: decoder-only transformer with sliding window attention (SWA) in most layers and full attention at the last layer. Context is 8K tokens at this stage.
Why this step exists: You need a strong generalist before specializing.
Example: After pretraining, the model can complete paragraphs, write basic functions, and recall common facts.

Step C: Midtraining (targeted boosts)

What happens: The team runs a 100B-token curriculum (Dolma 3 Dolmino Mix) adding high-impact math/code datasets, QA, and a seed of instruction and thinking traces.
Why this step exists: SFT alone can’t make a weak base great at math/code; midtraining plants and grows those skills.
Example data: • CraneMath and MegaMatt: rewritten, high-quality math web content. • Stack-Edu (FIM): code with infilling exercises to teach structure. • Meta-reasoning and program-verifiable sets: thinking steps that can be checked. • Tulu/Flan subsets: instruction seeds (without special chat tokens) to avoid weird output.
Decontamination: A two-phase n-gram scan + cluster expansion removes benchmark overlaps to prevent inflated scores.

Step D: Long-context extension (~65K)

What happens: The model extends its context window by training on very long documents from the science PDF pool and synthetic long tasks. SWA keeps compute affordable.
Why this step exists: Many real tasks require keeping track of tens of thousands of tokens.
Example: Reading a long report and answering detailed questions that refer back across sections.

Step E: Evaluation that actually guides decisions (OlmoBaseEval)

What happens: Tasks are clustered (Math, Code, MC STEM/Non-STEM, GenQA, FIM); noisy tasks are filtered out or weighted properly; bits-per-byte (BPB) proxy metrics are used so small models show signal.
Why this step exists: Small changes can look like wins due to noise; proxy metrics track real learning early.
Example: A new math dataset is accepted if a small proxy model shows lower BPB and the main suite shows stable gains—not just a lucky bump on one benchmark.

Step F: Post-training paths

Think models (show your work):
1. SFT on Dolci Think SFT with step-by-step traces.
2. Preference tuning (Delta Learning/DPO) with high-quality contrasting pairs.
3. Reinforcement learning with verifiable rewards (OlmoRL) across math, code, and some chat domains. Infrastructure optimizations keep RL stable and about 4× faster.
Instruct models (short, helpful, tool-using):
1. SFT on Dolci Instruct, including plenty of function-calling schemas.
2. Delta Learning to prefer concise, helpful responses.
3. RL with verifiable rewards to sharpen correctness while protecting brevity.
RL-Zero models (clean RL from base):
1. Start from Olmo 3 Base.
2. Train purely with RLVR (math, code, instruction-following, general mix) from clean Dolci RL-Zero data.
3. Release code, data, and checkpoints to let researchers fairly compare RL algorithms.

The secret sauce:

Fully open everything: data pools and mixes, training code, checkpoints each stage, evaluation code—so others can audit, extend, and improve.
Smarter data allocation: constrained mixing via proxy swarms (small models trained briefly) to pick topic ratios; quality-aware upsampling curves to emphasize great documents; careful decontamination.
Practical long-context: combine SWA with truly long documents (millions of PDFs) so the model learns to keep track over huge spans without exploding costs.
Preference + RL that can be checked: contrastive pairs focus style and format; verifiable tasks reward correctness, stabilizing reasoning improvements.

Concept sandwich close-out: 🍞 Top Bread (Hook): Practicing a sport with a scoreboard that instantly tells you if you scored. 🥬 Filling (The Actual Concept - RL-Zero revisited in practice): What it is: RL-Zero is research infrastructure to study RL effects cleanly, starting from base. How it works (step by step):

Use only verified, decontaminated RL training sets.
Score each attempt automatically.
Update the model to prefer higher-scoring actions, tracking progress transparently. Why it matters: Without a clean setup, it’s unclear if improvements come from RL or from hidden data leaks. 🍞 Bottom Bread (Anchor): Two RL setups look similar; in RL-Zero, you can prove the data didn’t include the test set, so a gain actually means better learning, not leakage.

04Experiments & Results

The test: They measured capabilities across clustered suites—Math, Code, MCQA (split into STEM and Non-STEM), GenQA, FIM—and also checked long-context reading, chat quality, and tool-use formatting. For post-training, they tracked step-by-step reasoning gains and brevity/function-calling behavior.

The competition: Olmo 3 models are compared to leading fully-open and open-weight baselines such as Stanford Marin, Apertus, LLM360 K2-V2, OLMo 2, Qwen 2.5/3, Gemma 3, Granite 3.3, and others at similar sizes.

The scoreboard (with context):

Base models: Olmo 3 Base is the top fully-open base at both 7B and 32B on Math and Code composites, often with double-digit gains against other fully-open peers. It keeps pace on MCQA and GenQA and narrows gaps with strong open-weight models.
Think models: Olmo 3.1 Think 32B is the strongest fully-open “thinking” model reported. It approaches Qwen 3 32B Thinking on reasoning suites while using about 6× fewer tokens, a remarkable efficiency. That’s like scoring an A when most others with full notes score A+—but you studied far fewer pages.
Instruct models: Olmo 3 Instruct (7B and 32B) surpasses many notable open-weight baselines on chat/function-calling style tasks and further closes the gap to top-tier open-weight models.
Long-context: After a relatively short long-context extension (50B tokens at 7B, 100B at 32B), Olmo 3 supports up to ~65K context and performs competitively with models known for long sequences.
Efficiency and stability: New RL infrastructure (OlmoRL) yields about a 4× speed-up for RL runs and sustains longer, more stable training across domains.

Surprising findings:

Decontamination doesn’t always behave as expected. Some benchmarks had large performance drops after removing contamination (e.g., known reading-comprehension sets), showing prior inflation; others showed little change, and one (GSM8K) even ticked up after decontamination because the contaminated training format didn’t match the tested format.
Special chat tokens during midtraining can harm outputs. Models began emitting those tokens at inference, tanking scores—so the team removed them until post-training.
Model souping (merging checkpoints) improved midtraining performance meaningfully for 32B, suggesting ensembling-like effects even before fine-tuning.
Domain trade-offs are real. Heavier math/code mixes boosted those areas but dented MCQA/GenQA; QA-heavy mixes didn’t help as much and also hurt math/code. The final mix balances these for strong overall results.
Small, high-quality meta-reasoning sets moved the needle on math/code more than expected, especially when combined with program-verifiable training and later RL.

Costs in practical terms: About 56 days on a 1024×H100 cluster to reach Olmo 3 Think 32B, with pretraining dominating time. At a rough $2/H100 hour assumption, this is about$ 2.75M—though the paper emphasizes wall-clock time as a more honest measure of the real pipeline cost.

Bottom line: Across the full suite, Olmo 3 delivers the best fully-open bases and the strongest fully-open thinking model, while giving the community the entire flow—data, code, and checkpoints—to verify and build upon.

05Discussion & Limitations

Limitations:

Compute is still significant. Although efficient designs help, training at these scales requires large GPU clusters, careful engineering, and monitoring.
Long-context is strong but not infinite. Extension goes to ~65K; some specialized tasks might want even longer or different memory mechanisms.
RL remains finicky. Post-training requires hyperparameter sweeps and infrastructure care; best-practices are improving but not yet “turnkey.”
Synthetic data quality varies. Rewrites and AI-generated tasks can carry biases or subtle artifacts; heavy curation helps but does not eliminate this risk.
English-centric focus. While sources are broad, most curation leans English/science/tech; global multilingual performance isn’t the primary focus here.
External model dependencies. Some synthetic data pipelines used open-weight LLMs; re-generations are permissive, but choices may influence style/outcomes.

Required resources:

GPUs: hundreds to a thousand+ H100-class GPUs for weeks with fast interconnects and reliable storage.
Data ops: high-throughput dedup, topic/quality classifiers, OCR at scale, and robust decontamination tools.
Eval ops: scripts to aggregate clustered metrics, proxy BPB evaluation, and decontamination-aware scoring.

When NOT to use:

Ultra-low-latency on-device settings where even a 7B model is too large.
Extremely long memory needs far beyond ~65K tokens without retrieval augmentation.
High-stakes domains demanding certified guarantees the paper does not claim (e.g., medical diagnosis or legal rulings without human oversight).
Non-English heavy deployments if you need top-tier performance across many languages out of the box.

Open questions:

Can RL with verifiable rewards cover more domains (beyond math/code) reliably at scale?
How far can constrained mixing and quality-aware upsampling go—do they generalize across more topics and languages?
What is the best balance between thinking traces and concise outputs for different user groups?
How should contamination detection evolve as benchmarks and formats change?
Can longer contexts (or hybrid retrieval) beat the current cost/benefit curve without harming accuracy?

06Conclusion & Future Work

Three-sentence summary: Olmo 3 fully opens the entire lifecycle of strong language models—data, code, and checkpoints—so anyone can reproduce, audit, and improve them. By combining a smart data curriculum (pretrain → midtrain → long-context) with well-designed evaluations and post-training (SFT, Delta Learning, RL with verifiable rewards), it delivers top fully-open base models and the strongest fully-open thinking model at 32B. It also standardizes a clean RL-Zero setup so researchers can fairly test RL ideas without hidden data leaks.

Main achievement: A complete, fully-open model flow that not only reaches state-of-the-art fully-open performance but also provides the community with the artifacts and recipes needed to understand and extend every part of the system.

Future directions: Extend and diversify long-context further; broaden multilingual coverage; stabilize and generalize RLVR to more domains; refine mixing laws and decontamination for evolving benchmarks; and deepen function-calling/tool-use with richer, verifiable tasks.

Why remember this: Olmo 3 is a blueprint for transparent AI progress—showing that with clean data pipelines, thoughtful evaluations, and open artifacts, the community can build powerful, trustworthy models together and move faster than any single team working alone.

Practical Applications

•Build a custom tutoring assistant that shows math or science reasoning steps using the Think variant.
•Create a concise customer-support bot that follows instructions and calls internal tools/APIs with the Instruct variant.
•Index and analyze long technical manuals or research PDFs using long-context capabilities.
•Develop a programming copilot that writes, refactors, and tests code; verify with unit tests for higher reliability.
•Run fair RL experiments with RL-Zero to compare algorithms on clean, decontaminated data.
•Customize data mixes for a domain (e.g., biology or finance) using the provided data recipes and quality upsampling.
•Audit and improve training pipelines by replaying every stage with released checkpoints and code.
•Prototype function-calling workflows that integrate calculators, databases, and search tools via structured calls.
•Evaluate small proxy models with OlmoBaseEval to make compute-efficient data decisions before scaling up.
•Extend the model with your own long-context documents (e.g., company knowledge bases) for richer Q&A.

Version: 1