Recursive Language Models

Alex L. Zhang; Tim Kraska; Omar Khattab

Recursive Language Models

Beginner

Alex L. Zhang, Tim Kraska, Omar Khattab12/31/2025

arXiv PDF

Key Summary

•Recursive Language Models (RLMs) let an AI read and work with prompts that are much longer than its normal memory by treating the prompt like a big external document it can open, search, and study with code.
•Instead of stuffing the whole prompt into the model’s short context window, RLMs load the prompt into a coding workspace (a REPL) and let the model write small programs that peek at slices and call the model again on those slices.
•This approach keeps performance strong even when inputs grow to 10 million+ tokens, where normal models quickly fall apart (context rot).
•Across tough long-context tasks like OOLONG and OOLONG-Pairs, RLMs scored far higher than standard methods such as summarization agents or coding agents with retrieval.
•RLMs can also build very long answers by assembling pieces in variables, bypassing the output length limits of a single model response.
•A small fine-tuned model, RLM-Qwen3-8B, improved by 28.3% on average over its base version using only 1,000 example trajectories, showing that native RLM training is promising.
•Costs are often comparable to (or lower than) standard baselines at the median, though there can be expensive outliers due to long recursive trajectories.
•RLMs are model-agnostic: they worked with GPT-5 and Qwen3-Coder-480B, but different models make different choices about how much to sub-call and how to chunk.
•For short inputs, plain LLMs can still be faster and sometimes more accurate; RLMs shine as inputs and task complexity grow.
•Treating the prompt as an environment plus symbolic recursion is the key idea that changes how we scale AI at inference time.

Why This Research Matters

Many real problems—research across thousands of documents, scanning years of logs, or understanding huge codebases—are too big for a model’s native memory. RLMs give AI a practical way to examine massive inputs without throwing away details. This can reduce missed clues in legal, medical, and scientific work where small facts matter. It can also make AI assistants more reliable for long projects, not just short chats. As models learn to write better analysis code, they can work faster and cheaper by reading only what’s needed. Over time, training native RLMs could unlock dependable long-horizon reasoning for everyday tools.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re handed a gigantic stack of books taller than your house and asked one question that needs tiny clues from almost every single page. If you tried to hold every page in your head at once, you’d get overwhelmed fast.

🥬 Large Language Models (LLMs)

What it is: LLMs are computer programs that read and write text, like supercharged autocomplete.
How it works:
1. They read words and patterns in your prompt.
2. They predict the next likely word, over and over.
3. They use a limited working memory window (their “context”) to stay on track.
Why it matters: Without LLMs, we wouldn’t have modern chatbots or coding helpers—but their short-term memory limits are a real bottleneck. 🍞 Anchor: When you ask a chatbot to write a poem about your dog, it uses its language skills to keep rhymes and style flowing.

🥬 Context Window

What it is: A context window is the model’s short-term memory—how much text it can pay attention to at once.
How it works:
1. You paste in a prompt.
2. The model reads up to a certain token limit (like 272K tokens for some models).
3. Anything beyond that limit is cut off.
Why it matters: If the important clue is past the limit, the model can’t see it. 🍞 Anchor: It’s like trying to do a puzzle with only a handful of pieces on your table—you must swap pieces in and out, which is slow and error-prone.

🥬 Long-Context Tasks

What it is: Problems where the answer depends on lots of information spread across a very long input.
How it works:
1. The task provides huge text (documents, logs, codebases).
2. The solution may need details scattered almost everywhere.
3. You often must process many or even all parts to get the correct answer.
Why it matters: Many real problems—legal cases, research, big code repos—are long-context. 🍞 Anchor: Answering, “Which two users fit rule X based on all their activities?” needs checking almost every entry.

🥬 Context Rot

What it is: Performance drops as inputs get longer—even within the context window.
How it works:
1. The model reads a longer prompt.
2. Its attention gets stretched or distracted.
3. Accuracy falls faster on harder tasks.
Why it matters: Bigger windows alone don’t guarantee reliable long-horizon reasoning. 🍞 Anchor: Reading 50 pages of clues can blur together; you forget which clue appeared where.

🥬 Context Compaction (Summarization)

What it is: Shrinking long inputs into summaries to fit within the window.
How it works:
1. When text gets too long, summarize older parts.
2. Keep a shorter memory of what matters.
3. Repeat as the conversation continues.
Why it matters: It saves space but can delete tiny details you may need later. 🍞 Anchor: Summarizing a mystery novel might skip the one-sentence clue that solves the case.

The World Before: People tried all sorts of agent tricks to stretch context: retrieval to fetch only relevant chunks, coding agents to compute or filter, and iterative summarization to compress. These help, but hit hard walls. Retrieval still must stuff results into the window. Summaries can miss crucial bits. And output length is also limited if the model must write everything directly.

The Problem: We need models to deeply process tens of millions of tokens, sometimes touching nearly every line (linear complexity) or every pair of lines (quadratic complexity). Simply growing the model’s native window is costly and still suffers context rot and output limits.

Failed Attempts:

Pure retrieval agents: They fetch snippets but still overfill the window.
Pure summarization: Loses details needed for dense reasoning.
Verbal sub-agents: They plan sub-calls in text, but long outputs still hit limits.

The Gap: No general, task-agnostic way to let an LLM work programmatically over an arbitrarily long prompt while preserving dense, fine-grained access and enabling very long outputs.

Real Stakes: Think of a student researching across 1,000+ articles, a lawyer scanning years of case files, or an engineer understanding a massive codebase. Missing one small connection can flip the answer from right to wrong. We need a method that scales thinking at inference time, not just memory size.

02Core Idea

🍞 Hook: You know how a librarian doesn’t try to balance a whole library on one desk? Instead, they keep the books on shelves and bring you the specific pages you need.

🥬 Inference-Time Scaling

What it is: Using extra smart steps at answer time (not training time) to think more or better.
How it works:
1. Add tools or loops to reason step-by-step.
2. Decompose big problems into manageable parts.
3. Spend compute only where it matters.
Why it matters: You can boost capability without retraining the whole model or expanding its window. 🍞 Anchor: Like taking more time and using a calculator during a test to solve a tough problem.

🥬 Read–Eval–Print Loop (REPL)

What it is: A coding workspace where you can read code, run it, and see results repeatedly.
How it works:
1. You write a small program.
2. The computer runs it and shows outputs.
3. You adjust and repeat until you’re done.
Why it matters: It lets the model manipulate huge text as variables without cramming it into the model’s short window. 🍞 Anchor: Like trying a math step on scratch paper, checking the result, and iterating.

🥬 Recursion (Sub-calls)

What it is: Calling the same process on smaller pieces to solve a big problem.
How it works:
1. Split the big input into chunks.
2. Call the model on each chunk.
3. Combine the partial answers and repeat if needed.
Why it matters: You can handle arbitrarily large inputs by working piece-by-piece. 🍞 Anchor: Cutting a giant pizza into slices so everyone can eat comfortably.

🥬 Recursive Language Models (RLMs)

What it is: A way to run an LLM inside a REPL, treat the user’s prompt as a variable, and let the LLM write code that peeks at pieces, processes them, and recursively calls itself.
How it works:
1. Load the entire prompt into a variable in the REPL (outside the model’s window).
2. Give the model small metadata (like prompt length and a tiny prefix) so it knows what it’s dealing with.
3. The model writes code to slice, search, and analyze the big prompt; it can programmatically call sub-models on slices.
4. It builds intermediate results in variables and assembles the final answer, which can be arbitrarily long.
Why it matters: This bypasses both input and output length limits and reduces context rot by letting code (not the window) manage scale. 🍞 Anchor: Instead of cramming a whole encyclopedia into your head, you keep it on a shelf and write a mini research program that fetches what you need and collects the final report.

The Aha! Moment (one sentence): Don’t shove the giant prompt into the model—treat it as an external object and let the model write a program that recursively calls itself over any slices it needs.

Three Analogies:

Library analogy: Keep the books on shelves (REPL variable), not on your desk (context window). Walk the stacks as needed.
Kitchen analogy: Prep ingredients in batches (slices), cook separate dishes (sub-calls), then plate the full meal (final variable).
Puzzle analogy: Sort pieces by color/edge (chunk/search), solve sub-sections (sub-calls), then snap sections together (assemble variables).

Before vs After:

Before: Agents fetched or summarized, but still hit window/output limits and lost details.
After: RLMs programmatically traverse huge inputs, calling themselves on-demand, and return results stored in variables, so inputs and outputs can both go far beyond the native limits.

Why It Works (intuition):

Moving the prompt into the environment means the model never wastes window space on raw input. Code, not tokens, does the heavy lifting.
Symbolic recursion lets the model scale work with input size (linear or even quadratic), where each sub-call stays within normal limits.
Returning a variable as the final answer breaks the output-length ceiling.

Building Blocks:

Prompt-as-variable: Entire input lives in the REPL.
Metadata drip: Only tiny summaries (length, short prefix, recent prints) go into the model’s context each turn.
Code-writing LLM: The model writes Python to slice/filter/search.
Programmatic sub-calls: A function to call the sub-LLM on slices, even inside loops.
Final assembly: A designated Final variable is returned as the answer, enabling arbitrarily long outputs.

03Methodology

At a high level: Input prompt → Initialize REPL with prompt variable → Give the model metadata → Model writes code to peek/slice/search → Execute code and optional recursive sub-calls → Store intermediate variables → Repeat until Final is set → Output Final.

🥬 Python Programming (as used here)

What it is: A simple language to write small programs that inspect and transform text.
How it works:
1. Store the prompt in variables.
2. Write functions to split, search, and analyze.
3. Loop over parts and save results.
Why it matters: It lets the model precisely control what to read and how to combine answers. 🍞 Anchor: Like making a recipe card that says “slice apples, mix sugar, bake,” so you can repeat it reliably.

🥬 Iterative Computing

What it is: Solving a problem step-by-step, refining after each result.
How it works:
1. Try a small step (e.g., search for a keyword).
2. Check what you found.
3. Decide the next step based on feedback.
Why it matters: Prevents getting lost; you make progress even on huge inputs. 🍞 Anchor: Studying for a test by doing a few practice problems, checking answers, then adjusting your plan.

🥬 Context Management (in RLMs)

What it is: Keeping track of big inputs and partial answers without cramming them into the model’s short-term memory.
How it works:
1. Keep the full prompt in the REPL variable.
2. Store partial results in variables.
3. Feed the model only tiny metadata and short printouts between turns.
Why it matters: Avoids overflow and context rot. 🍞 Anchor: Using folders and sticky notes to organize a research project so your desk never gets buried.

Step-by-step recipe:

Initialize REPL with prompt-as-variable

What happens: The system starts a Python REPL; context = full prompt (could be 10M+ tokens). Also, llm_query() is available to call a sub-LLM on any string you construct.
Why it exists: It moves the big input outside the model’s window while keeping it accessible.
Example: For 1,000 documents, context holds all docs; we haven’t shown any doc text to the model yet.

Give minimal metadata

What happens: The model gets tiny facts like total length and a very short prefix sample.
Why it exists: So the model knows what it’s dealing with (e.g., size, rough format) without reading everything.
Example: “Type: list of files; total characters: 8,300,000; first 200 chars: ‘# Report…’.”

Model writes code to peek and plan

What happens: The model outputs a code block (e.g., with regex searches for keywords) and prints summaries.
Why it exists: Code determines which slices to read; you don’t waste window on irrelevant parts.
Example: Search all docs for ‘festival’ and ‘La Union’, collect matches by filename.

Execute code, capture stdout, update state

What happens: The REPL runs the code; variables hold results; a truncated preview is shared with the model next turn.
Why it exists: Tight feedback loop without flooding context; the real data sits safely in REPL.
Example: stdout: “Found 7 matching docs; indices: [6, 82, 214, …].” Variables: matches, indices.

Optional recursive sub-calls (llm_query)

What happens: The code programmatically calls the sub-LLM on slices inside loops.
Why it exists: To do deep semantic analysis on manageable chunks.
Example: For each matched doc, llm_query(“Extract festival winner name from this doc: <doc_text>”). Save answers.

Build intermediate buffers and assemble

What happens: The code collects partial answers into variables (lists, dicts) and combines them.
Why it exists: Many tasks require aggregation (sum, vote, map-reduce-like pipelines).
Example: In OOLONG-Pairs, classify each line by label, then build all user ID pairs meeting the rule.

Return Final or keep iterating

What happens: When confident, the model sets Final or FINAL_VAR(variable_name) to produce the answer.
Why it exists: Returning a variable bypasses output-length limits.
Example: Final contains hundreds or thousands of pairs, larger than any single model output would allow directly.

What breaks without each step:

No REPL? You must stuff the prompt into the window—boom, you hit limits and rot.
No metadata? The model is blind and can’t plan efficient peeks.
No code? You can’t search/filter precisely; costs and errors spike.
No sub-calls? Dense tasks collapse; you can’t semantically process all the needed slices.
No Final variable? Long outputs get cut off mid-stream.

Secret sauce:

Treat the prompt as data in an external environment and let the model write a symbolic program with loops and recursive calls. This swaps fragile token juggling for robust, programmable processing that scales with input size and task complexity.

04Experiments & Results

The Test: The authors evaluated how well RLMs handle very long and increasingly complex tasks.

S-NIAH: Single-needle search; complexity ~constant with length.
BrowseComp-Plus (1K docs): Multi-hop Q&A across 1,000 provided documents; real research-style reading.
OOLONG: Requires transforming and aggregating nearly every line; linear complexity.
OOLONG-Pairs: Requires aggregating over pairs; quadratic complexity—extremely dense.
LongBench-v2 CodeQA: Multiple-choice understanding of code repositories.

The Competition: They compared RLMs against base LLMs (GPT-5, Qwen3-Coder-480B-A35B), CodeAct agents (with/without BM25 retrieval), and a summarization agent. They also tested an ablation: RLM with REPL but no sub-calls.

The Scoreboard (with context):

BrowseComp-Plus (1K docs, GPT-5 family): RLM(GPT-5) scored 91.3% with about $0.99 average cost. That’s like getting an A when other popular strategies hovered around a B range (e.g., summarization agent ~70.5%). Crucially, base GPT-5 could not even fit these inputs.
OOLONG-Pairs (GPT-5 family): Base GPT-5 got <0.1% F1—an F—while RLM(GPT-5) hit 58.0%—a solid C+ to B− on a task where nearly everyone else failed, showing dramatic gains on dense, quadratic aggregation.
OOLONG (GPT-5 family): RLM lifted scores by ~28.4% over base GPT-5, turning average work into very good performance on line-by-line transformations.
CodeQA and other splits (Qwen family): RLM variants frequently outperformed base models. Interestingly, the no-sub-call RLM sometimes beat the full RLM on sparser tasks like CodeQA (66% vs. 56% for Qwen3-Coder), showing sub-calls matter most when semantics are very dense.
10M+ token regime: RLMs maintained strong performance at scales where base models and other scaffolds degraded or broke.

Costs with context:

Median costs: RLMs were often comparable to, or cheaper than, baselines (e.g., cheaper than full summarization agents that ingest everything).
Variance: Some RLM runs were expensive outliers due to long recursive trajectories—like spending extra time double-checking in a tough exam.
BrowseComp-Plus (1K): A linearly extrapolated cost of naively stuffing 6–11M tokens would be $1.50–$ 2.75 just for ingestion with a small model; RLM(GPT-5) averaged $0.99 while achieving top accuracy.

Surprising Findings:

REPL without sub-calls can still beat many baselines for long but not-too-dense tasks, proving that offloading the prompt is already a big win.
On dense tasks like OOLONG-Pairs, sub-calls were essential, delivering 10%–59% gains over the no-sub-call ablation.
Different models behaved differently as RLMs: Qwen3-Coder tended to launch many sub-calls (sometimes too many), while GPT-5 was more conservative. A single extra prompt line warning Qwen not to over-use sub-calls helped.
Short-input tradeoff: For small contexts, base LLMs sometimes outperformed RLMs—overhead isn’t worth it when the desk is already clear.

Training a native RLM (small-scale):

RLM-Qwen3-8B, fine-tuned on only ~1,000 filtered trajectories from a larger model, improved by 28.3% on average across tasks and used fewer tokens thanks to better decisions. This suggests dedicated RLM training is a powerful new lever.

05Discussion & Limitations

Limitations:

Synchronous sub-calls: They ran calls sequentially, which can be slow. Asynchronous batching could provide big speedups.
Prompt brittleness: A one-size-fits-all RLM system prompt didn’t transfer perfectly between models; small tweaks mattered.
Coding ability required: Models that struggle with code manipulation don’t perform well as RLMs.
Cost variance: While median costs are good, long trajectories can become pricey outliers.
Small-input overhead: For short prompts, plain LLMs may be simpler and sometimes more accurate.

Required Resources:

A coding-capable LLM (preferably with reasoning features).
A safe REPL environment (sandboxed Python) that can store large strings and run code.
A sub-LLM API for recursive calls and a budget to accommodate variable-depth trajectories.

When NOT to Use:

Very short or simple tasks that fit comfortably in the context window.
Settings where code execution or external environments are disallowed.
Ultra-tight latency budgets where even occasional long trajectories are unacceptable.

Open Questions:

How deep should recursion go? They used only one level, but deeper trees or hybrids with neural attention could help.
Can we train fully native RLMs at scale so they write better code, chunk smarter, and reduce unnecessary sub-calls?
How to make sub-calls asynchronous, parallel, and cost-aware by design?
What are the best safeguards to avoid mixing thoughts and finals (e.g., more robust signaling than FINAL/FINAL_VAR tags)?
How to auto-tune strategies per task (chunk sizes, regex filters, batching policies) without human prompts?

06Conclusion & Future Work

Three-sentence summary: This paper introduces Recursive Language Models (RLMs), which keep the giant prompt outside the model as a variable in a coding workspace and let the model write programs that recursively call itself on slices. By doing so, RLMs bypass both input and output length limits and reduce context rot, achieving strong results even on 10M+ token problems that defeat standard approaches. A small fine-tuned native RLM (RLM-Qwen3-8B) further shows that training specifically for recursive behavior yields fast, general gains.

Main achievement: Turning long-context processing into a programmable, recursive workflow—treating the prompt as an environment—so inference-time compute, not raw window size, drives scaling.

Future directions: Train larger native RLMs, add asynchronous/parallel sub-calls, explore deeper recursion and symbolic–neural hybrids, and develop robust signaling and safety for long outputs. Also, build cost-aware planners that automatically pick chunk sizes and batching strategies.

Why remember this: It reframes the scaling story—from “make the window bigger” to “make the thinking smarter.” RLMs give models a practical way to read, reason over, and write about truly massive inputs without throwing away details.

Practical Applications

•Enterprise search: Programmatically scan millions of documents and compile precise, evidence-backed answers.
•Software engineering: Understand large repos by chunking files, querying semantics, and assembling architectural summaries.
•E-discovery and legal review: Systematically traverse case files and contracts while preserving fine-grained details.
•Scientific literature review: Aggregate findings across thousands of papers without losing critical caveats.
•Customer support analytics: Mine long chat histories and tickets to detect root causes and policy gaps.
•Compliance and auditing: Check vast logs and reports against rules, producing long, structured audit outputs.
•Healthcare records analysis: Sift longitudinal patient notes to find timeline-accurate, detail-rich summaries.
•Financial forensics: Explore transaction streams and pairwise relations (like OOLONG-Pairs) to flag anomalies.
•Education and tutoring: Build long-form study guides from textbooks and lecture notes with accurate citations.
•Data operations: Generate long outputs (e.g., complete pair lists, code maps) that exceed single-call output limits.

Version: 1