RE-TRAC: REcursive TRAjectory Compression for Deep Search Agents

Jialiang Zhu; Gongrui Zhang; Xiaolong Ma; Lin Xu; Miaosen Zhang; Ruiqi Yang; Song Wang; Kai Qiu; Zhirong Wu; Qi Dai; Ruichun Ma; Bei Liu; Yifan Yang; Chong Luo; Zhengyuan Yang; Linjie Li; Lijuan Wang; Weizhu Chen; Xin Geng; Baining Guo

RE-TRAC: REcursive TRAjectory Compression for Deep Search Agents

Intermediate

Jialiang Zhu, Gongrui Zhang, Xiaolong Ma et al.2/2/2026

arXiv PDF

Key Summary

•Re-TRAC is a new way for AI search agents to learn from each try, write a clean summary of what happened, and then use that summary to do better on the next try.
•It fixes a big weakness in the popular ReAct method, which walks in a single straight line and often forgets earlier plans or misses unexplored branches.
•After each attempt, the agent creates a structured state with three key parts: conclusions so far, checked evidence and sources, and remaining questions and branches.
•Using these states across attempts makes the agent explore smarter, avoid repeating work, and steadily narrow down to the right answer.
•On tough web research tests like BrowseComp, Re-TRAC boosts accuracy by about 15–20% over ReAct with frontier models.
•Even small models trained to understand Re-TRAC summaries set new records for their size groups (30B model at 53% on BrowseComp; 4B model at 30%).
•Compared with methods like Best-of-N or Majority Voting, Re-TRAC reaches higher accuracy while using fewer tool calls and tokens as rounds go on.
•The pass@K gap shows many problems are solvable if the agent simply explores more branches; Re-TRAC turns that potential into real gains by coordinating across attempts.
•Ablation studies show two helpful tricks: instructing the agent to freely judge prior summaries prevents tunnel vision, and stronger summarizers improve small models.
•Re-TRAC is both a test-time recipe for big models and a training recipe (via SFT) that helps smaller models act like seasoned researchers.

Why This Research Matters

Many real-life questions need careful web research, not just quick answers. Re-TRAC helps AI assistants remember what they’ve checked, what they still don’t know, and where to look next—so they waste less time and make fewer mistakes. This means students can study faster, journalists can verify facts more reliably, and teams can explore complex topics with less cost. Because it cuts down repeated work, it also saves money and energy as models grow. Smaller, cheaper models gain the most, making advanced research help more accessible. Over time, this approach can spread from web search to planning, coding, and science—anywhere many steps and sources must come together.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you and your friends are hunting for a mystery item in a giant library. If each friend searches in a straight line without talking to others about what they found, you’ll waste time repeating the same shelves and miss whole aisles.

🥬 Filling (The Actual Concept):

What it is: This paper tackles how AI “deep research agents” search the web over many steps without getting lost, repeating themselves, or ignoring good clues.
How it works (big picture): The authors show that most current agents use a straight-line plan called ReAct. That line gets very long, and the agent forgets earlier ideas, misses branches it planned to explore, and uses lots of tools and tokens. The paper introduces Re-TRAC, which makes the agent pause after each try, summarize what happened in a structured way, and then continue with a smarter plan.
Why it matters: Without a way to share experience across tries, agents waste compute, miss promising paths, and end up with wrong or half-baked answers.

🍞 Bottom Bread (Anchor): Think of a team of treasure hunters. After each day, they draw a clean map of places searched (and what they found), mark places they didn’t reach yet, and then plan the next day using that map. They find treasure faster and stop re-checking the same caves.

—

🍞 Hook: You know how a very talkative friend can tell interesting stories, but if the story is too long, you forget the important parts? That happens to AIs too.

🥬 The Concept (LLM):

What it is: A Large Language Model (LLM) is an AI that reads and writes text and can plan steps.
How it works: It predicts next words well, can call tools like search and browse, and can follow step-by-step reasoning.
Why it matters: Without a strong LLM, the agent can’t understand questions or plan multi-step searches.

🍞 Anchor: When you ask “What’s the capital of France?”, the LLM focuses on the words “capital” and “France” and says “Paris.”

—

🍞 Hook: Imagine a super librarian who can use the internet, open hundreds of pages, and connect clues.

🥬 The Concept (Deep Research Agents):

What it is: Deep research agents are LLM-powered systems that search the open web, read sources, and reason across many steps.
How it works: They plan, call tools (like search/visit), read results, and keep going for many turns.
Why it matters: Big, real questions need many steps and sources; a single quick answer isn’t enough.

🍞 Anchor: If you ask, “Which state’s official flower is shared with another state, and what’s the school from the lunch menu riddle?”, a deep research agent searches, reads pages, compares clues, and triangulates the answer.

—

🍞 Hook: You know how walking in only one straight hallway means you might miss rooms on the sides?

🥬 The Concept (ReAct baseline):

What it is: ReAct is a popular framework where the agent thinks a bit, acts (calls a tool), sees results, and writes all that in one long line of notes.
How it works: Reason → Act → Observe → Repeat, all in order.
Why it matters: This single line gets very long, so the agent forgets early plans and rarely revisits or branches well.

🍞 Anchor: It’s like following one hallway forever and forgetting the doors you meant to check earlier.

—

🍞 Hook: Imagine planning to check three playgrounds after school—but by dinner you only checked one and forgot the others.

🥬 The Concept (Incomplete Branch Exploration):

What it is: Agents often plan several search branches but end up exploring only some.
How it works: Long context pushes early plans out of focus; the agent gets stuck in a single path.
Why it matters: Many failures happen because good branches were never explored, not because the model couldn’t reason.

🍞 Anchor: The paper found up to 93% of failed searches had missing branches that were planned but never checked.

—

🍞 Hook: If you take multiple guesses on a quiz, your chance of getting one right goes up.

🥬 The Concept (Pass@K):

What it is: A metric that asks, “If the agent tries K times, did any try get the right answer?”
How it works: Run multiple independent attempts and see if at least one is correct.
Why it matters: A big gap between Pass@1 and Pass@K means better exploration could help a lot.

🍞 Anchor: The paper shows large Pass@K gaps for many strong models—proof there’s untapped potential if we manage the tries better.

—

🍞 Hook: Imagine friends trying the same maze separately without sharing maps. They repeat dead ends.

🥬 The Concept (Majority/Best-of-N limits):

What it is: Run several independent tries, then vote or pick the “best.”
How it works: Each attempt starts from scratch; no experience sharing.
Why it matters: Lots of redundancy and no way to build a global plan.

🍞 Anchor: Without a shared map, everyone keeps bumping into the same walls.

—

This paper fills the gap by letting agents summarize each attempt into a compact “state” and carry it forward. That enables reflection, branching, and efficient reuse of verified facts. The real-world stakes are big: better research assistants, faster fact-finding for students and reporters, and cheaper compute bills because the agent stops repeating itself.

02Core Idea

🍞 Hook: You know how great teams keep a scoreboard after each game so the next game plan is smarter? What if AI did that after each research try?

🥬 The Concept (Re-TRAC — Recursive Trajectory Compression):

What it is: After each search attempt (trajectory), the agent writes a structured state that captures conclusions, evidence with sources, and remaining uncertainties/branches, then uses it to guide the next attempt.
How it works:
1. Run a normal ReAct-style attempt (think → act → observe).
2. Compress that attempt into a clean state (what we know, what we checked, what’s left).
3. Start the next attempt conditioned on that state, explicitly targeting missed branches and avoiding re-checking verified facts.
4. Repeat for several rounds until solved or out of budget.
Why it matters: Without this, agents forget plans, waste tools, and stop too early. With it, exploration becomes progressive instead of random.

🍞 Anchor: It’s like a detective making a case file after each day: clues gathered, suspects cleared, leads not yet followed. Tomorrow’s search starts from the file, not from zero.

—

Three analogies for the same idea:

Detective Notebook: Each day ends with a tidy note of what’s proven, what’s shaky, and which doors weren’t knocked on. Tomorrow’s plan focuses on those doors.
Video Game Save Point: After exploring one dungeon path, you save a map marking treasure claimed and tunnels unvisited. Next run, you beeline to new tunnels.
Recipe Iteration: You bake cookies, note which steps worked, which didn’t, and which spices you still want to test. Next batch, you skip the failed trick and try the new spice.

Before vs After:

Before: Each try is independent, long notes get messy, early ideas are lost, repeated tool calls happen, and the agent settles for a local best guess.
After: Each try updates a shared state, missed branches are surfaced, verified facts are reused, tools are called less over time, and answers improve steadily.

Why it works (intuition):

Memory pressure shrinks when we summarize essentials and discard distracting trace details.
Exploration breadth is preserved because the state explicitly lists unresolved branches.
Efficiency rises as verified facts prevent redundant tool calls and token usage.
Credit assignment becomes easier: the agent can see which steps led to progress and which didn’t, then plan accordingly.

Building blocks (each with sandwich-style clarity):

🍞 Hook: You know how you keep a scorecard to see where you are in a game? 🥬 The Concept (Structured State Representation):

What it is: A compact, organized summary after each attempt.
How it works: It has three core facets—(1) Answer & Conclusions, (2) Evidence & Source Verification, (3) Uncertainties & Exploration Trace—and, for stronger models, three audit facets—Failed Attempts, Uncompleted Proposals, Discarded Possibilities.
Why it matters: Without structure, the next attempt starts in a fog; with structure, it starts with a clear to-do list and trusted facts. 🍞 Anchor: A tidy case file that marks checked clues, trusted witnesses, and the streets you still need to visit.

🍞 Hook: Imagine a coach reminding players, “Use the notes, but think for yourself.” 🥬 The Concept (Continuation Prompt with Free-Use Instruction):

What it is: Guidance that says, “Use the summary if it helps, but don’t be trapped by it; feel free to pivot.”
How it works: The next round gets the state plus instructions to question low-quality parts and to expand the search.
Why it matters: Prevents tunnel vision and encourages fresh branch exploration. 🍞 Anchor: A teacher’s comment: “Good outline—keep what works, toss what doesn’t, and try a new angle.”

🍞 Hook: Small bikes with training wheels can ride like bigger bikes if taught well. 🥬 The Concept (Re-TRAC-Aware SFT for Small Models):

What it is: Supervised fine-tuning that trains smaller models to read and use structured states.
How it works: Build synthetic Q&A, collect multi-round Re-TRAC traces from a stronger model, and fine-tune small models on these traces.
Why it matters: Makes compact models behave like skilled researchers, reaching state-of-the-art for their size. 🍞 Anchor: A 4B model jumps from tiny scores to 30% on BrowseComp; the 30B model hits 53%.

Together, these pieces turn scattered, linear tries into a coordinated, learning-over-time search process.

03Methodology

At a high level: User Question → Round 1 (ReAct rollout) → Compress to Structured State → Round 2 (conditioned on state) → ... → Final Answer.

Step-by-step recipe with the Sandwich pattern for new ideas:

Input and Tools 🍞 Hook: You can’t search a library without a catalog and a way to pull books from shelves. 🥬 The Concept (Search and Visit Tools):

What it is: Two main tools—search (find pages) and visit (open pages, extract relevant text, summarize to the goal).
How it works: Search returns titles, URLs, snippets; Visit fetches content (HTML/PDF), extracts text, and produces goal-focused summaries.
Why it matters: Without reliable tools, the agent can’t gather trustworthy evidence. 🍞 Anchor: Type a query, get promising links; open them and pull out the parts that answer your question.

Round 1: A standard ReAct rollout 🍞 Hook: First you explore one hallway before drawing a map. 🥬 The Concept (Initial Trajectory):

What it is: The agent reasons, calls tools, reads results, and writes a final answer or partial view.
How it works: It chains thoughts and tool calls; this run is like the old way (linear), so it may miss branches.
Why it matters: We need an initial pass to collect raw experience to compress. 🍞 Anchor: The agent tries a few searches, visits some pages, and forms an initial guess.

Compress the trajectory to a Structured State 🍞 Hook: After a long day of searching, you tidy your notes so tomorrow’s work starts clear. 🥬 The Concept (Trajectory Compression):

What it is: Turning a messy, long trace into a well-organized state.
How it works: Use a strict prompt to fill: (0) Current Answer, (1) Facts & Evidence with provenance, (2) Analysis tied to evidence, (3) Source inventory & verification status, (4) Uncertainties/limitations/gaps; plus audit facets for stronger models.
Why it matters: Compression keeps what matters and lists what to try next, so we don’t lose planned branches. 🍞 Anchor: A case file with checked facts, trusted sources, and a to-do list of unvisited leads.

Start the next round conditioned on the state 🍞 Hook: Open your saved map and pick a new corridor. 🥬 The Concept (Recursive Execution):

What it is: Each new round begins with the previous state and a continuation prompt.
How it works: The agent reads trusted facts, sees unresolved branches, and plans targeted tool calls; it is also reminded to ignore weak or misleading parts of the summary.
Why it matters: This preserves branching diversity while focusing effort, avoiding repeated checks. 🍞 Anchor: If Georgia and Iowa were already checked for the shared state flower clue, the next round jumps to New York immediately.

Stopping and Answer Selection 🍞 Hook: When your checklist is complete or time’s up, you decide your final answer. 🥬 The Concept (Round Limit and Finalization):

What it is: Run up to K rounds (often 8) or stop early if solved.
How it works: The last round’s answer is used as the final; you can also inspect if any earlier round had a correct answer (AP@K for analysis).
Why it matters: Puts a cap on compute while letting the agent learn across rounds. 🍞 Anchor: After 8 passes, submit the best-supported answer.

Test-Time Scaling Comparisons 🍞 Hook: You can try more times, but it’s smarter if tries learn from each other. 🥬 The Concept (Test-Time Scaling Methods):

What it is: MV (vote), WV (confidence-weighted vote), Best-of-N (pick highest self-rated), and RT@N (Re-TRAC after N rounds).
How it works: MV/WV/Best-of-N are independent tries; RT@N chains tries with memory.
Why it matters: RT@N often wins while using fewer tools/tokens as rounds go on. 🍞 Anchor: Instead of eight strangers guessing, you have one person who learns from each previous guess.

Training small models (Re-TRAC-aware SFT) 🍞 Hook: Practice runs with good feedback make a rookie play like a pro. 🥬 The Concept (SFT Pipeline):

What it is: Teach small models to read/use states by fine-tuning them on examples produced by a stronger model.
How it works: Build 33k synthetic questions via an entity-tree method; collect 4-round Re-TRAC traces from a strong model (GLM-4.7); filter to 104k high-quality samples; fine-tune small models (e.g., Qwen3-4B, Tongyi-30B) to follow the structured prompts.
Why it matters: Smaller models then gain most of Re-TRAC’s benefits at lower cost. 🍞 Anchor: After SFT, the 4B model jumps from almost zero to competitive scores when run with Re-TRAC.

The secret sauce 🍞 Hook: The magic isn’t just more tries—it’s smarter tries. 🥬 The Concept (Cross-Trajectory Reflection):

What it is: A disciplined loop of summarize → plan → explore new branches → avoid repeats.
How it works: Structured states preserve what’s known and what’s missing; continuation prompts encourage skepticism and new angles.
Why it matters: This turns pass@K potential into actual gains with less compute waste. 🍞 Anchor: Each round becomes like a new expedition starting from an updated, trustworthy map.

Concrete mini-example (school-and-state-flower riddle): Round 1 lists states sharing flowers and checks two; compression records checked states and the unvisited ones. Round 2 targets the next state, finds matching lunch menus, and verifies the school name. Token use and tool calls drop as the search space narrows.

04Experiments & Results

The Test: The authors evaluated on five challenging, real-world-like benchmarks for web research—BrowseComp, BrowseComp-ZH, GAIA, XBench, and HLE. They measured accuracy (did the final answer match ground truth?) and also tracked tokens and tool calls to understand efficiency across rounds.

The Competition: Baselines included standard single runs, Majority Voting (MV@n), Weighted Voting (WV@n), Best-of-N (Best@n), and the traditional ReAct approach. Multiple strong open and closed models were involved, from small 4B to very large ones.

Scoreboard with context:

Re-TRAC with a 30B model reached 53% on BrowseComp, roughly like getting an A- when others of similar size got B’s (e.g., 43–47%). It even edged out some much larger models (e.g., 358B at 52%).
The 4B Re-TRAC model hit 30% on BrowseComp and set new bests among <15B models. That’s like a well-coached junior team beating all other juniors.
As a test-time scaling method (RT@8), Re-TRAC matched or beat MV@8, WV@8, and Best@8 across several models (e.g., o3, o4-mini, GPT-5-medium, DeepSeek-V3.2, GLM-4.7). Notably, GLM-4.7 didn’t benefit much from MV/Best-of-N but still improved a lot with Re-TRAC—showing RT is more general and forgiving of weaker self-judgment.
Efficiency: Unlike independent tries (which scale resource use linearly), Re-TRAC’s tool calls and tokens per round tended to drop as rounds progressed. Think of steering toward the right aisle and stopping re-reading the same pages.

Surprising findings:

Free-use instruction: Telling the agent to treat the summary as helpful but not binding reduced getting stuck and improved round-by-round gains. This simple nudge matters.
Summarizer quality: Using a stronger summarizer to create the state helped the 4B model but not the already-strong 30B model, hinting that small models especially benefit from high-quality compressions.
Pass@K reality check: Big Pass@K gaps confirmed that models often can solve the task—but only if guided to explore more branches. Re-TRAC turned that potential into consistent wins.

Numbers in plain language:

“+15–20% over ReAct” on BrowseComp with frontier models means if ReAct scored around a mid-40s to high-40s, Re-TRAC pushed it into the low-60s for some settings or from high-40s to upper-50s/low-60s depending on the model—an unmistakable step up.
“50% resource savings for similar or better accuracy” versus other test-time scaling methods means you can get better answers without doubling your budget, because you stop re-checking what’s already verified.

Takeaway: Re-TRAC didn’t just win; it won smart—higher accuracy with fewer repeated actions, showing that sharing experience across tries beats starting from scratch each time.

05Discussion & Limitations

Limitations:

Quality of states: If the compression misses key facts or mislabels evidence, later rounds can chase the wrong branches.
Summarizer dependence: Small models are sensitive to how well the state is written; weaker summarization can blunt gains.
Tool and environment variability: Search results, site availability, and extraction quality (HTML/PDF parsing) can affect reliability.
Long-horizon complexity: Extremely tangled tasks may still need more rounds or more advanced planning modules.

Required resources:

Tooling: Stable web search and page extraction (with a summarizer like a small LLM for page-level synthesis).
Compute: Enough context window to hold the state plus the active round; a few rounds (often up to 8) of inference.
Data (for small-model training): Synthetic question generation and multi-round trace collection from a strong teacher model.

When NOT to use:

Single-shot, simple Q&A where one search suffices—overhead may not pay off.
Domains with no reliable sources (or heavy paywalls) where evidence verification breaks down.
Real-time, ultra-low-latency settings where multiple rounds are too slow.

Open questions:

How to auto-detect and fix low-quality states (e.g., a state auditor or verifier)?
How to schedule exploration vs. exploitation across rounds optimally (learned policies)?
Can multi-agent Re-TRAC variants share states in parallel, then merge them robustly?
How to integrate reinforcement learning so the agent learns when to compress, what to keep, and how to pick the final answer from state histories?
Can we extend beyond web search to planning robots, code debugging, or scientific literature review with domain-specific evidence checkers?

06Conclusion & Future Work

3-sentence summary: Re-TRAC turns many isolated research tries into a single, learning process by compressing each attempt into a structured state and using it to guide the next. This simple loop—summarize, plan, explore new branches, avoid repeats—raises accuracy, cuts redundant tool calls and tokens, and works for both big and small models. Across tough web-research benchmarks, Re-TRAC consistently outperforms ReAct and common test-time scaling baselines.

Main achievement: Showing that cross-trajectory reflection with structured states is a powerful, general recipe for deep search agents—one that converts the large Pass@K potential into real, efficient gains.

Future directions: Build learned policies to decide what to compress and when; design stronger state auditors; integrate reinforcement learning for end-to-end optimization; explore multi-agent state merging; and adapt to other domains like code, science papers, and enterprise knowledge bases.

Why remember this: It’s a blueprint for making agents not just think longer, but think smarter between tries—like keeping a clean research notebook that actually gets used. That mindset—turning attempts into progress—will matter in any long, complex task where evidence, plans, and uncertainties evolve over time.

Practical Applications

•Build a browsing agent that answers complex questions by running 4–8 Re-TRAC rounds instead of one long linear run.
•Add a structured state summary step after each research attempt to capture verified facts, sources, and unexplored branches.
•Use the continuation prompt that explicitly allows ignoring low-quality summaries to prevent the agent from getting stuck.
•Track and cache verified facts so later rounds skip redundant tool calls and focus on new branches.
•Train a small model with Re-TRAC-aware SFT using multi-round traces from a stronger teacher to boost on-device research.
•Switch from Majority/Best-of-N to Re-TRAC when tool budgets are tight and you need higher accuracy per token.
•Adopt audit facets (failed attempts, uncompleted proposals, discarded possibilities) for frontier models to surface missed leads.
•Monitor token and tool-call curves across rounds to confirm that exploration is becoming more targeted.
•Plug in a stronger summarizer for the compression step when using small models to improve the quality of states.
•Extend Re-TRAC beyond web search to long-horizon tasks like literature review, code triage, or multi-step data retrieval.

Version: 1