TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents

Hang Yan; Xinyu Che; Fangzhi Xu; Qiushi Sun; Zichen Ding; Kanzhi Cheng; Jian Zhang; Tao Qin; Jun Liu; Qika Lin

TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents

Intermediate

Hang Yan, Xinyu Che, Fangzhi Xu et al.2/2/2026

arXiv PDF

Key Summary

•This paper studies how AI agents get better while they are working, not just whether they finish the job.
•It introduces TIDE, a simple toolkit that watches an agent’s step-by-step journey (its trajectory) and scores three things: speed of improvement, getting unstuck from mistakes, and how helpful memory really is.
•AUV (Area Under Variation) measures how quickly success grows over time, so a fast, steady learner scores higher than a slow, late finisher even if both finally succeed.
•LR (Loop Ratio) spots when an agent keeps doing the same failing actions, showing whether it adapts or just gets stuck in circles.
•MI (Memory Index) checks if remembering past steps helps or hurts; surprisingly, more memory can be a burden in pure reasoning tasks.
•Across many tasks (games, household sims, web shopping, GUIs), big models loop less, but loops still crush performance when they appear.
•AUV reveals differences hidden by plain Success Rate; two agents with the same final score can have very different learning speeds.
•Memory helps more in information-hunting tasks (like shopping) but can hurt in straightforward puzzles (like FrozenLake).
•The takeaway: making better agents isn’t just about smarter thinking—it’s about smarter interacting with the world over time.
•TIDE works with any agent and any environment, and even on old logs, making it a practical diagnostic for building stronger AI agents.

Why This Research Matters

In real apps, we want agents that don’t just finish, but finish smartly—learning from feedback, avoiding repeated mistakes, and using only the memory that helps. TIDE gives builders a simple, portable way to see where an agent wastes time, where it loops, and where memory helps or hurts. That means faster debugging for coding helpers, safer action for home robots, and better clicks for GUI agents and web shoppers. It also helps compare models fairly: not just who wins, but who learns faster and cleaner. With these insights, teams can design training, prompts, and memory strategies that boost real-world reliability, not just benchmark scores.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re learning to ride a bike. At first, you wobble a lot, but with each try, you adjust, remember what worked, and improve faster. How you get better matters just as much as whether you finally ride.

🥬 The Concept (LLM Agents): What it is: An LLM agent is an AI that reads, thinks, and takes actions step by step in a world (like solving puzzles, clicking buttons, or navigating a home). How it works: 1) It looks at the world (an observation). 2) It thinks about what to do (reasoning). 3) It acts (an action). 4) The world changes and gives feedback. 5) Repeat until done. Why it matters: Without this loop, the agent can’t learn from what happens and may keep making the same mistake. 🍞 Anchor: A shopping agent searches, clicks a product, reads details, and clicks Buy. Each step depends on the last.

🍞 Hook: You know how a teacher doesn’t just care if you got the right answer, but also how you solved the problem? That’s because the process shows what you learned.

🥬 The Concept (Success Rate): What it is: Success Rate (SR) is a simple score: did the agent finish the task or not? How it works: Count successes, divide by total tasks, get a percent. Why it matters: It tells us if the agent can finish—but it hides how long it took or how messy the path was. Without more, a one-step win equals a 50-step stumble to the finish. 🍞 Anchor: Two teams both win the game. One scores in 2 minutes, the other in overtime after many mistakes. SR treats them as identical.

🍞 Hook: Think of a time-lapse video of a plant growing. You don’t just want the final height; watching how it grows over days tells the real story.

🥬 The Concept (Interaction Trajectory): What it is: A trajectory is the whole story of steps an agent takes: observation → thought → action → new observation, again and again. How it works: We log each turn: what the agent saw, thought, did, and what happened next. Why it matters: Without the trajectory, we can’t tell if the agent improves quickly, gets stuck, or learns from mistakes. 🍞 Anchor: A maze run recorded frame by frame shows if the runner learns or keeps bumping the same wall.

🍞 Hook: You know how during a test you sometimes figure things out as you go, using feedback from earlier questions? That’s improving at test time.

🥬 The Concept (Test-Time Improvement, TTI): What it is: TTI is an agent’s ability to get better while it is doing the task by using feedback, memory, and exploration. How it works: 1) Try something. 2) See what happens. 3) Adjust the plan. 4) Try again, smarter. Why it matters: In real worlds (apps, websites, homes), information appears as you act. Without TTI, agents miss clues, repeat errors, or waste steps. 🍞 Anchor: A robot looking for a key opens a drawer, finds nothing, remembers that, and checks another place instead of re-opening the same drawer.

The World Before: Agents were judged mostly by final Success Rate. This was like grading only the last page of homework. We couldn’t see if they reached success fast, whether they learned from errors, or if their memory helped or hurt.
The Problem: Real tasks are messy. Agents must explore, handle surprises, and use past steps wisely. Existing metrics blur important differences: fast vs slow wins, smart fixes vs stubborn repeats, helpful memory vs noisy baggage.
Failed Attempts: People tried counting number of turns or partial progress. But counting turns doesn’t tell whether those turns were useful or just loops. And mixing memory with other factors (like bigger models) made it hard to tell if memory truly helped.
The Gap: We needed a way to: a) watch improvement over time, b) tell real adaptation from loops, and c) isolate the value of memory itself.
Real Stakes: This affects daily life when AI agents help with coding, online shopping, using phones and computers, or organizing home tasks. If an agent can’t improve during the task, it may click the same wrong button 10 times, fill the same code bug patch again, or keep searching past the useful results.

🍞 Anchor: It’s the difference between a helpful assistant who learns as they go and one who keeps asking the same question over and over.

02Core Idea

🍞 Hook: Picture three health checkups for a runner: speed (how fast you improve), agility (can you pivot when you slip), and stamina (does carrying a backpack help or slow you down?).

🥬 The Concept (TIDE): What it is: TIDE is a simple, agent-agnostic diagnostic that breaks Test-Time Improvement into three parts: speed of improvement over time (AUV), getting unstuck vs looping (LR), and how useful memory is (MI). How it works: 1) Collect the full trajectory. 2) Draw a curve of success-by-turn. 3) Compute AUV from that curve. 4) Detect repeated state cycles and compute LR. 5) Re-run or re-score with and without memory to get MI. Why it matters: Without TIDE, we can confuse busy activity with progress, miss hidden loops, and assume more memory is always better—even when it hurts. 🍞 Anchor: Like a coach’s dashboard that shows lap times (AUV), whether the runner keeps tripping on the same hurdle (LR), and whether that heavy water pack helps or drags them down (MI).

Multiple Analogies for the Key Insight:

Road Trip: AUV = how quickly miles add up; LR = making the same wrong turn again; MI = your notes—do they guide you or distract you?
Cooking: AUV = how fast dishes finish as you prep; LR = remaking the same burnt omelet; MI = recipe notes—helpful tips or clutter?
Video Game: AUV = leveling up rate; LR = respawning and repeating the same failed path; MI = map notes—do they reveal secrets or flood you with noise?

🍞 Hook: Remember racing games where you see your position over time? Finishing first is good, but pulling ahead early shows real skill.

🥬 The Concept (AUV—Area Under Variation): What it is: AUV measures how fast and steadily success grows across turns. How it works: 1) For each turn t, compute the fraction of tasks already solved. 2) Plot that over time. 3) Add up the area under that curve (trapezoids). Higher AUV = earlier and steadier wins. Why it matters: Two agents with the same final success can feel very different—AUV rewards the one that improves sooner, wasting fewer steps. 🍞 Anchor: In a spelling bee, one student spells most words right from the start (high AUV), another only catches up at the end (lower AUV), even if both finish with similar totals.

🍞 Hook: You know how getting stuck in a loop feels like walking through a revolving door and ending up where you started?

🥬 The Concept (LR—Loop Ratio): What it is: LR measures how much the agent repeats the same unhelpful cycles. How it works: 1) Treat each step as moving between states. 2) Find cycles that return to the exact same state. 3) Count repeated copies of the same cycle. 4) LR = loop steps divided by all steps. Lower is better. Why it matters: A lot of turns doesn’t mean smart exploration. High LR means the agent is stubbornly retrying the same mistake. 🍞 Anchor: A web agent keeps clicking the same disabled button five times—lots of clicks, no progress. That’s a loop.

🍞 Hook: Think of a backpack on a hike. Sometimes it holds a map (great!). Other times it’s stuffed with rocks (ouch!).

🥬 The Concept (MI—Memory Index): What it is: MI tells whether remembering the history helps or hurts. How it works: 1) Score performance with full working memory. 2) Score again with memory removed (just current state). 3) MI = with-memory score minus without-memory score. Positive MI helps, negative MI hurts. Why it matters: Longer context isn’t always better—extra details can distract or confuse. 🍞 Anchor: In a puzzle, notes about old wrong guesses can either guide you away from traps—or waste your attention.

Before vs After:

Before: We mostly checked if the agent eventually succeeded (SR).
After: We can judge how efficiently it improved (AUV), whether it adapted or looped (LR), and if memory was useful (MI).

Why It Works (Intuition):

AUV values the timing of success, not just the final result. Early wins count more because they show efficient learning.
LR separates true adaptation (trying new paths) from fake activity (repeating errors).
MI isolates memory’s causal effect by comparing with/without memory, holding agent and environment constant.

Building Blocks:

Trajectory logging (observation, thought, action, next state).
Variation curve over turns for AUV.
State graph cycle detection for LR.
Memory ablation for MI (and optional small memory window tests).

🍞 Anchor: With TIDE, an agent that finishes fast, avoids getting stuck, and uses just the helpful bits of memory will shine clearly on all three dials.

03Methodology

At a high level: Input (a set of agent trajectories) → Build a success-over-time curve → Compute AUV → Detect repeated state cycles and compute LR → Re-score with and without memory and compute MI → Output a three-part diagnosis of TTI.

🍞 Hook: Think of reviewing a game replay. You mark when you scored, when you ran in circles, and whether your notes helped you win.

🥬 The Concept (Trajectory as Data): What it is: A trajectory is the full step-by-step record—what the agent saw, thought, did, and what changed. How it works: We store turns as (observation, thought, action, next observation), plus success/failure. Why it matters: This gives us the raw material to measure improvement, loops, and memory. 🍞 Anchor: A chess notation of every move lets a coach analyze strategy, blunders, and learning.

Step A: Compute the Success-Over-Time Curve (for AUV)

What happens: For each turn t, we compute Pt = the fraction of tasks already solved within t turns. This gives a step-by-step picture of how quickly success accumulates.
Why it exists: A final score hides timing. The curve shows pace and steadiness. Without it, two agents with the same end score look identical.
Example: Suppose across 100 tasks, by turn 3, 40 are solved (P3 = 0.40); by turn 6, 70 are solved (P6 = 0.70). The curve rises earlier if the agent is efficient.

Step B: Turn the Curve into a Single Number (AUV)

What happens: We add the area under the success curve (using simple trapezoids from turn t to t+1) and normalize by the max number of turns for that task. This is the Area Under Variation (AUV), between 0 and 1.
Why it exists: AUV rewards earlier improvement. Without it, late catch-ups look as good as early mastery.
Example: Two agents both end at 80% success by turn 20. Agent A hits 60% by turn 5 (high AUV). Agent B doesn’t cross 60% until turn 15 (lower AUV). Same SR, different AUV.

🍞 Hook: Have you ever walked in a maze and accidentally circled back to the same spot?

🥬 The Concept (State Graph and Cycles): What it is: We see the agent’s journey as moving between states; a cycle returns to the exact same state after some actions. How it works: When a state repeats, we check the path between those two visits. If the agent copies that same cycle again immediately, that’s a loop. Why it matters: Repeating the same cycle means no adaptation—just spinning wheels. 🍞 Anchor: Clicking the same page button back and forth without getting closer to checkout.

Step C: Detect Loops and Compute LR

What happens: We scan the trajectory for cycles (start and end at the same state). We ignore nested cycles to keep things clean. If a cycle is repeated back-to-back, we count all its steps as loop steps. LR = loop steps / total steps.
Why it exists: Counting turns alone can mask failure. LR spots harmful repetition. Without LR, an agent that “tries a lot” but repeats itself looks falsely active.
Example: In a grid world, if the agent goes Right→Left→Right→Left and ends up where it started, then does that exact sequence again, those steps inflate LR.

🍞 Hook: Think of a study notebook. A neat summary helps. A messy wall of scribbles distracts.

🥬 The Concept (Working Memory): What it is: The log of past observations and thoughts the agent can see at each turn. How it works: We can give the agent full history or just the latest state. Why it matters: Memory can guide better choices—but too much can create noise. 🍞 Anchor: A short to-do list focuses you; a giant unfiltered log can overwhelm you.

Step D: Measure Memory’s Help or Harm (MI)

What happens: We measure performance with memory (full history available) and without memory (only current task description + immediate observation). MI = AUV_with_memory − AUV_without_memory.
Why it exists: It isolates memory’s contribution. Without MI, we can’t tell if longer context helps, hurts, or does nothing.
Example: In a shopping task (partially observable), memory of previous pages helps (positive MI). In FrozenLake (fully observable), excess history might distract (negative MI).

Optional Memory Window Test

What happens: Keep only the N most recent turns (N = window size) to see where added history stops helping.
Why it exists: Finds the sweet spot before memory becomes clutter.
Example: Performance improves up to ~5 turns of history, then plateaus.

The Secret Sauce

AUV turns a final grade into a growth curve score—rewarding early, steady improvement.
LR distinguishes busy repetition from true adaptation.
MI answers, “Is memory a map or a weight?”

Putting It All Together

Input: Trajectories from any agent and environment.
Process: Compute AUV (temporal efficiency), LR (adaptation vs loops), MI (memory utility).
Output: A three-dial diagnosis showing if the agent wins fast, avoids getting stuck, and uses memory wisely.

🍞 Anchor: Like a car dashboard with speed (AUV), traction control warning (LR), and fuel usefulness (MI) so you can tune the driver and the route, not just celebrate reaching the destination.

04Experiments & Results

The Test: The authors evaluated many open-source and proprietary agents across five well-known environments: BlocksWorld (rearranging blocks), FrozenLake (grid navigation), Sudoku (logic puzzle), AlfWorld (household tasks), and WebShop (online shopping). They also applied TIDE to GUI agent logs (AndroidWorld, OSWorld, WindowsAgentArena) without re-running the agents, proving TIDE can analyze existing trajectories post-hoc.

What They Measured and Why:

AUV: To see who improves early and steadily, not just who finishes eventually.
LR: To see who adapts vs who gets stuck in repeated mistakes.
MI: To see if memory helps (information-bound tasks) or hurts (reasoning-bound tasks without hidden info).
SR: For context (the traditional final-success snapshot).

The Competition: They compared many agents, including small and large models and “thinking” variants (which generate more reasoning) vs “non-thinking” (shorter prompts). Proprietary large models like Gemini and DeepSeek served as strong baselines.

Scoreboard with Context:

AUV vs SR: In several cases, two agents had the same final SR but different AUV, meaning one learned faster. For example, in AlfWorld, two models tied on SR (~0.807), but Gemini 2.5 Pro had higher AUV than DeepSeek-V3.2, showing earlier, steadier improvement—like getting an A+ for pacing when others had a B for slow starts.
LR Findings: Many models showed high LR in some environments, meaning they repeated failing cycles often. Higher LR correlated with lower AUV—when you loop, you lose time and chances to adapt. Bigger models generally had lower LR, suggesting more diverse strategies and better adaptation.
MI Surprises: Memory didn’t always help. In FrozenLake and other reasoning-bound settings, MI was often negative—history distracted more than it helped. In WebShop (information-bound), MI tended to be positive—remembering prior pages and options improved decisions. Memory window tests showed benefits saturating around a few recent steps; beyond that, extra history added clutter.

Concrete Takeaways:

AUV reveals efficiency that SR hides: a quick learner beats a late bloomer even if totals match.
High LR is a red flag: lots of action can still be low-quality if it’s the same wrong thing.
Memory needs management: just adding more context isn’t a free win.

GUI Agent Diagnostics (Post-hoc TIDE):

Applying TIDE to OSWorld logs showed sharp drops in AUV when loops were present, especially with repeated Click actions. This suggests grounding (linking text instructions to the right UI element) is still a bottleneck. One model (Claude Sonnet) stayed relatively robust due to fewer click loops, hinting that reducing loop-prone moves can strongly protect performance.

Surprising Findings:

Negative MI in reasoning tasks: More memory can make the agent worse by distracting it with irrelevant history.
Low LR is necessary but not sufficient: avoiding loops helps, but you also need strong reasoning and exploration to push AUV high.
Bigger isn’t everything: Some reasoning-boosted models didn’t automatically convert inner thoughts into better real-world actions in information-bound tasks—thinking and doing must be aligned.

Overall, TIDE repeatedly uncovered hidden strengths and weaknesses that plain SR would have missed.

05Discussion & Limitations

Limitations:

Coverage: AUV, LR, and MI don’t capture every nuance (like tool-use quality or planning granularity); they’re core dials, not the entire dashboard.
State Matching: Loop detection needs a good notion of “same state.” For text and GUIs, this can be tricky (the paper uses exact match or embeddings with thresholds). Errors here can over- or under-count loops.
MI Assumptions: MI compares with vs without memory under the same agent and environment. If prompts or backends change, MI reflects more than just memory.
Horizon Choice: AUV depends on the chosen max turns (t_max). Different horizons can weight early or late improvements differently.

Required Resources:

Trajectories: You need the step-by-step logs (observations, actions, and success flags). No special model mods are required.
Light Compute: AUV, LR, and MI are simple to compute; MI adds one extra configuration/run or re-scoring pass.

When NOT to Use:

Single-Shot Tasks: If tasks finish in one step, temporal dynamics (AUV, LR, MI) add little.
Noisy or Unstable States: If the environment can’t provide reliable state comparisons, LR may be unreliable.
Memory-Immutable Agents: If you cannot practically change memory availability, MI will be hard to estimate.

Open Questions:

Better State Identity: How can we robustly detect same-state cycles for very rich, multimodal worlds?
Memory Management: What’s the best way to summarize or filter history automatically so MI stays positive?
From Thinking to Doing: How can we translate internal reasoning into better external actions in partially observable worlds?
Adaptive Horizons: Can AUV be extended with adaptive windows that match task complexity without manual tuning?
Loop-Aware Training: Can we reduce LR by adding loop penalties or diversity bonuses during training or decoding?

Bottom line: TIDE is a strong, practical start for diagnosing improvement during interaction, but there’s room to expand the toolkit and tighten the measurements in richer environments.

06Conclusion & Future Work

Three-Sentence Summary:

This paper introduces TIDE, a simple diagnostic that watches how AI agents improve while they work, not only whether they finish.
It measures three things: AUV for fast, steady improvement; LR for getting unstuck versus looping; and MI for whether memory helps or distracts.
Across many environments, TIDE exposes hidden differences that Success Rate alone misses, guiding better agent design.

Main Achievement:

A clear, agent-agnostic, environment-agnostic framework that decomposes Test-Time Improvement into three intuitive, measurable parts, revealing bottlenecks like loops and memory burden that block progress.

Future Directions:

Smarter memory (summarization and filtering) to keep MI positive; loop-aware training and decoding to reduce LR; robust state-matching for complex multimodal worlds; and integrating more dials (like planning quality) for a fuller picture.

Why Remember This:

Because real-world agents must improve as they go. TIDE turns that fuzzy idea into three concrete dials you can compute today, helping builders make agents that learn faster, avoid getting stuck, and carry only the memory that truly helps.

Practical Applications

•Benchmark agents with AUV to select those that learn faster, not just those that eventually succeed.
•Add loop detectors (LR) to stop and rethink when cycles are detected, reducing wasted actions.
•Tune memory windows to the smallest helpful size to keep MI positive and avoid distraction.
•Use MI to decide when to add summaries or filters for long histories in partially observable tasks.
•Set training rewards that penalize loops and bonus true exploration to lower LR during learning.
•Create dashboards that show SR, AUV, LR, and MI side by side for quick diagnosis after runs.
•Compare prompts or decoding strategies by their effect on AUV/LR to pick more adaptive setups.
•Prioritize grounding fixes in GUI agents when loops are dominated by repeated Click actions.
•Design curricula that gradually increase partial observability and track MI to guide memory teaching.
•Do post-hoc audits of existing logs with TIDE to find hidden bottlenecks without re-running agents.

Version: 1