AgentOCR: Reimagining Agent History via Optical Self-Compression
Key Summary
- •AgentOCR turns an agent’s long text history into pictures so it can remember more using fewer tokens.
- •It splits the history into reusable picture pieces (segments) and caches them to avoid drawing the same thing twice.
- •The agent also picks how much to shrink each image (compression factor) to save even more tokens when details aren’t critical.
- •A special reward teaches the agent to compress only when it still succeeds at the task, balancing accuracy and cost.
- •Across two tough benchmarks (ALFWorld and search-based QA), AgentOCR keeps over 95% of text-agent performance while cutting tokens by more than 50%.
- •Its segment optical caching speeds up rendering over 20× and reduces peak cache memory compared to a naive approach.
- •Compression works best up to about 55% token savings; pushing harder can blur details and hurt performance, especially for text-heavy tasks.
- •AgentOCR is model-agnostic but was tested with Qwen2.5-VL models and GRPO reinforcement learning.
- •It’s great for long, multi-step agents where context explodes, like web search, tools, or embodied tasks.
- •Limitations include dependence on VLMs not tailored for OCR, fixed rendering choices, and mainly text-centric histories.
Why This Research Matters
Long-horizon agents are powerful but quickly become expensive and slow because their text history explodes. AgentOCR makes that history compact by turning it into images, letting the same models think longer with fewer tokens. This saves money, reduces latency, and keeps accuracy high enough for real work. It also scales RL training by lowering the cost of long rollouts. In practical terms, you can build assistants that remember more and still run fast on limited hardware. As agents enter everyday tools and apps, such efficiency becomes the difference between a neat demo and a dependable product.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine you’re playing a long treasure hunt game. You keep notes about every clue you found and every choice you made. After many rounds, your notebook gets so stuffed that flipping back to find what you need takes forever.
🥬 The Concept (Reinforcement Learning, RL):
- What it is: RL is a way for an AI agent to learn by trying actions and getting rewards, like a game scoring system.
- How it works:
- The agent sees the current situation (an observation).
- It chooses an action.
- The world gives a reward and a new observation.
- The agent updates its policy to pick better actions next time.
- Why it matters: Without RL, agents don’t get feedback about which choices actually help them win over many steps. 🍞 Anchor: Like practicing basketball free throws; you shoot (action), see if it scores (reward), and adjust your aim (policy) for next time.
🍞 Hook: You know how reading one text message is easy, but scrolling back through a year of messages is slow and confusing?
🥬 The Concept (Long-context processing):
- What it is: Long-context processing means handling very long histories of text so the agent remembers what happened before.
- How it works:
- Keep a growing history of observations and actions.
- Feed the whole (or key parts of) history into the model each turn.
- The model attends to relevant pieces to decide the next move.
- Why it matters: Without it, the agent forgets earlier clues, repeats mistakes, and can’t plan across many steps. 🍞 Anchor: Solving a mystery novel is hard if you forget earlier chapters; long-context lets you keep the whole story in mind.
The world before: LLM agents were getting good at multi-step tasks—controlling apps, searching the web, or navigating houses in simulators. But their memories grew like vines. Every turn added more text, which meant more tokens to process, longer delays, more memory, and hitting model context limits.
The problem: In RL training and real use, the agent must keep lots of past information. Plain-text histories are bulky. As sequences get longer (thousands or even tens of thousands of tokens), attention becomes slow and memory-hungry. Costs skyrocket, and training becomes painful.
Failed attempts: People tried summarizing prompts, fancy attentions, retrieval, and hierarchical memory. These helped reduce some tokens but often lost details, added complexity, or still struggled when the history got extremely large and verbose.
The gap: There wasn’t a widely used way to keep almost all the details while massively shrinking token counts. A new angle was needed: represent the same information in a denser form that models can still read.
A new clue: Visual-language research and OCR showed that pictures (with text inside them) can pack more information per token than raw text. One image token can carry lots of characters’ worth of text.
Real stakes: If we can compress history better, agents can:
- Think longer without forgetting.
- Train faster and cheaper with RL.
- Run on smaller machines or serve more users.
- Work better in everyday tasks: shopping bots that compare many products, tutoring systems that follow months of student progress, or home robots that remember long chore lists.
In short, AgentOCR explores a simple yet powerful idea: turn the agent’s history into images, then add clever caching and agent-controlled compression so the agent keeps what matters without carrying a heavy text backpack.
02Core Idea
Aha moment in one sentence: If an agent’s long text history is turned into images and the agent learns how much to compress those images, we can keep performance high while slashing token costs.
🍞 Hook: You know how a class poster can show a lot of info in a small space, while the same info as paragraphs takes multiple pages?
🥬 The Concept (Visual tokens):
- What it is: Visual tokens are the image patches a vision-language model uses to read an image.
- How it works:
- Render text history as an image.
- Split the image into patches (visual tokens).
- Feed those tokens to a VLM.
- Why it matters: Images can carry dense text visually, so you need fewer tokens to represent the same content. 🍞 Anchor: A single class photo can show every student at once; you don’t need a separate paragraph describing each one.
🍞 Hook: Imagine squishing your drawing just a little to fit more on your corkboard, but not so much that you can’t read the labels.
🥬 The Concept (Optical self-compression):
- What it is: Convert text to an image and allow adjustable downsizing so the agent can control detail vs. cost.
- How it works:
- Turn history text into a neatly formatted image.
- Choose a compression factor (how much to shrink it).
- Use the shrunken image as history input.
- Why it matters: Without compression, image tokens can still grow too big; with it, the agent flexibly saves tokens when details aren’t critical. 🍞 Anchor: Like zooming out slightly on a map to see more streets at once, but not so far that street names blur.
🍞 Hook: Think of building a scrapbook. If you already have a sticker page saved, you don’t redraw it—you reuse it.
🥬 The Concept (Segment optical caching):
- What it is: Split history into small text segments, render each to an image once, and reuse cached images whenever the same segment appears.
- How it works:
- Break the full history into segments (like lines/blocks).
- Make a content key (hash) for each segment.
- If the segment was seen before, reuse its image; else render and cache it.
- Stack segment images to form the full history image.
- Why it matters: Without caching, you keep redrawing everything every turn, which is slow and wasteful. 🍞 Anchor: Like reusing Lego bricks you already built instead of rebuilding the same piece from scratch every time.
🍞 Hook: When you take notes, sometimes you write tiny to fit more on the page, and other times you write big to keep it readable.
🥬 The Concept (Agentic self-compression):
- What it is: The agent decides how much to compress the history image at each step and learns a policy to balance success with cost.
- How it works:
- The agent outputs a compression factor along with its action.
- The system shrinks the image accordingly for the next step.
- A special reward gives bonus points for successful episodes that used higher compression (but not if it hurts success).
- The bonus is given intermittently so the agent doesn’t over-compress.
- Why it matters: A fixed compression can be too blurry sometimes and too costly other times; adaptive control gets the best trade-off. 🍞 Anchor: Like choosing when to speak softly to save energy and when to speak loudly so everyone hears you.
Three analogies for the whole idea:
- Suitcase packing: Turn text into a folded poster (image), reuse packing cubes (cache), and decide how tightly to roll clothes (compression) depending on the trip.
- Library microfilm: Store many pages as compact film (images), reuse common inserts (cache), and adjust zoom (compression) based on how detailed you need to read.
- City map: Combine neighborhoods (segments), reuse known blocks, and zoom level (compression) depends on whether you’re driving across town or finding a house number.
Before vs after:
- Before: Agents fed giant text histories; tokens exploded; training and inference slowed down.
- After: Agents read compact history images, reuse repeated parts, and smartly tune detail level—keeping accuracy high with far fewer tokens.
Why it works (intuition, not equations):
- Images pack text densely, so token counts drop.
- Caching removes repeated rendering, so latency drops.
- Letting the agent choose compression avoids a one-size-fits-all setting; it spends tokens where details matter and saves them where they don’t.
Building blocks:
- A deterministic renderer to turn text into clean images.
- A segmenter + hash-based cache to reuse identical text blocks.
- A VLM backbone that understands images with text.
- An RL objective that adds a gentle nudge toward compression only when tasks succeed, and only some of the time, to avoid greediness.
03Methodology
High-level recipe: Instruction and observation → Fetch full history → Split into segments → Render or reuse cached segment images → Stack into one history image → Agent picks action and compression → Environment steps → Repeat with RL updates.
Step A: Keep the history tidy as text
- What happens: The system stores each observation-action pair in a memory buffer and fetches the full text history each turn.
- Why it exists: We need a single, consistent source of truth before rendering.
- Example: After 3 turns in ALFWorld, the buffer might show [Observation: “You’re in kitchen”] [Action: “open fridge”] [Observation: “You see milk”] [Action: “take milk”] [Observation: “Holding milk”].
🍞 Hook: Picture cutting a long letter into postcards so you can file them neatly. 🥬 The Concept (Segment optical caching, focused steps):
- What it is: Break the history into segments, render once per unique segment, and reuse.
- How it works:
- Split the history text into segments (e.g., lines or logical blocks).
- Make a hash key from each segment’s normalized content.
- On cache hit: grab the existing segment image; on miss: render it, store it.
- Stack images (usually vertically) to get the full history image.
- Why it matters: Without segment reuse, rendering time and memory balloon as the history grows. 🍞 Anchor: Like a sticker book where each unique sticker is printed once; you just place copies whenever the same sticker appears.
Step B: Render the optical memory
- What happens: A deterministic renderer converts text (with simple colors for roles like [Observation] in blue, [Action] in red) into a clean, legible image.
- Why it exists: Consistency helps the VLM read text-in-image reliably.
- Example: The ALFWorld history becomes a tall image with monospace font, fixed width, and color-coded tags.
Step C: Compress the history image
- What happens: The image is downscaled by a factor chosen by the agent (compression factor ≥ 1). Higher factor = fewer visual tokens, but potentially blurrier small text.
- Why it exists: Even images can grow with long histories; controlled shrinking keeps token use in check.
- Example: If the history image is 1120Ă—560 and the agent picks compression 1.21, the system shrinks height and width by about the square root of 1.21 (roughly 1.1Ă—), cutting visual tokens.
🍞 Hook: Sometimes you squint less and sometimes more, depending on what you need to read. 🥬 The Concept (Agentic self-compression, focused steps):
- What it is: The agent outputs both the environment action and the desired compression factor for the next step.
- How it works:
- The policy reads the instruction plus the current history image.
- It decides the next action (e.g., search query, tool call, navigation move).
- It also outputs a compression factor inside a special tag for the system to apply before the next turn.
- Training gives a small extra reward for using more compression, but only if the episode succeeds, and only at spaced intervals (every K iterations) to avoid over-compression.
- Why it matters: A static compression level is a compromise; adaptive control makes compression context-aware. 🍞 Anchor: Like picking whether to take a high-res or low-res photo depending on whether you’re shooting a poster (need detail) or a landscape (coarser is fine).
Step D: Policy learning with RL (GRPO as an example)
- What happens: The agent collects rollouts, gets rewards from the environment, plus an occasional compression bonus when it succeeds, and updates its policy.
- Why it exists: RL links choices (both actions and compression) to long-term success and efficiency.
- Example: In search QA, correct answers get reward 1, wrong get 0. If the episode succeeds and the agent used some compression, it gets a small log-shaped bonus, lightly encouraging efficiency.
Secret sauce:
- The trio of dense visual memory + segment caching + adaptive compression. Visual tokens drastically cut token counts, caching removes repeated work and storage waste, and self-compression lets the agent spend or save tokens at the right times.
What breaks without each step:
- No rendering to images: Token counts remain huge, hitting context limits.
- No segment cache: Rendering becomes a latency bottleneck as history grows.
- No self-compression: Either too many tokens (if set low) or too blurry (if set high), with no smart middle ground.
- No compression-aware reward: The agent won’t learn when it’s safe to save tokens and when it must preserve detail.
04Experiments & Results
The tests: Two very different long-horizon agent worlds.
- ALFWorld: Embodied tasks like pick, place, heat, and cool objects—long sequences but often less text-dense.
- Search-based QA: Multi-turn web-style queries with dense text passages—shorter horizons but very text-heavy.
The competition (baselines):
- Text (w/o RL): Plain text histories to a text model.
- OCR (w/o RL): Rendered images to a VLM, no RL.
- Text + GRPO: Strong text baseline trained with RL.
- AgentOCR: Our method—rendered images + caching + self-compression + RL.
Scoreboard with context:
- Token savings: AgentOCR reduced average tokens per step by more than 50% across tasks. Peak token usage dropped even more (up to ~81% in search QA on 7B).
- Performance: On ALFWorld, AgentOCR matched text+RL within about 1% (e.g., ~81.2% vs. 81.8% on 7B). On search QA, it preserved over 95% of text+RL performance (e.g., 40.1% vs. 41.9% on 7B).
- Caching efficiency: Segment optical caching delivered over 20Ă— rendering speedup compared to no cache, and cut peak cache memory relative to a naive append-only cache.
Make the numbers feel real:
- Think of grades: If text+RL gets an A, AgentOCR gets an A- but pays half the price in tokens. In peak moments (big history spikes), it keeps calm by chopping token spikes down by up to about 4–5×.
Surprising findings:
- Safe zone: There’s a robust zone up to about 55% token savings where performance stays above 95% of text.
- Task sensitivity: ALFWorld stayed resilient even at strong compression (still ~87% at 2.0Ă—), but search QA was more sensitive (~67%), because tiny text details blur first.
- Reward schedule matters: Giving the compression bonus every iteration made the agent compress too much (success dropped sharply). Spacing the bonus (every K=5 iterations) kept performance high while still saving tokens.
Ablations (what each piece adds):
- Without RL, optical history alone saves tokens but loses accuracy. With RL, the model learns to read images well and recovers performance.
- Self-compression without the spaced bonus became greedy (very high compression, big accuracy drop). With spacing, it balanced nicely (fewer tokens, same success).
05Discussion & Limitations
Limitations:
- VLM dependence: The backbone VLMs weren’t tailor-made for OCR. Different architectures or visual tokenizers could change results.
- Fixed rendering style: One renderer with fixed fonts and colors was used; poor choices could reduce readability and hurt reasoning.
- Mostly text-as-image: Real worlds also have screenshots, charts, and tables that aren’t just text; compressing those requires more advanced strategies.
Required resources:
- GPU memory for VLMs (experiments used H100s), storage for per-episode segment caches, and a fast hash+render pipeline.
- A stable RL setup (e.g., GRPO) and tool-wiring so the agent can output a compression factor.
When not to use:
- If your history is tiny, plain text is simpler.
- If tasks demand reading tiny text at all times (e.g., precise citations), aggressive compression may harm accuracy.
- If your VLM struggles with text-in-image reading, you might need OCR-specialized models first.
Open questions:
- Hybrid memory: When should we keep some parts as text and others as images?
- Smarter rendering: Can layout-aware or vector-based rendering boost legibility at high compression?
- Broader backbones: How do specialized OCR/VLM models affect the trade-off curve?
- Multimodal histories: What’s the best way to compress mixed content like tables, plots, and GUI screenshots?
- Adaptive segmentation: Could the agent also learn how to segment and style its own memory for even better reuse and clarity?
06Conclusion & Future Work
Three-sentence summary: AgentOCR turns an agent’s growing text history into compact images, then reuses repeated segments and lets the agent choose how much to compress. With RL and a gentle compression-aware reward, it preserves over 95% of text-agent performance while cutting tokens by more than half. This makes long-horizon training and inference faster, cheaper, and more scalable.
Main achievement: Showing that image-based histories with segment caching and agent-controlled compression can deliver large, reliable token savings without sacrificing task success.
Future directions: Explore OCR-specialized or layout-aware VLMs, hybrid text+image memories, compression for tables/plots/screenshots, and learned rendering/segmentation styles. Tie memory more tightly to retrieval so the agent can fetch high-res snippets only when needed.
Why remember this: It reframes a core bottleneck—long, expensive histories—into a new modality where information is denser and control is adaptive. That simple shift unlocks practical gains across many agent systems, bringing long-context reasoning closer to everyday viability.
Practical Applications
- •Speed up web-search agents that must track multiple queries and results across many steps.
- •Reduce serving costs for customer-support bots that keep long conversation histories.
- •Enable on-device assistants with limited memory to handle longer tasks without offloading everything to the cloud.
- •Accelerate RL fine-tuning of tool-using agents by shrinking rollout token footprints.
- •Boost GUI automation agents that repeat similar steps by caching recurring UI messages.
- •Support tutoring systems that follow a student’s progress over months while keeping latency low.
- •Improve robotics/embodied agents that need long episodic memory for household tasks.
- •Power research assistants that collect and reason over many documents without hitting context limits.
- •Enhance meeting assistants that summarize and recall long multi-speaker transcripts efficiently.