An Information Theoretic Perspective on Agentic System Design

Shizhe He; Avanika Narayan; Ishan S. Khare; Scott W. Linderman; Christopher Ré; Dan Biderman

An Information Theoretic Perspective on Agentic System Design

Intermediate

Shizhe He, Avanika Narayan, Ishan S. Khare et al.12/25/2025

arXiv PDF

Key Summary

•The paper shows that many AI systems work best when a small 'compressor' model first shrinks long text into a short, info-packed summary and a bigger 'predictor' model then reasons over that summary.
•They treat the compressor like a noisy walkie-talkie and measure how much useful information actually survives the trip using mutual information (MI).
•A simple MI estimator—computed directly from model log-probabilities—predicts downstream task performance without running full end-to-end tests.
•Bigger compressors are surprisingly more efficient: they keep more important bits while using fewer tokens, so you pay less compute for better quality.
•Scaling the compressor matters far more than scaling the predictor; increasing compressor size can yield large accuracy gains, while making the predictor huge often gives only small improvements.
•On LongHealth and FinanceBench, 7–8B compressors are up to 3.1× more accurate than 1–1.5B models and up to 4.6× more concise.
•A Qwen-2.5 7B compressor carries about 5.5× more bits of information per token than a 1.5B version and adds only about 1.3% more FLOPs-per-generation.
•Information rate (bits per token) strongly correlates with accuracy and perplexity, creating a task-agnostic proxy for compression quality.
•In a Deep Research setup, small local compressors (≈3B) recover ~99% of frontier-level accuracy while cutting API costs by about 74%.
•Practical rule of thumb: front-load compute into local compressors and choose models/families that maximize information density.

Why This Research Matters

Long inputs are common in real life—medical files, financial reports, legal documents, web research—and models often stumble when given everything at once. This work provides a simple way to check if a summary truly preserves the needed information, no matter the task. With better compressors, we can run more intelligence locally on laptops and phones, lowering latency, costs, and privacy risks. Teams can stop guessing which component to upgrade and instead use MI to guide design choices that consistently improve quality. The approach scales to practical systems like Deep Research, showing that small local compressors can deliver frontier-level results at a fraction of the price.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you're packing for a long trip with only a small suitcase. You need to squeeze everything important into a tiny space, or the rest of your journey becomes annoying and expensive.

🥬 The Concept (Agentic LM systems):

What it is: Many modern AI apps use teams of language models where one model shrinks big piles of text (the compressor) and another model reads that short version to answer questions (the predictor).
How it works:
1. Gather a long context (web pages, PDFs, chats).
2. A smaller LM compresses it into a short, focused summary.
3. A bigger LM reads that summary and produces the final answer.
Why it matters: Without this split-up job, the big model can choke on too many tokens (context rot), get slow, or become too expensive.

🍞 Anchor: Think of a school group project—one friend skims the whole book (compressor) and gives a sharp outline; another friend (predictor) turns that outline into the final essay.

🍞 Hook: You know how you forget the start of a long story if someone talks for too long without pausing?

🥬 The Concept (Context rot):

What it is: When models get too much input, they start losing track of important earlier details.
How it works:
1. Long inputs fill the model’s attention window.
2. Important bits get buried among unimportant ones.
3. Answers get worse even though you gave more text.
Why it matters: Feeding more raw text can backfire; we need smart compression to keep just the important pieces.

🍞 Anchor: It’s like taking notes in class—if you write everything word-for-word, you can’t find the key ideas later.

🍞 Hook: Imagine a tidy packer and a great storyteller working together.

🥬 The Concepts (Compressor vs. Predictor):

What they are: The compressor is the tidy packer that makes a compact, info-rich summary; the predictor is the storyteller that uses that summary to answer the question.
How it works:
1. Compressor picks question-relevant nuggets and writes them briefly.
2. Predictor reasons over those nuggets to produce a final answer.
Why it matters: If the compressor keeps the right facts, the predictor shines; if it drops key facts, even a genius predictor struggles.

🍞 Anchor: If your outline misses the main plot twist, the best essay-writer still gets the story wrong.

🍞 Hook: You know how coaches want a single stat that tells them if training is working across different games?

🥬 The Concept (Task-agnostic evaluation):

What it is: A way to judge how good a compressor is without tying it to one specific task or dataset.
How it works:
1. Look at how much information the summary preserves from the original text.
2. Don’t depend on a particular downstream test.
3. Use a general signal that predicts many tasks.
Why it matters: Saves time and money—no need to test every compressor with every predictor on every task.

🍞 Anchor: Like measuring a student’s note-taking skill by how well their notes cover the textbook, not just one quiz.

🍞 Hook: Radios crackle; sometimes the message gets fuzzy.

🥬 The Concept (Viewing the compressor as a noisy channel):

What it is: Treat the compressor like a walkie‑talkie link from big text to short text that may lose or distort details.
How it works:
1. Original text goes in.
2. The compressor sends a shorter message.
3. Some information may be dropped or garbled.
Why it matters: If we treat it like a communication channel, we can use information theory to measure how much meaning survives.

🍞 Anchor: If a scout radios, “North path safe,” but static cuts off “not,” the team makes a bad decision—so measuring message quality matters.

The world before: People chained models together by trial and error. They tweaked which compressor to use and which predictor to pair it with, then ran costly end-to-end tests. Every new model release meant re‑trying many combinations. What broke? We didn’t know how much of the original context the compressor really preserved. The predictor’s success or failure mixed together two effects: (1) Was the summary informative? (2) Was the predictor good at reasoning? Without a clean yardstick for summaries themselves, teams guessed.

Failed attempts: Practitioners relied on downstream accuracy or human inspection of summaries. But accuracy confounds compression quality with predictor ability, and human checks don’t scale or generalize. Some information-theoretic estimators existed, but many needed extra training or full vocabulary probabilities—impractical for modern inference servers.

The gap: A simple, practical, task-agnostic way to score summaries by how much information they carry from the source.

Real stakes: Better compressors mean shorter prompts, lower latency, cheaper API bills, and less privacy risk (more processing can run locally). With phones and laptops now powerful enough to host mid-size LMs, pushing compute into local compressors can cut cloud costs while maintaining accuracy.

02Core Idea

🍞 Hook: Imagine two teammates passing secret notes in class. One writes a compressed note; the other uses it to solve the problem. If the note leaves out key facts, the solution fails—no matter how smart the solver is.

🥬 The Concept (Key insight in one sentence):

Treat the compressor as a noisy communication channel and score it by how much mutual information (MI) its summary preserves about the original text, because MI predicts downstream performance across tasks.

Multiple analogies (3 ways):

Suitcase analogy: A good packer fits all essentials into a carry-on; MI measures how many essentials made it in. More essentials per item (bits per token) means smarter packing.
Walkie-talkie analogy: The sender’s voice goes through static; MI measures how much of the message got through.
Study-notes analogy: Notes condense a chapter; MI measures how much of the chapter’s test-relevant content remains in the notes.

Before vs. After:

Before: We judged by end-to-end accuracy, mixing compressor quality with predictor skill and paying for big, slow sweeps.
After: We measure information passed from input to summary, predict downstream performance without full runs, and spend compute where it matters most—the compressor.

🍞 Hook: You know how a speedometer helps you drive better even before you reach your destination?

🥬 The Concept (Mutual information as a proxy):

What it is: MI says how much knowing the summary tells you about the original text.
How it works:
1. For each document and its compression, estimate how likely the summary is given the document.
2. Compare that to how likely the same summary is across other documents.
3. The bigger the gap, the more info the summary carries about its own document.
Why it matters: High MI means the summary is specific and informative, which strongly predicts better answers.

🍞 Anchor: If your notes only make sense for your textbook (not random ones), they’re carrying lots of information about that book—high MI.

🍞 Hook: Picture paying per word you send in a text message.

🥬 The Concept (Information rate / bit efficiency):

What it is: Bits of MI per output token—how much meaningful info each token carries.
How it works:
1. Compute MI between original text and summary.
2. Divide by the number of summary tokens.
3. Higher is better: fewer tokens, more meaning.
Why it matters: It rewards dense, precise summaries and relates tightly to final task accuracy.

🍞 Anchor: If two notes both let you ace the quiz, but one is half as long, it’s the more bit‑efficient note.

🍞 Hook: Choosing image quality on your phone camera—smaller files mean more blur, unless your camera is smart.

🥬 The Concept (Rate–distortion trade-off):

What it is: The balance between how much information you send (rate) and how much error you accept in the final answer (distortion).
How it works:
1. Measure rate (bits per token) of the compressor’s summaries.
2. Measure distortion (1 − accuracy) of the final result.
3. Fit a smooth curve: more rate generally means less distortion, until a floor.
Why it matters: It shows when making the predictor bigger barely helps, but making the compressor better still pays off.

🍞 Anchor: Past a point, adding megapixels doesn’t fix a blurry picture if the photo was badly framed (poor compression).

Building blocks:

Practical MI estimator: Uses only log‑probabilities available from modern inference engines; no need for full vocabulary logits dumps or extra training.
Conditioning on the question (Q): Summaries are judged for question-relevant information (I(X; Z | Q)).
Token efficiency: Larger compressors output fewer tokens for the same or better information.
Compute view: Because larger compressors emit fewer tokens, total FLOPs-per-generation can grow sublinearly with model size.

Why it works (intuition): Bigger, better compressors understand what’s important and say it briefly. That increases MI and bit efficiency. The predictor can then reason effectively without wading through fluff. If the compressor fumbles key facts, even a giant predictor can’t recover what never arrived.

03Methodology

High-level recipe: Input (document X + question Q) → [Compressor creates a short, Q‑relevant summary Z] → [Predictor answers Y from Z] → Output (answer).

Step-by-step, like a lab guide:

Prepare data and questions.

What happens: Choose datasets with long contexts (e.g., clinical notes, finance reports, research papers, chats, web pages) and attach a question Q to each.
Why it exists: Real tasks overflow context windows; we need long inputs to test compression.
Example: A 120,000‑token annual report plus “What was the 2022 net income?”

Generate compressions Z with different compressors.

What happens: Use several LM families and sizes (e.g., Llama, Qwen, Gemma at 1–14B) to create Q‑focused summaries.
Why it exists: We want to see how compressor choice affects quality, length, and compute.
Example: A 7B model might produce a 6‑sentence summary; a 1.5B model might produce 3 paragraphs but miss the key figure.

Estimate mutual information I(X; Z | Q). 🍞 Hook: Think of checking how uniquely your summary matches its original document. 🥬 The Concept (MI estimator, practical version):

What it is: A Monte Carlo score comparing “summary given its own doc” vs “summary given other docs,” conditioned on the question.
How it works:
1. For each document i, sample M summaries from its compressor.
2. Score how likely each summary is under its own document vs. all N documents (using log-probs from an inference server or a small proxy model).
3. Average these differences to get an MI estimate; clip tiny negative values caused by sampling noise to 0.
Why it matters: It’s directly computable on modern servers, no extra model training, and it predicts downstream accuracy well. 🍞 Anchor: If your summary gets a very high score for its own doc and low scores for others, it’s highly informative—high MI.

Compute information rate (bit efficiency).

What happens: Divide MI by the number of tokens in Z to get bits per token.
Why it exists: Rewards information-dense summaries; directly tracks answer quality in practice.
Example: If two compressors have the same MI, but one uses half the tokens, it has double the bit efficiency.

Predict answers with different predictors.

What happens: Feed Z to various predictors (e.g., 8B, 70B, 405B, GPT‑4o) and measure accuracy or perplexity.
Why it exists: Separates compression quality from reasoning power; shows where scaling helps most.
Example: Beyond ~70B, bigger predictors often add only small gains if the summary quality is unchanged.

Estimate compute cost (FLOPs-per-generation). 🍞 Hook: Like counting total pushups in a workout—reps (tokens) times effort per rep (model size). 🥬 The Concept (FLOPs-per-generation):

What it is: A simple formula that approximates compute used per generated token for dense transformers.
How it works:
1. Compute per-token FLOPs scales roughly with model size.
2. Multiply by the number of output tokens.
3. Summarize cost across models and settings.
Why it matters: Lets us compare accuracy gains to compute spent; shows when bigger compressors are actually cheaper overall because they emit fewer tokens. 🍞 Anchor: A stronger athlete (bigger model) needs fewer reps (tokens) to finish the job; total work can be similar or even lower.

Fit rate–distortion curves.

What happens: Plot distortion (1 − accuracy) vs. rate (bits per token) and fit a simple decaying curve.
Why it exists: Visualizes diminishing returns; highlights that scaling the compressor shifts the curve more than scaling the predictor.
Example: Curves flatten out at a floor—extra bits won’t fix label noise or judge imperfections.

Deep Research system test.

What happens: The predictor breaks a big task into sub-queries; multiple compressors parse sources in parallel; the predictor synthesizes into a long report.
Why it exists: Shows that the information-theoretic rules transfer to a realistic agentic pipeline.
Example: With local 3B–7B compressors, the system kept ~99% of frontier accuracy while shrinking API cost by ~74%.

The secret sauce:

A practically computable MI estimator (no full vocab dumps, no extra training) that correlates strongly with downstream performance.
Conditioning on the question to measure task-relevant information, not fluff.
The realization that larger compressors often produce shorter, denser summaries, making total compute scale sublinearly—so you can front‑load compute locally and save cloud cost.

04Experiments & Results

The test: Measure how compressor size, family, and token length affect (1) accuracy on QA tasks and (2) perplexity on language tasks, and see how MI and bit efficiency line up with those outcomes.

The competition: Compressors from Llama‑3, Qwen‑2.5, Gemma‑3 at 1–14B; predictors from 8B up to 405B and GPT‑4o. Baselines include feeding uncompressed data directly to GPT‑4o.

Scoreboard with context:

Accuracy jumps by scaling compressors: On LongHealth, 7–8B compressors are up to 3.1× more accurate than 1–1.5B models and even beat a GPT‑4o‑only baseline by about 4 percentage points. On FinanceBench, 7–8B compressors are up to 2.6× better than 1–1.5B and recover ~97% of the GPT‑4o‑only baseline.
Token efficiency: Larger compressors output up to 4.6× fewer tokens while keeping or improving accuracy—like writing an A+ summary in half the words.
Compute: Because output tokens drop as compressors get bigger, total FLOPs-per-generation can grow sublinearly. For Qwen‑2.5 on LongHealth, going from 1.5B to 7B added only ~1.3% more FLOPs-per-generation.
MI and bit efficiency: Larger compressors carry much more information from source to summary and more bits per token. One notable case: a Qwen‑2.5 7B compressor conveyed about 5.5× more bits of MI per token than its 1.5B sibling.
Rate–distortion: Information rate strongly tracks distortion (1 − accuracy). The fitted curves show that after a moderate predictor size (~8B–70B), further predictor scaling yields small gains unless the compressor improves.
Perplexity link: On an extractive setup with FineWeb, MI correlates strongly with perplexity (r ≈ −0.84), meaning more informative summaries also make next‑token predictions easier for a separate model.
Family effects: Model family matters. Qwen‑2.5 compressors often scale more compute‑efficiently than Llama or Gemma. Predictors don’t need to match the compressor family to perform well.

Surprising findings:

Bigger can be cheaper: Stronger compressors write shorter notes; shorter notes save predictor compute and API cost—so larger local compressors can lower total spend.
Predictor scaling limits: Upgrading the predictor from 70B to 405B gave only small accuracy bumps (e.g., ~12% in one setting), while upgrading the compressor from ~1B to ~7B delivered much larger gains.
Prompt robustness: Asking compressors to write 3, 6, or 9 sentences didn’t change the core scaling trend—capacity, not formatting, drives the improvements.
Deep Research win: Local compressors as small as 3B recovered ~99% of frontier accuracy at roughly 26% of the original API cost (≈74% savings).

05Discussion & Limitations

Limitations:

MI estimator practicality vs. purity: The Monte Carlo estimator uses log-probabilities and, at small scales (≈1–3B), may need a proxy model for stability. This can introduce variance or small biases.
Model scope: Most tests use GPT‑style, non‑reasoning dense LMs with single‑round communication. Results may differ for chain‑of‑thought, multi‑turn agents, or mixture‑of‑experts where compute depends on which experts fire.
Estimation noise: Finite-sample MI can dip slightly below zero; clipping fixes the artifact but reminds us that estimates have variance.
Compute realism: FLOPs‑per‑generation is a clear abstraction, but device‑specific optimizations and memory bandwidth also affect real latency and cost.

Required resources:

Access to inference-time log-probabilities (or a small proxy model) to score summaries.
Ability to sample multiple compressions per document for stable MI estimates.
A range of compressor sizes/families and at least a modest predictor (≈8B+) to see the trends.
Basic logging to count tokens and approximate FLOPs.

When NOT to use this approach:

Very short contexts where compression is unnecessary—the overhead isn’t worth it.
Tasks needing verbatim recall of long spans (legal citations, code diffs) where any omission is fatal—consider full-context or retrieval with exact spans.
Highly creative generation where preserving source information is not the main goal.

Open questions:

Better estimators: Could contrastive objectives like InfoNCE yield even more stable MI estimates at scale?
Multi-turn channels: How do MI and rate–distortion behave over several communication rounds or with planning/decomposition traces?
Training for rate: Can we train compressors to directly optimize the rate–distortion trade-off (maximize MI per token subject to accuracy floors)?
Routing decisions: Can MI estimates guide when to fall back to full-context remote processing or which compressor to pick on the fly?
MoE and quantization: How do sparse activation patterns and low‑precision inference alter the compute vs. information picture?

06Conclusion & Future Work

Three-sentence summary: This paper treats the compressor in agentic LM systems as a noisy channel and measures how much information its summaries preserve from the source using a practical mutual information estimator. Information rate (bits per token) strongly predicts downstream performance, revealing that scaling compressors yields larger gains than scaling predictors while often reducing total compute through shorter outputs. These principles transfer to a Deep Research pipeline, where small local compressors recover frontier accuracy at a fraction of the cost.

Main achievement: A simple, task‑agnostic, inference‑time MI estimator that turns compressor quality into a single, predictive number—unlocking principled design of multi‑LM systems.

Future directions: Build better MI estimators (e.g., InfoNCE variants), extend to multi‑round agent communication, train compressors with rate–distortion objectives, and design routing policies that pick the right compressor or fallback dynamically. Explore compute‑aware training for sparse/MoE models and quantify effects of quantization on information rate.

Why remember this: It gives a clear north star—optimize for information density. When in doubt, front‑load compute into a strong local compressor, use MI to verify that summaries carry the right bits, and expect better accuracy, lower cost, and faster systems.

Practical Applications

•Choose compressors by maximizing information rate (bits per token) before running expensive end-to-end tests.
•Front-load compute into a local compressor to reduce predictor API tokens and cut cloud costs.
•Set compression length targets (e.g., 3–9 sentences) and verify MI stays high to ensure summaries remain dense.
•Use MI as a routing signal: if MI is low, fall back to full-context or a larger compressor.
•Budget compute with FLOPs-per-generation: prefer compressors that reduce output tokens to keep total cost sublinear.
•Audit compressors by sampling multiple summaries and tracking MI variance; pick prompts/settings with stable MI.
•When swapping model families (e.g., Llama ↔ Qwen), recheck MI and rate–distortion curves instead of full sweeps.
•For memory systems (chat/history), compute MI of memory entries to keep only the most informative notes.
•In enterprise RAG, compress retrieved documents with a strong local compressor and pass only high-MI snippets.
•Tune prompts for conciseness but validate with MI so shorter outputs don’t silently drop vital facts.

Version: 1