An Information Theoretic Perspective on Agentic System Design
Key Summary
- ā¢The paper shows that many AI systems work best when a small 'compressor' model first shrinks long text into a short, info-packed summary and a bigger 'predictor' model then reasons over that summary.
- ā¢They treat the compressor like a noisy walkie-talkie and measure how much useful information actually survives the trip using mutual information (MI).
- ā¢A simple MI estimatorācomputed directly from model log-probabilitiesāpredicts downstream task performance without running full end-to-end tests.
- ā¢Bigger compressors are surprisingly more efficient: they keep more important bits while using fewer tokens, so you pay less compute for better quality.
- ā¢Scaling the compressor matters far more than scaling the predictor; increasing compressor size can yield large accuracy gains, while making the predictor huge often gives only small improvements.
- ā¢On LongHealth and FinanceBench, 7ā8B compressors are up to 3.1Ć more accurate than 1ā1.5B models and up to 4.6Ć more concise.
- ā¢A Qwen-2.5 7B compressor carries about 5.5Ć more bits of information per token than a 1.5B version and adds only about 1.3% more FLOPs-per-generation.
- ā¢Information rate (bits per token) strongly correlates with accuracy and perplexity, creating a task-agnostic proxy for compression quality.
- ā¢In a Deep Research setup, small local compressors (ā3B) recover ~99% of frontier-level accuracy while cutting API costs by about 74%.
- ā¢Practical rule of thumb: front-load compute into local compressors and choose models/families that maximize information density.
Why This Research Matters
Long inputs are common in real lifeāmedical files, financial reports, legal documents, web researchāand models often stumble when given everything at once. This work provides a simple way to check if a summary truly preserves the needed information, no matter the task. With better compressors, we can run more intelligence locally on laptops and phones, lowering latency, costs, and privacy risks. Teams can stop guessing which component to upgrade and instead use MI to guide design choices that consistently improve quality. The approach scales to practical systems like Deep Research, showing that small local compressors can deliver frontier-level results at a fraction of the price.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: Imagine you're packing for a long trip with only a small suitcase. You need to squeeze everything important into a tiny space, or the rest of your journey becomes annoying and expensive.
š„¬ The Concept (Agentic LM systems):
- What it is: Many modern AI apps use teams of language models where one model shrinks big piles of text (the compressor) and another model reads that short version to answer questions (the predictor).
- How it works:
- Gather a long context (web pages, PDFs, chats).
- A smaller LM compresses it into a short, focused summary.
- A bigger LM reads that summary and produces the final answer.
- Why it matters: Without this split-up job, the big model can choke on too many tokens (context rot), get slow, or become too expensive.
š Anchor: Think of a school group projectāone friend skims the whole book (compressor) and gives a sharp outline; another friend (predictor) turns that outline into the final essay.
š Hook: You know how you forget the start of a long story if someone talks for too long without pausing?
š„¬ The Concept (Context rot):
- What it is: When models get too much input, they start losing track of important earlier details.
- How it works:
- Long inputs fill the modelās attention window.
- Important bits get buried among unimportant ones.
- Answers get worse even though you gave more text.
- Why it matters: Feeding more raw text can backfire; we need smart compression to keep just the important pieces.
š Anchor: Itās like taking notes in classāif you write everything word-for-word, you canāt find the key ideas later.
š Hook: Imagine a tidy packer and a great storyteller working together.
š„¬ The Concepts (Compressor vs. Predictor):
- What they are: The compressor is the tidy packer that makes a compact, info-rich summary; the predictor is the storyteller that uses that summary to answer the question.
- How it works:
- Compressor picks question-relevant nuggets and writes them briefly.
- Predictor reasons over those nuggets to produce a final answer.
- Why it matters: If the compressor keeps the right facts, the predictor shines; if it drops key facts, even a genius predictor struggles.
š Anchor: If your outline misses the main plot twist, the best essay-writer still gets the story wrong.
š Hook: You know how coaches want a single stat that tells them if training is working across different games?
š„¬ The Concept (Task-agnostic evaluation):
- What it is: A way to judge how good a compressor is without tying it to one specific task or dataset.
- How it works:
- Look at how much information the summary preserves from the original text.
- Donāt depend on a particular downstream test.
- Use a general signal that predicts many tasks.
- Why it matters: Saves time and moneyāno need to test every compressor with every predictor on every task.
š Anchor: Like measuring a studentās note-taking skill by how well their notes cover the textbook, not just one quiz.
š Hook: Radios crackle; sometimes the message gets fuzzy.
š„¬ The Concept (Viewing the compressor as a noisy channel):
- What it is: Treat the compressor like a walkieātalkie link from big text to short text that may lose or distort details.
- How it works:
- Original text goes in.
- The compressor sends a shorter message.
- Some information may be dropped or garbled.
- Why it matters: If we treat it like a communication channel, we can use information theory to measure how much meaning survives.
š Anchor: If a scout radios, āNorth path safe,ā but static cuts off ānot,ā the team makes a bad decisionāso measuring message quality matters.
The world before: People chained models together by trial and error. They tweaked which compressor to use and which predictor to pair it with, then ran costly end-to-end tests. Every new model release meant reātrying many combinations. What broke? We didnāt know how much of the original context the compressor really preserved. The predictorās success or failure mixed together two effects: (1) Was the summary informative? (2) Was the predictor good at reasoning? Without a clean yardstick for summaries themselves, teams guessed.
Failed attempts: Practitioners relied on downstream accuracy or human inspection of summaries. But accuracy confounds compression quality with predictor ability, and human checks donāt scale or generalize. Some information-theoretic estimators existed, but many needed extra training or full vocabulary probabilitiesāimpractical for modern inference servers.
The gap: A simple, practical, task-agnostic way to score summaries by how much information they carry from the source.
Real stakes: Better compressors mean shorter prompts, lower latency, cheaper API bills, and less privacy risk (more processing can run locally). With phones and laptops now powerful enough to host mid-size LMs, pushing compute into local compressors can cut cloud costs while maintaining accuracy.
02Core Idea
š Hook: Imagine two teammates passing secret notes in class. One writes a compressed note; the other uses it to solve the problem. If the note leaves out key facts, the solution failsāno matter how smart the solver is.
š„¬ The Concept (Key insight in one sentence):
- Treat the compressor as a noisy communication channel and score it by how much mutual information (MI) its summary preserves about the original text, because MI predicts downstream performance across tasks.
Multiple analogies (3 ways):
- Suitcase analogy: A good packer fits all essentials into a carry-on; MI measures how many essentials made it in. More essentials per item (bits per token) means smarter packing.
- Walkie-talkie analogy: The senderās voice goes through static; MI measures how much of the message got through.
- Study-notes analogy: Notes condense a chapter; MI measures how much of the chapterās test-relevant content remains in the notes.
Before vs. After:
- Before: We judged by end-to-end accuracy, mixing compressor quality with predictor skill and paying for big, slow sweeps.
- After: We measure information passed from input to summary, predict downstream performance without full runs, and spend compute where it matters mostāthe compressor.
š Hook: You know how a speedometer helps you drive better even before you reach your destination?
š„¬ The Concept (Mutual information as a proxy):
- What it is: MI says how much knowing the summary tells you about the original text.
- How it works:
- For each document and its compression, estimate how likely the summary is given the document.
- Compare that to how likely the same summary is across other documents.
- The bigger the gap, the more info the summary carries about its own document.
- Why it matters: High MI means the summary is specific and informative, which strongly predicts better answers.
š Anchor: If your notes only make sense for your textbook (not random ones), theyāre carrying lots of information about that bookāhigh MI.
š Hook: Picture paying per word you send in a text message.
š„¬ The Concept (Information rate / bit efficiency):
- What it is: Bits of MI per output tokenāhow much meaningful info each token carries.
- How it works:
- Compute MI between original text and summary.
- Divide by the number of summary tokens.
- Higher is better: fewer tokens, more meaning.
- Why it matters: It rewards dense, precise summaries and relates tightly to final task accuracy.
š Anchor: If two notes both let you ace the quiz, but one is half as long, itās the more bitāefficient note.
š Hook: Choosing image quality on your phone cameraāsmaller files mean more blur, unless your camera is smart.
š„¬ The Concept (Rateādistortion trade-off):
- What it is: The balance between how much information you send (rate) and how much error you accept in the final answer (distortion).
- How it works:
- Measure rate (bits per token) of the compressorās summaries.
- Measure distortion (1 ā accuracy) of the final result.
- Fit a smooth curve: more rate generally means less distortion, until a floor.
- Why it matters: It shows when making the predictor bigger barely helps, but making the compressor better still pays off.
š Anchor: Past a point, adding megapixels doesnāt fix a blurry picture if the photo was badly framed (poor compression).
Building blocks:
- Practical MI estimator: Uses only logāprobabilities available from modern inference engines; no need for full vocabulary logits dumps or extra training.
- Conditioning on the question (Q): Summaries are judged for question-relevant information (I(X; Z | Q)).
- Token efficiency: Larger compressors output fewer tokens for the same or better information.
- Compute view: Because larger compressors emit fewer tokens, total FLOPs-per-generation can grow sublinearly with model size.
Why it works (intuition): Bigger, better compressors understand whatās important and say it briefly. That increases MI and bit efficiency. The predictor can then reason effectively without wading through fluff. If the compressor fumbles key facts, even a giant predictor canāt recover what never arrived.
03Methodology
High-level recipe: Input (document X + question Q) ā [Compressor creates a short, Qārelevant summary Z] ā [Predictor answers Y from Z] ā Output (answer).
Step-by-step, like a lab guide:
- Prepare data and questions.
- What happens: Choose datasets with long contexts (e.g., clinical notes, finance reports, research papers, chats, web pages) and attach a question Q to each.
- Why it exists: Real tasks overflow context windows; we need long inputs to test compression.
- Example: A 120,000ātoken annual report plus āWhat was the 2022 net income?ā
- Generate compressions Z with different compressors.
- What happens: Use several LM families and sizes (e.g., Llama, Qwen, Gemma at 1ā14B) to create Qāfocused summaries.
- Why it exists: We want to see how compressor choice affects quality, length, and compute.
- Example: A 7B model might produce a 6āsentence summary; a 1.5B model might produce 3 paragraphs but miss the key figure.
- Estimate mutual information I(X; Z | Q). š Hook: Think of checking how uniquely your summary matches its original document. š„¬ The Concept (MI estimator, practical version):
- What it is: A Monte Carlo score comparing āsummary given its own docā vs āsummary given other docs,ā conditioned on the question.
- How it works:
- For each document i, sample M summaries from its compressor.
- Score how likely each summary is under its own document vs. all N documents (using log-probs from an inference server or a small proxy model).
- Average these differences to get an MI estimate; clip tiny negative values caused by sampling noise to 0.
- Why it matters: Itās directly computable on modern servers, no extra model training, and it predicts downstream accuracy well. š Anchor: If your summary gets a very high score for its own doc and low scores for others, itās highly informativeāhigh MI.
- Compute information rate (bit efficiency).
- What happens: Divide MI by the number of tokens in Z to get bits per token.
- Why it exists: Rewards information-dense summaries; directly tracks answer quality in practice.
- Example: If two compressors have the same MI, but one uses half the tokens, it has double the bit efficiency.
- Predict answers with different predictors.
- What happens: Feed Z to various predictors (e.g., 8B, 70B, 405B, GPTā4o) and measure accuracy or perplexity.
- Why it exists: Separates compression quality from reasoning power; shows where scaling helps most.
- Example: Beyond ~70B, bigger predictors often add only small gains if the summary quality is unchanged.
- Estimate compute cost (FLOPs-per-generation). š Hook: Like counting total pushups in a workoutāreps (tokens) times effort per rep (model size). š„¬ The Concept (FLOPs-per-generation):
- What it is: A simple formula that approximates compute used per generated token for dense transformers.
- How it works:
- Compute per-token FLOPs scales roughly with model size.
- Multiply by the number of output tokens.
- Summarize cost across models and settings.
- Why it matters: Lets us compare accuracy gains to compute spent; shows when bigger compressors are actually cheaper overall because they emit fewer tokens. š Anchor: A stronger athlete (bigger model) needs fewer reps (tokens) to finish the job; total work can be similar or even lower.
- Fit rateādistortion curves.
- What happens: Plot distortion (1 ā accuracy) vs. rate (bits per token) and fit a simple decaying curve.
- Why it exists: Visualizes diminishing returns; highlights that scaling the compressor shifts the curve more than scaling the predictor.
- Example: Curves flatten out at a floorāextra bits wonāt fix label noise or judge imperfections.
- Deep Research system test.
- What happens: The predictor breaks a big task into sub-queries; multiple compressors parse sources in parallel; the predictor synthesizes into a long report.
- Why it exists: Shows that the information-theoretic rules transfer to a realistic agentic pipeline.
- Example: With local 3Bā7B compressors, the system kept ~99% of frontier accuracy while shrinking API cost by ~74%.
The secret sauce:
- A practically computable MI estimator (no full vocab dumps, no extra training) that correlates strongly with downstream performance.
- Conditioning on the question to measure task-relevant information, not fluff.
- The realization that larger compressors often produce shorter, denser summaries, making total compute scale sublinearlyāso you can frontāload compute locally and save cloud cost.
04Experiments & Results
The test: Measure how compressor size, family, and token length affect (1) accuracy on QA tasks and (2) perplexity on language tasks, and see how MI and bit efficiency line up with those outcomes.
The competition: Compressors from Llamaā3, Qwenā2.5, Gemmaā3 at 1ā14B; predictors from 8B up to 405B and GPTā4o. Baselines include feeding uncompressed data directly to GPTā4o.
Scoreboard with context:
- Accuracy jumps by scaling compressors: On LongHealth, 7ā8B compressors are up to 3.1Ć more accurate than 1ā1.5B models and even beat a GPTā4oāonly baseline by about 4 percentage points. On FinanceBench, 7ā8B compressors are up to 2.6Ć better than 1ā1.5B and recover ~97% of the GPTā4oāonly baseline.
- Token efficiency: Larger compressors output up to 4.6Ć fewer tokens while keeping or improving accuracyālike writing an A+ summary in half the words.
- Compute: Because output tokens drop as compressors get bigger, total FLOPs-per-generation can grow sublinearly. For Qwenā2.5 on LongHealth, going from 1.5B to 7B added only ~1.3% more FLOPs-per-generation.
- MI and bit efficiency: Larger compressors carry much more information from source to summary and more bits per token. One notable case: a Qwenā2.5 7B compressor conveyed about 5.5Ć more bits of MI per token than its 1.5B sibling.
- Rateādistortion: Information rate strongly tracks distortion (1 ā accuracy). The fitted curves show that after a moderate predictor size (~8Bā70B), further predictor scaling yields small gains unless the compressor improves.
- Perplexity link: On an extractive setup with FineWeb, MI correlates strongly with perplexity (r ā ā0.84), meaning more informative summaries also make nextātoken predictions easier for a separate model.
- Family effects: Model family matters. Qwenā2.5 compressors often scale more computeāefficiently than Llama or Gemma. Predictors donāt need to match the compressor family to perform well.
Surprising findings:
- Bigger can be cheaper: Stronger compressors write shorter notes; shorter notes save predictor compute and API costāso larger local compressors can lower total spend.
- Predictor scaling limits: Upgrading the predictor from 70B to 405B gave only small accuracy bumps (e.g., ~12% in one setting), while upgrading the compressor from ~1B to ~7B delivered much larger gains.
- Prompt robustness: Asking compressors to write 3, 6, or 9 sentences didnāt change the core scaling trendācapacity, not formatting, drives the improvements.
- Deep Research win: Local compressors as small as 3B recovered ~99% of frontier accuracy at roughly 26% of the original API cost (ā74% savings).
05Discussion & Limitations
Limitations:
- MI estimator practicality vs. purity: The Monte Carlo estimator uses log-probabilities and, at small scales (ā1ā3B), may need a proxy model for stability. This can introduce variance or small biases.
- Model scope: Most tests use GPTāstyle, nonāreasoning dense LMs with singleāround communication. Results may differ for chaināofāthought, multiāturn agents, or mixtureāofāexperts where compute depends on which experts fire.
- Estimation noise: Finite-sample MI can dip slightly below zero; clipping fixes the artifact but reminds us that estimates have variance.
- Compute realism: FLOPsāperāgeneration is a clear abstraction, but deviceāspecific optimizations and memory bandwidth also affect real latency and cost.
Required resources:
- Access to inference-time log-probabilities (or a small proxy model) to score summaries.
- Ability to sample multiple compressions per document for stable MI estimates.
- A range of compressor sizes/families and at least a modest predictor (ā8B+) to see the trends.
- Basic logging to count tokens and approximate FLOPs.
When NOT to use this approach:
- Very short contexts where compression is unnecessaryāthe overhead isnāt worth it.
- Tasks needing verbatim recall of long spans (legal citations, code diffs) where any omission is fatalāconsider full-context or retrieval with exact spans.
- Highly creative generation where preserving source information is not the main goal.
Open questions:
- Better estimators: Could contrastive objectives like InfoNCE yield even more stable MI estimates at scale?
- Multi-turn channels: How do MI and rateādistortion behave over several communication rounds or with planning/decomposition traces?
- Training for rate: Can we train compressors to directly optimize the rateādistortion trade-off (maximize MI per token subject to accuracy floors)?
- Routing decisions: Can MI estimates guide when to fall back to full-context remote processing or which compressor to pick on the fly?
- MoE and quantization: How do sparse activation patterns and lowāprecision inference alter the compute vs. information picture?
06Conclusion & Future Work
Three-sentence summary: This paper treats the compressor in agentic LM systems as a noisy channel and measures how much information its summaries preserve from the source using a practical mutual information estimator. Information rate (bits per token) strongly predicts downstream performance, revealing that scaling compressors yields larger gains than scaling predictors while often reducing total compute through shorter outputs. These principles transfer to a Deep Research pipeline, where small local compressors recover frontier accuracy at a fraction of the cost.
Main achievement: A simple, taskāagnostic, inferenceātime MI estimator that turns compressor quality into a single, predictive numberāunlocking principled design of multiāLM systems.
Future directions: Build better MI estimators (e.g., InfoNCE variants), extend to multiāround agent communication, train compressors with rateādistortion objectives, and design routing policies that pick the right compressor or fallback dynamically. Explore computeāaware training for sparse/MoE models and quantify effects of quantization on information rate.
Why remember this: It gives a clear north starāoptimize for information density. When in doubt, frontāload compute into a strong local compressor, use MI to verify that summaries carry the right bits, and expect better accuracy, lower cost, and faster systems.
Practical Applications
- ā¢Choose compressors by maximizing information rate (bits per token) before running expensive end-to-end tests.
- ā¢Front-load compute into a local compressor to reduce predictor API tokens and cut cloud costs.
- ā¢Set compression length targets (e.g., 3ā9 sentences) and verify MI stays high to ensure summaries remain dense.
- ā¢Use MI as a routing signal: if MI is low, fall back to full-context or a larger compressor.
- ā¢Budget compute with FLOPs-per-generation: prefer compressors that reduce output tokens to keep total cost sublinear.
- ā¢Audit compressors by sampling multiple summaries and tracking MI variance; pick prompts/settings with stable MI.
- ā¢When swapping model families (e.g., Llama ā Qwen), recheck MI and rateādistortion curves instead of full sweeps.
- ā¢For memory systems (chat/history), compute MI of memory entries to keep only the most informative notes.
- ā¢In enterprise RAG, compress retrieved documents with a strong local compressor and pass only high-MI snippets.
- ā¢Tune prompts for conciseness but validate with MI so shorter outputs donāt silently drop vital facts.