TimeBill: Time-Budgeted Inference for Large Language Models
Key Summary
- •TimeBill is a way to help big AI models finish their answers on time without ruining answer quality.
- •It predicts how many tokens (words) an answer will have and how long the whole run will take.
- •Using those predictions, it smartly chooses how much old memory (KV cache) to keep or throw away.
- •Unlike fixed settings that can be too strict or too loose, TimeBill adjusts itself for each new question and time limit.
- •A small helper model (an SLM) classifies the answer length into fine buckets for accurate timing.
- •A timing estimator links the work the model does (FLOPs) to real time measured on the actual hardware.
- •With a safe cushion (a pessimistic factor), TimeBill plans for worst-case time so deadlines aren’t missed.
- •In tests, TimeBill kept completion rates high and answer quality better than common baselines.
- •It works with time-critical tasks like robots and self-driving where late answers can be unsafe.
Why This Research Matters
TimeBill makes large language models dependable under real deadlines, which is critical for robots, vehicles, factories, and other systems where timing equals safety. By planning length and time before generating, it prevents late or empty answers that can break workflows. It preserves as much context as possible, so accuracy stays high instead of collapsing under aggressive speed hacks. The method is hardware-aware, making predictions reliable on your actual GPU or CPU. It can pair with other accelerations like quantization to stack benefits. In short, TimeBill turns unpredictable response times into on-time delivery with minimal quality trade-offs.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine you’re in a cooking contest with a strict timer. You don’t just want tasty food—you must plate it before the buzzer. If you run long, the dish doesn’t count, even if it’s amazing.
🥬 The Concept (Time-Budgeted Inference): What it is: Time-budgeted inference means making an AI finish answering within a chosen time limit while keeping answers useful. How it works: 1) Set a time budget, 2) predict how long the AI will take, 3) adjust the AI’s settings to finish by the deadline, 4) generate the answer, 5) stop on time. Why it matters: Without it, the AI might miss the deadline and give either no answer or an incomplete one—bad for safety and reliability.
🍞 Anchor: A robot hearing “Stop!” must react now. A perfect answer that arrives late is useless.
🍞 Hook: You know how you tell a story one word at a time, thinking of the next word based on what you just said?
🥬 The Concept (Auto-Regressive Generation): What it is: Many language models create answers one token at a time, using what’s already been written. How it works: 1) Read the prompt, 2) produce the first token, 3) add the next token based on all previous tokens, 4) repeat until done. Why it matters: The final time depends on how many tokens get generated, which is unpredictable in advance.
🍞 Anchor: If you don’t know whether your story will be 2 sentences or 20, it’s hard to know when you’ll finish.
🍞 Hook: Think of packing for a trip: first, you lay everything out (prefill), then you zip the suitcase little by little (decoding) while deciding what fits next.
🥬 The Concept (Prefill and Decoding Phases): What it is: LLMs have a prefill phase (read the whole prompt) and a decoding phase (generate tokens step by step). How it works: 1) Prefill: process the input to prepare memory, 2) Decoding: produce tokens one by one using the stored memory, 3) stop at an end token or max length. Why it matters: Prefill cost depends on input size; decoding cost grows with how much has already been said, so timing changes as you go.
🍞 Anchor: Reading a long question takes longer; adding each new word also gets slower if you keep more context around.
🍞 Hook: Imagine keeping a notepad of everything you’ve said so far so you can look back quickly.
🥬 The Concept (KV Cache): What it is: A fast memory the model uses to remember key information from earlier tokens. How it works: 1) During prefill, build the cache from the prompt, 2) during decoding, reuse it so you don’t recompute everything, 3) grow it as you add tokens. Why it matters: A big cache speeds up thinking but makes each new step heavier; if it grows too big, time can blow up.
🍞 Anchor: Your notepad helps you write better, but if it’s too full, flipping pages for each new sentence takes longer.
🍞 Hook: If your backpack is too heavy, you remove items to keep moving fast.
🥬 The Concept (KV Cache Eviction Ratio): What it is: The fraction of the cache you throw away to save time and memory. How it works: 1) Decide what to keep vs. evict, 2) evict a portion after prefill, 3) decode faster but with less history, 4) balance speed and quality. Why it matters: Evict too little—miss deadlines; evict too much—worse answers.
🍞 Anchor: Tossing old class notes helps you walk faster, but toss too many and you can’t study well.
🍞 Hook: Picture a driver who must reach a green light before it turns red.
🥬 The Concept (Hard Deadlines in Real-Time Systems): What it is: A rule that late results count as failures, no matter how good they are. How it works: 1) Set a strict time, 2) if you pass it, the job fails, 3) systems may kill the job or skip future ones to recover. Why it matters: In robots or cars, late guidance can be dangerous.
🍞 Anchor: A stop signal that comes too late is no stop at all.
Before this work, many teams used fixed cache settings or offline tricks like quantization and pruning to go faster. Those helped but didn’t adapt to each question’s time budget or the unpredictable answer length. Others tried coarse length buckets or black-box timing predictors that weren’t fine-grained or easy to trust. The big missing piece was a method that could: 1) predict response length well for the specific model, 2) translate predicted “work” into real time on the actual hardware, and 3) choose the right cache eviction ratio per request to hit the time target without trashing quality. The stakes are real: late or empty answers can break a factory line, confuse a helper robot, or harm safety in driving.
02Core Idea
🍞 Hook: You know how good planners first estimate how long homework will take, then cut or simplify parts to finish before bedtime?
🥬 The Concept (TimeBill’s Key Insight): What it is: Predict answer length and execution time first, then dynamically set how much cache to evict so the model finishes on time with minimal quality loss. How it works: 1) A small helper model guesses how long the answer will be (in tokens), 2) a timing estimator turns that length and input size into a time forecast, 3) given the time budget, TimeBill picks the smallest eviction ratio that makes the run fit, 4) the main model starts, evicts as planned, and finishes on time. Why it matters: Without smart planning, you either miss deadlines or over-evict and hurt the answer.
🍞 Anchor: It’s like packing just enough to fit in your locker before the bell rings, keeping what matters most.
Three analogies for the same idea:
- Traffic: Estimate travel time, then choose which side streets (evictions) to take so you reach before the meeting starts.
- Cooking: Time how long each step takes, then trim steps you can live without to serve before guests arrive.
- School project: Predict how many slides you’ll present, estimate practice time, then cut only the least important slides.
Before vs. After:
- Before: Fixed eviction ratios guessed ahead of time. Sometimes too slow (miss deadlines) or too harsh (bad answers).
- After: Per-request planning uses predicted length and timing to choose just-enough eviction. Fewer misses, better quality.
🍞 Hook: Imagine a librarian who can guess the length of a book just by reading the summary.
🥬 The Concept (Response Length Predictor, RLP): What it is: A small language model trained to classify the answer length into fine buckets (instead of guessing an exact number). How it works: 1) Read the input, 2) pick a length bucket (e.g., every 32 tokens), 3) multiply bucket by size to get a predicted length, 4) cap it by a max length. Why it matters: Length drives time; fine-grained buckets make timing much more accurate than coarse guesses.
🍞 Anchor: Like sorting books into shelves labeled “very short,” “short,” “medium,” “long,” and “very long”—but with many shelves for precision.
🍞 Hook: Think of a stopwatch that learns your running pace on your track, not some random field.
🥬 The Concept (Execution Time Estimator, ETE): What it is: A model that maps “how much work” into “how much time” on the exact hardware. How it works: 1) Analyze which parts of the transformer do the heavy lifting (attention and feed-forward), 2) relate their work (FLOPs) to prompt length and kept cache size, 3) fit simple curves using real measurements from your GPU/CPU, 4) predict both typical and worst-case time. Why it matters: True timing depends on both software and hardware; fitting to your machine is crucial.
🍞 Anchor: A coach times you on your own track to make a reliable race-day plan.
🍞 Hook: When you plan for a trip, you might add extra minutes in case of traffic.
🥬 The Concept (Pessimistic Factor and WCET): What it is: A safety cushion that inflates predicted length so the worst-case execution time (WCET) won’t exceed the budget. How it works: 1) Multiply predicted length by a cautious factor, 2) recompute the time, 3) choose eviction to fit even this worst case. Why it matters: Better to be safely on time than optimistically late in safety-critical tasks.
🍞 Anchor: Leaving home 10 minutes early means you still arrive on time if the bus is slow.
Building blocks working together:
- Fine-grained RLP aligned to the target LLM via knowledge distillation so it “speaks the same length language.”
- Workload-guided ETE that ties FLOPs to time and tunes coefficients by profiling.
- A simple optimizer that picks the smallest eviction ratio that still meets the deadline.
- Parallel execution that hides predictor overhead during the model’s prefill. The “aha” is that a little planning turns a wobbly, sometimes-late process into a dependable, on-time one, with as little sacrifice as possible.
03Methodology
At a high level: Input prompt → Response Length Predictor (RLP) → Execution Time Estimator (ETE) → Choose eviction ratio → Run prefill and decoding with eviction → Output.
Step 1. Read the input and start prefill
- What happens: The LLM begins its prefill (reads the whole prompt to set up its memory). At the same time, on another processor, the RLP reads either the original prompt or a compressed version.
- Why this step exists: Prefill cost depends on prompt length; doing prediction in parallel hides overhead so we don’t waste time.
- Example: With a 4,000-token prompt, prefill might take about a second on your GPU. While that runs, the RLP prepares a length prediction.
🍞 Hook: You know how you jot down a shorter version of notes to study faster?
🥬 The Concept (Prompt Compression for Prediction): What it is: Shrinking the prompt for the RLP so the predictor finishes during prefill. How it works: 1) Estimate how long prefill will take, 2) pick a short prompt size the RLP can process within that time, 3) compress or summarize the prompt to that size, 4) run the RLP on the short prompt. Why it matters: If prediction finishes before prefill ends, it adds practically zero delay.
🍞 Anchor: Make a half-page study guide so you can review it before the bell rings.
Step 2. Predict the response length with RLP
- What happens: The RLP outputs a bucket index; multiply by bucket size to get a token-length estimate, then cap by the maximum allowed length.
- Why this step exists: Length determines how many decoding steps we’ll pay for.
- Example: Bucket size 32; RLP picks bucket 10 → 320 tokens; cap at max if needed.
Step 3. Estimate execution time with ETE (including a safety cushion)
- What happens: ETE uses simple fitted curves: time grows roughly quadratically with prompt in prefill (because attention compares many pairs) and roughly linearly with kept history in each decoding step. It then inflates predicted length by a pessimistic factor to plan for worst case.
- Why this step exists: We need to know if we’ll fit the time budget; planning for worst case keeps us from missing deadlines.
- Example: Suppose prefill is 1.2 s, decoding 3.0 s in worst case—total 4.2 s against a 5 s budget. Good to go.
🍞 Hook: Think of choosing the few notes you must keep for a test to save time but still answer well.
🥬 The Concept (Choosing the Eviction Ratio): What it is: Select the smallest fraction of cache to throw away so timing fits the budget. How it works: 1) Try to keep as much cache as possible to protect quality, 2) if timing is too tight, slightly increase eviction, 3) stop when worst-case time is at or under the budget, 4) limit eviction to a maximum to avoid big quality drops. Why it matters: This balances speed and accuracy per request.
🍞 Anchor: Keep most flashcards, toss a few duplicates to finish your study session on time.
Step 4. Run the LLM with that eviction plan
- What happens: Prefill completes, we evict the chosen percentage of cache, then decode token-by-token using the remaining memory.
- Why this step exists: This is the actual answering phase, now guarded by a plan that matches the time budget.
- Example: After prefill, evict 40% of cache. Decoding proceeds faster, and the answer completes in time.
Step 5. Deliver the answer on time
- What happens: The model stops at the end token or max tokens and returns the response before the deadline.
- Why this step exists: Hard deadlines treat late answers as failures; finishing on time is essential.
- Example: The robot gets its instruction in 4.8 s of a 5 s budget.
🍞 Hook: Imagine timing chores not only by how long they usually take, but by the amount of work (like number of dishes) and your kitchen speed.
🥬 The Concept (Workload-Guided Timing via FLOPs + Profiling): What it is: A hybrid timing model that connects theoretical work with real hardware speed. How it works: 1) Count where compute happens most (attention and feed-forward), 2) link those costs to prompt length and cache size, 3) measure on your actual GPU/CPU to fit the curves, 4) use the fitted model to predict end-to-end time. Why it matters: It’s both interpretable and accurate, and it adapts to different machines.
🍞 Anchor: You know how many plates you can wash per minute in your own sink, not someone else’s.
The secret sauce
- Fine-grained length buckets aligned to the exact target LLM (via knowledge distillation) make the length prediction precise.
- A simple, interpretable timing model, tuned by real measurements, makes time estimates trustworthy.
- A per-request optimization picks the smallest necessary eviction, preserving quality.
- Predictor work overlaps prefill, so planning doesn’t slow you down. Together, these choices turn uncertain, variable runs into reliable, on-time answers with minimal quality loss.
04Experiments & Results
🍞 Hook: Think of a school race where finishing before the whistle is required, and you also want the best form.
🥬 The Concept (What They Tested): What it is: They checked if TimeBill could keep jobs on time (completion rate) while keeping answers strong (quality scores). How it works: 1) Use a long-context benchmark (LongBench), 2) set several time budgets (5–10 seconds), 3) compare TimeBill to common strategies, 4) measure completion rate and answer quality, 5) try two strict policies when time might be missed (Kill and Skip-Next). Why it matters: Real systems care about both: do you finish, and is the answer good?
🍞 Anchor: A runner must cross the line before the horn and still keep good running form.
Competitors
- Vanilla (no eviction): Often overruns—good potential quality, but many late or empty answers.
- Fixed eviction ratios (25%, 50%, 75%, 95%): Faster but may over-delete memory, hurting answers.
- AWQ (4-bit weight quantization): Speeds up the model offline, but not tuned per time budget.
Scoreboard (with context)
- TimeBill reached state-of-the-art average answer quality while keeping completion rates competitive with the most aggressive fixed eviction (95%). That’s like earning an A in quality while still turning in almost every assignment on time.
- Vanilla lagged: too many late submissions meant low completion and lower average scores (since late jobs can count as zero).
- Fixed eviction showed a hump: small evictions improved completion and slightly boosted average scores, but heavy evictions slashed quality.
- AWQ was a small boost over vanilla but didn’t match TimeBill. TimeBill is compatible with quantization, so you can combine both.
🍞 Hook: If you expect traffic, leaving earlier makes it more likely you arrive on time.
🥬 The Concept (Pessimistic Factor Findings): What it is: A knob that inflates predicted length for safety. How it works: 1) Low values (1–5) raised completion and average scores by avoiding overruns, 2) too-high values (6–8) forced excessive eviction, lowering answer quality. Why it matters: Choosing a reasonable cushion gives you on-time arrivals without over-cutting.
🍞 Anchor: Planning to arrive 10 minutes early helps; planning an hour early might make you throw away half your presentation to rush.
Other results
- The timing estimator was very accurate: about 1–2% error in phase-level estimates and tight end-to-end predictions with a safe upper bound.
- The fine-grained length predictor outperformed BERT-based coarse classifiers and direct regression, especially with many small buckets (e.g., 512).
05Discussion & Limitations
Limitations
- Accuracy depends on the quality of the length predictor and the timing estimator; weak predictions can lead to under- or over-eviction.
- Profiling-based timing curves need to be refreshed if hardware, drivers, or kernels change significantly.
- Very unusual prompts or models might produce lengths outside training habits; the pessimistic factor helps but can still over-cut.
- The approach currently tunes only the KV eviction ratio; other knobs (like temperature, top-k, or adaptive early stopping) aren’t yet co-optimized.
Required resources
- A small helper model (SLM) and training data for length prediction aligned to your target LLM.
- Profiling runs on the actual serving hardware to fit timing curves.
- A serving stack that can evict KV cache at runtime (e.g., with SnapKV-like support) and run predictor + compression in parallel.
When not to use
- If your application does not care about timing (no deadlines), fixed high-quality settings may be simpler.
- If hardware changes very frequently or you can’t profile, estimates may drift.
- If your task is ultra-sensitive to context loss (where any eviction harms correctness), consider alternative speedups before eviction.
Open questions
- Can we jointly optimize more knobs (KV quantization, dynamic stopping, retrieval strategies) with the same time guarantee?
- Could online learning adapt the pessimistic factor per workload automatically?
- How well does this scale across multi-tenant servers with interference and queueing effects?
- Can we predict not only length but also quality impact of eviction for even smarter choices?
06Conclusion & Future Work
Three-sentence summary
- TimeBill plans ahead: it predicts how long an answer will be and how much time the model will need, then chooses just enough KV cache eviction to finish on time.
- A fine-grained small-model length predictor and a workload-guided, profiled timing estimator make the plan accurate and trustworthy.
- In tests, TimeBill improved both on-time completion and answer quality compared to common baselines.
Main achievement
- Turning uncertain, sometimes-late LLM runs into dependable, on-time responses with minimal quality loss—by aligning length prediction, timing estimation, and adaptive eviction per request.
Future directions
- Co-optimizing more runtime knobs (quantization level, retrieval, early stopping) and learning the pessimistic factor online.
- Extending timing guarantees to multi-model pipelines and multi-tenant clusters with queueing.
- Predicting quality impact directly to guide eviction choices even better.
Why remember this
- In safety- and mission-critical settings, “on time and good enough” beats “perfect but late.” TimeBill shows that a little smart planning—predict, estimate, adapt—can make LLMs reliable partners under real deadlines.
Practical Applications
- •Autonomous driving assistants that must deliver safe decisions within strict milliseconds-level budgets.
- •Industrial automation controllers that need timely instructions to keep assembly lines synchronized.
- •Robotics task planning where late or partial responses can cause motion errors or safety hazards.
- •On-device assistants with tight latency goals, balancing accuracy and battery consumption.
- •Call center bots that must reply within service-level targets while keeping helpfulness high.
- •Real-time game NPC narration or hints that must appear in sync with gameplay events.
- •Live captioning or translation systems that must keep up with speakers without big delays.
- •Emergency dispatch triage tools that must produce short, reliable summaries quickly.
- •AR/VR guides that need to render text prompts within frame-time constraints.
- •Edge IoT devices that run small LLMs under power and time limits, using adaptive eviction.