Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory

Mirac Suzgun; Mert Yuksekgonul; Federico Bianchi; Dan Jurafsky; James Zou

Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory

Beginner

Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi et al.4/10/2025

arXiv

Key Summary

•The paper introduces Dynamic Cheatsheet (DC), a simple way for language models to keep a tiny, smart notebook of useful tricks while they are being used.
•Instead of training the model again, DC lets it remember short strategies, formulas, and code snippets that worked before and reuse them later.
•Two versions exist: DC-Cu (curate after answering) and DC-RS (retrieve helpful past cases first, then synthesize a better cheatsheet before answering).
•On hard math and science tasks, DC helped big models like Claude 3.5 Sonnet and GPT-4o make far fewer repeat mistakes and score much higher.
•In the Game of 24, GPT-4o jumped from 10% to 99% after it discovered and reused a Python solver stored in memory.
•On AIME exams, Claude 3.5 Sonnet doubled its accuracy by saving general math insights and pulling them back when needed.
•DC avoids stuffing the whole conversation into the prompt; it keeps only short, reusable pieces, which saves space and focuses the model.
•Smaller models benefit less because they don’t produce enough good solutions to fill the cheatsheet with reliable tips.
•DC works with black-box APIs since it never changes the model’s internal weights, making it practical to deploy.
•Overall, DC turns one-off answers into a growing library of know-how, making LMs feel more like students who learn from each problem.

Why This Research Matters

Dynamic Cheatsheet turns language models from forgetful test-takers into learners who carry forward what works. This reduces repeated mistakes, especially in arithmetic and structured reasoning, and speeds up future problem-solving. It’s practical because it doesn’t need retraining or internal weight changes, so it works with black-box APIs. In classrooms and tutoring tools, it can help students see consistent, improving performance across homework sets. In coding and data tasks, it captures proven snippets and checklists, cutting bugs and rework. In research and professional settings, it builds a lightweight, living reference of verified steps and rules. Overall, it makes AI more dependable over time, not just smart in single moments.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you study for a test, you don’t want to relearn the same trick every time? You keep a tiny notebook of shortcuts so you can solve new problems faster. That’s what this paper wants language models (LMs) to do: stop solving each question in a vacuum and start carrying forward the hard-won lessons.

The world before: Large language models became very good at many tasks—explaining, summarizing, coding, and even solving multi-step math problems. But they usually acted like goldfish: every new question was a fresh start. Even if they discovered a great trick on Monday, they might not use it on Tuesday because they didn’t store it anywhere. People tried to fix this by fine-tuning models (changing their inside settings) or by attaching giant document libraries (retrieval-augmented generation). These helped with facts and style but didn’t teach the model to reuse its own problem-solving methods from one question to the next.

The problem: LMs often repeat the same errors and repeat the same rediscovery of good tricks. Imagine doing 100 puzzles and forgetting every helpful step you used on puzzle #3. That wastes time and causes silly mistakes (especially with arithmetic), and it means the model can’t build momentum across a test.

Failed attempts: 1) Full-history appending: People tried pasting the entire chat history into the prompt. That created long, messy inputs that were hard to search and often exceeded context limits, so it was slow, expensive, and noisy. 2) Retrieval of raw examples: Just pasting a few similar Q&A pairs helped a bit, but it didn’t turn scattered answers into general, reusable strategies. 3) Extra thinking strategies like chain-of-thought or tree-of-thought can help on a single question, but the benefit disappears after that question because nothing is saved.

The gap: What was missing was a light, evolving memory for strategies—like a student’s cheatsheet that grows during the test. Instead of saving everything, it should save just the reusable, bite-sized pieces: algorithms, code snippets, common formulas, and clever patterns. And it should work with black-box models where you can’t touch the inside weights.

Real stakes: This matters anywhere we want fewer mistakes and faster progress as tasks continue. Think homework helpers that stop miscounting after they learn a reliable way to check their math, coding copilots that remember the perfect snippet for a common bug, or research assistants that keep a short list of theorems, units, and rules they’ve verified are correct. For example, in the Game of 24, a model can waste time trying random arithmetic. But if it once discovers a short Python brute-force trick and stores it, every future puzzle becomes a quick win. In math exams like AIME, DC helped models remember algebra and combinatorics moves. In knowledge-heavy science questions (GPQA-Diamond, MMLU-Pro), DC let the model keep a compact set of principles and references.

Bottom line: Before this work, models treated each question as a one-off. Now, with Dynamic Cheatsheet, they can roll their experience forward, cut routine errors, and improve through the test without needing retraining or human grading.

02Core Idea

The “Aha!” in one sentence: Give the model a tiny, well-edited notebook (cheatsheet) that it updates as it goes, so it can reuse good strategies instead of starting from scratch each time.

Multiple analogies:

School binder: You solve a few problems, write the useful trick on an index card, and later flip to that card to solve similar problems faster.
Video game inventory: When you find a great tool, you keep it. Next level? Use that tool again—no need to hunt for it from scratch.
Chef’s prep station: The chef keeps small containers of prepped ingredients. When a new order comes in, they grab the right containers instead of re-prepping everything.

Before vs. After:

Before: Each question is a fresh start; great ideas get lost; silly mistakes recur; long transcripts clog the context.
After: The model stores clear, short, reusable strategies; pulls them back when relevant; makes fewer repeat mistakes; and solves related tasks more consistently.

Why it works (intuition, no equations): Good problem-solving has patterns. If the model distills those patterns into small, general tips—like a Python snippet that tests arithmetic combinations or a checklist for physics units—then every similar future question becomes easier and more reliable. Curating the memory keeps it small and focused, avoiding the distraction of full transcripts.

Building blocks explained with the Sandwich pattern:

🍞 Hook: Imagine taking notes during a test so you don’t forget what worked. 🥬 The Concept: Dynamic Cheatsheet (DC) is a growing, external memory of short strategies, examples, and code the model can reuse. How it works: (1) The model checks its cheatsheet, (2) solves the problem, (3) updates the cheatsheet with only the best, general tips. Why it matters: Without DC, the model repeats old mistakes and rediscovers tricks over and over. 🍞 Anchor: In Game of 24, the model found a Python solver once, saved it, and then aced almost all future puzzles.

🍞 Hook: Picture learning while you’re taking the test. 🥬 The Concept: Test-Time Learning means the model improves during use, not only during training. How it works: (1) For each new question, the model can consult and update its cheatsheet; (2) useful strategies are kept; (3) bad ones are pruned. Why it matters: Without it, the model’s effort on one question never helps the next. 🍞 Anchor: Claude’s AIME accuracy jumped when it kept algebra and counting insights from earlier problems.

🍞 Hook: Think of a librarian picking only the best books for a small shelf. 🥬 The Concept: Memory Curation is choosing what to save, improving it, and removing clutter. How it works: (1) Judge if a solution is correct and general, (2) rewrite it as a short, reusable tip, (3) remove outdated or wrong items. Why it matters: Without curation, the notebook gets bloated and confusing. 🍞 Anchor: Keeping a single, polished Python template beat pasting pages of raw chat history.

🍞 Hook: When you need a recipe, you look up the right card. 🥬 The Concept: Memory Retrieval is finding the most relevant stored tips for the new question. How it works: (1) Compare the new question to past ones, (2) pull the closest matches and their distilled strategies, (3) use them to guide the solution. Why it matters: Without retrieval, the right help stays buried. 🍞 Anchor: For GPQA-Diamond science questions, pulling a compact rules list helped Claude choose better answers.

🍞 Hook: A chef cooks using both the pantry and their skills. 🥬 The Concept: Solution Generation is the model’s actual answering step, guided by helpful memory entries. How it works: (1) Read the question and relevant memory tips, (2) plan steps, possibly write code, (3) produce a final answer. Why it matters: Without good generation, tips won’t turn into correct outputs. 🍞 Anchor: With the stored 24-game solver, generation simply ran the snippet and returned a correct expression.

Two DC flavors:

🍞 Hook: Do you sometimes write notes after you finish a problem? 🥬 The Concept: DC-Cu (Cumulative) updates the cheatsheet after answering. How it works: solve now, then curate; steadily grow the cheatsheet from experience. Why it matters: It’s simple and effective when related problems arrive in sequence. 🍞 Anchor: Claude doubled AIME 2024 accuracy by curating strategies after each question.

🍞 Hook: Or do you skim your notes before starting the next problem? 🥬 The Concept: DC-RS (Retrieval & Synthesis) retrieves relevant past examples first and refines the cheatsheet before answering. How it works: (1) Find similar past Q&A pairs, (2) synthesize or improve the cheatsheet, (3) answer using the refined memory. Why it matters: You benefit from fresh insights immediately. 🍞 Anchor: GPT-4o went from 10% to 99% on Game of 24 by retrieving and reusing its stored Python solver right away.

03Methodology

At a high level: New Question → (Optional) Retrieve similar past cases → Curate/Refine cheatsheet → Generate solution using cheatsheet → Update cheatsheet.

Step-by-step recipe with purpose and examples:

Prepare the workspace (the external cheatsheet)

What happens: Start with an empty or small cheatsheet that will hold short, well-labeled entries: strategies, formulas, and code snippets. Each entry can track a usage count.
Why it exists: You need a tidy place to store and find reusable know-how. If you skip this, tips get lost in long transcripts.
Example: Create a section called “Reusable Code” and save a tiny Python function that checks whether four numbers can make 24.

For DC-RS, retrieve before you solve (optional but powerful)

What happens: When a new question arrives, find similar past questions and their solutions, then pass them to a curator to refine the cheatsheet before solving.
Why it exists: You want to use good patterns right away. Without retrieval, you might miss the best prior examples.
Example: A physics question about units triggers retrieval of a memory item titled “Unit Consistency Checklist” with tips on dimensions and conversions.

Curate and synthesize the cheatsheet

What happens: The curator rewrites, merges, or removes entries so they’re short, accurate, and general. It extracts the essence from past solutions and tosses fluff.
Why it exists: If you don’t curate, the cheatsheet bloats, becomes noisy, and retrieval gets worse.
Example: Replace three slightly different 24-game snippets with one clean, commented function; increment its usage count.

Generate the solution using the updated cheatsheet

What happens: The solver reads the question, consults the cheatsheet, and produces an answer with clear steps. If coding helps, it writes a self-contained Python block and runs it (as allowed by the setup).
Why it exists: This is where tips become results. Without using the cheatsheet, you’d re-derive tricks or make the same errors.
Example: For 7, 7, 8, 11 in Game of 24, the solver grabs the saved function, runs it, and outputs a valid expression if found.

Update the cheatsheet after solving (DC-Cu and the final step of DC-RS)

What happens: The curator decides what to keep, what to fix, and what to remove based on the latest attempt. It promotes general strategies and prunes fragile ones.
Why it exists: Learning happens here. Without this, the model won’t improve across questions.
Example: On an AIME geometry problem, it adds a short checklist: “Angle chasing steps,” referencing theorems and a diagram tip.

Data formatting and discipline:

Final answer formatting: The system enforces a simple, machine-readable answer block so graders can parse outputs without confusion.
Accuracy metrics: Different tasks are judged appropriately—flexible matching for multiple-choice or functional correctness for arithmetic puzzles.

Concrete mini-examples:

Game of 24: Early tries are manual. Then the model writes a small brute-force solver, stores it, and instantly reuses it for the remaining 90+ puzzles.
Math Equation Balancer: The model stores a code routine to place operators. Accuracy jumps close to perfect because execution beats mental arithmetic.
AIME: The model keeps algebra/combinatorics patterns, like a checklist for counting problems, so those ideas transfer across questions.

The secret sauce:

It’s not just memory; it’s curated, tiny, and reusable. That combo makes retrieval fast and guidance clear.
DC-RS improves timing: retrieve similar cases and refine memory first, so the current answer benefits immediately.
The system stays black-box friendly: no weight updates required. You can use commercial APIs as-is and still get learning behavior.

Failure modes and safeguards:

If the base model is too weak, it won’t produce enough correct seeds. The cheatsheet then fills with noise. Guardrails: keep entries short, prefer code that self-verifies, and increment usage counts to prioritize proven tips.
If retrieval pulls the wrong examples, it can confuse the solver. Use better embeddings and consider domain tags (e.g., ‘combinatorics’, ‘kinematics’) to route to the right memory section.

04Experiments & Results

The tests: The authors measured accuracy on tough reasoning tasks where the model can really benefit from building and reusing strategies across questions. They used functional correctness for puzzles (does the expression actually work?) and soft matching for multiple-choice (ignore tiny formatting differences).

The competition (baselines):

BL (Baseline): just solve each question with minimal instructions, no memory.
DC-∅ (Empty Memory): use structured guidance but never store anything—tests the value of storage itself.
FH (Full History): paste every past Q&A into the prompt—usually bloated and noisy.
DR (Dynamic Retrieval): fetch similar past Q&A but without curating into generalized tips.

The scoreboard with context:

Game of 24 (100 puzzles): GPT-4o exploded from 10% (like getting 1 out of 10 right) to 99% (almost perfect) using DC-RS, thanks to discovering and reusing a Python solver. DC-∅ only reached 19%, proving storage and reuse were the real boosters.
AIME 2024/2025 and AIME 2020–2024: Claude 3.5 Sonnet more than doubled accuracy on AIME 2024 (23.3% to 50.0%) and jumped big on 2025 as well. GPT-4o also improved (e.g., 20% to 40% on AIME 2024 with DC-RS). That’s like moving from guessing to having a plan.
Math Equation Balancer: Both Claude and GPT-4o went from about half-correct to near-perfect (around 98–100%) once they stored and reused a simple code routine.
GPQA-Diamond (science): Claude rose from 59.6% to 68.7% using DC-RS—roughly a 9-point jump just from test-time memory. GPT-4o’s gains were smaller, showing that retrieval precision and curation quality matter and can vary by model.
MMLU-Pro (Engineering/Physics): Claude saw steady gains (up to +8 points in Physics); GPT-4o sometimes dipped, likely because retrieved content wasn’t consistently helpful for it.

Surprising findings:

Code beats chatter: As soon as the model realized a short program could check many combinations, accuracy rocketed and stayed high.
Curated beats full-history: Keeping a small, polished cheatsheet outperformed dumping everything into the context, which often confused the model and wasted tokens.
Model size matters: Smaller models benefited less; they didn’t create enough correct seeds to fill the cheatsheet with reliable tips.

Takeaways:

DC turns a sequence of questions into a learning journey. When problems are structurally related (like many AIME questions or the 24-game puzzles), DC shines even brighter.
Retrieval timing helps: refining the cheatsheet before answering (DC-RS) can deliver immediate benefits on the current question.

05Discussion & Limitations

Limitations:

Depends on base ability: If the model rarely produces correct steps, the cheatsheet fills with weak strategies and can mislead future answers.
Retrieval noise: Pulling poorly matched examples can confuse the solver, especially on diverse topics.
Long-context generation: Some models struggle to rewrite or reorganize longer memory nicely; they may refer vaguely to old content instead of properly updating it.
Sequential cost: Curating after every query adds overhead; benefits appear as the session continues and may not help isolated, single-shot tasks.

Required resources:

A black-box LM API with stable reasoning ability (Claude 3.5 Sonnet, GPT-4o worked best here).
An embedding model and a simple vector index for retrieval (for DC-RS).
Light storage for the cheatsheet (text store or small database) and, if allowed, a code execution sandbox.

When not to use:

One-off, unrelated questions where future reuse is unlikely.
Very small or weak models that can’t supply accurate seeds to learn from.
Strict latency budgets where the extra curation step is too costly.

Open questions:

How to auto-clean memory: Can we better detect and prune wrong or stale strategies without ground-truth labels?
Smarter retrieval: Could hierarchical or domain-aware retrieval pick more reliable tips and avoid topic-crossing confusion?
Memory sharing: Can a strong model’s cheatsheet safely help a smaller one, and how should entries be simplified?
Tool diversity: Beyond Python, which tools (symbolic math, solvers, search) give the biggest long-term lift when stored as reusable patterns?
Curriculum ordering: What’s the best way to order test questions so the cheatsheet ramps up quickly?

06Conclusion & Future Work

Three-sentence summary:

Dynamic Cheatsheet gives language models a tiny, evolving memory that stores short, reusable strategies, code snippets, and patterns during use.
By retrieving and curating these tips, models make fewer repeat mistakes and solve related problems more accurately, all without changing their internal weights.
Across math, puzzles, and science, DC delivered large gains (like 10% to 99% on Game of 24), especially for capable base models.

Main achievement:

Showing that carefully curated, persistent, and compact memory at test time can transform a black-box LM from a one-off answerer into a learner that improves across a sequence of tasks.

Future directions:

Better retrieval and pruning to keep memory clean; domain-specific sub-memories; curriculum-style ordering; and broader tool use (symbolic math engines, domain solvers) captured as reusable entries.

Why remember this:

DC is a practical, training-free bridge between isolated answers and experience-driven learning. It makes LMs feel more like students who carry forward what works, turning early effort into later wins.

Practical Applications

•Math tutoring systems that remember and reuse step-by-step checklists for algebra, geometry, and combinatorics.
•Coding assistants that store verified bug-fix snippets and testing templates for common errors.
•Scientific Q&A tools that keep compact unit/conversion guides and domain theorems to avoid repeated mistakes.
•Customer support bots that save successful troubleshooting flows and reuse them for similar tickets.
•Data cleaning assistants that remember pattern-matching scripts for recurring formatting issues.
•Legal or policy helpers that store curated citation patterns and compliance checklists.
•Robotics or IoT assistants that keep reusable diagnostic routines for sensors and actuators.
•Exam-prep platforms that build a personal cheatsheet of strategies from a student’s solved problems.
•Finance or accounting bots that retain validated reconciliation steps and formula checks.
•Business analytics copilots that save common SQL query patterns and visualization templates.

Version: 1