Learning to Discover at Test Time

Mert Yuksekgonul; Daniel Koceja; Xinhao Li; Federico Bianchi; Jed McCaleb; Xiaolong Wang; Jan Kautz; Yejin Choi; James Zou; Carlos Guestrin; Yu Sun

Learning to Discover at Test Time

Intermediate

Mert Yuksekgonul, Daniel Koceja, Xinhao Li et al.1/22/2026

arXiv PDF

Key Summary

•This paper shows how to keep training a language model while it is solving one hard, real problem, so it can discover a single, truly great answer instead of many average ones.
•The method, called TTT-Discover, treats the test problem like a mini-world for reinforcement learning and updates the model after every batch of attempts.
•A special entropic objective makes the model strongly favor its best ideas in each batch, which helps it climb to record-breaking solutions rather than just improving the average.
•A PUCT-based chooser smartly reuses the most promising past attempts while still exploring new directions, extending the effective horizon of search.
•Across math, GPU kernels, algorithm design, and biology, TTT-Discover set new state-of-the-art results using an open 120B model within a modest test-time compute budget.
•In math, it improved classic bounds in Erdős’ minimum overlap and a key autocorrelation inequality using new high-piece step functions and targeted optimizers.
•In GPU kernel engineering, it discovered fused, mixed-precision kernels up to about twice as fast as prior bests on some hardware, validated by competition organizers.
•In algorithm contests (AtCoder), it produced winning-level C++ solvers, outperforming human leaders and prior AI systems in the tested challenges.
•In single-cell denoising, it beat widely used methods like MAGIC on benchmark metrics by adding learned refinements that directly optimize the score.
•The approach is reproducible, uses open models, and costs only a few hundred dollars per problem, opening a practical path to AI-assisted scientific discovery.

Why This Research Matters

This work turns test-time compute into focused learning on the exact problem you care about, enabling breakthroughs rather than averages. In practice, that means cheaper and greener AI via faster kernels, tighter mathematical bounds that guide future research, and stronger industrial solvers for scheduling and planning. In biology, better denoising extracts more value per expensive experiment, accelerating discovery. Because it runs on open models with modest budgets, labs and startups can adopt it without exclusive access to frontier systems. By aligning the learning objective with discovery’s real goal—a single best solution—TTT-Discover provides a practical blueprint for AI to help humans push the edge of what’s possible.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how when you face a brand‑new puzzle, you don’t just guess once and walk away—you try, see what worked a little, learn, and try again smarter?

🥬 The Concept (Discovery problems and state of the art):

What it is: A discovery problem is one where we want to beat the best-known result (the “state of the art”) on a specific, hard task.
How it works:
1. We start from the best result people already know.
2. We make new attempts and measure how good they are.
3. We only “win” if we find a result that’s better than the current best.
Why it matters: Without focusing on beating the best, a system might settle for “pretty good” and never push the frontier. 🍞 Anchor: Imagine a paper airplane contest where the record flight is 30 meters. Discovery means building one plane that flies 30.1 meters, not having ten planes that average 20 meters.

The World Before: Large language models (LLMs) could write code, sketch proofs, and propose algorithms. But when a problem truly needed new ideas beyond training data—like shaving microseconds off a GPU kernel or tightening a math bound—most systems used test-time search with a frozen model. They’d prompt the model thousands of times, store the best candidates, and craft new prompts using evolutionary heuristics. The model itself never learned from its test-time experience. It was like a student who tries many approaches but isn’t allowed to internalize any lesson.

🍞 Hook: Imagine guessing passwords over and over but never remembering which guesses were closer—that’s wasteful!

🥬 The Concept (Frozen-model search):

What it is: Prior methods kept the LLM’s weights fixed and improved only the prompts and candidate selection.
How it works:
1. Sample many answers from the model.
2. Keep the best ones in a buffer.
3. Generate new prompts from the buffer using hand-crafted heuristics.
Why it matters: It scales with more samples, but the model never becomes smarter about this specific problem. 🍞 Anchor: It’s like trying 10,000 lock combinations without ever learning which digits were promising.

People tried two common strategies. First, Best-of-N: just sample N times and keep the best. Second, evolutionary search: mutate and combine previous attempts and re-prompt the model. These help, but when the target is “find one excellent answer,” optimizing average performance or improving prompts alone often stalls.

🍞 Hook: You know how practice makes perfect—if you actually learn from the practice?

🥬 The Concept (Test-Time Training, TTT):

What it is: A way to keep training the model while it is solving the test problem, so it learns from its own fresh attempts.
How it works:
1. Attempt the problem and get feedback (a reward score).
2. Update the model’s weights using that feedback.
3. Try again with the now-smarter model.
Why it matters: Without TTT, the model can’t internalize what worked on this exact problem. 🍞 Anchor: Baking your first cake and adjusting the recipe after every bite until it’s delicious—during the same baking session.

The Problem: Standard reinforcement learning (RL) aims for a policy that does well on average across many states and is robust for repeated deployment. Discovery problems are different: we only care about one test instance and one best result. Optimizing the average can hide the few brilliant attempts we actually want.

🍞 Hook: Think of grading only the single best shot in basketball practice, not your average over 100 shots.

🥬 The Concept (Continuous rewards):

What it is: A score that smoothly reflects how good a candidate is (e.g., faster runtime = higher score).
How it works:
1. Define a numeric score for each attempt.
2. Use the score to compare and select improvements.
3. Keep trying to push the score higher.
Why it matters: We can tell “how much better,” not just “better or worse,” which guides fine-grained progress. 🍞 Anchor: Timing your sprint every day gives you exact seconds to beat, not just a pass/fail.

Failed Attempts: Naive test-time RL with expected reward and no reuse often under-explores risky but high-payoff ideas, starts each try from scratch (short horizon), and treats a near-record and a non-record as almost the same unless carefully shaped. Evolutionary methods overfit to prompt design, and Best-of-N wastes learning signal in the attempts.

The Gap: A method that (1) keeps training the model on this single problem, (2) chases the best attempt instead of the average, and (3) smartly reuses strong partial solutions to extend how far each try can go.

Real Stakes: Faster GPU kernels make AI cheaper and greener. Better math bounds guide future theorems. Stronger scheduling and planning boost industry efficiency. Improved single-cell denoising can save money on experiments and speed up biology discoveries. That’s why pushing from “good” to “best” on one concrete instance really matters.

02Core Idea

Aha! Moment in one sentence: If discovery is judged by your single best result, then train the model at test time to chase and amplify its best ideas—right as they appear.

🍞 Hook: You know how in a talent show, you only care about your top routine, not the average of all your rehearsals?

🥬 The Concept (TTT-Discover):

What it is: A test-time training method that updates the LLM while solving one problem, using an entropic objective to favor top-performing attempts and a PUCT chooser to smartly reuse promising states.
How it works:
1. Treat the problem as an RL environment with continuous rewards.
2. Generate a batch of attempts (code, proofs, algorithms), score them, and store them.
3. Update the model using an entropic objective that upweights the best samples.
4. Pick the next starting points with a PUCT-inspired rule to extend strong directions but still explore.
5. Repeat for dozens of steps; return the single best solution found.
Why it matters: Optimizing for the best attempt (not the average) makes the system discover record-setters. 🍞 Anchor: It’s like a music coach who listens to your best take in each session and trains you to lean into what made it great.

Three analogies for the same idea:

Rock climbing: Don’t average all your routes; learn from the one path that got you highest and push even further next try.
Cooking contest: Keep the notes from your tastiest batch, refine those exact choices, and occasionally try a bold spice.
Treasure map: Start future searches near places where you already found gold, but still peek into a few new caves.

Before vs After:

Before: Frozen models sampled a lot but couldn’t internalize test-time lessons; averaging-focused objectives missed rare gems; reuse often lacked principled balance between exploration and exploitation.
After: The model keeps learning; the entropic objective pulls the policy toward its top performers; PUCT reuse lengthens the effective search horizon while preventing tunnel vision.

Why it works (intuition, no equations):

The entropic objective: Instead of giving equal attention to all attempts, it gives extra weight to the best ones in a batch. Think of it as turning up the volume on your strongest ideas so the model shifts its behavior toward producing more like them.
Adaptive temperature (β): If improvements are tiny, it sharpens focus more; if a few outliers could destabilize learning, it softens focus to stay stable.
PUCT reuse: For choosing where to start next, it scores states by their best reachable reward (not the average), nudges starts toward higher-ranked states, and still boosts under-visited options. This grows deep improvements and keeps diversity.

Building blocks, each with a sandwich:

🍞 Hook: Ever keep a journal of your best tries so you can reuse what worked? 🥬 The Concept (State reuse):

What it is: Starting new attempts from previous good partial solutions.
How it works: (1) Keep a buffer of past states; (2) pick a state to warm-start; (3) generate improved code/ideas from that point.
Why it matters: It extends the effective horizon—complex solutions often require multiple successive refinements. 🍞 Anchor: Editing a strong draft instead of writing from scratch every time.

🍞 Hook: If you only need one bullseye, you should study your best shots most carefully. 🥬 The Concept (Entropic objective):

What it is: A learning goal that magnifies the influence of the top samples in each batch.
How it works: (1) Score all attempts; (2) compute weights that grow exponentially with score; (3) update the model more from higher-weight attempts; (4) adapt sharpness so training stays stable.
Why it matters: It directly steers learning toward record-beating behavior. 🍞 Anchor: In a singing class, replay and analyze your best performance to reinforce what made it excellent.

🍞 Hook: Choosing where to look next can be as important as how you improve. 🥬 The Concept (PUCT chooser):

What it is: A rule to pick the next starting states that balances best-so-far potential with exploration.
How it works: (1) Score each saved state using the maximum reward it led to; (2) add a bonus for high-ranked states; (3) add an exploration bonus for under-visited ones; (4) select states with the highest combined score.
Why it matters: Avoids over-committing too soon while still pushing promising directions further. 🍞 Anchor: In a science fair project, you spend more time on the design that’s showing the most promise, but you still test a few wild ideas.

Putting it together: TTT-Discover forms a tight loop—generate, score, learn from the best, and reuse wisely. This loop turns test-time compute into skill on the specific problem, enabling one standout solution to emerge.

03Methodology

High-level pipeline: Input (problem text and initial model) → Define environment (states, actions, reward, transition) → Loop over steps: [Select start (PUCT) → Generate many attempts → Evaluate reward → Update buffer → Update model (entropic objective)] → Output the single best solution found.

Step-by-step with sandwich explanations and examples:

Define the environment for the single test problem. 🍞 Hook: Imagine turning a puzzle into a little video game: each move gives you points. 🥬 The Concept (Environment design):

What it is: A formal wrapper around the problem: what a “state” looks like, what an “action” is, how actions change states, and how we score (reward).
How it works:
1. State s: a candidate solution (e.g., Triton kernel code; step function in math; C++ algorithm; denoising code).
2. Action a: the model’s next proposal (thinking + code).
3. Transition T: parse/execute code to get next state s′.
4. Reward R(s′): a continuous score (faster runtime gets higher reward, tighter bound gives higher reward, etc.).
Why it matters: Clear rules let us try, score, and improve repeatedly. 🍞 Anchor: In GPU kernels, state = kernel code; action = new kernel; T compiles/parses; reward = 1/runtime if it’s correct.

Initialize a reuse buffer H with seed states. 🍞 Hook: Before building a tower, you lay out your best base blocks. 🥬 The Concept (Seeding the buffer):

What it is: Start with empty or simple valid solutions, sometimes including known baselines.
How it works: (1) Add trivial or random valid states (math); (2) add unoptimized reference kernels (GPU); (3) add a solid starter algorithm (AtCoder) or a baseline like MAGIC (biology).
Why it matters: Gives safe places to start refinement and ensures we don’t begin from nothing. 🍞 Anchor: For AtCoder ahc039, they seeded from a strong ALE-Agent baseline; for single-cell, they seeded MAGIC.

Choose initial states to expand using PUCT. 🍞 Hook: Picking the right starting ramp makes your skateboard trick more likely to land. 🥬 The Concept (PUCT start-state selection):

What it is: A score that mixes (a) how good a state’s best child was, (b) a prior that prefers higher-ranked states, and (c) an exploration bonus for under-visited areas.
How it works:
1. For each state s, compute Q(s) = best reward from its descendants (or R(s) if never expanded).
2. Add a rank-based prior P(s) and an exploration term decreasing with visit count n(s).
3. Pick states with the highest combined score as batch starts.
Why it matters: It deepens promising lines while preventing overfitting to just one path. 🍞 Anchor: In math bounds, restart from a good step function you already improved, but still try a few fresh random starts.

Generate actions (attempts) with the current policy. 🍞 Hook: Now we take many swings with our newest bat. 🥬 The Concept (Batch rollouts):

What it is: Produce many candidate solutions from the model for each chosen start.
How it works:
1. Group rollouts per chosen start to explore local variations.
2. Constrain tokens so there’s room to print final code.
3. Capture both thinking and final code in the buffer.
Why it matters: Multiple shots around strong starts find sharper refinements. 🍞 Anchor: For TriMul, 512 kernels per step, in 8 groups of 64, each group starting from a chosen state.

Evaluate rewards by running/validating outputs. 🍞 Hook: You can’t know your time without a stopwatch. 🥬 The Concept (Scoring and checks):

What it is: Run correctness checks and compute continuous rewards.
How it works:
1. Parse/execute code (with time limits and safety checks).
2. If invalid, reward = 0; else compute task-specific score (e.g., 1/runtime, tighter bound, higher contest score, lower MSE).
3. Log results in the buffer.
Why it matters: Reliable feedback keeps learning honest and stable. 🍞 Anchor: In math, invalid step functions or timeouts get 0; valid ones score by certified bounds.

Update the model using the entropic objective with adaptive sharpness. 🍞 Hook: Turn up the volume on your best ideas so your next tries resemble them. 🥬 The Concept (Entropic policy update):

What it is: A policy-gradient style update that upweights top-reward attempts within each start’s batch.
How it works:
1. For a chosen start, compute weights that grow with reward.
2. Select β adaptively to keep the reweighting within a safe KL budget.
3. Apply a KL regularizer to avoid drifting too far.
4. One gradient step per batch with LoRA to keep updates efficient.
Why it matters: The model learns to generate more like its current best samples. 🍞 Anchor: If one kernel is much faster, the update leans toward code patterns that produced it.

Update the buffer and repeat for ~50 steps. 🍞 Hook: Rinse and repeat—each loop makes the next loop smarter. 🥬 The Concept (Continual, instance-specific learning):

What it is: Online learning focused entirely on this one test instance.
How it works:
1. Keep only top states (size-capped) to maintain quality and diversity.
2. Block full lineages within the same batch to reduce duplicates.
3. Iterate: choose starts → generate → score → train.
Why it matters: Converts test-time compute into specialized skill for this precise problem. 🍞 Anchor: Over steps, reward histograms shift upward; max rewards pass prior SOTA.

Concrete mini-examples:

Math (Erdős): State = 600-piece step function; actions run Python that does FFT-accelerated descent + annealing with feasibility projection; reward = 1/upper bound; discovery = 0.380876 (new SOTA).
Kernels (TriMul): State = Triton code; actions propose fusing LayerNorm and gating ops, switching to FP16, delegating big matmuls to cuBLAS; reward = inverse runtime; discovery = down to 1161 μs on H100.
Algorithms (AtCoder): State = C++ solver; actions refine rectangles/beam search/annealing; reward = local test score; discovery = scores that would place 1st.
Biology (Denoising): State = MAGIC-like code; actions add gene-adaptive transforms, SVD refinements, log-space polishing; reward = benchmark score; discovery = higher scores on PBMC/Tabula.

Secret sauce in one breath: Train at test time, favor the best attempts (entropic), start from the best partials (PUCT), and keep repeating until a record-breaker pops out.

04Experiments & Results

The Tests and Why They Matter:

Mathematics: Tighten classic bounds where even tiny improvements are meaningful (they rule out large families of possible proofs). Rewards are continuous via certified bounds from constructed step functions.
GPU Kernel Engineering: Create faster kernels for real workloads; runtime is the reward. Faster kernels reduce costs across AI.
Algorithm Design (AtCoder): Compete on realistic industry-like optimization tasks; private tests ensure generalization beyond local tuning.
Biology (Single-cell denoising): Improve reconstruction metrics that labs already trust as proxies, helping extract more value per experiment.

The Competition (Baselines and Humans):

Best-of-25600: Same total sampling budget as TTT-Discover but no learning and no reuse.
Evolutionary baselines: OpenEvolve and reports from AlphaEvolve/ThetaEvolve/ShinkaEvolve.
Top human entries: From official leaderboards or papers.

Scoreboard with Context:

Mathematics: • Erdős’ minimum overlap (lower is better for the upper bound). Best human: 0.380927. Prior best AI (AlphaEvolve): 0.380924. TTT-Discover: 0.380876. That’s like beating an Olympic record by more than the last record-breaker did. • First Autocorrelation inequality (upper bound): Best human 1.50973; AlphaEvolve V2 1.50317; ThetaEvolve 1.50314; TTT-Discover 1.50286. Another new state-of-the-art. • Second Autocorrelation inequality (lower bound): Prior AI (AlphaEvolve V2) 0.961; TTT-Discover 0.959 (no new record here). • Circle packing: TTT-Discover matches best reported results on tested n, showing competitiveness.
GPU Kernels (TriMul): • H100: Best human ≈ 1371 μs; TTT-Discover 1161 μs. That’s a strong A+ while others score A’s. • A100: Best human ≈ 4531 μs; TTT-Discover 2198 μs (about 2× faster), despite training reward measured on H100 only—good cross-hardware generalization. • B200/MI300X: Consistent >15% gains over best humans in measured setups. Expert organizers noted the method correctly fused memory-bound ops, used FP16 within tolerances, and sensibly delegated big matmuls.
Algorithm Design (AtCoder Heuristic Contests): • ahc039 (geometry): Top human 566,997; TTT-Discover 567,062. A small but decisive win where prior AI lagged humans. • ahc058 (scheduling): TTT-Discover 848,414,228 (higher is better), topping reported human submissions and prior AI.
Biology (Single-cell denoising): • PBMC: MAGIC variants ~0.42–0.64 score; Best-of-25600 0.62; OpenEvolve 0.70; TTT-Discover 0.71 (lower MSE 0.15, same Poisson 0.05 constraints). • Tabula: MAGIC ~0.40–0.64; Best-of-25600 0.65; OpenEvolve 0.71; TTT-Discover 0.73 (MSE 0.14). Gains are consistent across datasets.

Surprising Findings:

Best-of-25600 sometimes improved math bounds a bit, showing the base model had latent capability—but TTT-Discover surpassed these by learning.
Training only on H100 runtimes still yielded kernels that generalized to other GPUs, suggesting the learned improvements (fusion, precision choice) captured broadly useful patterns.
In biology, small algorithmic refinements that explicitly target the metric produced robust gains without totally rewriting the baseline, hinting that metric-aware polishing at test time is powerful.

Ablations (TriMul on H100):

Full TTT-Discover (entropic + PUCT) beat all variants. Constant-β entropic degraded later; expected-reward objective improved slowly; no TTT stagnated; ε-greedy reuse helped some but was weaker; no reuse or naive RL barely moved the needle. This supports that both the entropic objective and PUCT are key.

Compute/Practicality:

Model: Open gpt-oss-120b via Tinker.
Budget: 50 steps × 512 rollouts/step; about a few hundred dollars per problem.
Reproducibility: Open code and open model; results reviewed by domain experts where applicable.

05Discussion & Limitations

Limitations (be specific):

Continuous, verifiable rewards: The method currently relies on a smooth numeric score and reliable validation. Binary/sparse rewards (e.g., “proof accepted or not”) are much harder.
Single-instance focus: It does not aim to generalize across tasks; it aims to win on one problem. That’s by design, but it means you may need to re-run for each new instance.
Stability/knobs: Although hyperparameters were mostly fixed here, entropic sharpness, KL penalties, batch sizes, and reuse settings can interact; extreme tasks may still need tuning.
Resource ceilings: While affordable compared to pretraining, some domains (huge compiles, long proofs) can make rollouts slow or infrastructure-heavy.
Safety/numerics: Mixed precision or aggressive fusions that pass benchmarks might still pose edge-case issues in downstream pipelines; human review remains valuable.

Required Resources:

An LLM capable of code/Math/algorithm reasoning (here, gpt-oss-120b) and a test-time RL loop with LoRA updates.
A reliable evaluator with correctness checks and trusted timing/metrics.
Enough budget for tens of thousands of sampled tokens per step and batch evaluation (often a few hundred dollars per problem).

When NOT to Use:

Problems without automatic or continuous rewards (e.g., subjective writing quality without a robust judge) unless you add a stable, low-noise reward proxy.
Situations needing broad generalization across many future instances; then training-time RL or supervised finetuning across distributions may be better.
Real-time/low-latency settings where multiple test-time training steps are impossible.

Open Questions:

Sparse/binary rewards: How to extend entropic updates and reuse when only rare successes provide signal?
Learning-to-reuse: Can we learn the reuse policy (like PUCT’s components) end-to-end for even better selection?
Evaluation leakage: How to ensure metric-targeted polishing doesn’t overfit to quirks of the evaluator while retaining real-world validity?
Multi-instance curriculum: For families of related problems, can we sequence instances to bootstrap faster discoveries?
Safety and robustness: How to integrate proofs of correctness or formal verification more tightly, especially for numerical kernels and math outputs?

06Conclusion & Future Work

Three-sentence summary: TTT-Discover keeps training a language model at test time on a single hard problem, using an entropic objective to chase its best attempts and PUCT to smartly reuse promising partial solutions. This instance-focused loop turns raw test-time compute into skill for that exact task, surfacing one standout solution rather than many average ones. Across math, GPU kernels, algorithm design, and biology, it set or matched state-of-the-art results using open models and modest budgets.

Main Achievement: Showing that carefully designed test-time reinforcement learning—prioritizing maxima (entropic) and long-horizon refinement (PUCT reuse)—can reliably discover new state-of-the-art solutions on real scientific and engineering benchmarks.

Future Directions: Extend to sparse/binary rewards and non-verifiable domains; learn reuse strategies; combine with formal verification; develop multi-instance curricula; and explore automated metric design that better correlates with downstream impact.

Why Remember This: It reframes test-time compute from “just sample more” to “actually learn while you try,” aligning the objective with discovery’s true goal: one brilliant solution. By making the model a fast learner on the very problem we care about, TTT-Discover opens a practical path for AI to push frontiers in science and engineering, not someday, but during the run itself.

Practical Applications

•Optimize GPU kernels for production AI workloads by running TTT-Discover on your exact hardware and input shapes.
•Hunt for improved mathematical constructions or bounds by defining a verifiable reward and letting the system refine step functions or certificates.
•Build winning-level contest algorithms (e.g., routing, packing, scheduling) by training directly on the official generators and local validators.
•Tune domain-specific code (e.g., data processing pipelines) by setting performance/correctness as reward and iterating at test time.
•Refine scientific analysis scripts (e.g., single-cell denoising) to the dataset at hand, with metrics that reflect your downstream goals.
•Create faster inference components by fusing operations and precision-tuning under correctness constraints learned during TTT.
•Adapt solvers to new hardware (CPU/GPU/accelerators) by timing locally and letting TTT-Discover specialize kernels and parameters.
•Develop task-specific heuristics automatically by rewarding end-to-end objective gains rather than hand-crafting rules.
•Prototype new algorithmic ideas safely by starting from strong baselines and letting PUCT-guided reuse explore and refine.
•Set up internal leaderboards and continuous evaluation so TTT-Discover can keep pushing your system’s single best result over time.

Version: 1