CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation

Zhongyuan Peng; Caijun Xu; Changyi Xiao; Shibo Hong; Eli Zhang; Stephen Huang; Yixin Cao

CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation

Intermediate

Zhongyuan Peng, Caijun Xu, Changyi Xiao et al.2/2/2026

arXiv PDF

Key Summary

•CoDiQ is a recipe for making hard-but-solvable math and coding questions on purpose, and it controls how hard they get while you generate them.
•Its key insight is test-time scaling: letting the model use more thinking tokens tends to raise difficulty but can lower solvability, so CoDiQ balances both with smart checks.
•Six Difficulty-Enhancement Strategies help models add real, deep challenge instead of fake trickiness.
•An iterative CoDiQ Pipeline upgrades a seed question round by round and stops the moment difficulty backslides or solvability breaks.
•Two difficulty meters—LLMs-Ranking and a Value Network—estimate how hard a question is, while a stronger verifier checks if it is actually solvable.
•A specialized CoDiQ-Generator is trained with reinforcement learning to push difficulty higher without breaking validity.
•The team built CoDiQ-Corpus, 44K competition-grade question sequences with over 82% solvability in human checks.
•Training reasoning models on CoDiQ-Corpus improves scores on hard benchmarks like AIME and MATH.
•Difficulty scales with token use: more reasoning tokens generally means harder questions, but CoDiQ prevents crashes in solvability.
•All code, models, and data are open-sourced to help others build stronger reasoning systems.

Why This Research Matters

CoDiQ turns the art of writing hard questions into a controllable science, letting us mass-produce trustworthy challenges that steadily train better reasoning models. This helps build AI tutors that can match a student’s level today and gently raise the bar tomorrow. It strengthens coding assistants by exposing them to realistic, competition-grade tasks, improving reliability. Scientists get models trained on deep, multi-step thinking, helpful for planning experiments or debugging proofs. Because solvability is enforced, we avoid ‘fake hard’ data that can poison model training. Open-sourcing the pipeline, generator, and corpus lets the community scale this approach to more fields.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your math teacher keeps a special folder of puzzles that get trickier every week, pushing your brain just the right amount. That folder doesn’t fill itself—it takes skill to write great hard questions that you can still solve.

🥬 The Concept (Large Reasoning Models): What it is: Large Reasoning Models (LRMs) are AI systems trained to solve multi-step problems in math and code. How it works: They read a question, think in steps (like notes to themselves), and produce a final answer. Why it matters: Without tough practice problems, LRMs stop improving—just like athletes without better drills. 🍞 Anchor: When an LRM learns on Olympiad-style problems, it gets better at long, careful thinking than when it only sees simple worksheets.

🍞 Hook: You know how you can take a simple riddle and make it tougher by adding rules? AI can do that too—if it knows what makes a problem truly hard.

🥬 The Concept (Question Generation): What it is: Question generation is when a model creates new problems to solve. How it works: Start from a seed problem, then rewrite it to add constraints, change numbers, or mix ideas to raise difficulty. Why it matters: If models can make their own hard practice, they can level up faster without needing endless human-written problems. 🍞 Anchor: From “add two numbers” to “add many numbers that follow a rule and fit a time limit”—same idea, harder rules.

🍞 Hook: Think of a video game that lets you set the difficulty from Easy to Nightmare. For AI problem-making, that slider used to be missing.

🥬 The Concept (Difficulty and Solvability): What it is: Difficulty is how hard a problem is; solvability is whether it can be solved with clear rules and a correct answer. How it works: We measure difficulty relatively (this one harder than that one) and check solvability with a separate, careful model. Why it matters: Hard without solvable becomes “fake hard”—confusing or impossible—and that hurts learning. 🍞 Anchor: A puzzle missing a key piece isn’t hard; it’s broken.

🍞 Hook: If you try to write a riddle much smarter than you understand, you might create nonsense. Models have this issue too.

🥬 The Concept (Generator Capacity Ceiling): What it is: A model usually can’t generate questions much harder than it can solve consistently. How it works: As the model raises complexity, it risks contradictions, missing info, or tasks it can’t verify. Why it matters: Without respecting this ceiling, you get lots of invalid problems and wasted compute. 🍞 Anchor: A third-grader trying to invent calculus problems might write something that sounds fancy but makes no sense.

🍞 Hook: Pile on too many rules and even a fair puzzle turns into a trap.

🥬 The Concept (Solvability–Complexity Trade-off): What it is: Pushing complexity makes solvability harder to keep. How it works: Each extra constraint narrows valid answers and can explode computation. Why it matters: You must add “smart difficulty” that grows thinking depth without breaking logic or feasibility. 🍞 Anchor: Making a maze bigger is fine; sealing every path by mistake makes it unsolvable.

🍞 Hook: Saying “make it harder” is like shouting “run faster!” without a stopwatch—you can’t control what you can’t measure.

🥬 The Concept (Difficulty Control): What it is: A way to tune how hard generated questions get, on purpose. How it works: Use relative ranking, a learning-based difficulty scorer, and strict stopping rules to guide difficulty up step by step. Why it matters: Precise control lets us build reliable curricula and steadily train better reasoners. 🍞 Anchor: Like belt levels in martial arts: clear tiers, tested, and earned in order.

The world before: Researchers used prompt tricks, agent pipelines, and heavy filtering to make hard problems. It worked but was brittle, expensive, and slow. “Hardness” was vague; models often produced “fake hard” questions that broke under scrutiny. Some teams trained special generators, but control over exact difficulty still felt fuzzy, and validity depended on lots of post-hoc filtering.

The problem: We need a scalable way to generate truly hard, competition-grade problems that remain solvable and can be dialed up or down with precision. Three blockers stood in the way: the generator capacity ceiling, the solvability–complexity trade-off, and the missing difficulty dial.

Failed attempts: Simple prompting (“make it harder”) led to shallow tweaks. Pure adversarial generation often produced puzzles that tricked models but lacked clean solutions. Big agent workflows required many steps and still let broken items slip through, forcing expensive human checks.

The gap: A method to scale difficulty at inference time, while continuously verifying solvability and measuring difficulty on a smooth scale, was missing.

Real stakes: Better hard problems mean better-trained models for math and code—powering safer software tools, stronger STEM tutoring, and scientific discovery support. If we can mass-produce trustworthy, graded challenges, we can grow reasoning ability like a well-run school: stepwise, measurable, and reliable.

02Core Idea

🍞 Hook: Picture a treadmill that automatically speeds up when your running gets steady, but also checks your heartbeat so you don’t get hurt.

🥬 The Concept (Test-Time Scaling): What it is: Let the model think longer and in more steps during generation to raise question difficulty, but watch solvability so it doesn’t break. How it works: Increase reasoning token budgets, apply targeted difficulty strategies, and verify each upgraded question; stop if difficulty drops or solvability fails. Why it matters: This gives us a live difficulty dial we can turn safely, round by round. 🍞 Anchor: A puzzle factory where each puzzle gets one notch harder, as long as it still has a real solution.

Aha! Moment in one sentence: If we scale a model’s reasoning at test-time and pair it with precise difficulty meters and a strict solvability gate, we can reliably grow hard, competition-grade questions on demand.

Three analogies:

Staircase: Each iteration is one safe step up—harder, but with a sturdy handrail (the verifier).
Chef’s tasting: Add spice little by little, taste (difficulty meters), serve only if delicious and edible (solvable).
Game level builder: Auto-generates levels that get trickier while ensuring each is beatable and not a dead end.

Before vs After:

Before: “Make it harder” was guessy; quality swung wildly, and lots of broken puzzles appeared.
After: Difficulty grows in controlled steps, with continuous difficulty scoring and solvability checks. We get sequences of problems from easy to hard, ready for curriculum training.

Why it works (intuition):

More thinking tokens let models weave deeper constraints, but that risks breaking validity. CoDiQ adds two guardrails: dual difficulty meters that prefer truly tougher logic (not surface tweaks) and a strong solvability verifier that rejects broken upgrades. By stopping the very moment the curve turns down (difficulty regression) or logic snaps (unsolvable), we capture clean, monotonic difficulty trajectories.

🍞 Hook: Like coaches who know the exact drills to boost challenge, not just “try harder.”

🥬 The Concept (Difficulty-Enhancement Strategies): What it is: Six playbooks that add deep, algorithmic difficulty instead of cosmetic tricks. How it works: Inject challenge via Dimensionality & Constraints, Mathematical Abstraction, Inverse & Constructive, State Explosion, Theorem Disguise, and Edge Case & Rigor Engineering. Why it matters: They push models to create puzzles that require real multi-step reasoning. 🍞 Anchor: Turning “count odd sums” into “count subsequences with parity, length, and multiple mod rules” that needs careful DP logic.

🍞 Hook: Imagine upgrading a puzzle in rounds, each time checked by judges for fairness and difficulty.

🥬 The Concept (CoDiQ Pipeline): What it is: An iterative loop that upgrades a seed question through up to eight rounds. How it works: Propose harder version → estimate difficulty (LLMs-Ranking + Value Network) → verify solvability → keep if both pass; else stop. Why it matters: It captures clean sequences from easy to hard with no backsliding or broken steps. 🍞 Anchor: A ladder of 3–8 rungs where each rung is harder and still climbable.

🍞 Hook: When picking the taller hill to climb, it helps to compare hills side by side and use a map of slopes.

🥬 The Concept (Difficulty Estimation): What it is: Tools to judge which question is harder. How it works: LLMs-Ranking sorts batches by perceived difficulty; a Value Network scores difficulty from early hidden states; scores are normalized to [0,1]. Why it matters: Reliable meters let us enforce monotonic difficulty and allocate more compute to harder items. 🍞 Anchor: Two judges—one ranking, one scoring—agree that Puzzle C > B > A.

🍞 Hook: Think of a referee who ensures every match can actually be played to a fair finish.

🥬 The Concept (Solvability Verification): What it is: A strong model (Qwen3-32B) double-checks that the upgraded question is well-posed and answerable. How it works: It outputs solvable/unsolvable with confidence; only high-confidence solvable items pass. Why it matters: Stops “fake hard” puzzles that are contradictory, under-specified, or computationally infeasible. 🍞 Anchor: If Round 3 creates a nearly empty solution space or impossible runtime, the verifier rejects it.

🍞 Hook: You learn fastest when a coach rewards the exact thing you’re trying to improve.

🥬 The Concept (CoDiQ-Generator via Reinforcement Learning): What it is: A tuned generator (from Qwen3-8B) trained to push difficulty upward while staying valid. How it works: RL rewards upgrades that increase difficulty and stay solvable, and penalizes regressions/repeats/invalids; it focuses training right at the model’s breaking points. Why it matters: It raises the ceiling, so the pipeline can climb more rungs before stopping. 🍞 Anchor: After RL, an 8B model can outpace a 32B baseline in maximal solvable difficulty within the pipeline.

🍞 Hook: A great school keeps graded textbooks from Level 1 to Level 10.

🥬 The Concept (CoDiQ-Corpus): What it is: A 44K-sequence library of math/code problems that get harder step by step. How it works: Generated by the pipeline + RL-tuned generator; verified and difficulty-scored; shows >82% solvability in human checks. Why it matters: It’s ready-made curriculum fuel that measurably boosts reasoning models. 🍞 Anchor: Models trained on CoDiQ-Corpus improved on AIME and MATH without extra human-written data.

03Methodology

High-level recipe: Seed Question → Upgrade with Difficulty Strategies → Difficulty Estimation (LLMs-Ranking + Value Network) → Solvability Verification → Keep/Stop → Next Round → Output a difficulty-graded sequence.

Step 1: Start with a seed and plan the upgrade

What happens: Pick a simple, solvable problem from datasets like GSM8K or CodeAlpaca and decide which strategy (e.g., State Explosion or Mathematical Abstraction) best raises deep difficulty.
Why this step exists: Random changes often create fake hard or broken puzzles; strategies ensure difficulty comes from real reasoning depth.
Example: From “count subsequences with odd sum” to “count subsequences with odd sum AND even length AND sum mod 3 = 1.”

Step 2: Propose a harder variant (Round i)

What happens: The generator uses the chosen strategies to add orthogonal constraints (e.g., parity + modulus + length rules), disguises a theorem, or demands a constructive inverse.
Why it matters: True complexity comes from interacting constraints that change the algorithmic core (e.g., from O(n) parity to multi-state DP or CRT-based counting).
Example: Add sum mod 5 = 2 and length mod 4 = 2; now the state grows from 2 to 120, requiring careful DP.

Step 3: Estimate difficulty

What happens: Two meters evaluate hardness: • LLMs-Ranking: An expert LLM (Doubao-Seed-1.8) sorts a batch by which feels harder under a structured rubric. • Value Network: Reads early hidden states of Qwen3-8B’s reasoning to predict correctness probability; low probability implies high difficulty. • Normalization: Map grouped ranks to a 0–1 score to smooth granularity.
Why this step exists: You need a stable dial to ensure each round is not easier than the last and to guide compute allocation.
Example: A new variant gets a higher DR-LLM and lower VN probability (i.e., harder), so it passes the monotonicity check.

Step 4: Verify solvability

What happens: A stronger model (Qwen3-32B) inspects the question for clarity, completeness, consistency, and feasible solution complexity, returning solvable/unsolvable with confidence.
Why this step exists: Difficulty meters can’t guarantee fairness; the verifier prevents contradictions, missing info, or near-zero solution density.
Example: If constraints push a CRT state to 2,310×8 states and the expected valid count is ~0 for typical n, the verifier flags unsolvable and stops the run.

Step 5: Enforce stopping rules

What happens: If difficulty regresses or the verifier says unsolvable, discard that round and stop. Otherwise, keep the question and iterate (up to 8 rounds).
Why this step exists: Guarantees strictly increasing difficulty and a fully solvable trajectory.
Example: Rounds 1–2 pass; Round 3 fails solvability; output the sequence from Round 0–2.

Step 6: Train a better generator with RL (CoDiQ-Gen-8B)

What happens: Collect breaking points (where the base model failed) and use RL with a reward that balances solvability confidence and positive difficulty delta. Optimize with GRPO in VeRL.
Why this step exists: It teaches the generator to push right at the edge without slipping into invalidity, lifting the difficulty ceiling.
Example: After RL, more runs reach Round 5–6 before failing, giving longer, harder sequences.

Step 7: Build CoDiQ-Corpus and use it for curriculum

What happens: Aggregate all valid sequences into a 44K-sequence library stratified by token budgets and difficulty levels. Use staged training (L1→L2→L3) where models practice from earlier to later rounds.
Why this step exists: Progressively structured practice is a proven learning pattern that boosts generalization on tough benchmarks.
Example: A 4B model trained through L1→L3 surpasses a standard RL baseline on MATH-500 and AIME 2024.

Secret sauce (why CoDiQ is clever):

It doesn’t just filter; it grows difficulty at inference with a live hardness dial and a safety net.
Dual difficulty meters catch both perceived and representation-level difficulty.
A strong verifier enforces fairness and feasibility.
RL alignment moves the model’s true edge outward, so the pipeline climbs higher.

Mini Sandwiches for key components:

🍞 Hook: Like two judges—one by feel, one by stats. 🥬 The Concept (LLMs-Ranking + Value Network): What it is: A pair of difficulty estimators that agree on which is harder. How it works: Rank by expert LLM; score by hidden-state predictor; combine via normalization. Why it matters: Prevents shallow or biased hardness calls. 🍞 Anchor: Both say Version C is hardest, so it gets more compute.
🍞 Hook: A referee ensuring every game can finish. 🥬 The Concept (Solvability Verification): What it is: Automatic well-posedness checking with confidence. How it works: A larger model inspects logic, missing info, and computational feasibility. Why it matters: Blocks fake hard traps. 🍞 Anchor: Over-constrained CRT case rejected.
🍞 Hook: Practice right at your limit. 🥬 The Concept (RL-Tuned Generator): What it is: A model rewarded for safe difficulty jumps. How it works: Rewards positive difficulty gain and solvability; penalizes regress/repeat/broken. Why it matters: Raises the pipeline’s maximum rung. 🍞 Anchor: 8B model beats a larger baseline under pipeline rules.

04Experiments & Results

The test: The authors built CoDiQ-Bench (200 seeds; 100 math, 100 code) to see which generators can climb to the highest solvable difficulty under controlled rounds and token budgets. They measured two difficulty scores—DR-LLM (LLM-based ranking) and DR-VN (Value Network score)—and tracked solvability. They also tested whether more reasoning tokens correlate with harder questions.

The competition: Baselines included GLM-4.6 (flagship), GPT-OSS-20B, and the Qwen3 series (0.6B → 32B). All ran inside the same CoDiQ Pipeline so the comparison is fair. The authors also compared “Direct Prompt” versus their “CoDiQ Prompt,” and then introduced their RL-tuned CoDiQ-Gen-8B.

The scoreboard (with context):

Prompting helps: Swapping in the CoDiQ Prompt increased reasoning token use and pushed difficulty up across most models—like giving better climbing shoes.
CoDiQ-Gen-8B shines: Despite being smaller, the RL-tuned 8B generator achieved higher maximal solvable difficulty than larger Qwen3-32B in pipeline terms. That’s like a well-trained runner outpacing a taller athlete on a measured course.
Token–difficulty link is strong: Pearson correlations between tokens used and difficulty were r≈0.83–0.85 (p<0.001). Think: more thinking tokens → reliably harder questions, as long as the evaluator and model aren’t saturated.
Upper bound without verifier: Removing solvability checks raises the theoretical difficulty ceiling; with the CoDiQ Prompt, even smaller models synthesized extremely complex logic. But many of those instances weren’t guaranteed solvable—showing why the verifier matters.
Budgeted compute: Under strict token budgets (8k/16k/32k), CoDiQ-Gen-8B consistently yielded higher difficulty than baseline models, meaning it spends compute more wisely.

Surprising findings:

A tuned 8B can beat a larger 32B on the pipeline’s objective (hard-and-solvable), proving alignment and strategies matter as much as raw size.
Solvability vs difficulty trade-off is real: Push tokens high without guardrails and solvability can collapse—especially in smaller models. CoDiQ-Gen-8B decouples this drop via RL.
Human study (N=200) found 82% precision on accepted items and 90% NPV on rejects; some false negatives were valid but just too hard for the verifier—evidence of the Verifier Paradox.

Corpus results:

CoDiQ-Corpus difficulty beat well-known datasets (AIME, NuminaMath-1.5, LiveCodeBench, Code-Contests), with average DR-LLM≈91.4% and DR-VN≈82.8%.
Training on CoDiQ-Corpus in a 3-stage curriculum improved downstream performance: a 4B model series (CoDiQ-L1/L2/L3-4B) surpassed a standard RL baseline on MATH-500 and AIME 2024 (e.g., AIME rose to 70.6%).

Takeaway: The pipeline’s combination—test-time scaling, dual difficulty meters, and strict solvability checks—produces harder, trustworthy problems that actually boost reasoning models when used for curriculum learning.

05Discussion & Limitations

Limitations:

English-only and math/code focus: Other languages and domains (like physics word problems with diagrams) need adaptation of verification and strategies.
Verifier Paradox: A fixed-capacity verifier can reject truly valid but extremely hard items. This creates an “epistemic ceiling” on difficulty.
Compute cost: Dual difficulty checks plus verification increase tokens and latency; tight real-time settings may struggle.
Metric dependence: Relative difficulty relies on robust rankers and the Value Network staying calibrated; domain drift may require re-tuning.

Required resources:

A strong base LRM (8B–32B) with long-CoT support; a larger verifier model; RL infrastructure (e.g., VeRL + GRPO); and GPU time for multi-round generation and checks.

When NOT to use:

Ultra low-latency environments where verification overhead is unacceptable.
Domains lacking reliable automatic verifiers (e.g., open-ended creative writing) where solvability cannot be formalized.
Tiny models without enough reasoning tokens; they may overfit to shallow tricks or collapse solvability.

Open questions:

Adaptive verifier scaling: Can the verifier’s own compute scale with difficulty to reduce false negatives while staying efficient?
Cross-domain generalization: How to extend to science, logic puzzles with diagrams, or multi-modal tasks?
Absolute difficulty: Can we turn relative ranks into robust, domain-agnostic absolute scores?
Human-in-the-loop minimality: What’s the lightest-touch human validation that lifts ceilings without sacrificing scale?
Safety and bias: How to detect and avoid hidden biases or pathological edge cases when difficulty increases in unfamiliar domains?

06Conclusion & Future Work

Three-sentence summary: CoDiQ is a test-time scaling framework that upgrades seed questions into harder, still-solvable versions by pairing difficulty strategies with dual difficulty meters and a strict verifier. An RL-tuned generator (CoDiQ-Gen-8B) raises the ceiling of safe difficulty growth, enabling longer, cleaner difficulty sequences. The resulting 44K-item CoDiQ-Corpus measurably improves reasoning models through curriculum-style training.

Main achievement: Turning “make it harder” into a precise, scalable, and verifiable process—where difficulty climbs step by step without breaking solvability—then proving it boosts downstream reasoning.

Future directions: Scale to more domains and languages; develop adaptive, stronger verifiers to reduce the Verifier Paradox; refine absolute difficulty measures; and explore multi-modal hard-problem generation. Integrating dynamic token budgets and self-evolving curricula could further accelerate learning.

Why remember this: CoDiQ shows that difficulty itself can be engineered at inference time with safety rails and used as fuel for better thinkers. It converts a vague art—writing great hard problems—into a controllable science that builds stronger reasoning models in the wild.

Practical Applications

•Adaptive AI tutoring that assigns the next-harder-but-solvable math problem based on a learner’s progress.
•Coding interview prep platforms that auto-generate verified, level-graded algorithm challenges.
•Curriculum training for new reasoning models using ready-made difficulty ladders from CoDiQ-Corpus.
•Benchmark construction where difficulty can be dialed to test models at specific levels without hand-crafting.
•Data augmentation for RLHF pipelines using verified hard questions to boost long-chain reasoning.
•Safety evaluations by generating edge-case, solvable scenarios that probe model robustness.
•Automated competition practice sets (AIME-style or code contests) refreshed regularly with guaranteed validity.
•Teacher tools to transform basic textbook problems into multi-step, competition-grade versions.
•Model debugging by pushing generators to their failure boundary and analyzing where solvability breaks.
•Progress tracking dashboards that correlate token budgets with achieved difficulty for compute planning.

Version: 1