Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text

Ximing Lu; David Acuna; Jaehun Jung; Jian Hu; Di Zhang; Shizhe Diao; Yunheng Zou; Shaokun Zhang; Brandon Cui; Mingjie Liu; Hyunwoo Kim; Prithviraj Ammanabrolu; Jan Kautz; Yi Dong; Yejin Choi

Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text

Intermediate

Ximing Lu, David Acuna, Jaehun Jung et al.1/30/2026

arXiv PDF

Key Summary

•Golden Goose turns messy internet text into clean multiple-choice puzzles that computers can learn from and get automatic rewards for.
•It masks a small but important stretch of reasoning in a passage and asks the model to pick the right missing piece from several look‑alike choices.
•This trick makes unverifiable sources like textbooks, forums, and web pages usable for Reinforcement Learning with Verifiable Rewards (RLVR).
•The authors built GooseReason-0.7M, over 700,000 such tasks across math, coding, and science, to keep training from getting stuck.
•On strong 1.5B and 4B models that had stopped improving, adding GooseReason led to steady gains and new state-of-the-art results on 15 benchmarks.
•A 9-option MCQ format hits a 'just right' difficulty that creates the best learning signal, avoiding too-easy elimination or too-hard open-ended guessing.
•Golden Goose also worked in cybersecurity by mining raw web scrapes, beating a much larger 7B domain-specialized model after only 100 RL steps.
•The approach is simple, cheap to scale, and plugs into any RL recipe because correctness is checked by matching a choice, not by running tests.
•It highlights a data-centric path to scale reasoning: reuse abundant but unverifiable text instead of handcrafting tiny verified datasets.

Why This Research Matters

Golden Goose unlocks the huge pile of reasoning text on the internet that was previously unusable for RL because it lacked easy checkers. By turning that text into auto-graded MCQs, models can keep learning instead of stalling when verified datasets run out. This means better math help, clearer science explanations, stronger coding assistance, and smarter security reasoning—without hiring armies of human graders. It also works fast: even small models got big gains with modest compute. Because the approach is simple and format-agnostic, it can plug into existing RL training recipes and power progress across many domains. Over time, this could make helpful AI tutors and assistants more accurate and widely accessible.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you're studying for a quiz, but your teacher only has a tiny stack of answer-checked questions. You practice them so much that you stop learning new things.

🥬 The Concept: Reinforcement Learning (RL) is a way for models to learn by trying things and getting rewards when they do well. How it works:

The model reads a task and produces an answer.
A rule gives a reward if the answer is correct.
The model updates itself to get more rewards next time. Why it matters: Without clear rewards, the model can't tell what's right or wrong, and learning stalls. 🍞 Anchor: Like a video game character getting coins for completing levels, the model gets "coins" for correct answers and learns which moves to repeat.

🍞 Hook: You know how a math worksheet with an answer key lets you check your work quickly?

🥬 The Concept: A Verifiable Reward is a reward you can check automatically with a simple, reliable rule. How it works:

Define a rule (e.g., does 2+2=4?).
Compare the model's answer to ground truth.
Give a 1 (correct) or 0 (incorrect) reward. Why it matters: If checking needs a human or a giant debate, you can't scale RL easily. 🍞 Anchor: A spelling bee judge checks if "elephant" matches the card. Quick yes/no makes judging thousands of answers easy.

🍞 Hook: Think of a teacher giving stars only when they can instantly check the answer.

🥬 The Concept: Reinforcement Learning with Verifiable Rewards (RLVR) is RL where every task has an auto-checkable answer. How it works:

Pick tasks with auto-check (math with solvers, code with unit tests).
Let the model try multiple times (rollouts).
Reward correct ones to shape better reasoning. Why it matters: RLVR unlocked big leaps in reasoning but depends on having many auto-checkable tasks. 🍞 Anchor: Coding tasks with tests are great for RLVR because passing tests instantly proves correctness.

🍞 Hook: Have you ever practiced the same flashcards so often you could answer them without thinking—and then stopped improving?

🥬 The Concept: Data Saturation is when training keeps going but learning plateaus because tasks stop providing new signals. How it works:

The model repeatedly sees the same limited tasks.
It either always gets them right or always wrong.
No mix of success/failure means no useful feedback to improve. Why it matters: RL needs tasks where the model sometimes wins and sometimes loses to learn what changes help. 🍞 Anchor: If you always ace the same 10 questions, you don’t learn new skills.

🍞 Hook: The internet is full of explanations, but many don’t come with an easy answer key.

🥬 The Concept: Unverifiable Internet Text is content (like forum posts or textbook paragraphs) that’s rich in reasoning but lacks a simple automatic way to check correctness. How it works:

It describes multi-step logic (proofs, chemistry mechanisms, coding fixes).
But there isn’t a one-line answer or runnable test.
That makes it hard to use for RLVR. Why it matters: Tons of great learning material is left unused because it can’t be auto-checked as-is. 🍞 Anchor: A long biology explanation may be super helpful, but there’s no quick yes/no checker for it.

The World Before: RLVR pushed reasoning forward by using data with automatic checkers (math verifiers, code unit tests). Researchers scaled compute and training length, but gains stalled because verified data was limited and narrow (mostly math and code). Open-ended domains (proofs, medicine, economics) were often ignored because they weren’t machine-checkable. Failed Attempts: 1) Collecting more human-authored, verified problems is expensive and slow. 2) Handcrafted procedural environments generate infinite puzzles but cover few styles and often mimic existing math/logic patterns. 3) Open-ended fill-in-the-blank with an AI judge adds heavy overhead and breaks when models ignore instructions. The Gap: We needed a way to turn rich but unverifiable text into tasks with simple, automatic correctness checks. Real Stakes: Without more and fresher RLVR tasks, models hit walls sooner—especially stronger ones—hurting progress in daily-life areas like safer coding help, clearer science tutoring, and practical security guidance.

02Core Idea

🍞 Hook: You know those stories where someone removes a key paragraph and asks you to pick the right missing part from a list?

🥬 The Concept: Golden Goose turns any reasoning-rich passage into a multiple-choice fill-in-the-middle puzzle with one correct answer and several tricky lookalikes. How it works:

Take a source text with real reasoning.
Have a strong LLM mask a short, contiguous block of crucial steps.
Treat the removed block as the answer and generate several plausible but wrong distractors.
Ask the student model to choose which option fills the [MASK]. Why it matters: Now we can auto-check answers by simple match, unlocking vast unverifiable text for RLVR. 🍞 Anchor: From a chemistry explanation, hide the reaction steps, then offer 9 similar-looking options; the correct sequence is the only one that truly fits.

Multiple analogies:

Crossword clue: Hide a critical word chain; the model must pick the exact phrase that makes the paragraph make sense.
Jigsaw: Remove a puzzle piece; many pieces look similar, but only one fits perfectly.
Cooking recipe: Skip a step; several steps sound plausible, but only the right one keeps the dish edible.

Before vs After:

Before: RLVR needed clean, verified tasks (few domains), and training plateaued as data was exhausted.
After: Any rich passage can become a verified MCQ; training keeps finding fresh, medium-difficulty challenges and continues to improve.

Why it works (intuition):

Contiguous masking forces understanding of the local reasoning chain, not just a keyword.
Multiple plausible distractors prevent easy guessing or elimination tricks.
Verification is trivial (does the choice equal the ground truth?), so RL scales without heavy judges or test runners.
With enough diverse sources, the model sees many reasoning styles, improving generalization beyond MCQs.

Building blocks (each as a mini sandwich):

🍞 Hook: Picking the right answer from options feels easier to grade than writing an essay. 🥬 The Concept: Multiple-Choice Questions (MCQ) let us check correctness by matching the chosen option. How it works: Present options; one is right; check equality. Why it matters: Enables massive, cheap auto-grading. 🍞 Anchor: Like school quizzes where the computer can grade thousands instantly.
🍞 Hook: Ever skipped the middle of a sentence to see if your friend can guess it? 🥬 The Concept: Fill-in-the-Middle masks a meaningful span, not just a single word. How it works: Identify a consecutive block of key steps; replace with [MASK]. Why it matters: Forces understanding of reasoning flow, not just vocabulary. 🍞 Anchor: Hide "mix vinegar and baking soda" in a science fair write-up and see if the reader knows the fizzing step.
🍞 Hook: Tricky wrong answers make you think harder. 🥬 The Concept: Distractors are wrong-but-plausible options similar in style and length to the right span. How it works: Generate lookalikes that break subtly (logic, order, conditions). Why it matters: Prevents guessing and pushes real reasoning. 🍞 Anchor: In coding, several malloc lines look fine, but only one allocates the correct shape.
🍞 Hook: Some puzzles are too easy; they don’t teach you much. 🥬 The Concept: Difficulty-based Filtering removes items the student model always gets right (or always wrong). How it works: Try 16 rollouts; keep tasks with a mix of success/failure. Why it matters: Medium-difficulty tasks give the best learning signal. 🍞 Anchor: Keep math problems where you sometimes slip—those are the ones that help you improve.
🍞 Hook: Big libraries help you practice many skills. 🥬 The Concept: GooseReason-0.7M is a large set (700k+) of these MCQ fill-in-the-middle tasks spanning math, code, and STEM. How it works: Mine rich sources; synthesize masked MCQs with distractors; filter by difficulty. Why it matters: Provides fresh, steady fuel for RL training after saturation. 🍞 Anchor: It’s like a giant, well-leveled workbook for a robot student.

03Methodology

High-level pipeline: Source text → Mask a key reasoning span → Generate distractors → Build MCQ → Filter by difficulty (optional) → RLVR training with auto-check.

Step A: Find reasoning-rich passages (Input gathering)

What happens: Collect passages from math/CS forums (e.g., AoPS, coding sites), textbooks (MegaScience), or web scrapes (FineWeb for cybersecurity).
Why it exists: We need authentic reasoning steps to mask; otherwise, questions become shallow.
Example data: A forum answer explaining how to dynamically allocate a 2D array in C; a chemistry textbook paragraph on Cu2+ with NH3 reactions; a cybersecurity note on directory hash collisions.

🍞 Hook: It’s easier to hide a whole line in a song than a single note. 🥬 The Concept: Contiguous span masking picks a short, consecutive block of crucial reasoning. How it works:

Prompt a strong LLM to identify an important multi-sentence (or multi-line code) block.
Replace it with [MASK], creating S_mask.
Treat the removed block t as the correct answer. Why it matters: Forces the model to recover the chain of logic, not a random fact. 🍞 Anchor: In code, hide the specific malloc line that allocates each row; in chemistry, hide the precipitation and complex-formation steps.

Step B: Generate distractors (Option creation)

What happens: Ask the LLM for many incorrect-but-plausible alternatives that match style and length.
Why it exists: Without strong distractors, the model could guess or eliminate easily.
Example: For the C array, distractors might allocate the wrong element size, omit multiplication by width, or place the cast on the wrong pointer type.

🍞 Hook: Magic tricks are only impressive if the fake-outs look real. 🥬 The Concept: Distractors are carefully designed lookalikes that subtly fail. How it works:

Ensure they mirror formatting, vocabulary, and tone.
Vary the specific flaw (order, constants, edge cases).
Provide enough options (e.g., 9) to make elimination hard. Why it matters: Medium-hard tasks create both successes and failures—perfect for RL updates. 🍞 Anchor: Nine slightly different chemistry step sequences—only one matches the actual reaction.

Step C: Build the MCQ and randomize options

What happens: Combine S_mask with {correct + distractors}, shuffle options, and output a structured item.
Why it exists: Randomization prevents memorizing positions; structure supports large-scale training.
Example: JSON with masked_reference_solution, removed_steps, and distractors.

Step D: Handle noisy sources (Passage extraction)

What happens: For messy web scrapes, first extract or summarize a clean, educational passage before masking.
Why it exists: Noise can make the mask guessable without reasoning; extraction keeps quality high.
Example: From a long blog post, extract the paragraph explaining how collisions in directory hashing cause denial-of-service.

Step E: Difficulty-based filtering (Quality control)

What happens: Test each item with 16 rollouts from the student model; drop ones that are always right (too easy) or always wrong (too hard).
Why it exists: Keep items in the sweet spot that produce a learning signal.
Example: If the model gets 16/16, it’s probably guessable; remove it.

RL training: Plug-and-play with any RLVR recipe

What happens: Use a stable RL algorithm (e.g., ProRL v2 variant of GRPO). For each task, the model picks an option; reward is 1 if it matches the ground truth.
Why it exists: Simple, cheap verification allows massive scaling without judges or test runners.
Example with data: On GooseReason-Math, 9-choice MCQs spread accuracy into a medium band, maximizing effective examples.

Secret sauce (why it’s clever):

Converts unverifiable text into verifiable tasks via exact-match MCQ.
Uses contiguous masks to capture real reasoning flow.
Tunes difficulty with number of distractors and rollout-based filtering.
Requires no domain-specific test harnesses or human graders.

🍞 Hook: Sometimes writing an essay is too open-ended; a good quiz keeps you honest. 🥬 The Concept: MCQ vs Open-ended. How it works:

Open-ended infill requires an LLM judge and risks the model ignoring instructions.
MCQ needs only exact-match checking.
With 9 options, elimination tricks fail; real reasoning is needed. Why it matters: MCQ gives reliable, scalable signals where open-ended often collapses to zero accuracy. 🍞 Anchor: In tests, most open-ended items gave no learning signal; 9-choice MCQs produced many effective examples.

🍞 Hook: A practice set is useful only if it still challenges you. 🥬 The Concept: Effective Examples are tasks where a trained model sometimes succeeds and sometimes fails. How it works:

Measure accuracy over multiple rollouts.
Keep items with mixed outcomes.
Use them to guide RL updates. Why it matters: Effective examples are the fuel of continued learning. 🍞 Anchor: GooseReason added 450k+ effective examples—13× more than a popular prior blend.

04Experiments & Results

The test: Can Golden Goose keep improving models after they plateau on existing RLVR data, and can it do so efficiently? The authors evaluated across 15 benchmarks spanning math (AIME24/25, AMC, MATH, Minerva, Olympiad), coding (APPS, CodeContests, CodeForces, TACO, HumanEvalPlus, LiveCodeBench), STEM reasoning (GPQA Diamond), instruction following (IFEval), and logic puzzles (Reasoning Gym).

The competition (baselines):

Continued RL on the same ProRL data (a strong existing RLVR blend).
RLVE (adaptive environments; especially strong in math).
Larger non-RL or differently trained models like Qwen3-30B-Instruct for reference.
In cybersecurity: Llama-Primus family (8B domain-specialized) vs a 4B general model trained on GooseReason-Cyber.

Scoreboard with context:

Data saturation recovery (Qwen-4B-Instruct after plateau): • With only ProRL data, performance flatlined or regressed after ~300 steps. • Adding GooseReason-0.7M turned −1.29% math into about +2.18%, small coding gains into +2.24%, and −1.52% STEM into +2.40%—like turning a slipping grade into steady B-to-A improvements.
Prolonged RL on ProRL-1.5B-v2 (already heavily trained): • ProRL-only continued training delivered tiny gains over another 1,100 H100 hours (like studying hard but re-reading the same notes). • With GooseReason-0.7M, absolute gains were roughly +2.71% math, +2.12% coding, +3.48% STEM—clear, continued progress across the board. • Despite training on MCQs, the model improved on non-MCQ evaluations, showing transferable reasoning skills.
Difficulty sweet spot (task format ablation): • Open-ended infill led to >83% zero-accuracy items (no learning signal) as models ignored instructions. • 3-option MCQs were too easy (elimination worked too well). • 9-option MCQs pushed most items into medium difficulty, creating many effective examples.
Compute-efficient scaling from scratch (Qwen-4B-Instruct, 200 RL steps): • Joint training with GooseReason-0.7M outperformed ProRL-only at the same step counts across math and code—more learning per unit compute.
Cybersecurity (GooseReason-Cyber, ~180k items from web scrapes): • After only 100 RL steps, the 4B model achieved a +4.44% average boost on three benchmarks (CTI-MCQ, CyberMetric, SecEval), beating an 8B domain-specialized SOTA that had far more domain-specific training.

Surprising findings:

MCQ training improved performance on non-MCQ tasks, suggesting the model learned underlying reasoning patterns, not just test-taking tricks.
Stronger base models saturated earlier, but GooseReason still revived them—fresh, diverse tasks matter even more as models get better.
STEM benefited most, likely because existing verified STEM data is much scarcer than verified math/code.

Bottom line: Golden Goose systematically turned stalled training runs into steady climbers, using simple auto-checkable MCQs built from previously unusable text.

05Discussion & Limitations

Limitations:

Source quality: If the internet text is biased, outdated, or toxic, synthesized items may inherit those issues.
Hallucinated distractors: Poorly crafted distractors might accidentally be correct or too obviously wrong, reducing training value.
Format mismatch: Not all real-world tasks are naturally MCQ-shaped; some reasoning forms may lose nuance when compressed into options.
Overfitting to format: If overused without diversity, models might become too MCQ-savvy; mixing with other RLVR data remains important.

Required resources:

A strong generator LLM (the paper used GPT-5) to identify key spans and craft high-quality distractors.
Modest RL compute compatible with GRPO/ProRL-style training.
Storage and dataloading for 0.7M+ items and logging multiple rollouts for difficulty filtering.

When NOT to use:

Tasks needing precise numeric verification beyond text choice (e.g., exact code execution behavior) where unit tests are superior.
Domains where the educational passage can’t be reliably extracted (ultra-noisy scrapes without clear reasoning content).
Safety-critical scenarios where subtle distractor errors could train harmful misconceptions without additional human review.

Open questions:

How to automatically detect and fix flawed distractors (e.g., near-correct or ambiguous options) at scale?
Can the method adaptively pick mask lengths and positions based on student-model weaknesses for faster gains?
What is the best mix of MCQ vs open-ended vs programmatically verified tasks for long-term generalization?
How well does the approach extend to other high-stakes fields (law, medicine) with careful safety filters and expert-in-the-loop audits?

06Conclusion & Future Work

Three-sentence summary: Golden Goose converts rich but unverifiable text into auto-checkable MCQs by masking a contiguous reasoning span and surrounding it with plausible distractors. This unlocks massive new RLVR data, leading to steady improvements even in models that had stopped getting better on existing datasets, and it transfers to diverse benchmarks beyond MCQs. The team built GooseReason-0.7M and a cybersecurity variant, achieving state-of-the-art results with simple verification and efficient compute.

Main achievement: A simple, scalable pipeline that turns abundant unverifiable internet text into high-quality RLVR tasks, reliably reviving saturated training and broadening coverage beyond math/code into STEM and cybersecurity.

Future directions:

Extend to other specialized domains (law, medicine) with safety filters and expert audits.
Automate quality assurance for distractors and ambiguity checks.
Personalize masking to target each model’s weak spots and maximize learning signal.
Blend formats (MCQ + programmatic tests + chain-of-thought) to further boost transfer.

Why remember this: It’s a clean idea—mask, distract, verify by match—that converts the world’s reasoning text into RL fuel. By fixing the data bottleneck, it keeps reasoning models improving when standard recipes stall, without expensive judges or handcrafted environments.

Practical Applications

•Expand a model’s reasoning training by auto-synthesizing MCQ tasks from your organization’s manuals or wikis.
•Boost a plateaued RL run by mixing in Goose-style masked MCQs from relevant textbooks or forums.
•Create domain-specific datasets (e.g., cybersecurity, finance) by extracting educational passages from curated web scrapes.
•Rapidly prototype new benchmarks by masking key steps in expert-written solutions and adding distractors.
•Use difficulty-based filtering to maintain a pool of medium-hard items that reliably drive learning.
•Evaluate instruction-following vs reasoning by comparing open-ended infill against 9-choice MCQ accuracy.
•Pretrain small models on Goose MCQs to improve compute efficiency before heavier RL phases.
•Diagnose model weaknesses by analyzing which masked spans (definitions, transitions, edge cases) cause most errors.
•Improve generalization by diversifying sources (textbooks, forums, research summaries) for broader reasoning styles.
•Safeguard training by auto-detecting and removing low-quality or ambiguous items during data synthesis.

Version: 1