Scaling Open-Ended Reasoning to Predict the Future

Nikhil Chandak; Shashwat Goel; Ameya Prabhu; Moritz Hardt; Jonas Geiping

Scaling Open-Ended Reasoning to Predict the Future

Intermediate

Nikhil Chandak, Shashwat Goel, Ameya Prabhu et al.12/31/2025

arXiv PDF

Key Summary

•The paper teaches small language models to predict open-ended future events by turning daily news into thousands of safe, graded practice questions.
•They build an automated pipeline that reads news, writes forward-looking questions, hides answer leaks, and checks that each question has a single clear outcome.
•To prevent cheating with future info, the system only uses a frozen, offline news archive and retrieves articles published no later than one month before each event resolves.
•Training uses reinforcement learning with a reward that combines being right (accuracy) and being honest about uncertainty (Brier score), which improves both accuracy and calibration.
•Dense retrieval of five relevant news chunks gives big accuracy boosts across many model families, even when models are small.
•Their 8B model, OpenForecaster8B, reaches accuracy and calibration competitive with much larger proprietary models on tests from May–August 2025.
•Calibration gains transfer to other benchmarks (like QA and science tests), meaning the model better knows when it might be wrong.
•Mixing free-form questions with some binary market questions gives the best balance across open-ended and yes/no forecasting.
•They carefully avoid information leakage and perform held-out testing only at the end to keep evaluations fair.
•All code, data, and models are released so others can reproduce and extend the work.

Why This Research Matters

Better forecasts help leaders choose wisely under uncertainty—when the cost of being wrong is high. This work shows we can safely train smaller, open models to make calibrated, open-ended predictions using frozen news snapshots. Honest retrieval and a reward that values both correctness and humility build trust: the model knows when it knows, and admits when it doesn’t. That makes it easier to set risk thresholds, plan contingencies, and avoid overconfident mistakes. Because calibration improvements carry over to other tasks, the benefits extend beyond forecasting. Open releases of data, code, and models invite the community to replicate, audit, and improve the system. In short, it’s a practical step toward AI that informs decisions instead of bluffing.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) Imagine you’re planning a class field trip. You have to pick a day, guess the weather, and predict if the bus will be on time. You can’t know for sure, but good guesses help you make smart choices.

🥬 Filling (The Actual Concept)

What it is: Forecasting is making the best possible guesses about the future using clues you have right now.
How it works: You gather information, weigh different possibilities, assign chances to each, and update as new clues arrive.
Why it matters: Without careful forecasting, important decisions (like planning trips, policies, or investments) can go wrong because we overreact to loud news or ignore hidden clues.

🍞 Bottom Bread (Anchor) A city mayor must decide whether to open emergency shelters before a storm. Good forecasting turns scattered reports into a responsible plan.

The World Before For language models, most practice questions about the future came from prediction markets (places where people bet yes/no on events). That made two things happen:

Lots of questions were binary (just yes or no), which lets a model get 50% right by accident—even with weak reasoning.
The topics were skewed (e.g., heavy on U.S. politics or crypto), so models didn’t practice on the truly diverse, surprising events the world throws at us. And there was another problem: if you train or test using the live web, pages change and search results shuffle. That can accidentally leak the true outcome back into training, making models look smarter than they really are.

🍞 Top Bread (Hook) You know how it’s not fair to peek at the answer sheet before a test?

🥬 Filling (The Concept: Information Leakage)

What it is: Leakage is when future answers sneak into the model’s training or evaluation.
How it works: If we use live web pages that got updated after the event happened, the model might read hints or spoilers.
Why it matters: Then we can’t tell if the model actually reasoned about the future—or just memorized the answer.

🍞 Bottom Bread (Anchor) If a quiz promises, “Predict who’ll win the game tomorrow,” but the webpage was edited after the game ended, that’s cheating.

The Problem We want models that can make open-ended forecasts, like “Who will become the next Prime Minister of Country X?” or “Which company will be acquired by September?” These questions don’t have neat multiple-choice lists. They require imagination, exploration, and giving a probability—not just a guess. But getting enough high-quality, open-ended training data is hard and slow if you wait for humans to write and resolve everything.

🍞 Top Bread (Hook) Think of a giant newspaper library frozen in time, like a museum of yesterday’s news.

🥬 Filling (The Concept: Backtesting on Static News)

What it is: Using fixed monthly snapshots of news (CCNews) so dates don’t change.
How it works: Only articles published before the event are allowed; evaluation questions resolve after the model’s training cutoff.
Why it matters: This keeps training honest and lets us truly test future prediction, not answer memorization.

🍞 Bottom Bread (Anchor) If the model’s knowledge stops in April 2025, we test it on events from May–August 2025—real future for the model.

Failed Attempts and Gaps

Only binary questions: too easy to game and not imaginative enough.
Live web retrieval: risks sneaking in spoilers.
Small, hand-made datasets: don’t scale to the huge variety of world events.

What Was Missing A safe, scalable way to turn daily news into lots of clean, forward-looking, open-ended questions with clear answers—plus a training method that teaches both accuracy and honest uncertainty.

Real Stakes

Policy makers need sound odds before setting rules.
Companies plan hiring and budgets based on likely futures.
Scientists choose research directions under uncertainty.
And everyday people make choices (schools, savings, health) that benefit from better, humbler, more calibrated predictions.

02Core Idea

🍞 Top Bread (Hook) You know how great coaches don’t just say “We’ll win!”—they say “We’ve got a 60% chance if we play strong defense”? That number shows confidence, not just a guess.

🥬 Filling (The Aha!)

One-sentence insight: If we mass-produce safe, forward-looking questions from news and reward models for both being right and being honest about uncertainty, small models can forecast surprisingly well.

Three Analogies

Library Detective: The model is a detective who gets a stack of dated clippings (retrieval) and must predict what happens next, without seeing tomorrow’s paper.
Weather Forecaster: It doesn’t just say “rain or shine”—it gives a percent chance, then gets graded more when it’s right and fair when it’s unsure (calibration).
Science Fair: Instead of a yes/no quiz, the model must hypothesize an open-ended answer (like a name or place) and say how sure it is—then it gets feedback after the event resolves.

Before vs After

Before: Mostly binary questions, web leakage risks, and rewards that pushed models to sound confident even when they shouldn’t.
After: Open-ended questions generated at scale from fixed-time news, offline retrieval to stay honest, and a reward that balances accuracy with calibrated confidence.

🍞 Top Bread (Hook) Imagine a robot that reads an article and writes a fair quiz question where the answer isn’t obvious or spoiled.

🥬 Filling (The Concept: Automated Question Generation)

What it is: An LLM reads a news article and proposes future-facing questions with a single, short answer and clear resolution rules.
How it works: One model drafts questions; another model filters them, removes leaks (like answer hints), and picks the best question per article.
Why it matters: This turns thousands of articles into thousands of high-quality training examples—fast and safely.

🍞 Bottom Bread (Anchor) From an article about a coming vote, the system asks: “Who will be confirmed as Prime Minister by July 17, 2025?” with the resolution source and format defined.

🍞 Top Bread (Hook) You know how a chef checks ingredients before cooking?

🥬 Filling (The Concept: Retrieval-Augmented Generation)

What it is: The model fetches a few relevant news snippets from a frozen archive to ground its reasoning.
How it works: It embeds article chunks, finds the top five most relevant pieces, and reads them before predicting.
Why it matters: Without fresh-but-honest context, the model might miss crucial clues available before the event.

🍞 Bottom Bread (Anchor) Predicting an election outcome is better after reading recent, pre-deadline coverage of debates and polls.

🍞 Top Bread (Hook) Picture a game where you get points for scoring goals, but also for calling your shots accurately.

🥬 Filling (The Concept: Accuracy + Brier Score Reward)

What it is: A combined reward that pays for correct answers and for reporting realistic probabilities.
How it works: If you’re right and confident, big reward; if you’re wrong but admitted low confidence, small penalty; if you’re wrong and overconfident, bigger penalty.
Why it matters: Without it, models can overstate confidence (bad calibration) or avoid trying on hard cases (poor exploration).

🍞 Bottom Bread (Anchor) Saying “It’s 55% likely to be Candidate A” and being right is better than shouting “100%!” and being wrong.

Building Blocks

OpenForesight dataset (~52K filtered questions) from 250K news articles.
Offline retrieval from CCNews; five chunks per question.
Reinforcement Learning (GRPO) to learn from outcomes.
Model-based answer matching to fairly judge open-ended strings.
Careful held-out testing (May–Aug 2025) to avoid peeking.

🍞 Top Bread (Hook) Learning to ride a bike takes feedback after each try.

🥬 Filling (The Concept: Reinforcement Learning with GRPO)

What it is: The model tries answers and probabilities; it gets reward after seeing the true outcome.
How it works: Sample several predictions, score them with the combined reward, and update the model toward better strategies.
Why it matters: For open-ended futures, step-by-step labels are rare; outcomes are the cleanest teacher.

🍞 Bottom Bread (Anchor) After predicting who wins a leadership vote, the model later sees the real winner and adjusts how bold to be next time.

03Methodology

High-Level Recipe Input → (A) Turn news into safe questions → (B) Retrieve pre-deadline context → (C) Model predicts answer + probability → (D) Outcome arrives → (E) Reward with Accuracy + Brier → (F) Update with RL → Output: A calibrated forecaster.

Step A: Generate and Clean Questions

What happens: One LLM drafts multiple forward-looking questions per article (with background, resolution source/date, expected answer format). Another LLM checks if the question is future-facing, has a unique answer, and removes leaks (like directly stating the answer). It selects the best question when several pass.
Why this step exists: If the answer leaks or the question is ambiguous, the model can learn shortcuts or get unfair feedback.
Example: Article says a parliament will vote next week. The pipeline creates: “Who will be confirmed as Prime Minister by July 17, 2025?” It sets the source of truth (parliament website), resolution date, and answer format (full name). Any text like “Yulia Svyrydenko is expected…” gets rewritten to avoid hints.

Step B: Offline Retrieval (frozen news)

What happens: Split each article into 512-token chunks; embed them; at prediction time, fetch the top-5 relevant chunks dated no later than one month before the event.
Why this step exists: It keeps the model honest (no after-the-fact spoilers) and gives it the latest allowed clues.
Example: Predicting a cabinet appointment? Retrieval surfaces prior reports of frontrunners and vote schedules published before the deadline.

🍞 Top Bread (Hook) Think of retrieval like asking a librarian for the five most relevant books—printed before a certain date.

🥬 Filling (The Concept: Dense Retrieval)

What it is: Turn text into vectors (embeddings) so we can efficiently find similar passages.
How it works: Each chunk becomes a point in vector space; we pick the nearest ones to the question.
Why it matters: Without good retrieval, the model might miss critical context and guess wildly.

🍞 Bottom Bread (Anchor) Asking, “Who will chair the summit?” pulls passages about pre-summit shortlists—not post-summit summaries.

Step C: Forecast (answer + probability)

What happens: The model writes an open-ended short answer (often a name or place) and a probability (0–1) that it is correct.
Why this step exists: Open-ended questions don’t come with multiple choices; the model must choose and show how sure it is.
Example: “Answer: Tadej Pogačar; Probability: 0.60.”

🍞 Top Bread (Hook) Like calling your shot in basketball—“bank off the glass, 60%”.

🥬 Filling (The Concept: Calibration)

What it is: Matching confidence to reality over many predictions.
How it works: If you say 70% often, then about 70% should be right.
Why it matters: Overconfidence misleads; underconfidence wastes good insights.

🍞 Bottom Bread (Anchor) A model that says “I’m 30% sure” and is right about 30% of the time is trustworthy.

Step D: Judge the Answer

What happens: Another LLM checks if the model’s free-form answer matches the ground truth (semantic matching: “Geoffrey Hinton” equals “Geoffrey Everest Hinton”).
Why this step exists: Open-ended answers have many valid forms; exact-string checks would be unfair.
Example: “UK Prime Minister Rishi Sunak” vs “Rishi Sunak” should count the same.

🍞 Top Bread (Hook) You know how teachers accept both “USA” and “United States” if they mean the same thing?

🥬 Filling (The Concept: Model-Based Answer Matching)

What it is: An LLM judge tests meaning, not just spelling.
How it works: It compares the student’s answer to the official answer with alias tolerance.
Why it matters: Prevents marking correct ideas as wrong just because of formatting.

🍞 Bottom Bread (Anchor) “NYC” should match “New York City” when the question expects a city name.

Step E: Reward with Accuracy + Brier

What happens: Combine (1) a point if correct and (2) a Brier-like score that rewards truthful probabilities for open-ended answers.
Why this step exists: Accuracy alone pushes overconfidence; Brier alone can discourage trying on hard cases. The combo balances both.
Example: Being right at 0.6 gets a decent reward; being wrong at 0.99 gets a harsh penalty.

🍞 Top Bread (Hook) Like grading both your answer and how realistically you rated your odds.

🥬 Filling (The Concept: Brier Score for Free-Form)

What it is: A score that rises when you’re correct and calibrated, and falls when you’re wrong and overconfident.
How it works: Correct low-confidence answers earn small credit; wrong, high-confidence answers lose more.
Why it matters: Encourages honest uncertainty and discourages bluffs.

🍞 Bottom Bread (Anchor) Saying “20% chance” and being wrong is not a big deal; saying “100%” and being wrong is.

Step F: Update with RL (GRPO)

What happens: The model samples several answers, gets the reward for each, and updates toward higher-reward behaviors.
Why this step exists: Outcome-based learning is the most direct signal for real-world forecasting.
Example: If guessing “Unknown” too often gives low reward, the model learns to explore plausible names with modest confidence.

Secret Sauce

A carefully filtered, leak-free, open-ended dataset (OpenForesight) at scale.
Offline, date-safe retrieval that boosts accuracy without cheating.
A reward that teaches both correctness and humility.
Mixing some binary market questions preserves broad competence.

04Experiments & Results

The Test

What they measured: Accuracy (how often the answer is right) and calibration via a Brier-style score (how well probabilities match reality). They also checked consistency on long-term logical relations, and whether calibration gains transfer to other tasks.
Why it matters: A good forecaster must be both right and well-calibrated; otherwise, people can’t trust its confidence.

The Competition

Compared against: Much larger proprietary models (e.g., GPT OSS 120B, DeepSeek-R1, Grok-3-Mini) and strong open models (Qwen and Llama families).
Leveling the field: Everyone got the same offline retrieval context (top-5 pre-deadline chunks) to ensure fairness without leakage.

Scoreboard (with context)

Open-ended test (May–Aug 2025): The 8B model, OpenForecaster8B, achieves Brier and accuracy competitive with or better than much larger models. Think of it like a small car keeping pace with race cars by taking smarter lines, not burning more fuel.
FutureX external benchmark (Jul–Aug 2025): Strong accuracy and near-top Brier performance, again rivaling larger systems.
Retrieval helps a lot: Adding five retrieved chunks improves accuracy by roughly 9–18% across different models, then plateaus—like getting most of the benefit from the first few relevant pages.

Surprising Findings

Reward design matters: Training with only accuracy hurt calibration (models became overconfident). Training with only Brier made models say “Unknown” too often (avoiding hard bets). The combo reward fixed both.
Free-form data is essential: Models trained only on binary data didn’t improve much on open-ended tasks. Mixing some binary with free-form gave the best overall balance.
Calibration generalizes: After forecasting training, calibration improved on unrelated benchmarks (SimpleQA, GPQA-Diamond, MMLU-Pro). Better “self-knowledge” reduces hallucinations and supports abstaining when unsure.
Consistency improved: Checks on long-term logical relations (like AND/OR relationships) showed reduced violations, meaning the model’s probabilities fit together more sensibly.

Concrete Numbers (rounded and simplified)

Dataset: ~52K open-ended questions from ~250K articles; 5 retrieved chunks per question.
Retrieval gains: Accuracy bumps of about 9–18% across models; little benefit beyond five chunks.
Held-out testing: Only measured at the very end on 302 carefully filtered questions; also tested on FutureX.
Generalization: Calibration gains visible across unrelated benchmarks without retrieval.

Takeaway A small, openly released 8B forecaster became competitive with much larger models by: (1) training on a leak-safe, large, open-ended dataset, (2) reading just-enough pre-deadline context, and (3) learning a reward that values both correctness and honest uncertainty.

05Discussion & Limitations

Limitations

News-only bias: The training data comes from news, which can overrepresent certain topics (e.g., politics) and underrepresent slow, underreported domains (e.g., some scientific advances).
Late reporting: Articles sometimes describe events after they happened, risking accidental ease or leakage if dates aren’t carefully controlled.
Short answers: The system focuses on short, concrete outcomes (names, places), not long-form rationales that are harder to grade.
Single-guess Brier: The adapted score uses one best guess plus a probability. Real forecasters often consider several plausible answers; multi-guess scoring is deferred for future work.
Retrieval scope: Using one month cutoff is cautious but may still miss useful context published slightly earlier or create uneven difficulty across questions.

Required Resources

Offline news corpus (CCNews) and its embeddings indexed for retrieval.
An answer-matching judge model and a forecasting base model with RL training capability.
Compute: Outcome-based RL is more expensive than simple fine-tuning; careful batching and sampling are needed.

When Not to Use

When the outcome space is extremely open and grading is ambiguous (e.g., “Why did markets fall?”).
When live, up-to-the-minute data is required and leakage controls cannot be guaranteed.
When decisions depend on long-form causal explanations rather than short, resolvable outcomes.

Open Questions

Multi-outcome probability reports: Can models list several plausible answers with probabilities that sum to 1, and be graded fairly at scale?
Beyond news: How to safely incorporate academic papers, reports, and datasets without leakage?
Human-AI teaming: Can forecasters and models co-train, improving each other’s calibration and exploration?
Long-form grading: Can we design robust, automated graders for multi-paragraph forecasts?
Safety and governance: How should we audit forecasting systems used in high-stakes settings to prevent overreliance or misuse?

06Conclusion & Future Work

Three-Sentence Summary This work scales open-ended forecasting by converting daily news into thousands of safe, leak-free questions and training models with a reward that balances accuracy and honest uncertainty. With offline retrieval and outcome-based RL, a small 8B model reaches accuracy and calibration competitive with much larger proprietary systems. The improved calibration even carries over to other benchmarks, making the model more trustworthy in general.

Main Achievement A complete, reproducible recipe—data, retrieval, reward design, and RL—that turns static news snapshots into a powerful training loop for open-ended future prediction, released openly as the OpenForesight ecosystem.

Future Directions

Support multiple plausible answers with full probability distributions and fair scoring.
Expand beyond news to other fixed-cutoff corpora while maintaining leakage controls.
Develop robust grading for long-form forecasts and richer consistency checks.
Explore mixed human–AI forecasting teams to combine creativity, skepticism, and scale.

Why Remember This It shows that careful data curation, honest retrieval, and a reward that values humility can let smaller models forecast the future more reliably—pushing AI from confident guessing toward dependable judgment.

Practical Applications

•Government early-warning: Aggregate calibrated forecasts about policy votes, sanctions, or leadership changes to prepare response options.
•Corporate planning: Estimate probabilities of key supplier changes, regulations, or mergers to stress-test budgets and inventories.
•Risk dashboards: Show decision-makers both the top prediction and its confidence to guide when to act versus wait.
•Research prioritization: Forecast which scientific areas are likely to see breakthroughs to allocate grants more effectively.
•Media analysis: Track evolving odds on developing stories using only pre-deadline articles to avoid spoilers or bias.
•Investment triage: Use calibrated signals (with abstention when uncertain) to screen events for deeper human analysis.
•Crisis logistics: Predict likely locations or timings of summit meetings or resource reallocations to pre-position aid.
•Elections monitoring: Combine pre-deadline coverage into cautious, date-safe probability estimates of outcomes.
•Operations playbooks: Trigger predefined actions when probabilities cross thresholds (e.g., >70% chance of a strike).
•Model auditing: Use consistency checks and calibration curves to monitor whether deployed models stay trustworthy over time.

Version: 1