Falcon-H1R: Pushing the Reasoning Frontiers with a Hybrid Model for Efficient Test-Time Scaling

Falcon LLM Team; Iheb Chaabane; Puneesh Khanna; Suhail Mohmad; Slim Frikha; Shi Hu; Abdalgader Abubaker; Reda Alami; Mikhail Lubinets; Mohamed El Amine Seddik; Hakim Hacid

Falcon-H1R: Pushing the Reasoning Frontiers with a Hybrid Model for Efficient Test-Time Scaling

Beginner

Falcon LLM Team, Iheb Chaabane, Puneesh Khanna et al.1/5/2026

arXiv PDF

Key Summary

•Falcon-H1R is a small (7B) AI model that thinks really well without needing giant computers.
•It mixes two brain styles—Transformer and Mamba—so it can read and think through very long problems fast and cheaply.
•The team carefully taught it with high-quality, step-by-step solutions (SFT), then fine-tuned its decisions using rewards (RL).
•A clever test-time trick called DeepConf stops weak ideas early and keeps only strong solution paths, saving lots of tokens and time.
•On hard math and science tests (like AIME24/25 and GPQA), Falcon-H1R matches or beats much larger models.
•It reaches 96.7% on AIME25 under DeepConf while using about 38% fewer tokens than a popular 8B baseline.
•Special training choices—like focusing on hard problems, using one strong teacher, and balancing token counts—boosted stability and accuracy.
•Its hybrid design scales well to many parallel thoughts at once, which is perfect for test-time scaling.
•The project shows that smarter training and architecture can beat just making models bigger.
•This makes advanced reasoning more affordable for classrooms, startups, and tools that need careful step-by-step thinking.

Why This Research Matters

Falcon-H1R shows that advanced reasoning doesn’t require giant, expensive models—smart data and design can make small models shine. That means high-quality math tutoring, safer coding assistants, and science helpers can run on cheaper hardware. By pruning weak ideas early, DeepConf saves energy and time, which helps the environment and your cloud bill. Schools and small companies can now access strong step-by-step problem solving without needing massive servers. For users, this translates to clearer explanations, better first-try answers, and faster results. As test-time scaling becomes more efficient, everyday tools get smarter without hidden costs.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how some kids solve a tricky puzzle by trying many ideas quickly, while others pick one careful plan and stick to it? The best solvers often mix both: explore a bit, then focus on what works.

🥬 The Concept (Language Models, the world before): A language model is a computer program that predicts the next word and can use that skill to solve problems. How it works:

It reads your question as a sequence of tokens (word pieces).
It guesses the next token again and again to build an answer.
With special training, it learns to follow instructions and reason step-by-step. Why it matters: Before, we often improved reasoning by making models huge, which was expensive and slow.

🍞 Anchor: Imagine asking, “How many tiles do I need to cover my floor?” A strong language model doesn’t just spit a number; it explains the steps, checks itself, and gives the final count.

🍞 Hook: Imagine you’re studying for a math contest. You get better by seeing great worked solutions, then practicing until your steps are solid.

🥬 The Concept (Supervised Fine-Tuning, SFT): SFT is teaching a model with hand-picked examples that include the right answers and the thought process. How it works:

Curate clean, correct, step-by-step solutions (math, code, science).
Emphasize hard problems so the model learns real strategies.
Train the model to imitate those high-quality reasoning traces. Why it matters: Without SFT on rich, correct steps, the model may sound fluent but miss the logic.

🍞 Anchor: If you show the model 12 different solid ways to factor a tough polynomial, it’s more likely to find a reliable path at test time.

🍞 Hook: Think of a dog learning tricks: it tries things, and you give a treat when it gets it right. Over time, it learns what works.

🥬 The Concept (Reinforcement Learning, RL): RL lets the model try answers and get rewarded when it’s correct or well-behaved. How it works:

The model generates several solution attempts (rollouts).
A checker verifies correctness (e.g., final boxed answer or code tests).
The model is nudged toward the successful attempts. Why it matters: Without rewards, the model might not improve its first-try accuracy or control length.

🍞 Anchor: For code problems, the reward is “all tests pass,” so the model learns to write correct, not just plausible, programs.

🍞 Hook: When you explain your math steps out loud, it’s easier to spot mistakes.

🥬 The Concept (Chain-of-Thought, CoT): CoT is teaching the model to show its steps before the final answer. How it works:

Encourage step-by-step reasoning in training.
Keep the final answer clearly marked.
Later, let the model explore multiple reasoning chains. Why it matters: Without steps, answers can look right but be wrong. Steps reveal and improve logic.

🍞 Anchor: “<think> I will factor the quadratic and check roots </think> Answer: 3” makes it clear how the number came out.

🍞 Hook: Suppose you could think in parallel—try many ideas quickly, then keep only the best.

🥬 The Concept (Test-Time Scaling, TTS): TTS uses more compute during answering to try multiple solution paths and combine them. How it works:

Generate many reasoning chains.
Score or vote on their answers.
Pick the winner or a consensus. Why it matters: Without TTS, the model only gets one shot; with TTS, it can recover from a weak path by relying on stronger ones.

🍞 Anchor: For a tricky geometry problem, sampling 512 step-by-step attempts and then voting can turn a near-miss into a near-guarantee.

🍞 Hook: You know how hybrid bikes can handle both city streets and dirt trails? They mix the best parts of two designs.

🥬 The Concept (Hybrid Model): A hybrid model combines different building blocks so it can be both fast and smart. How it works:

Use Transformers for attention (great at connecting distant ideas).
Use Mamba/State-Space parts for speedy, memory-friendly long sequences.
Run them in parallel for high throughput. Why it matters: Without the hybrid, long, detailed reasoning becomes slow and memory-hungry.

🍞 Anchor: Falcon-H1R’s hybrid Transformer–Mamba blocks process very long solutions efficiently, like a car that’s fast on highways and nimble in alleys.

🍞 Hook: Imagine being confident only when you really should be.

🥬 The Concept (Confidence Calibration): This makes the model’s confidence match reality. How it works:

The model assigns scores to its own partial reasoning.
Low-confidence paths can be stopped early.
High-confidence paths continue and get more budget. Why it matters: Without good calibration, you might waste time on weak ideas or cut off a winning path too soon.

🍞 Anchor: In DeepConf, paths with dropping confidence get pruned, which saves tokens while keeping accuracy high.

The world before: Big models dominated tough reasoning, but they were pricey and slow. The problem: we needed high accuracy on long, step-by-step tasks without exploding costs. Failed attempts: just adding more data or mixing too many teacher styles raised noise; using many weak solution paths at test time made inference expensive. The gap: efficiently training a small model for long reasoning traces and making test-time scaling cheaper with smart early stopping. The stakes: affordable AI tutoring, code assistants that verify themselves, and science helpers that can reason deeply—on regular hardware.

02Core Idea

🍞 Hook: Imagine a small, fast team that wins a puzzle tournament by practicing smart, following great examples, and quitting bad ideas early.

🥬 The Concept (Aha! Moment): You can get big-model reasoning quality from a small model by (1) hybrid architecture for long, fast thinking, (2) careful SFT on many strong, long solutions, (3) RL to sharpen first-try accuracy and length control, and (4) a test-time filter (DeepConf) that stops weak chains early. How it works (3 analogies):

Chef analogy: Curate only top recipes (SFT), taste-and-tweak (RL), and toss early any dish that smells off (DeepConf); serve the best plate (voting).
Sports analogy: Train with harder drills (SFT), get live feedback from a coach (RL), and bench players who underperform mid-game (DeepConf), keeping your fastest squad (hybrid) on the court.
Classroom analogy: Study excellent worked examples (SFT), take quizzes with answer checks (RL), and skip strategies that immediately look shaky (DeepConf), all using notebooks that can hold very long workings (hybrid). Why it matters: Without this combo, small models either think too shallowly or become too slow and costly at test time.

🍞 Anchor: On AIME25, Falcon-H1R with DeepConf reached about 96.7% accuracy while using around 38% fewer tokens than a strong 8B baseline.

Before vs After:

Before: Small models fell behind on long, multi-step problems; TTS helped but was expensive.
After: A 7B hybrid model matches or beats much larger models and runs TTS efficiently by pruning weak chains.

Why it works (intuition):

The hybrid Transformer–Mamba design keeps long contexts fast and memory-light, so the model can actually afford long CoT.
SFT on difficulty-weighted, multi-rollout data teaches diverse, correct strategies, especially for hard problems.
RL with verifiable rewards pushes pass@1 and tames overly long outputs.
DeepConf uses confidence to cut wasteful branches in real time, concentrating compute on promising paths.

Building blocks (mini-sandwich tour):

🍞 Hook: Picture a library with a speedy robot librarian and a wise archivist. 🥬 The Concept (Parallel Transformer–Mamba Architecture): It combines attention (Transformer) with state-space speed (Mamba) to handle long reasoning efficiently. How it works: 1) Attention links distant steps, 2) State-space runs fast over long sequences, 3) Parallel design boosts throughput. Why it matters: Without it, long step-by-step answers are slow and memory-heavy. 🍞 Anchor: Generating 20k–48k tokens of reasoning becomes practical on modest hardware.

🍞 Hook: Think of getting tutored by the best solution writers. 🥬 The Concept (SFT with hard, verified traces): Teach by example, focusing on long, correct, and challenging solutions. How it works: 1) Verify answers, 2) Prefer hard problems, 3) Use many good rollouts per question. Why it matters: Without this, the model imitates style, not substance. 🍞 Anchor: 12 diverse correct math rollouts per prompt noticeably boost performance on tough items.

🍞 Hook: Imagine pop quizzes with instant result checks. 🥬 The Concept (RL with verifiable rewards): Reward correct finals (and formats), so the model learns what truly works. How it works: 1) Generate groups of attempts, 2) Auto-check math or run code tests, 3) Update policy toward winners. Why it matters: Without rewards, first-try accuracy and length control lag. 🍞 Anchor: Code gets reward 1.0 only if all tests pass.

🍞 Hook: Picture a gardener trimming weak branches so the tree grows stronger ones. 🥬 The Concept (DeepConf for TTS): Early-stop low-confidence chains and keep investing in strong ones. How it works: 1) Warm up to learn a threshold, 2) Monitor sliding-window confidence, 3) Stop chains that dip below. Why it matters: Without pruning, TTS wastes tokens and time. 🍞 Anchor: In DeepConf@512, Falcon-H1R uses far fewer tokens at high accuracy across AIME24/25 and more.

03Methodology

High-level recipe: Input question → (Hybrid model reads very long context) → Step-by-step generation with system prompt → (SFT-taught reasoning + RL-refined decisions) → Optional parallel chains with DeepConf → Aggregate to final answer.

Step A: Data curation and SFT (teach by great examples) 🍞 Hook: Imagine building a study pack that has only the best solutions and the right level of challenge. 🥬 The Concept (SFT on curated long CoT): Train on math, code, and science with verified answers and emphasis on hard problems. How it works:

Verify or judge correctness (math-verify; sandbox tests for code; LLM judge for tricky science).
Keep only clean samples; remove broken code or missing final answers.
Use many rollouts per prompt (n=12) to expose diverse good strategies; prefer a single strong teacher to avoid conflicting styles.
Weight by difficulty: up-weight hard items, down-weight easy ones. Why it matters: Without carefully filtered, diverse, correct traces, the model learns noisy or shallow habits. 🍞 Anchor: After SFT, AIME-style accuracy jumps substantially even before RL.

Practical training tweaks (kept simple) 🍞 Hook: Think of sharing snacks fairly among friends so no one gets too much or too little. 🥬 The Concept (Fair Token Counting across GPUs): Make every token contribute equally to the loss, even when sequences have different lengths. How it works:

Each GPU processes different-length sequences.
Normalize losses by the number of valid tokens globally, not per-GPU.
This reduces gradient noise and stabilizes training. Why it matters: Without it, short examples can accidentally shout louder than long ones. 🍞 Anchor: Turning this on raised AIME25 accuracy by several points during SFT.

🍞 Hook: Picture splitting a super-long essay into parts so multiple friends can read it at the same time. 🥬 The Concept (Long-context parallelism): Use context/sequence parallelism so very long sequences (up to 36k–48k) fit and train efficiently. How it works:

Split sequences across devices (Ulysses-style),
Carefully gather/scatter inside hybrid blocks,
Use optimized kernels (RoPE, RMSNorm, XEnt) for speed and memory. Why it matters: Without this, long CoT would be too slow or run out of memory. 🍞 Anchor: Training at batch size 512 over millions of samples becomes steady and fast.

Step B: RL with verifiable rewards (sharpen first-try accuracy) 🍞 Hook: Imagine a spelling bee where only perfectly spelled words score points. 🥬 The Concept (GRPO-style group RL): Generate multiple attempts and reward the correct ones; ignore uniformly all-right/all-wrong batches via backfill. How it works:

Sample G=16 rollouts per prompt at τ≈0.85 and allow up to 48k tokens to think long.
Reward math by correct final answer; reward code when all tests pass; format rewards ensure clean <think>…</think> Answer.
Use backfill with a generation cache so batches always carry learning signal without wasted compute. Why it matters: Without group-based, verifiable rewards and stable batching, RL becomes noisy or stalls. 🍞 Anchor: RL nudges the model to be right on the first try more often while trimming needless verbosity.

Step C: Test-time scaling with DeepConf (spend compute wisely) 🍞 Hook: Think of a tournament where weak teams are eliminated early so the best teams get more playtime. 🥬 The Concept (DeepConf@K): Generate many chains but prune low-confidence ones early based on sliding-window scores. How it works:

Warm-up with 16 chains to set a confidence threshold.
Continue generating up to K=512 total chains; stop any chain whose recent window falls below threshold.
Vote among survivors (majority or confidence-weighted); parsing answers robustly via math_verify. Why it matters: Without DeepConf, TTS burns tokens on weak ideas and takes longer. 🍞 Anchor: On AIME25, Falcon-H1R + DeepConf reached ~96.7% while saving ~38% tokens vs a strong 8B baseline.

Step D: Inference engineering (make it fly) 🍞 Hook: Imagine airport lanes opening more booths as the crowd grows. 🥬 The Concept (Hybrid-parallel serving): Combine tensor/data parallelism tuned per model size and batch to maximize throughput. How it works:

Choose parallelism based on batch and length,
Use vLLM paged attention for memory efficiency,
Exploit the hybrid architecture’s strength at long outputs and large batches. Why it matters: Without careful serving, parallel chains would bottleneck and wipe out efficiency gains. 🍞 Anchor: Benchmarks show +20% to +100% throughput over a pure-Transformer 8B baseline at long outputs and larger batches.

Secret sauce (why this method is clever)

Train on longer, harder, verified traces so the model truly learns strategies, not slogans.
Use RL with exact checkers to lift first-try success and curb length.
Architect for long contexts so those strategies are usable at inference.
Prune weak chains during TTS so extra compute buys accuracy, not waste.

04Experiments & Results

🍞 Hook: Imagine a school tournament where you not only score points but must finish with fewer attempts than others.

🥬 The Concept (What was tested): The team measured accuracy on hard reasoning tasks and how many tokens the model spent to get there. How it works:

Standard pass@1 tests on math (AIME24/25, HMMT25, AMO-Bench, MATH500), code (LiveCodeBench v6, SciCode, TB Hard, τ-Telecom), and general (GPQA-Diamond, MMLU-Pro, HLE, IFBench).
Test-Time Scaling with DeepConf@512 on selected math/science sets, measuring both accuracy and total generated tokens.
Careful prompts, robust answer parsing (math_verify), and multiple runs where needed. Why it matters: Without measuring both accuracy and token cost, we can’t tell if a method is truly efficient.

🍞 Anchor: Think “A+ while using fewer pages of scratch work.”

The competition: Qwen3-8B, DeepSeek-R1-0528-Qwen3-8B, Phi-4-Reasoning-Plus-14B, GPT-OSS-20B, Qwen3-32B, Nemotron-H-47B-Reasoning.

Scoreboard (contextualized):

Math: Falcon-H1R hits 88.1% (AIME24), 83.1% (AIME25), 64.9% (HMMT25), 36.3% (AMO-Bench), 97.4% (MATH500)—matching or beating many larger models. That’s like a 7th grader tying with high-schoolers who are much taller.
Code: 68.6% on LiveCodeBench v6—second only to GPT-OSS-20B—showing strong reasoning plus execution.
General: Competitive on GPQA-Diamond and MMLU-Pro; strong on instruction-following (IFBench 53.4) and HLE (11.1).

DeepConf@512 (the TTS test):

AIME24/25, GPQA-D, AMO-Bench subset: Falcon-H1R reaches top or near-top accuracy while using substantially fewer tokens than peers.
Example: On AIME25, ~96.7% accuracy with ~38% fewer tokens than a notable 8B reasoning baseline.
Voting methods (majority vs confidence-weighted) gave similar results—filtered sets were consistently high quality.

Surprises:

Single-teacher traces beat multi-teacher mixes: fewer conflicting styles led to more stable learning.
More rollouts per prompt during SFT helped most on the hardest problems, confirming that diversity teaches resilience.
Math-first training generalizes well—even to code/science—better than code-first training generalizes back to math.

Meaning of numbers:

“88% on AIME24” is like getting an A in a class where many strong students are getting Bs.
“36% on AMO-Bench” sounds modest until you realize the next best is over 10 points lower—these are Olympiad-level brainteasers.
Token savings under DeepConf mean faster answers, lower bills, and greener compute—without dropping accuracy.

05Discussion & Limitations

🍞 Hook: Even the best pocket calculator can’t replace a library—some tasks need more knowledge.

🥬 The Concept (Honest assessment): Falcon-H1R is a sharp small model for reasoning, but it isn’t magic. Limitations:

Knowledge-heavy tasks: On GPQA-Diamond and MMLU-Pro, larger models with more memorized facts can still edge ahead.
Domain imbalance: Emphasis on math may bias gains; science judged by an LLM can provide weaker reward signals.
Very long answers: Although 20k–48k tokens are supported, extreme cases still cost time and GPU memory.
Multi-teacher data: Mixing reasoning styles hurt stability and accuracy in ablations.

Required resources:

Training used 256×H100 GPUs for SFT and RL with long contexts.
Inference benefits from vLLM and careful parallelism; consumer-grade setups can still leverage the 7B size but will get less throughput.

When not to use:

Pure factual recall without reasoning (a short, direct answer from a larger knowledge model may be faster).
Domains where no verifiable reward exists (hard to run RL reliably).
Ultra-latency-critical settings where even small TTS is too slow.

Open questions:

How small can we go and still keep this level of reasoning—5B? 3B?
Can we build better science rewarders than LLM judges to unlock bigger gains?
How to fuse external tools (search, CAS, theorem provers) without hurting calibration?
Can pruning be made even smarter—e.g., dynamic regrouping, learned stopping rules?

🍞 Anchor: Think of Falcon-H1R as a fast, well-trained chess player—you can still beat it with a giant opening book (knowledge), but for deep calculation under time pressure, it punches way above its weight.

06Conclusion & Future Work

Three-sentence summary:

Falcon-H1R proves that a 7B hybrid model, trained on curated long reasoning traces and refined with verifiable-reward RL, can deliver state-of-the-art reasoning at a fraction of the usual cost.
Its architecture makes long, parallel chain-of-thought feasible, and DeepConf test-time scaling cuts waste by stopping weak paths early.
The result is high accuracy, strong token efficiency, and fast inference that rivals much larger models.

Main achievement:

Shifting the accuracy–cost frontier for reasoning: a compact model that matches or surpasses bigger peers while enabling highly efficient test-time scaling.

Future directions:

Push smaller hybrids further (≤5B) without losing depth, design stronger verifiable rewards for science, and learn adaptive pruning policies beyond simple thresholds.
Explore tool-augmented reasoning (symbolic math, program analyzers) while preserving calibration and efficiency.

Why remember this:

Falcon-H1R shows “smart beats big”: with the right data, training, and architecture, small models can think deeply, quickly, and cheaply—bringing advanced reasoning to classrooms, startups, and everyday tools.

Practical Applications

•Personal math tutors that show steps, check answers, and adapt difficulty on regular laptops.
•Coding copilots that run tests automatically and favor solutions that truly pass.
•Homework helpers that explain reasoning clearly and concisely before giving the final answer.
•Science study aids that attempt multiple solution paths and pick the most reliable one.
•Customer support bots that think through tricky, multi-step troubleshooting with fewer mistakes.
•Data analysis assistants that attempt several reasoning chains and report the most consistent findings.
•Educational content generators that produce verified step-by-step worked examples.
•On-device reasoning for robotics or IoT tasks where compute and memory are limited.
•Enterprise decision assistants that prune low-confidence recommendations early to save time.
•Research aides that can scale their thinking effort dynamically depending on task difficulty.

Version: 1