T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground

Dmitrii Stoianov; Danil Taranets; Olga Tsymboi; Ramil Latypov; Almaz Dautov; Vladislav Kruglikov; Nikita Surkov; German Abramov; Pavel Gein; Dmitry Abulkhanov; Mikhail Gashkov; Viktor Zelenkovskiy; Artem Batalov; Aleksandr Medvedev; Anatolii Potapov

T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground

Intermediate

Dmitrii Stoianov, Danil Taranets, Olga Tsymboi et al.12/11/2025

arXiv PDF

Key Summary

•T-pro 2.0 is an open Russian language model that can answer quickly or think step by step, so you can pick speed or accuracy when you need it.
•It uses a special Cyrillic-dense tokenizer that breaks Russian words into fewer pieces, making reading and writing in Russian faster and cheaper.
•A smart guessing trick called EAGLE speculative decoding speeds up responses by about 1.8–2.3× without changing the final answer.
•The team releases everything openly: model weights, a 500k instruction dataset (T-Wix), an original Russian math benchmark (T-Math), and EAGLE draft weights.
•On Russian general-knowledge and dialogue tests, T-pro 2.0 competes with top models and beats many popular baselines.
•On hard Russian reasoning tasks like ruAIME and T-Math, it performs strongly, showing real step-by-step problem-solving skill.
•The public web demo lets you compare fast answers vs. reasoning mode side by side and watch speed telemetry in real time.
•Training uses a careful pipeline: instructional midtraining, supervised fine-tuning, reward-model–guided selection, and DPO alignment.
•Despite being optimized for Russian, the model stays competitive on English reasoning benchmarks.
•This work builds a reusable, transparent ecosystem for building practical, efficient Russian LLM apps.

Why This Research Matters

Russian speakers get a faster, smarter assistant that can switch between quick answers and transparent reasoning, which builds trust in sensitive tasks like math or policy questions. Businesses can deploy snappy Russian chatbots that stay cost-efficient thanks to shorter tokenization and speculative decoding. Teachers and students gain an original Russian benchmark (T-Math) to track real progress instead of relying on translated tests that may distort difficulty. Developers can reproduce results with open weights, datasets, and a live demo that shows exactly where speedups come from. The approach proves that better tokenizers and decoders—not just bigger models—can dramatically improve user experience. It also lays a path to bring the same recipe to other under-served scripts and languages. Overall, it turns Russian LLM research into a transparent, hands-on field where anyone can build and measure real improvements.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your school only has English math books, but your class speaks Russian. You could still learn, but it would be slower, messier, and sometimes just wrong. Now imagine the books also have super tiny or chopped-up letters—reading gets even harder.

🥬 The World Before: Large language models (LLMs) first learned to be great at English. Russian models existed, but many strong ones were closed (API only) and most open ones were smaller or adapted from English-focused models. Two big problems followed: (1) tokenizers (the tools that slice text) were built for English, so Russian words got split into too many pieces; and (2) there wasn’t a shared, open playground to test Russian reasoning—especially to compare quick answers vs. step-by-step thinking on the same prompt under the same settings. Even when models could think, the experience felt slow because every token had to be generated one by one.

🍞 Hook: Think of running a relay race where your baton (tokens) is tiny and must be passed a thousand times. That takes time! Now imagine bigger, better batons and a way to hand over several at once.

🥬 The Problem: Researchers wanted a single, open Russian model that can switch gears: (a) answer directly when you want speed, or (b) show its reasoning when you want reliability and transparency. They also needed a faster engine for Russian decoding and a fair way to measure math and reasoning in Russian. Without this, it was hard to study hybrid reasoning in Russian, hard to build apps that feel snappy, and hard to compare models honestly.

🍞 Hook: You know how a cook uses a sharp knife for vegetables and a heavy knife for bones? Using the right tool makes everything smoother.

🥬 Failed Attempts: People tried simply translating English datasets, but translations can be uneven and don’t always capture Russian style or difficulty. Others kept the English-friendly tokenizer, which made Russian text longer and slower to process. Some models pushed reasoning but kept standard slow decoding; others were fast but weak at math-like reasoning. And there were few public, reproducible demos showing how speed tricks really help users.

🍞 Hook: Imagine a puzzle where pieces from another country’s set almost fit yours—but not quite. You can force it, but it won’t be great.

🥬 The Gap: A missing trio stood out: (1) a Russian-friendly tokenizer to shrink sequences; (2) a strong, reasoning-aware training pipeline (including instruction midtraining, SFT, preferences via a reward model, and DPO alignment); and (3) a live, open demo plus benchmarks focused on Russian, especially math, to compare modes (fast vs. think) and measure speedups.

🍞 Hook: Think of a new playground with clear scoreboards, stopwatches, and fair rules. Now everyone can train, compare, and improve.

🥬 Real Stakes: This matters for real life. Shorter tokenization and speculative decoding mean lower costs and snappier chats for businesses, teachers, and students using Russian every day. Hybrid reasoning means a customer bot can reply fast to simple questions but switch to careful reasoning for tough cases. A fair Russian math benchmark (T-Math) helps schools and researchers track progress honestly. And an open demo with telemetry lets developers see exactly how speed tricks help users feel the difference.

🍞 Anchor: Picture a homework helper in Russian. For “What’s the capital of France?” it replies quickly. For “Solve this olympiad geometry problem,” it shows its steps. Both feel fast, because the words are sliced efficiently and the model guesses several tokens ahead when it can, cutting wait time without cutting quality.

02Core Idea

🍞 Hook: You know how a bike has gears? Low gear to climb carefully, high gear to go fast on flat roads. You switch depending on the path.

🥬 Aha! Moment: Build one open Russian model that can switch between fast direct answers and slower, step-by-step reasoning—and make both modes efficient using a Cyrillic-dense tokenizer, a reasoning-focused training stack, and EAGLE speculative decoding.

Analogy 1 (Gears): Direct answer = high gear (speed), reasoning = low gear (control). You choose the gear based on the hill (task difficulty).
Analogy 2 (Chef and prep cook): The big model is the chef; a tiny helper (EAGLE draft) prepares likely ingredients (tokens) so the chef works faster without changing the final dish.
Analogy 3 (Library index): A better index (Cyrillic-dense tokenizer) means you find the right pages (meaning) quicker, so the whole visit (inference) is shorter.

Before vs. After:

Before: Russian text was over-chopped into many tokens; decoding was strictly one-token-at-a-time; reasoning and non-reasoning weren’t easy to compare live.
After: Fewer tokens per Russian word; a draft-and-verify decoder speeds things up ~1.8–2.3×; a public demo lets you switch modes and watch speed metrics.

Why It Works (intuition):

Shorter sequences mean less total work for the model.
Instructional midtraining + SFT + preference optimization teach the model to both follow instructions and reason clearly in Russian.
Speculative decoding proposes multiple likely next tokens and only keeps the ones the big model would have picked anyway—so it’s both fast and faithful.

Building Blocks (each with the Sandwich pattern):

🍞 You know how some languages have longer words that shouldn’t be chopped randomly? 🥬 Cyrillic-dense Tokenizer: It’s a Russian-aware text slicer that makes fewer, smarter pieces per word. How: replace rare non-Cyrillic tokens with useful Cyrillic chunks and ensure common Russian parts are directly representable. Why: shorter inputs/outputs → less compute → lower latency and cost. 🍞 Example: “математика” used to take many tokens; now it often takes just 1–2.

🍞 You know how practicing with instructions helps you learn tasks faster? 🥬 Instructional Midtraining: A training step on 40B tokens of instruction-like data (lots in Russian) to adapt the model to the new tokenizer and strengthen reasoning. How: diverse, deduplicated instructions with teacher-regenerated answers. Why: it teaches the model new token pieces and structured problem-solving before fine-tuning. 🍞 Example: After midtraining, the model improves on ruAIME and long-context tasks.

🍞 You know how a teacher marks which answers are better? 🥬 Reward Model (RM): A model that scores completions so we can pick better ones. How: build preference data via knockout-style tournaments and train the RM to prefer higher-quality outputs. Why: without this coach, the model might learn from messy or too-long answers. 🍞 Example: Given 8 candidate solutions, the RM helps pick the best one.

🍞 You know how a tutor gives you targeted practice, not just more pages? 🥬 Supervised Fine-Tuning (SFT) with T-Wix 500k: Curated instruction data (general + reasoning) distilled from strong teachers. How: filter, deduplicate, balance domains/difficulty, and select high-quality completions via RM. Why: focused, clean practice builds reliable skills. 🍞 Example: The model learns to do math, code, and dialogue in Russian with fewer mistakes.

🍞 You know how you can learn faster when feedback is on your own attempts? 🥬 DPO (Direct Preference Optimization): Align the model using comparisons among its own outputs. How: generate multiple on-policy completions, score with RM, and train on best-vs-worst pairs. Why: it directly fixes real failure modes. 🍞 Example: The model stops over-explaining when a short answer is better.

🍞 Imagine a speedy assistant who drafts a few next words for a master writer to approve. 🥬 EAGLE Speculative Decoding: A tiny draft head proposes tokens; the big model verifies them, keeping only those it agrees with. How: dynamic draft trees via SGLang; accept runs of tokens at once. Why: big speedups with the same final answer quality. 🍞 Example: STEM questions get ~2× faster because the next words are more predictable.

🍞 Think of a fair test made by Russian math teachers. 🥬 T-Math Benchmark: A set of 331 original Russian olympiad problems with numeric answers. How: extract, filter, human-verify, and auto-grade. Why: translated tests can hide difficulty; T-Math reveals true Russian reasoning skill. 🍞 Example: T-pro 2.0 scores 0.54 pass@1, competitive with strong open models.

Bottom Bread (Anchor): In the public demo, you type a Russian math question. The model in reasoning mode shows steps; in fast mode it’s brief. You see their speeds and token stats live, thanks to the tokenizer and EAGLE working together.

03Methodology

High-level recipe: Input (Russian/English prompt) → Cyrillic-dense Tokenizer → Hybrid LLM (T-pro 2.0) → EAGLE speculative decoding (draft + verify) → Output (direct answer or reasoning trace), with a training pipeline behind it (midtraining → SFT → RM → DPO → EAGLE training).

Step 1: Cyrillic-dense Tokenizer

What happens: The prompt is split into fewer, more meaningful tokens for Russian (and similar Cyrillic languages).
Why it exists: If words are chopped into too many parts, sequences get long, slow, and costly. Fewer tokens = faster processing.
Example: On Russian Wikipedia, words tokenized into ≤2 tokens jump from 38% (Qwen3) to 60% with T-pro’s tokenizer.

Step 2: Instructional Midtraining (40B tokens)

What happens: Starting from Qwen3-32B, the model practices on instruction-style data dominated by Russian and reasoning tasks, with regenerated teacher answers for consistency.
Why it exists: It teaches the model how to use the new tokenizer pieces and strengthens Russian reasoning before fine-tuning.
Example: ruAIME 2024 improves from 0.60 to 0.67 with instruct-only midtraining versus a pretrain+instruct mix.

Step 3: Build a Reward Model (RM)

What happens: Create preference data via knockout tournaments among multiple model outputs; train an RM that scores which output is better.
Why it exists: To select high-quality completions for SFT/DPO and reduce noisy or overly long answers.
Example: In Arena-Hard-RU Best-of-8, their RM achieves the strongest separation (ΔBoN), meaning it’s good at picking winners and spotting poor answers.

Step 4: Supervised Fine-Tuning (SFT) with T-Wix 500k

What happens: Fine-tune on a curated mix: 468k general instructions (balanced by domain and difficulty) plus ~30k reasoning traces distilled from teachers and filtered by the RM.
Why it exists: Clean, balanced practice helps the model follow instructions, explain steps, and stay concise.
Example: Prompts include math, code, science, writing, and long-context tasks up to 32k tokens; answers are selected from 8 teacher candidates by RM score.

Step 5: On-policy DPO Alignment

What happens: The SFT model generates 16 candidates per instruction; the RM scores them; training uses best-vs-worst pairs (100k pairs total) to push the model toward preferred behavior.
Why it exists: It fixes the model’s real mistakes on its own outputs, improving coherence and instruction following without expensive online RL.
Example: The model reduces rambling and follows constraints better after DPO.

Step 6: EAGLE Speculative Decoding Integration

What happens: A tiny draft head (1 decoder layer + FR-Spec) guesses several possible next tokens; the main 32B model verifies them fast using SGLang’s dynamic draft trees.
Why it exists: It keeps answers identical to standard decoding but speeds up generation by accepting multiple tokens at once when confident.
Example: At temperature 0.8, average speedup is ~1.85× (non-think) and ~1.83× (think). STEM domains reach ~2.0×.

Step 7: Public Demo and Telemetry

What happens: A web UI lets you run T-pro 2.0 (with EAGLE) vs. a baseline (Qwen3-32B) side by side, in either fast or reasoning mode; it streams tokens and displays speed, tokens/sec, and acceptance length.
Why it exists: To make speedups, verbosity changes, and reasoning clarity visible and reproducible.
Example: You can try a preset from T-Math and see EAGLE accept runs of ~3.3 tokens on average, explaining the speedup.

The Secret Sauce:

Three things combine: (1) fewer tokens via a Cyrillic-dense tokenizer, (2) better brains via instruct-only midtraining + curated SFT + on-policy DPO guided by a strong RM, and (3) a turbocharger via EAGLE speculative decoding that preserves answer quality while cutting wait time.

Concrete mini walk-through:

Input: “Реши: 2x + 5 = 15. Сколько x?”
Tokenizer: Splits Russian words efficiently; sequence is short.
Model (fast mode): Outputs “5” quickly.
Model (reasoning mode): “2x + 5 = 15 → 2x = 10 → x = 5. Ответ: 5.”
EAGLE: Draft head proposes ‘x’, ‘=’, ‘5’ in a run; verifier accepts, so you see faster streaming without changing the final answer.

What breaks without each step:

No tokenizer adaptation: Russian becomes longer and slower, hurting latency and cost.
No midtraining: The model struggles to use new token pieces and fails to build robust reasoning habits.
No RM/DPO: The model keeps wordy or low-quality patterns.
No EAGLE: You wait longer for the same answer, especially on long reasoning traces.
No demo: It’s hard to prove, compare, or reproduce the speed and quality claims.

04Experiments & Results

The Tests and Why:

General knowledge/dialogue (MERA, ruMMLU-Pro, Arena Hard Ru, WildChat Hard Ru) to check broad understanding and chat quality in Russian.
Reasoning (T-Math, ruAIME 2024/2025, ruMATH-500, ruGPQA, ruLCB, Vikhr Math/Physics) to measure step-by-step problem-solving in Russian.
Speed/efficiency (ruMT-Bench, ruAlpaca, ruCodeEval, ruMMLU-Pro by domain) to quantify speculative decoding gains.

The Competition:

Open baselines: Qwen3-32B, RuAdaptQwen3-32B-Instruct, Gemma 3 27B, DeepSeek-R1-Distill-Qwen-32B.
Larger/proprietary references: DeepSeek-V3, DeepSeek-R1, YandexGPT, GigaChat, GPT-4o/o4-mini.

Scoreboard with Context:

General knowledge/dialogue: T-pro 2.0 scores 0.66 on MERA and 0.697 on ruMMLU-Pro—close to top-tier numbers in its class. In Arena Hard Ru it reaches 91.1, and 72.6 on WildChat Hard Ru, outperforming all open-source systems and many proprietary ones. Think of this as scoring an A when many others got B’s or C’s.
Reasoning: On T-Math, T-pro 2.0 achieves 0.541 pass@1, competitive with strong open baselines; on ruAIME 2024/2025 it gets 0.704/0.646, far above several well-known models. On ruMATH-500 it hits 0.94—like getting an A on a very hard math exam.
English retention: Despite the Russian focus, T-pro 2.0 remains strong on English (AIME 2024: 0.765; MATH-500: 0.966), showing the tokenizer and training didn’t break cross-lingual ability.
Speedups: With EAGLE, average speedup is ~1.85× in non-reasoning and ~1.83× in reasoning at temperature 0.8; STEM domains reach ~1.99× on ruMMLU-Pro. Acceptance length hovers around ~3.3 tokens, meaning the model often approves short runs at once.

Surprising/Notable Findings:

Instruct-only midtraining (no raw pretrain mix) improved reasoning more than a mixed pretrain+instruct blend with the same token budget—showing targeted practice beats generic continuation at this stage.
STEM vs. humanities: EAGLE helps more in STEM because token predictions are more regular, so the draft guesses are accepted more often.
Russian-first doesn’t mean English-last: The Cyrillic-dense tokenizer and Russian-heavy training did not materially hurt English reasoning—an encouraging sign for multilingual adaptability.
T-Math remains unsolved: Even frontier models don’t exceed ~0.75 pass@1, confirming it’s a genuinely hard, revealing Russian benchmark.

What It Means in Plain Terms:

The model can chat well in Russian and solve tough problems with competitive accuracy.
It stays fast thanks to smarter token slicing and smart guessing.
It’s open and reproducible enough that others can verify these claims using the same demo, data, and pipelines.

05Discussion & Limitations

Limitations (honest assessment):

Tool use and agent skills: No special training for function calling or complex multi-tool workflows, so it may lag models specialized for agents.
Offline-only alignment: Used DPO but not online RL like PPO/GRPO, which may limit robustness on out-of-domain or tricky edge cases.
Long-context beyond 32k unverified: Although 128k is theoretically possible via RoPE scaling, it wasn’t empirically validated here.
Reproducibility caveat: Some proprietary data were used in midtraining/DPO. The team released T-Wix, T-Math, and model/EAGLE weights to help, but a perfect end-to-end rerun may be hard.

Required Resources:

Serving: One H100 GPU per endpoint in the demo (T-pro 2.0 with EAGLE; baseline without). SGLang for draft-tree decoding and telemetry.
Training: Multi-node H100s for midtraining/SFT/DPO and EAGLE draft training; long-context parallelism and memory-aware training tricks.

When Not to Use:

Heavy tool-calling or complex multi-turn agent tasks (no special optimization here).
Ultra-long documents (>32k tokens tested) where retrieval across 100k+ tokens is critical.
High-stakes, regulated domains (medical/legal) without human review and additional safety controls.
Languages far from Cyrillic, where tokenizer gains won’t help.

Open Questions:

Can EAGLE-3 and quantization bring even bigger speedups on smaller GPUs without quality loss?
How much would online RL (PPO/GRPO) improve robustness over DPO alone?
Can we validate and extend stable performance to 64k–128k tokens?
How to further reduce length bias in reward modeling and judging prompts?
What’s the best recipe to generalize this approach to other under-served scripts (e.g., Georgian, Armenian)?

06Conclusion & Future Work

Three-sentence summary: T-pro 2.0 is an open Russian model that switches between fast direct answers and step-by-step reasoning while staying efficient. It achieves this with a Cyrillic-dense tokenizer, a reasoning-focused training pipeline (midtraining → SFT → RM-guided selection → DPO), and EAGLE speculative decoding. The team releases the model, a 500k instruction set (T-Wix), an original Russian math benchmark (T-Math), and a live demo with telemetry.

Main achievement: Showing that tokenizer design and speculative decoding, combined with targeted instruction-based training and preference optimization, can deliver both strong Russian reasoning and fast inference—without increasing model size.

Future directions: Add tool use and agent skills, test online RL for robustness, validate longer contexts (64k–128k), integrate EAGLE-3 and quantization, and extend the tokenizer+pipeline approach to more scripts/languages.

Why remember this: It’s a full, open ecosystem—model, data, benchmark, and demo—that proves speed and reasoning can coexist in a practical Russian LLM. It also reminds us that the “small parts” (tokenizers and decoders) are big levers for real-world usefulness, not just afterthoughts.

Practical Applications

•Build a Russian homework helper that answers factual questions fast and shows steps for hard math.
•Create a customer support bot that replies instantly for simple requests but switches to reasoning mode for tricky cases.
•Develop an internal coding assistant for Russian-speaking teams that explains solutions clearly.
•Deploy low-latency FAQ chat for e-government services where speed and clarity both matter.
•Run fair Russian evaluations of new LLMs using the T-Math benchmark and the public demo.
•Use the RM+DPO pipeline to align specialized Russian assistants (finance, education) with concise, correct styles.
•Serve bilingual (Russian/English) chat reliably without sacrificing English reasoning quality.
•Prototype classroom tools that reveal reasoning traces to help teachers grade thinking, not just answers.
•Integrate EAGLE decoding in production to cut response latency and cloud costs on Russian workloads.
•Quantize and extend the EAGLE draft model for smaller GPUs to enable edge or on-prem deployments.

Version: 1