TranslateGemma Technical Report

Mara Finkelstein; Isaac Caswell; Tobias Domhan; Jan-Thorsten Peter; Juraj Juraska; Parker Riley; Daniel Deutsch; Geza Kovacs; Cole Dilanni; Colin Cherry; Eleftheria Briakou; Elizabeth Nielsen; Jiaming Luo; Kat Black; Ryan Mullins; Sweta Agrawal; Wenda Xu; Erin Kats; Stephane Jaskiewicz; Markus Freitag; David Vilar

TranslateGemma Technical Report

Intermediate

Mara Finkelstein, Isaac Caswell, Tobias Domhan et al.1/13/2026

arXiv PDF

Key Summary

•TranslateGemma is a family of open machine translation models fine-tuned from Gemma 3 to translate many languages more accurately.
•It learns in two stages: first by studying many trusted translation examples (human and high-quality synthetic), then by practicing and getting rewarded for better outputs (reinforcement learning).
•A smart mix of reward models (MetricX-QE, AutoMQM, ChrF, a Naturalness judge, and a generalist judge) guides the model to fewer errors and more natural-sounding text.
•Across 55 language pairs on the WMT24++ benchmark, TranslateGemma beats the original Gemma 3 models at all sizes.
•Smaller TranslateGemma models often match or beat larger baseline models, giving strong results with less compute.
•The model keeps its multimodal skills and even improves on image-translation tasks in the Vistra benchmark without extra multimodal training.
•Human evaluations (MQM) across 10 directions agree with the automatic scores, with especially big gains for lower-resource languages.
•Careful data curation (including MADLAD-400 sources and SMOL/GATITOS human data) and robust prompts help ensure reliability and general abilities remain intact.

Why This Research Matters

Better translation means more people can learn, work, and get help in their own language. TranslateGemma lifts quality not just for popular languages but also for lower-resource ones, making the internet fairer. It runs efficiently, so even smaller models can deliver strong results on regular hardware. Because it keeps multimodal skills, it can help with real-world tasks like understanding signs or documents from photos. The open release invites researchers and builders everywhere to adapt and improve it. In emergencies or global collaboration, clearer communication saves time and reduces costly misunderstandings.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how talking with people from different countries can be tricky if you don’t share a language? Even simple things like reading signs or understanding a message from a friend can become puzzles.

🥬 Filling (The Concept: Machine Translation, MT)

What it is: Machine Translation is when computers turn text from one language into another.
How it works:
1. Read the source sentence.
2. Figure out its meaning and structure.
3. Write the same meaning in the target language.
Why it matters: Without MT, information stays locked behind language walls, and people can’t learn, share, or work together as easily.

🍞 Bottom Bread (Anchor): When you type “Hello” and ask for Spanish, MT gives you “Hola.”

🍞 Top Bread (Hook): Imagine a super student who has read millions of books and practiced answering all kinds of questions.

🥬 Filling (The Concept: Large Language Models, LLMs)

What it is: An LLM is a very big computer model trained to understand and generate language.
How it works:
1. Study huge amounts of text to learn patterns.
2. Predict the next words that sound right and make sense.
3. Adjust its guesses based on feedback.
Why it matters: Without LLMs, translations would be more literal, miss cultural meaning, and break on tricky phrases.

🍞 Bottom Bread (Anchor): An LLM helps choose “break a leg” → “¡mucha suerte!” in Spanish, not the silly “rompe una pierna.”

🍞 Top Bread (Hook): Think of a bilingual workbook where every sentence in English has the correct French version right next to it.

🥬 Filling (The Concept: Parallel Data)

What it is: Parallel data is pairs of matching texts in different languages.
How it works:
1. Line up a source sentence and its correct translation.
2. Train the model to turn the first into the second.
3. Repeat for millions of pairs.
Why it matters: Without parallel data, the model can’t directly learn how ideas map between languages.

🍞 Bottom Bread (Anchor): “I love cats.” ↔ “J’adore les chats.”

🍞 Top Bread (Hook): Suppose you don’t have enough bilingual workbooks. What if a very strong translator writes extra examples for you?

🥬 Filling (The Concept: Synthetic Data)

What it is: Synthetic data is translation pairs created by powerful models instead of humans.
How it works:
1. Take high-quality monolingual sentences.
2. Use a top model to translate them into the other language.
3. Filter and keep only the best ones.
Why it matters: Without synthetic data, low-resource languages stay under-taught and models lag behind.

🍞 Bottom Bread (Anchor): Take a Swahili news sentence, have a top model translate it to English, then keep the best version to train on.

🍞 Top Bread (Hook): Imagine a science fair where judges give careful scores to every project.

🥬 Filling (The Concept: Quality Estimation, QE)

What it is: QE checks how good a translation is, even without a human reference.
How it works:
1. Look at the source and the translation.
2. Predict a quality score.
3. Prefer translations with better scores.
Why it matters: Without QE, we’d keep noisy or wrong synthetic examples and teach the model bad habits.

🍞 Bottom Bread (Anchor): If two French versions of an English sentence exist, QE helps pick the one that’s more faithful and fluent.

🍞 Top Bread (Hook): Think of a world championship for translation where teams test their best.

🥬 Filling (The Concept: WMT Benchmarks)

What it is: WMT is a well-known set of translation tests covering many language pairs.
How it works:
1. Provide standard test sets.
2. Compare systems fairly with common metrics.
3. Track progress across years and languages.
Why it matters: Without WMT, we wouldn’t know if a new model truly improves over others.

🍞 Bottom Bread (Anchor): If TranslateGemma scores higher on WMT24++, we can trust it’s genuinely better.

🍞 Top Bread (Hook): Picture a Swiss Army knife that can read, write, and even understand images.

🥬 Filling (The Concept: Multimodality)

What it is: Multimodality means a model can handle different input types, like text and pictures.
How it works:
1. Accept an image.
2. Find the text content in it.
3. Translate that text into the target language.
Why it matters: Without multimodality, you can’t translate street signs or posters from photos.

🍞 Bottom Bread (Anchor): Show a picture of a bakery sign in German; the model outputs the English translation of the sign.

Before this paper, strong open models existed, but they weren’t specialized for top-tier translation across many languages while staying efficient and multimodal. People tried bigger models or more raw data, but that can be expensive and noisy. The gap: a clear, open recipe that combines high-quality synthetic + human data with targeted reinforcement to push translation quality across the board, not just in a few languages. TranslateGemma fills that gap with a two-stage process and careful reward design, showing wins that matter for daily life: reading global news, helping families communicate, and supporting humanitarian work where accurate language bridging saves time and avoids confusion.

02Core Idea

🍞 Top Bread (Hook): Imagine training for a spelling bee: first you study the answer key, then you do practice rounds where coaches score your performance and you improve from feedback.

🥬 Filling (The Concept: The Key Insight)

What it is: The “aha!” is to first fine-tune on curated parallel data (human + high-quality synthetic), then use reinforcement learning with multiple reward judges to polish translations for accuracy and naturalness.
How it works:
1. Stage 1 (Supervised Fine-Tuning, SFT): Learn directly from correct translation pairs.
2. Stage 2 (Reinforcement Learning, RL): Generate translations, score them with several reward models, and push the system toward better outputs.
3. Keep general skills by mixing in instruction-following data.
Why it matters: Without the two stages, you either learn rules without polish (only SFT) or polish without a good base (only RL); together, you get both.

🍞 Bottom Bread (Anchor): Like studying the answer sheet (SFT) and then doing scored practice rounds (RL) until your answers are accurate and sound native.

Multiple Analogies:

Sports: SFT is learning the playbook; RL is scrimmaging with a coach giving instant feedback on mistakes and great moves.
Cooking: SFT is following a trusted recipe; RL is taste-testing and adjusting seasoning until it’s perfect.
Music: SFT is reading sheet music; RL is rehearsing with a conductor who corrects timing and tone.

🍞 Top Bread (Hook): You know how a superhero team beats a villain better than a lone hero?

🥬 Filling (The Concept: Reward Model Ensemble)

What it is: Several specialized judges score translations from different angles (faithfulness, fluency, naturalness, general abilities).
How it works:
1. MetricX-QE checks overall translation quality without needing a reference.
2. AutoMQM flags error spans and severity.
3. ChrF compares overlap with a synthetic reference.
4. A Naturalness autorater penalizes “non-native-sounding” text.
5. A generalist judge preserves broader skills.
Why it matters: Without a team of judges, the model might get good at one thing (like word overlap) and bad at others (like sounding natural).

🍞 Bottom Bread (Anchor): It’s like having a grammar judge, a style judge, a faithfulness judge, and a “does this sound human?” judge—then combining their advice.

🍞 Top Bread (Hook): Think of corrections directly on the exact words you said, not just a final grade.

🥬 Filling (The Concept: Token-level Advantages)

What it is: Token-level advantages give fine-grained feedback on specific word spans, not just a single score for the whole sentence.
How it works:
1. Compute a sentence-level reward (e.g., MetricX).
2. Add span-level signals (e.g., AutoMQM’s error highlights).
3. Combine them so the model knows exactly which words helped or hurt.
Why it matters: Without token-level signals, the model struggles to learn which parts to fix.

🍞 Bottom Bread (Anchor): If “bank” was mistranslated as a river bank instead of a money bank, token-level feedback points to that exact word.

Before vs After:

Before: Gemma 3 is strong but general; translation quality varies by language and model size.
After: TranslateGemma is specialized; it outperforms across 55 language pairs, and smaller models rival bigger baselines.

Why It Works (Intuition):

Curated supervision builds a solid base.
Diverse rewards prevent reward-hacking and push both accuracy and naturalness.
Token-level signals make learning efficient by pointing exactly where to improve.
Keeping instruction-following data avoids overfitting so the model stays helpful beyond translation.

Building Blocks:

High-quality data pipeline (human + filtered synthetic).
Supervised fine-tuning with safe training choices (e.g., freezing embeddings).
Reinforcement learning with multiple reward models.
Advantage computation that blends sentence- and token-level feedback.
A reliable, simple translation prompt that aligns training and inference.

03Methodology

At a high level: Input (text or image with text) → Stage 1: Supervised Fine-Tuning (learn from pairs) → Stage 2: Reinforcement Learning (polish with rewards) → Output (translated text).

Stage 1: Supervised Fine-Tuning (SFT)

🍞 Top Bread (Hook): Imagine you’re learning French by copying from a perfect answer key.

🥬 Filling (The Concept: SFT)

What it is: The model learns from correct translation pairs (parallel data) with teacher-forced examples.
How it works:
1. Mix human-translated data (SMOL, GATITOS) and high-quality synthetic pairs.
2. Include 30% generic instruction-following data to keep general skills.
3. Fine-tune Gemma 3 checkpoints (27B/12B/4B) with AdaFactor, LR=1e-4, batch=64, 200k steps.
4. Update all parameters except freeze embeddings (helps unseen scripts/languages).
Why it matters: Without SFT, RL would start from a wobbly base and learn slower.

🍞 Bottom Bread (Anchor): Train on “I’m hungry.” ↔ “Tengo hambre.” thousands of times until it becomes second nature.

Data Creation and Filtering

🍞 Top Bread (Hook): Think of picking the best practice sentences so you don’t waste time on bad examples.

🥬 Filling (The Concept: Synthetic Data Pipeline)

What it is: A careful process generates and filters synthetic translations from strong models.
How it works:
1. Start with monolingual sources from MADLAD-400; bucket sentences by length.
2. Sample 1M candidate sources per language pair.
3. Pre-filter sources by comparing two Gemini 2.5 Flash samples (greedy vs. temperature 1.0) using MetricX 24-QE; keep those where sampling helps most.
4. For selected sources, generate 128 translations with Gemini 2.5 Flash.
5. Score with MetricX 24-QE and keep the best; generate both single sentences and up-to-512-token blobs.
6. Run a formatting filter (with Gemini 2.5 Flash) to remove weird outputs.
Why it matters: Without strict filtering, noise sneaks in and hurts quality.

🍞 Bottom Bread (Anchor): From 128 candidate Spanish versions of one English paragraph, the pipeline keeps the single best, cleanly formatted one.

Human Data for Coverage

🍞 Top Bread (Hook): When maps have blank spots, you ask local experts for directions.

🥬 Filling (The Concept: Human Parallel Data for Low-Resource Languages)

What it is: High-quality human translations (SMOL, GATITOS) expand language/script coverage.
How it works:
1. Add diverse language pairs beyond the benchmark set.
2. Balance with synthetic data to avoid overfitting.
3. Use in SFT (and mostly not in RL) to ground the model.
Why it matters: Without human data, rare languages can remain weak.

🍞 Bottom Bread (Anchor): Professional translations in Marathi and Swahili make the model more reliable for those communities.

Stage 2: Reinforcement Learning (RL)

🍞 Top Bread (Hook): After studying, you do practice tests while coaches highlight mistakes and give points.

🥬 Filling (The Concept: RL for Translation)

What it is: The model generates translations, receives scores, and updates to prefer higher-scoring outputs.
How it works:
1. Use an ensemble of reward models:
  - MetricX-24-XXL-QE (reference-free quality; rescaled so higher is better).
  - Gemma-AutoMQM-QE (span-level error detection; uses standard MQM weights).
  - ChrF (character n-gram overlap; only reward using synthetic references).
  - Naturalness Autorater (LLM-as-a-judge penalizing non-native phrasing).
  - Generalist reward (maintains reasoning/instruction/multilingual breadth).
2. Compute advantages:
  - Start with sequence-level rewards (reward-to-go across tokens).
  - Add token/span-level advantages from AutoMQM and Naturalness.
  - Batch-normalize combined advantages.
3. Update the policy to increase the chance of high-advantage tokens/sequences.
Why it matters: Without multi-judge rewards and token-level guidance, the model might chase the wrong signal or not know what to fix.

🍞 Bottom Bread (Anchor): If “read” (present) vs “read” (past) is wrong in a translation, span-level feedback points to the exact word to change.

Prompting Alignment

🍞 Top Bread (Hook): Tests should look like your homework, so you’re not surprised.

🥬 Filling (The Concept: Standardized Translation Prompt)

What it is: A consistent, professional-translator prompt is used both during data creation and evaluation.
How it works:
1. Specify source/target languages and their codes.
2. Instruct: output only the translation, no extra text.
3. Wrap inputs automatically with provided tools.
Why it matters: Without prompt alignment, outputs vary and evaluation gets noisy.

🍞 Bottom Bread (Anchor): The model always sees “You are a professional X→Y translator…” so its behavior is stable.

Multimodal Retention (Image Translation)

🍞 Top Bread (Hook): Practice your language skills on real-world photos like menus and signs.

🥬 Filling (The Concept: Zero-Extra-Training Image Translation)

What it is: Even without extra multimodal training, the model still translates text found in images.
How it works:
1. Input the image plus a simple instruction to translate the visible text.
2. The model identifies the text content and translates it.
3. Evaluate on the Vistra benchmark (single-text images subset).
Why it matters: Without keeping multimodal skills, the model would fail on photos and signs.

🍞 Bottom Bread (Anchor): A photo of a Russian shop sign is correctly translated into English on the first try.

Secret Sauce (Why this recipe is clever)

High-quality synthetic data with QE filtering multiplies training reach without adding noise.
Human data shores up weaker languages and scripts.
An ensemble of complementary reward models prevents over-optimizing one metric.
Token-level advantages speed learning and pinpoint fixes.
Prompt alignment reduces variance between training and use-time.

04Experiments & Results

The Tests: We measured translation quality on the WMT24++ benchmark (55 language pairs) using MetricX and COMET-22, plus human MQM evaluations on WMT25 test sets (10 directions). We also checked image translation on the Vistra benchmark (subset of 264 images with single text regions).

🍞 Top Bread (Hook): When you race, you need a fair track and a stopwatch.

🥬 Filling (The Concept: Metrics That Matter)

What it is: MetricX and COMET-22 are trusted automatic judges; MQM is a professional human evaluation framework.
How it works:
1. Automatic: compute scores that correlate with human judgments (lower MetricX is better; higher COMET is better).
2. Human: translators mark exact error spans and severity; lower MQM totals mean better translations.
3. Compare against a strong baseline (Gemma 3) at the same model sizes.
Why it matters: Without solid metrics and baselines, results don’t mean much.

🍞 Bottom Bread (Anchor): Getting a MetricX score reduction of ~24% is like moving from a B to an A in class rank.

Scoreboard Highlights (Text Translation):

27B: TranslateGemma lowers MetricX notably vs. Gemma 3 (e.g., about a 23–26% relative reduction reported). COMET-22 also improves.
12B: TranslateGemma beats the 27B Gemma 3 baseline in average quality—small but meaningful efficiency win.
4B: TranslateGemma approaches or surpasses Gemma 3 12B baseline levels on COMET-22, a big deal for low-compute settings.
Per-language examples (MetricX, lower is better):
- English→German: 1.63 → 1.19
- English→Spanish: 2.54 → 1.88
- English→Hebrew: 3.90 → 2.72
- English→Swahili: 5.92 → 4.45
- English→Lithuanian: 6.01 → 4.39
- English→Estonian: 6.40 → 4.61
- English→Icelandic: 8.31 → 5.69
Takeaway: Gains are broad, covering both high- and low-resource languages.

Image Translation (Vistra):

TranslateGemma keeps and often improves image translation quality without extra multimodal fine-tuning, especially at 27B.
Example (averaged over English→German/Spanish/Russian/Chinese): strong MetricX drops (better) and COMET-22 mostly improves.

Human Evaluation (MQM on 10 directions):

TranslateGemma 27B and 12B generally beat Gemma 3 27B, confirming automatic metrics.
Biggest improvements show up in lower-resource directions like English→Marathi and English→Swahili.
Exceptions: German as target is roughly on par; Japanese→English shows a regression mainly due to named-entity mistranslations while other categories improved.

Surprises and Notes:

Smaller TranslateGemma models rival or beat larger baselines—quality per compute improves.
Improvements transfer to metrics not used as rewards (e.g., COMET-22), suggesting genuine quality gains rather than overfitting a single metric.
Multimodal gains arrive “for free” from better text translation skills.

05Discussion & Limitations

Limitations

Uneven Gains: While most language pairs improve, a few (e.g., Japanese→English) may regress in specific error types like named entities.
Reward Dependence: If reward models are biased or miss certain errors, RL might over-optimize the wrong thing (reward hacking risk).
Synthetic Data Quality: Even after filtering, synthetic pairs can carry subtle errors that teach the model bad habits.
Scale Effects: The hypothesis that larger models benefit more from wide language exposure isn’t fully proven here.
Multimodal Scope: Image translation tests use a simplified subset (single text region) and no extra multimodal training; complex scenes may need more.

Required Resources

Compute: Fine-tuning 4B/12B/27B models for 200k steps plus RL requires multi-GPU or TPU clusters.
Data: Access to large monolingual corpora (MADLAD-400), human parallel data (SMOL/GATITOS), and strong synthetic generation (e.g., Gemini 2.5 Flash) and filtering tools (MetricX 24-QE).
Tooling: RL frameworks supporting token-level advantages, Kauldron SFT tooling, and consistent prompting.

When NOT to Use

Highly specialized domains with strict terminology unless you add domain adaptation.
Settings requiring guaranteed named-entity precision (e.g., legal citations) without extra constraints or post-editing.
Low-latency on tiny hardware if even the 4B model is too heavy—consider distillation or smaller specialized models.

Open Questions

Can better named-entity handling in RL (e.g., entity-aware rewards) fix regressions like Japanese→English?
How far can synthetic data + QE go for truly low-resource languages with scarce monolingual text?
What’s the best balance of human vs. synthetic data as model size grows?
Can multimodal-specific RL further boost image translation without hurting text tasks?
How to automatically detect and avoid reward hacking across diverse languages and scripts?

06Conclusion & Future Work

Three-Sentence Summary

TranslateGemma is an open set of translation-specialized models fine-tuned from Gemma 3 using a two-stage recipe: supervised learning on curated human+synthetic data, then reinforcement learning guided by multiple reward judges.
It achieves consistent gains across 55 language pairs, with smaller models often matching or beating larger baselines, and it retains improved image translation without extra multimodal training.
Human evaluations confirm the improvements, especially in lower-resource languages, though some directions (e.g., Japanese→English named entities) need targeted fixes.

Main Achievement

Turning a strong general LLM into a state-of-the-art open translator via a careful blend of high-quality data curation, ensemble rewards, and token-level feedback—delivering better quality per compute.

Future Directions

Add entity-aware or domain-specific rewards; expand low-resource coverage; explore multimodal RL; investigate stronger safeguards against reward hacking; and refine balance between human and synthetic data.

Why Remember This

TranslateGemma shows that combining clean supervision with rich, multi-angle feedback can lift translation quality broadly, keep efficiency high, and even improve related skills like image translation—offering a practical, open path for the community to build on.

Practical Applications

•Build a multilingual chatbot that answers customer questions in the user’s native language with fewer errors.
•Translate product manuals and safety instructions across dozens of languages while preserving terminology.
•Provide real-time helpdesk translation so support agents and users communicate smoothly worldwide.
•Localize educational content (lessons, quizzes, stories) for schools in low-resource languages.
•Translate signage and notices from photos for travelers or accessibility apps.
•Power news aggregation sites that summarize articles across languages accurately and naturally.
•Assist humanitarian organizations by translating forms and guidance quickly in crisis regions.
•Support subtitling and dubbing workflows by generating more faithful base translations for editors.
•Enable cross-border e-commerce by translating listings, reviews, and return policies clearly.
•Provide developer APIs for batch and streaming translation with consistent prompts and metrics.

Version: 1