Typhoon-S: Minimal Open Post-Training for Sovereign Large Language Models

Kunat Pipatanakul; Pittawat Taveekitworachai

Typhoon-S: Minimal Open Post-Training for Sovereign Large Language Models

Beginner

Kunat Pipatanakul, Pittawat Taveekitworachai1/26/2026

arXiv PDF

Key Summary

•Typhoon-S is a simple, open recipe that turns a basic language model into a helpful assistant and then teaches it important local skills, all on small budgets.
•The recipe has two parts: adoptability (SFT + On-Policy Distillation) and sovereign capability (small-scale RFT with a new InK-GRPO trick).
•Using mostly open English data plus a small, carefully built Thai dataset, SFT alone wasn’t enough; adding OPD made the model more robust and better at following instructions.
•Full-logits distillation beat top-K for tricky Thai code-switching, helping the model avoid brittle mistakes when mixing languages.
•The team’s InK-GRPO adds a light next-word learning stream during RL, which injects missing domain knowledge (like Thai law) while improving task rewards.
•On Thai legal QA (NitiBench), InK-GRPO improved accuracy over standard GRPO, and in an agent setup it even beat a GPT-5 baseline under similar tool-augmented conditions.
•Crucially, these gains came without hurting general skills across English and Thai, showing little to no catastrophic forgetting.
•All of this ran on academic-scale hardware: roughly 2 days on 8 GPUs for adoptability and 1 day on 4 GPUs for sovereign capability.
•The result is practical guidance and open resources so countries or domains can build transparent, high-quality local models they truly control.

Why This Research Matters

Public institutions, schools, and hospitals can now build helpful assistants in their own languages without relying on closed, expensive systems. Courts and legal aid groups can get models that understand local laws and reasoning steps, improving access to justice. Small teams can audit, control, and update their models, which is key for safety, privacy, and trust. The recipe reduces compute and data needs, making responsible AI development more inclusive worldwide. Improved code-switching robustness reflects how people actually talk, reducing frustrating mistakes. Because general skills are preserved while local strengths grow, this approach offers practical, dependable tools for daily work. Open releases (models, datasets, code) let others learn, reproduce, and adapt the method for their communities.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your school only has a few computers and a small library, but you still want to build a super smart helper who understands your town’s language and rules. You don’t want to borrow a giant secret robot; you want your own that you truly understand and can fix.

🥬 The World Before:

What it is: The AI world was dominated by huge models mostly trained on English and Chinese, made by a few big labs with massive computers and complex training recipes.
How it works (story):
1. Big teams collected tons of data (often not your language).
2. They trained giant models for months on expensive GPU clusters.
3. They used complicated post-training with huge instruction sets and advanced RL pipelines.
Why it matters: Smaller groups (like a university or a national lab) couldn’t copy this recipe; too costly, too closed, and not designed for local needs.

🍞 Anchor: Think of a very advanced spaceship manual written in another language—useful, but not if your team can’t read it or fix the ship.

🍞 Hook: You know how a one-size-fits-all shirt rarely fits perfectly? AI is similar—general models can miss local details like laws, culture, or code-switching slang.

🥬 The Problem:

What it is: There was no clear, affordable way to post-train a model so it becomes both a great general helper (adoptability) and a strong local expert (sovereign capability).
How it works (challenges):
1. General instruction data skews toward high-resource languages.
2. Complex RL and preference pipelines demand big engineering teams.
3. Local models often know regional facts but fail at instruction following, tools, and agent behaviors.
Why it matters: Without a simple path, schools, hospitals, courts, and public agencies can’t safely use or own AI that truly understands their community.

🍞 Anchor: It’s like hiring a smart visitor who knows everything about the world but can’t follow your local school rules or speak your slang.

🍞 Hook: Imagine cooking a tasty meal with few ingredients and a basic stove. Could we do the same with AI—good results without fancy gear?

🥬 Failed Attempts:

What it is: People tried scaling data and pipelines—more instructions, more RL, more everything.
How it works (and where it breaks):
1. More English-centric instructions crowd out low-resource languages.
2. Bigger teachers and more stages create engineering overhead.
3. Offline distillation can make models brittle when they face new situations.
Why it matters: Bigger wasn’t better for small teams—it was just bigger, pricier, and less practical.

🍞 Anchor: If your toolbox is tiny, you don’t fix a bike by building a car factory; you choose smarter, lighter tools.

🍞 Hook: You know how you might first learn to follow recipes and then learn your grandma’s special secret sauce? That’s the two-part plan here.

🥬 The Gap:

What it is: A missing minimal, openly documented post-training recipe that works under academic-scale resources and keeps control local.
How it works (needs):
1. Adoptability: Make a base model a capable assistant (instructions, math, code, tools).
2. Sovereign capability: Teach region-specific, high-stakes knowledge (like Thai law) without losing general skills.
Why it matters: This lets communities build models they own, understand, and can audit.

🍞 Anchor: First teach the robot to follow directions well, then teach it your town’s rules and language quirks.

🍞 Hook: Imagine a playbook that’s short, clear, and uses gears you can afford.

🥬 Why This Paper Exists:

What it is: Typhoon-S proposes a minimal open recipe: SFT + On-Policy Distillation for adoptability, and a small RFT stage with InK-GRPO for sovereign capability.
How it works:
1. Stage 1: SFT on open English data plus a small Thai set.
2. Stage 2: On-Policy Distillation (GKD) to fix brittleness and improve robustness.
3. Stage 3: Small-scale RFT with InK-GRPO to inject missing domain knowledge and improve reasoning (including an agent with tools for retrieval).
Why it matters: It shows strong results using about two days on 8 GPUs for adoptability and one day on 4 GPUs for sovereign capability—within reach for many labs.

🍞 Anchor: Like building a strong bicycle (SFT), tuning its handling to your riding style (OPD), then adding a map holder and local trail notes (InK-GRPO + tools) so you ace the local race.

02Core Idea

🍞 Hook: You know how you can teach a friend the basics of board games fast, and later give them a quick crash course on your family’s special house rules? That’s faster than teaching everything from scratch.

🥬 The Aha! Moment (one sentence): A small, carefully designed post-training recipe—SFT + On-Policy Distillation for general skills, then tiny RFT with InK-GRPO for local expertise—can rival big pipelines without big budgets.

Multiple Analogies:

Cooking: SFT is the base recipe; OPD is tasting as you cook (the teacher guides you on your own attempts); InK-GRPO is adding local spices while practicing plating under a timer.
Sports: SFT is learning the rules; OPD is scrimmaging with a coach who corrects every move you make; InK-GRPO is practicing with a playbook of local opponents’ patterns added mid-drill.
Music: SFT learns scales; OPD practices songs with a teacher who gives note-by-note feedback as you play; InK-GRPO adds regional rhythms while performing.

Before vs After:

Before: Sovereign base models had local facts but poor instruction following and tool use; or teams relied on heavy data/pipelines they couldn’t run.
After: With SFT+OPD, models become robust assistants. With InK-GRPO, they gain domain-specific reasoning (like Thai law) while keeping general skills.

Why It Works (intuition):

SFT gives the model a reliable “follow directions” backbone.
On-Policy Distillation reduces brittleness by letting a stronger teacher grade the student on the student’s own outputs—not just on fixed answers.
InK-GRPO mixes task rewards with light next-word learning on in-domain text, gently injecting missing knowledge while RL sharpens reasoning and decisions.

Building Blocks (each with the Sandwich pattern):

🍞 Hook: You know how a smart librarian remembers lots of books? 🥬 Large Language Model (LLM)

What it is: A computer program that predicts the next word to understand and generate text.
How it works:
1. Reads tons of text to learn patterns.
2. Predicts the next token over and over.
3. Uses those predictions to answer questions or follow instructions.
Why it matters: It’s the base brain we improve. 🍞 Anchor: When you ask, “What’s Thailand’s capital?” it continues the pattern to say “Bangkok.”

🍞 Hook: Imagine owning your bike, tools, and manual, so you’re never stuck waiting on someone else. 🥬 Sovereign Setting & Capability

What it is: Keeping control of weights, data, and training so the model can do high-stakes local tasks (like legal reasoning) responsibly.
How it works:
1. Use open data and simple steps.
2. Target local languages and laws.
3. Keep transparency so you can audit and fix things.
Why it matters: Critical decisions need trust, privacy, and control. 🍞 Anchor: A court clerk using a locally trained assistant that understands Thai law exactly as written.

🍞 Hook: First learn to follow recipes before entering a cooking contest. 🥬 Supervised Fine-Tuning (SFT)

What it is: Training on input–answer pairs so the model learns to follow instructions.
How it works:
1. Show a prompt and a correct response.
2. Nudge the model toward those tokens.
3. Repeat across many tasks (English + a bit of Thai).
Why it matters: Gives basic instruction-following and tool-call formatting. 🍞 Anchor: “Write a polite email in Thai”—the model copies good style from examples.

🍞 Hook: It’s easier to improve when a coach corrects you on your own moves. 🥬 On-Policy Distillation (OPD) with a Teacher

What it is: A strong teacher scores the student’s own outputs token-by-token and the student learns from that.
How it works:
1. Student generates an answer.
2. Teacher provides a full probability distribution over next tokens.
3. Student nudges its probabilities to match the teacher.
Why it matters: Reduces brittleness and improves robustness to mistakes. 🍞 Anchor: When mixing Thai and English (“code-switching”), the teacher helps choose the right characters.

🍞 Hook: Sometimes you need the whole menu, not just the top dishes. 🥬 Full-Logits vs Top-K Distillation

What it is: Full-logits use the teacher’s entire token distribution; top-K keeps only the biggest few.
How it works:
1. Collect teacher probabilities.
2. Either keep all (full) or truncate (top-K).
3. Train student to match what you kept.
Why it matters: Full-logits better handle long-tail tokens in Thai code-switching. 🍞 Anchor: Avoiding tiny typos in Thai-English mixes because the model saw the full “menu” of choices.

🍞 Hook: Rewards help you practice the right habits. 🥬 Reinforcement Fine-Tuning (RFT) with GRPO

What it is: Training that rewards good answers and discourages bad ones to improve reasoning.
How it works:
1. Generate answers or tool-using steps.
2. Score them with a reward (format + correctness, or correctness only for agents).
3. Update the policy to favor higher-reward behaviors.
Why it matters: Targets hard skills like legal reasoning. 🍞 Anchor: Solving a Thai legal question and getting a higher reward for the correct, well-structured answer.

🍞 Hook: Add local facts while you practice, like glancing at flashcards between drills. 🥬 InK-GRPO (Injected Knowledge GRPO)

What it is: GRPO plus a sometimes-on next-word loss from in-domain text to inject missing knowledge.
How it works:
1. Do normal RL steps (GRPO) on questions.
2. With probability ρ, add a light CE loss on domain text.
3. Balance with weight λ so RL stays in charge.
Why it matters: Teaches facts the base model lacks while sharpening reasoning. 🍞 Anchor: Reading Thai law snippets between practice questions to answer better next time.

🍞 Hook: Tools are like magnifying glasses for tricky questions. 🥬 Agentic RFT with RAG Tools

What it is: Let the model search and read documents over multiple turns before answering.
How it works:
1. Model decides to ‘search’ or ‘read’.
2. Retrieves top documents, then reads one.
3. Uses info to craft the final answer.
Why it matters: Boosts accuracy on hard, knowledge-heavy tasks. 🍞 Anchor: The agent searches a legal corpus, reads a statute, then answers the case question correctly.

03Methodology

At a high level: Input (base model + open data) → Stage A: SFT → Stage B: On-Policy Distillation → Output 1: Instruct model → Stage C: Small RFT with InK-GRPO (optionally agentic with tools) → Output 2: Sovereign-capable specialist.

Stage A: Supervised Fine-Tuning (SFT) 🍞 Hook: You know how you learn better when someone shows you examples with right answers? 🥬 The Concept

What it is: Teach the model to follow instructions and use tools by showing prompt–response pairs.
How it works:
1. Build a mixed dataset: 200k Tulu 3 (general English), 100k Toucan Tool (tool use), ~40k Thai AutoIF (Thai alignment with clever constraint placement and occasional English constraints).
2. Train with cross-entropy (likelihood of correct tokens).
3. Use sequence packing to fit long contexts efficiently.
Why it matters: This sets a solid base of instruction following and tool formatting, including Thai. 🍞 Anchor: Given “Summarize this article in Thai,” the model learns to produce short, correct Thai summaries.

Secret sauce in SFT: Thai data and constraint augmentation 🍞 Hook: Sometimes mixing languages in your notes helps you remember better. 🥬 The Concept

What it is: Randomly place constraints in system or user messages and sometimes translate them between Thai and English.
How it works:
1. Take a Thai prompt; keep or translate constraints to English.
2. Randomly move constraints into system or user message.
3. Keep AutoIF-style, code-verifiable constraints and filter with self-eval.
Why it matters: Stronger Thai performance, better code-switching, and robustness to prompt structure. 🍞 Anchor: A Thai math prompt with English constraints still yields a correct, well-formatted answer.

Stage B: On-Policy Distillation (OPD) using Generalized Knowledge Distillation (GKD) 🍞 Hook: Practicing your own homework and getting graded as you go beats copying from an answer key. 🥬 The Concept

What it is: The student writes answers; the teacher shares token-by-token guidance (full probabilities), and the student aligns to that.
How it works:
1. With probability λ=0.25, generate on-policy outputs; otherwise, use SFT data.
2. Query the teacher (e.g., Qwen3-30B A3B Instruct) for full logits along the sequence.
3. Minimize forward KL (teacher→student) at token level.
Why it matters: Reduces brittleness, especially in Thai code-switching and open-ended tasks. 🍞 Anchor: For a Thai-English mixed chat, the teacher helps the student avoid tiny script mistakes, boosting MT-Bench TH and CS scores.

Secret sauce in OPD: Full-logits over Top-K 🍞 Hook: Seeing the whole picture helps avoid small errors. 🥬 The Concept

What it is: Keep the teacher’s full token distribution instead of only top-K.
How it works:
1. Compute full next-token probabilities.
2. Train student to match across all tokens, not just a few.
3. Use efficient swapping/offloading to fit memory limits.
Why it matters: Much better Thai code-switching robustness (e.g., 93.4 vs 69.8). 🍞 Anchor: During sampling, the model chooses the right Thai diacritics because it learned from the full menu.

Engineering to fit academic hardware

Dynamic model swapping (load active model to GPU, offload others to RAM).
FSDP with CPU offloading for 8B students.
vLLM-backed rollouts for fast student inference.
Outcome: Full-logits OPD on 4×H100 feasible; SFT+OPD on 8×H100 in ~2 days.

Stage C: Small-scale RFT with InK-GRPO (optionally Agentic) 🍞 Hook: Practice answering tough questions while skimming local notes in between. 🥬 The Concept

What it is: GRPO RL plus a sometimes-on next-word loss from in-domain text to inject knowledge.
How it works:
1. Collect on-policy rollouts; compute trajectory-level GRPO updates.
2. With probability ρ (e.g., 0.6), add a small CE loss (weight λ~0.1) on in-domain corpus (e.g., NitiBench’s legal contexts).
3. Use rewards: accuracy (and format when not agentic). Judge model scores 0/1/2 then normalized.
Why it matters: Improves Thai legal reasoning while preserving general skills. 🍞 Anchor: Legal agent searches Thai statutes, reads relevant sections, and outputs the correct answer more often.

Agentic RFT (RAG tools) 🍞 Hook: When you don’t know, look it up—then answer. 🥬 The Concept

What it is: Train the model to choose when to search and read documents before answering.
How it works:
1. search: semantic retrieval returns top-3 docs (FAISS IVF-SQ8, Qwen embeddings).
2. read: returns full document content.
3. Optimize final-answer accuracy with GRPO; mask tool outputs from gradients.
Why it matters: Raises accuracy on hard, knowledge-heavy tasks like law. 🍞 Anchor: The agent solves a tricky case by pulling the exact section of Thai civil code, then concludes correctly.

Putting it together (data and compute)

SFT data: 200k Tulu 3 (EN), 100k Toucan Tool (EN tools), 40k Thai AutoIF.
OPD data: 100k Tulu 3 subset, 20k Toucan subset, 40k Thai AutoIF.
RFT data: NitiBench (Thai legal), MIRAGE-Bench Thai (multilingual RAG); CE uses in-domain text.
Compute: ~2 days on 8×H100 for 8B SFT+OPD; ~1 day on 4×H100 for 4B RFT.

Secret Sauce Summary

Thai data at SFT is essential for Thai-native and code-switching alignment.
OPD on-policy with full logits boosts robustness on open-ended, multilingual generation.
InK-GRPO injects missing domain knowledge during RL without causing forgetting.
Agentic RFT leverages tools for real gains on hard, local tasks.

04Experiments & Results

The Test: What and Why

We measured general assistant skills (chat, instruction following), knowledge, math, code, and tool/agent tasks in English and Thai. We also stress-tested Thai code-switching because real users mix languages.
For sovereign capability, we focused on Thai legal reasoning (NitiBench) and multilingual RAG tasks (MIRAGE-Bench).

The Competition: Baselines

SFT-only vs SFT+OPD (adoptability).
Full-logits distillation vs top-K.
GRPO vs InK-GRPO (with pretraining-style CE vs SFT-style CE).
Qwen3-8B Instruct vs Typhoon-S-8B (sovereignty-adapted base) on Thai-focused suite.
Agentic RFT vs GPT-5 with search/agent.

Scoreboard with Context

SFT alone is not enough:

Average: 37.45 (SFT) vs 43.94 (SFT+OPD). That’s like jumping from a C to a solid B.
Thai code-switching: 65.4 (SFT) → 93.4 (SFT+OPD). Huge robustness boost.
Tool use and open QA: SFT often brittle; SFT+OPD fixes many failures (e.g., HPQA zeros improved).
Takeaway: OPD makes the assistant sturdier and more reliable.

Full-logits vs Top-K OPD:

Average: 43.94 (full) vs 42.81 (top-K).
Thai code-switching: 93.4 (full) vs 69.8 (top-K) — full wins big on long-tail tokens.
Some single-answer tasks were similar or slightly better with top-K.
Takeaway: Full logits are worth it for multilingual robustness.

Thai data matters—especially at SFT:

Removing Thai from SFT crushed Thai results (e.g., Thai IFE 73.35 → 57.44; CS 65.4 → 34.4).
OPD without Thai had smaller average drops, but Thai-native tasks still improved when Thai was included.
Takeaway: Include Thai at SFT for alignment; OPD refines it.

Sovereignty-adapted base (ThaiLLM-8B) + recipe works:

On Thai-only suite, Typhoon-S-8B beat Qwen3-8B (Thai average 71.20 vs 66.66) — better Thai chat, code-switching, Thai knowledge (OTE), and Thai agentic QA.
On full EN+TH suite, Typhoon-S-8B remained competitive but trailed on hard English scientific knowledge and math/code.
Takeaway: Starting from a local base + minimal recipe yields a Thai-strong assistant.

InK-GRPO improves sovereign tasks over GRPO:

NitiBench: 19.30% (InK) vs 15.82% (GRPO) — +3.48 points, a meaningful bump.
MIRAGE: 22.63% (InK) vs 20.99% (GRPO) — consistent gain.
Takeaway: Injecting in-domain text during RL adds missing facts while training behaviors.

Pretraining-style CE > SFT-style CE for InK-GRPO:

NitiBench: 19.30% (PT) vs 16.89% (SFT) vs 15.82% (GRPO).
Takeaway: Broader in-domain language modeling encourages exploration and complements RL better than SFT-style targets.

Agentic RFT + InK-GRPO shines:

NitiBench agentic accuracy: 78.02% (Agentic InK) vs 73.73% (Agentic GRPO).
Beat GPT-5 + Search (38.07%) and even GPT-5 + Agent (75.34%) under comparable tool setups.
Takeaway: Small, well-trained agents with tools can exceed very large general models on focused sovereign tasks.

No severe catastrophic forgetting:

Across broad EN+TH suite, GRPO and InK-GRPO models stayed around the base average (≈48–50%).
Gains were targeted (e.g., chat, CS) without broad declines.
Takeaway: The pipeline preserves general skills while adding local strength.

Surprising Findings

Full-logits OPD mattered most for Thai code-switching—suggesting long-tail multilingual tokens need full distributions.
Pretraining-style CE during RL outperformed SFT-style CE for knowledge injection—counterintuitive, but likely due to better exploration.
A compact 4B agent with good training beat a massive GPT-5 in Thai legal QA under comparable agent setups—showing the power of focused, sovereign training.

Practical Performance Frame

Training budget: ~2 days on 8×H100 for 8B adoptability; ~1 day on 4×H100 for 4B sovereign capability.
Data: Mostly open English instruction + small high-quality Thai; domain text for CE.
Outcome: A minimal, open, reproducible path to sovereign LLMs with competitive results.

05Discussion & Limitations

Limitations

Language scope: Experiments center on Thai; while methods likely generalize, direct evidence for other languages is limited.
Pre/mid-training: Not explored here; scaling laws and deeper knowledge infusion at those stages remain open.
Judge/reward design: LLM-as-a-judge choices influence RL; although mitigations were used, different judges may yield different behaviors.
Hard STEM/code: The sovereignty-adapted model trailed on English scientific knowledge, math, and coding versus a strong multilingual baseline.
Data quality: Target-language and legal corpora quality control is vital; noisy or biased texts could inject errors.

Required Resources

Hardware: 8×H100 for ~2 days (8B SFT+OPD) and 4×H100 for ~1 day (4B RFT). Variants can scale down but may need longer.
Data: Open English instruction (e.g., Tulu 3, Toucan), curated Thai prompts/responses with constraint augmentation, and clean in-domain texts for CE.
Software: HF Transformers/TRL, vLLM, FAISS, and an RL framework (veRL-like) with efficient model swapping/offloading.

When NOT to Use

If you cannot curate reliable in-domain texts (for CE), knowledge injection may backfire.
If your task is purely English scientific STEM/coding and you already have a top multilingual instruct model, the Thai-focused benefits may not offset trade-offs.
If you demand full proprietary-scale performance without constraints, this minimal recipe won’t match months-long, thousand-GPU training.

Open Questions

Generalization: How well does InK-GRPO work in other low-resource languages or specialized domains (medicine, finance, public policy)?
Hyperparameters: What are the best ρ and λ schedules for CE mixing? Can adaptive schedules improve learning?
Knowledge provenance: How to track, audit, and update injected knowledge safely over time?
Scaling behavior: How do benefits change with larger backbones or multi-stage (pre/mid) training?
Robustness: Can we further harden code-switching and tool-use behaviors under noisy or adversarial prompts without heavy compute?

06Conclusion & Future Work

Three-Sentence Summary

Typhoon-S is a minimal, open post-training recipe that makes a base model both a robust general assistant (SFT + On-Policy Distillation) and a strong local expert (small RFT with InK-GRPO).
It achieves competitive results on Thai tasks—including legal reasoning and agentic retrieval—while preserving general skills and requiring only academic-scale compute.
Key choices—Thai data in SFT, full-logits OPD for robustness, and pretraining-style CE mixing during RL—drive the gains without complex pipelines.

Main Achievement

Demonstrating a practical, reproducible path for sovereign LLMs: with modest hardware and open datasets, institutions can train instruction-following assistants and legal agents that outperform much larger general models on localized tasks.

Future Directions

Extend to more languages and domains (healthcare, public services); explore adaptive CE schedules and judge designs; study pre/mid-training synergies and larger backbones; add safety/auditing tools for injected knowledge.

Why Remember This

Because it shows that careful design beats sheer size for many real-world needs: with the right small steps—SFT, on-policy distillation, and light RL with knowledge injection—communities can build transparent, controllable AI that truly serves local people.

Practical Applications

•Build a Thai-speaking virtual clerk that answers legal questions with citations from local codes.
•Create a bilingual helpdesk bot for public services that handles Thai–English code-switching smoothly.
•Deploy a hospital triage assistant trained on local medical guidelines (extending InK-GRPO to medicine).
•Launch a university writing tutor that follows strict formatting rules and supports Thai sources.
•Set up a small-team RAG agent to search local archives and summarize findings for journalists.
•Develop an SME customer-support agent that calls tools (billing, booking) reliably using Toucan-style formats.
•Train a classroom assistant that explains math problems in Thai while preserving standard notation.
•Construct a local governance Q&A agent that cites municipal regulations accurately.
•Build a software assistant that mixes Thai comments with English code for developer onboarding.
•Stand up a privacy-preserving model in government offices where weights and data must remain in-country.

Version: 1