šŸŽ“How I Study AIHISA
šŸ“–Read
šŸ“„PapersšŸ“°BlogsšŸŽ¬Courses
šŸ’”Learn
šŸ›¤ļøPathsšŸ“šTopicsšŸ’”ConceptsšŸŽ“Shorts
šŸŽÆPractice
🧩ProblemsšŸŽÆPrompts🧠Review
Search
Solar Open Technical Report | How I Study AI

Solar Open Technical Report

Intermediate
Sungrae Park, Sanghoon Kim, Jungho Cho et al.1/11/2026
arXivPDF

Key Summary

  • •Solar Open is a giant bilingual AI (102 billion parameters) that focuses on helping underserved languages like Korean catch up with English-level AI quality.
  • •The team created 4.5 trillion tokens of high-quality synthetic data so the model could learn even when real Korean data was scarce.
  • •They taught the model with a smart bilingual curriculum that starts broad and gets cleaner and tougher over time, across about 20 trillion training tokens.
  • •A custom tokenizer with a big vocabulary (196,608 tokens) makes Korean and math/code text compress well, speeding training and saving context space.
  • •The model uses a Mixture-of-Experts design so only the best small group of experts activates per token, making it efficient and strong.
  • •They introduced SnapPO, a decoupled reinforcement learning framework that separates generation, reward scoring, and training for easier scaling and mixing goals.
  • •Training happened in phases: pre-training, mid-training to gather diverse reasoning paths, SFT to pick successful paths, then RL to compose them well.
  • •Solar Open leads on many Korean benchmarks (finance, law, medical) and stays competitive in English, especially in math and preference alignment.
  • •Engineering tricks like hierarchical sharding and kernel fixes doubled throughput on B200 GPUs to 7,200 tokens/sec for fast large-scale training.
  • •This recipe—synthetic data, bilingual curriculum, and scalable RL—can be reused to build strong models for other underserved languages too.

Why This Research Matters

When AI works well only in a few languages, many people miss out on help for school, work, and daily life. Solar Open shows a way to close that gap by creating high-quality data, teaching in the right order, and scaling reinforcement learning safely. This brings better medical, legal, and financial answers to Korean users while staying strong in English, too. The approach is reusable, so other languages can benefit without starting from scratch. Faster, more efficient tokenization and MoE design make training and serving more cost-effective in real deployments. Safer, culturally aware answers help protect users and build trust. In short, it’s a practical path toward fair, capable, multilingual AI.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how a library can be amazing for one subject (tons of English books) but nearly empty for another (not many Korean books)? Before this work, the open AI world looked like that: English (and to some extent Chinese) had lots of high-quality data and strong open models, but most other languages did not. This imbalance meant that people speaking underserved languages couldn’t get the same helpful, accurate AI for study, work, or daily tasks.

The problem was twofold. First, data scarcity: there simply wasn’t enough good Korean text—by web byte count, Korean ranks far below English and Chinese. Second, even when you try to train a bilingual model, language details matter: tokenizers might break words poorly, cultural context might be missing, and reasoning examples may not reflect local curricula or styles. Together, these issues cause longer inputs (more tokens), slower learning, and answers that miss the point culturally or logically.

People tried a few things before. One common attempt was to just mix in whatever Korean data could be found, without carefully checking quality or balance. That often led to noisy learning, tokenizers that fell back to raw bytes (inefficient), and models that knew a bit of everything but not enough of the important things. Others tried dense models trained mostly on English, hoping knowledge would transfer. Some transfer happens, but not enough—especially for specialized domains (like Korean law or medicine) or for everyday cultural sense.

What was missing was a whole-system approach designed for underserved languages: (1) create the right kind of data at the right time (including synthetic data when real data is scarce), (2) teach in a smart order (a curriculum that balances languages, domains, and difficulty), and (3) scale up reinforcement learning so the model can combine small reasoning steps into big, correct solutions. Also, the tokenizer needed to be language-aware, so Korean text wouldn’t explode into too many tokens.

The stakes are real. Imagine: a Korean student studying for a national exam; a small clinic needing safe, accurate medical summaries in Korean; a local business wanting customer-support agents that understand Korean style and context; or anyone asking about history, culture, or sensitive topics where the tone and facts must be right. If the AI stumbles here, users waste time, get frustrated, or—worse—receive wrong or insensitive advice. Getting this right makes AI fairer and more useful to more people.

Solar Open presents a practical, connected solution. It’s a large (102B) Mixture-of-Experts model designed to be efficient and strong for both Korean and English. The team synthesized 4.5 trillion tokens of high-quality, domain- and reasoning-focused data to fill the gaps. They trained with a progressive bilingual curriculum across roughly 20 trillion tokens, increasing quality thresholds and synthetic ratios over time to sharpen skills. Then, they used a scalable RL framework (SnapPO) to push reasoning and alignment further—cleanly separating generation, reward scoring, and training so each part could grow without breaking the rest. A custom tokenizer with a large vocabulary and Korean-aware rules improved efficiency and math/code formatting. Together, these pieces aim not just to build one good model, but to show a repeatable recipe any team could use to lift up other underserved languages, too.

02Core Idea

Aha! Moment in one sentence: If you can’t find enough great data for an underserved language, make it—teach it in the right order—and use scalable RL to turn many small reasoning steps into strong, well-aligned answers.

Three analogies for the same idea:

  1. Chef school: When fresh ingredients (real data) are scarce, you practice with high-quality mock foods (synthetic data), follow a curriculum from easy chopping to full recipes, and then do taste tests with judges (rewards) to fine-tune your cooking.
  2. Sports training: Start with drills (atomic steps in pre/mid-training), study winning plays (SFT on successful trajectories), and then scrimmage with referees (RL with rewards) to learn how to combine moves under pressure.
  3. Building a bridge: Gather materials (data synthesis), design a blueprint (bilingual curriculum by phase), and stress-test it with simulations (RL) until it holds heavy loads (complex reasoning and alignment) safely.

Before vs After:

  • Before: Models mostly trained on English; Korean saw tokenization inefficiency, scarce domain data, and weaker reasoning, plus RL that was hard to scale across multiple goals.
  • After: A Korean-aware tokenizer, large-scale synthetic data, a step-by-step bilingual curriculum, and SnapPO RL make Korean capabilities competitive while keeping English strong. RL scales more simply across reasoning, safety, and preferences.

Why it works (intuition):

  • Early stages fill the model’s brain with many small logical moves across topics (so it has the building blocks). Mid-training adds multiple solution paths for hard problems (so it sees variations). SFT picks the winners—the clean, successful paths (so it knows what "good" looks like). RL then rewards combining those steps in new, better ways (so it generalizes and composes skills). The tokenizer ensures text is efficiently represented, so the model learns faster and uses context better. MoE keeps compute focused on the best specialists for each token, making big models efficient in practice.

Building Blocks (in dependency-aware order, each with the Sandwich pattern):

šŸž Hook: You know how a puppy learns tricks faster with treats than with long lectures? 🄬 Reinforcement Learning (RL): RL is a way for AI to learn by trying answers and getting scores (rewards) for better ones.

  • How it works: 1) Generate several answers; 2) Score each with a reward function; 3) Push the AI toward higher-scoring answers; 4) Repeat.
  • Why it matters: Without RL, the AI copies patterns it saw; with RL, it learns to choose and improve. šŸž Anchor: A math problem gets partial credit for correct steps; RL nudges the model toward answers with more correct steps.

šŸž Hook: Imagine a hospital with many specialists—heart, lungs, brain—each stepping in only when needed. 🄬 Mixture-of-Experts (MoE): MoE is a model where many expert sub-networks exist, but only a few get used per token.

  • How it works: 1) A router scores which experts fit the token; 2) Top experts process it; 3) Their outputs combine; 4) Load balancing keeps usage fair.
  • Why it matters: Without MoE, you pay full price for all experts every time; with MoE, you get big-model quality at lower active compute. šŸž Anchor: When reading a legal sentence, legal experts turn on; for code, coding experts step in.

šŸž Hook: Think of a school schedule that balances math and language arts across the year. 🄬 Bilingual Curriculum: A training plan that balances Korean and English, raises quality thresholds over time, and covers key domains.

  • How it works: 1) Start broad and noisy; 2) Filter harder each phase; 3) Increase high-quality synthetic data; 4) End with specialized topics.
  • Why it matters: Without a curriculum, the model learns unevenly and wastes tokens. šŸž Anchor: Early lessons show many small logic steps; later lessons focus on Korean culture and tough reasoning.

šŸž Hook: When there aren’t enough practice worksheets, teachers make new ones that match the test style. 🄬 Synthetic Data Generation: Creating high-quality, realistic training text when real data is scarce.

  • How it works: 1) Generate from capable open models; 2) Filter for quality and topic; 3) Balance difficulty; 4) Mix with real data.
  • Why it matters: Without synthetic data, underserved languages stay underserved. šŸž Anchor: If you need Korean medical Q&A but can’t find enough, you synthesize them and verify quality.

šŸž Hook: Picture a relay race where runners (steps) hand off the baton cleanly, each at top speed. 🄬 SnapPO: A decoupled RL framework that splits generation, reward scoring, and training into separate steps with cached results.

  • How it works: 1) Generate multiple answers and store their probabilities; 2) Compute rewards offline; 3) Train with GSPO off-policy; 4) Iterate.
  • Why it matters: Without decoupling, changing goals or rewards means rebuilding the whole pipeline. šŸž Anchor: Add more reward types (math correctness, safety, style) without redoing generations.

šŸž Hook: Cutting veggies into neat pieces makes cooking faster and tastier. 🄬 Tokenization: Breaking text into pieces (tokens) that the AI reads; a Korean-aware tokenizer keeps pieces meaningful and compact.

  • How it works: 1) Large vocabulary; 2) Keep digits whole; 3) Preserve spaces for code; 4) Oversample Korean.
  • Why it matters: Without a good tokenizer, Korean gets too many tokens and loses meaning. šŸž Anchor: Fewer tokens per sentence = faster training and more context in memory.

šŸž Hook: Solving puzzles needs both pieces and a plan. 🄬 Reasoning Capability: The AI’s ability to chain steps into full solutions.

  • How it works: 1) Learn atomic steps (pre/mid-training); 2) Study successful paths (SFT); 3) Practice with scores (RL); 4) Recombine steps.
  • Why it matters: Without reasoning, answers sound nice but miss the solution. šŸž Anchor: For multi-step math, the model writes a clear plan, checks steps, and reaches the right answer.

03Methodology

At a high level: Inputs (mixed real + 4.5T synthetic tokens, bilingual) → Data Curriculum (quality filters + domain balance) → Pre-training (MoE model learns broad steps) → Mid-training (generate diverse reasoning trajectories) → SFT (curate successful solutions) → RL with SnapPO (compose, align, and scale) → Output (a bilingual 102B MoE model strong in Korean and solid in English).

Step A: Tokenizer and Data Setup

  • What happens: Build a Korean-aware, large-vocab tokenizer (196,608 tokens) with digit splitting and whitespace preservation; assemble ~19.7T tokens (about 20T) with English general, math/code, Korean general, Japanese/multilingual, and domain-specific (finance, law, medical); add 4.5T high-quality synthetic tokens.
  • Why it exists: Efficient tokens mean fewer steps to read the same sentence; synthetic data fills Korean gaps and provides reasoning-rich material.
  • Example: A Korean reasoning answer with neat math formatting becomes shorter in tokens, so longer chains of thought fit into the context window.

Step B: Progressive Bilingual Curriculum (Pre-training)

  • What happens: Train in phases: start broad and noisy (low thresholds, 10% synthetic), then raise quality filters and synthetic ratios (up to ~64%), ending with specialized high-quality Korean culture, math, and code. Filters: general-quality classifier, educational quality score, and embedding-based topic selection.
  • Why it exists: Early coverage builds many small logical moves; later focus sharpens tough skills without wasting tokens on low-quality text.
  • Example: A low-quality blog post gets filtered out in later phases, but a high-scoring Korean history article is kept for cultural knowledge.

Step C: Mid-training (Reasoning Trajectories)

  • What happens: Generate multiple solution paths (2–5 per query) for tough problems using capable open models; include long-context samples; mix back a slice of high-quality pre-training data to avoid forgetting.
  • Why it exists: Seeing many ways to solve the same problem teaches flexible step combinations—the ingredients for later RL.
  • Example: For a difficult geometry proof, the data shows a synthetic write-up with several distinct approaches, side by side.

Step D: SFT (Supervised Fine-Tuning)

  • What happens: Curate successful reasoning trajectories (the best-quality full solutions), teach instruction following and the chat template (<|think|> for internal reasoning). Use a difficulty estimator: if capable models disagree a lot, the query is likely harder; choose a balanced difficulty mix.
  • Why it exists: The model needs a clear sense of what "good reasoning" and "good formatting" look like before exploring with RL.
  • Example: Two math solutions enter; the clean, step-by-step one with a correct answer is selected for SFT; messy or incorrect ones are filtered.

Step E: RL with SnapPO (Two Phases)

  • What happens: SnapPO decouples the loop. Generation: use vLLM to sample many answers and store log-probs. Reward: compute correctness scores for STEM, execution checks for code, multi-part scores for agents, reward models for open-ended writing, and safety checks. Training: use GSPO (a stable, memory-efficient policy optimization) to learn off-policy from cached responses. Iterate.
  • Why it exists: Decoupling makes it easy to add domains, tweak rewards, or scale compute without entangling everything.
  • Example: To increase safety, you add new refusal/safe-completion rewards and re-train with the same cached generations; no need to regenerate all answers.

Secret Sauce 1: MoE done simply

  • The model uses MoE everywhere (no dense layers), a shared expert for stability, top-8-of-128 experts, and load balancing to keep experts fairly used. This delivers big-model quality with less active compute per token.

Secret Sauce 2: Engineering for speed

  • Hierarchical sharding (HSDP) scales to 60 nodes; careful dtype fixes and MoE kernel paths bump throughput to 7,200 tokens/sec on B200 GPUs; Arrow file sharding slashes data loading from hours to minutes.

Secret Sauce 3: Difficulty-aware synthesis

  • A difficulty estimator guides which queries to generate and keep, so the training set stays challenging enough to grow reasoning without overwhelming the model.

Concrete mini-walkthrough with data:

  • Input: A Korean medical Q&A seed lacks enough real examples. Synthetic pipeline generates realistic doctor–patient questions with correct, safe advice. Educational scoring and topic clustering keep the best ones. Mid-training creates multiple solution paths (diagnostic reasoning variants). SFT chooses the clearest path. RL adds rewards for factual accuracy, medical safety style, and cultural sensitivity in Korean. Output: The model answers new Korean medical questions clearly, safely, and in culturally appropriate tone.

04Experiments & Results

The Test: The team measured general knowledge, domain expertise, math/code reasoning, instruction following, preference alignment, agent skills, and long-context understanding in both Korean and English. They cared not only about raw scores but also about whether the model was particularly strong where it promised to be (Korean domains) while staying competitive in English.

The Competition: They compared against strong open baselines roughly similar in size or ambition: gpt-oss-120b (medium/high variants) and GLM-4.5-Air, among others.

The Scoreboard with context:

  • Korean general knowledge: KMMLU ā‰ˆ 73.0 (about like scoring higher than a strong peer by a couple points), KMMLU-Pro ā‰ˆ 64.0, CLIcK ā‰ˆ 78.9, HAE-RAE v1.1 ā‰ˆ 73.3. In plain words: solid improvements, showing the curriculum boosted culturally grounded knowledge.
  • Korean domain strength: Finance (KBankMMLU) ā‰ˆ 65.5, Law (KBL) ā‰ˆ 65.5, Medical (KorMedMCQA) ā‰ˆ 84.4. This is like getting an A in medical and strong B+/A- in finance and law, beating peers by 1–9 percentage points depending on the category.
  • Korean reasoning and following: Ko-AIME 2024/2025 ā‰ˆ 80.3/80.0 and HRM8K ā‰ˆ 87.6 show strong math chops; Ko-IFEval ā‰ˆ 87.5 shows solid instruction following; Ko-Arena Hard v2 ā‰ˆ 79.9 signals strong preference alignment.
  • English general knowledge: MMLU ā‰ˆ 88.2 and MMLU-Pro ā‰ˆ 80.4 (competitive with peers), GPQA-Diamond ā‰ˆ 68.1 (tough grad-level science).
  • English math: AIME 2024 ā‰ˆ 91.7, AIME 2025 ā‰ˆ 84.3, HMMT ā‰ˆ 73.3/80.0—this is a big leap over many baselines on several math sets, like jumping from a B to an A/A+.
  • Code: LiveCodeBench v6 ā‰ˆ 74.2—competitive but behind some variants of gpt-oss.
  • Alignment and writing: Arena Hard v2 ā‰ˆ 74.8 and Writing Bench ā‰ˆ 7.51—good preference alignment and writing.
  • Agents and long context: Tau2 scores are solid; long-context AA-LCR shows room to grow versus some baselines.

Surprising findings:

  • The bilingual curriculum plus synthetic data let Solar Open match a similar-scale baseline’s English performance with fewer tokens (about 48%–77% of their budget depending on language)—that’s like finishing a marathon faster with less fuel.
  • The tokenizer mattered more than one might expect for Korean reasoning and coding: higher bytes-per-token efficiency translated to faster training and longer usable contexts.
  • Agentic abilities got a strong start from high-quality simulation data—even before RL—showing that good synthetic trajectories can bootstrap complex multi-tool skills.

Takeaway: Solar Open leads across many Korean tasks (especially domain benchmarks like medical) while staying competitive in English knowledge and math. The results validate the end-to-end recipe: oversized, language-aware tokenizer + progressive bilingual curriculum + targeted synthetic data + decoupled RL training.

05Discussion & Limitations

Limitations:

  • Lowest-resource languages may still be hard: even with aggressive synthesis, it’s unclear how well this recipe works when there are extremely few reference materials or cultural anchors.
  • Reward design and alignment are delicate: RL depends on good reward functions; poor or biased rewards can steer the model the wrong way.
  • Some English areas (e.g., certain code benchmarks, ultra-long-context) trail specialized baselines; further targeted training would help.
  • The approach requires careful data governance: ensuring synthetic data is accurate, safe, and license-compliant is nontrivial.

Required Resources:

  • Significant compute: roughly 480 B200 GPUs and months of training time; robust storage and fast networking.
  • Engineering depth: MoE stability, tokenizer design, scalable data pipelines, and RL infrastructure (vLLM + GSPO + decoupled caching) demand experienced teams.
  • Curation effort: building and maintaining quality filters, difficulty estimators, cultural sensitivity datasets, and domain-specialized corpora.

When NOT to Use:

  • If you have a tiny budget and only need a small model for simple tasks; a lighter bilingual model may suffice.
  • If your primary goal is code generation SOTA or ultra-long-context reasoning, you might prefer specialized models or add targeted continued training.
  • If you cannot support RL or curation pipelines, a pure SFT approach on existing open data may be more practical initially.

Open Questions:

  • Language scaling laws: How does adding a language shift performance elsewhere under fixed compute/data budgets?
  • Minimal viable synthesis: What’s the smallest, smartest synthetic recipe that still unlocks big gains for a new language?
  • Reward robustness: How to design rewards that generalize across domains without overfitting to benchmarks?
  • Tokenizer transfer: Best practices for expanding tokenizers and continually training existing models to add new languages efficiently.

06Conclusion & Future Work

Three-sentence summary: Solar Open shows how to build a strong bilingual LLM for an underserved language by combining massive, high-quality synthetic data, a progressive bilingual curriculum, and a scalable, decoupled RL framework (SnapPO). A Korean-aware tokenizer and MoE architecture make learning efficient and context-friendly, while SFT and RL phases turn many small steps into reliable, well-aligned reasoning. The result is a model that leads on many Korean tasks and remains competitive in English.

Main achievement: A reusable, end-to-end methodology—oversized language-aware tokenization, quality-raising curricula with large-scale synthesis, and decoupled RL—that lifts an underserved language to near-frontier performance without sacrificing bilingual strength.

Future directions: Test the recipe on even lower-resource languages; refine reward models for broader, safer alignment; push long-context and code with targeted continued training; and study principled language scaling laws for fair multi-language growth. Also explore efficient ways to expand tokenizers and continually train existing models as new languages are added.

Why remember this: It’s a blueprint for language equity in AI—showing that with the right data, curriculum, and RL design, we can bring high-quality reasoning and culturally aware answers to communities that have been left behind, and do so in a scalable, repeatable way.

Practical Applications

  • •Build strong LLMs for other underserved languages by reusing the synthesis–curriculum–SnapPO recipe.
  • •Create high-quality synthetic domain datasets (e.g., medical, legal) when real data is scarce or sensitive.
  • •Use difficulty estimators to balance training sets so models keep improving on genuinely hard problems.
  • •Adopt language-aware tokenizers to improve efficiency and context length in multilingual deployments.
  • •Scale RL training across multiple goals (reasoning, safety, preferences) using a decoupled SnapPO pipeline.
  • •Simulate agent tool-use trajectories to bootstrap planning and error recovery without expensive real logs.
  • •Target continued training for weak spots (e.g., specific math topics, code styles, long-context tasks).
  • •Deploy safe-completion responses for sensitive topics (e.g., self-harm) to reduce harm while offering support.
  • •Leverage MoE to get big-model quality with lower active compute costs in production.
  • •Continuously refine rewards (correctness, style, cultural sensitivity) without regenerating data by caching outputs.
#Solar Open#Mixture-of-Experts#bilingual LLM#Korean LLM#synthetic data generation#curriculum learning#tokenization#reinforcement learning#SnapPO#GSPO#pretraining#SFT#preference alignment#agent simulation#underserved languages
Version: 1