Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain
Key Summary
- •The paper builds special Turkish legal AI models called Mecellem by teaching them from the ground up and then giving them more law-focused lessons.
- •They trained encoder models (ModernBERT, 155M and 403M parameters) from scratch on 112.7B tokens and found the best checkpoints by testing real retrieval tasks during training, not just by watching training loss.
- •This downstream-aware checkpoint trick mattered: the best retrieval score showed up before the lowest training loss, which means 'lowest loss' isn’t always the best model for real work.
- •With light, single-stage post-training (contrastive learning with smart negative filtering), the 155M model rivaled much larger models and reached top-tier scores on Turkish legal retrieval.
- •They also took decoder models (Qwen3 1.7B and 4B) and continually pre-trained them on Turkish legal text using a careful, four-phase curriculum to avoid forgetting general language.
- •This continual pre-training cut legal text perplexity by up to 43.1% (1.7B, multi-phase) and 36.2% (4B, single-phase), showing strong domain adaptation.
- •A custom tokenizer and Turkish morphology filters (suffix entropy and lemma diversity) helped the models handle the agglutinative structure of Turkish better.
- •Long-context matters in law: shortening sequence length during post-training hurt regulation and case law retrieval a lot, proving legal tasks need long documents.
- •Overall, Mecellem shows a cost-effective path: pre-train once well, pick checkpoints by real tasks, then add small, smart post-training instead of huge multi-stage pipelines.
- •These models and datasets are open-source, enabling better Turkish legal search, assistance, and RAG systems.
Why This Research Matters
Better Turkish legal AI can save hours of manual searching, reduce mistakes, and make justice information more accessible. With models that truly understand Turkish morphology and long legal texts, lawyers and citizens get more accurate answers faster. Businesses can build reliable RAG systems for compliance and contracts without the cost of massive training pipelines. Government services and legal clinics can scale assistance with trustworthy retrieval and summaries. The open-source release means universities, startups, and public institutions can adopt and improve the models. Ultimately, this approach is a reusable blueprint for other low-resource languages and formal domains beyond law.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine learning to read a super long, complex rulebook in a language where one word can pack the meaning of a whole sentence. That’s Turkish law for AI models: long, careful, and full of tiny details packed into word endings.
🥬 The World Before:
- General large language models (LLMs) are usually trained mostly on English. They do okay across many topics, but they struggle when texts are not in English and when the writing is very formal and precise—like laws, court decisions, and regulations.
- Turkish adds another twist: it’s agglutinative. Lots of meaning sticks onto a single word using suffixes. That makes splitting words and understanding grammar harder for models that learned mainly on English.
- On top of that, legal documents are long, filled with special terms, and must follow strict logic and rules. Getting them wrong can cause misunderstandings or bad advice.
🍞 Anchor example: Suppose someone searches, “What law controls data privacy for hospitals in Turkey?” A general model might miss specific law names or mix up which rule applies. A law-smart Turkish model should find and explain the correct statute quickly and clearly.
🍞 Hook: You know how a chef needs special recipes for baking vs. grilling? Likewise, AI needs special training for special domains.
🥬 The Problem:
- Existing Turkish models weren’t deeply trained on legal language. Fine-tuning alone (tiny, last-minute lessons) often wasn’t enough to capture the complex grammar and logic of legal Turkish.
- Models often used the lowest training loss to pick checkpoints. But the authors found that in Turkish legal retrieval, the best real-world performance appeared before the minimum loss—so choosing by loss alone could pick the wrong model.
🍞 Anchor example: It’s like practicing math until you memorize the answers but forget how to solve new problems. You get great practice scores but stumble on real tests.
🍞 Hook: Picture a library robot who can either write stories (decoder) or file and compare documents (encoder). Both jobs matter in law.
🥬 Failed Attempts:
- Heavy, multi-stage pipelines (like synthetic data + big supervised phases) can be great, but they are expensive and complex to reproduce.
- Simple fine-tuning on small sets didn’t give deep legal understanding and sometimes caused catastrophic forgetting—losing general language skills while learning legal ones.
🍞 Anchor example: A student who crams only for biology might forget math. A good plan keeps both.
🍞 Hook: Imagine teaching in levels: start simple, then add tougher content so the student doesn’t get overwhelmed.
🥬 The Gap:
- There wasn’t a widely shared, cost-effective recipe for Turkish legal models that:
- trains encoders from scratch with Turkish-aware tokenization,
- picks checkpoints by real legal tasks (like retrieval), and
- adapts decoders through careful continual pre-training (CPT) without forgetting.
🍞 Anchor example: A school that first teaches Turkish basics, then law words, then long legal reasoning, finally practices mixed challenges—the paper builds that exact school for AI.
🍞 Hook: Why should non-researchers care?
🥬 Real Stakes:
- Lawyers, judges, journalists, and citizens need quick, reliable legal search and summaries in Turkish.
- Companies need trustworthy RAG (Retrieval-Augmented Generation) systems for compliance and contracts.
- Government services and law clinics can save time and reduce errors with better tools.
🍞 Anchor example: When a small business asks, “Do I need consent under PDPL for this customer data use?”, a domain-smart model can retrieve the right article and explain it in plain Turkish.
02Core Idea
🍞 Hook: You know how a coach doesn’t just pick the player who looks best in warmups but the one who actually scores in the game?
🥬 The Aha! Moment (one sentence): Don’t pick the model checkpoint by lowest training loss—pick it by how well it performs on real legal retrieval, then add efficient, targeted post-training and carefully designed continual pre-training to master Turkish law.
🍞 Anchor example: The authors kept checking how well the model retrieved Turkish legal texts during training and stopped when that score was best—even if the training loss wasn’t the lowest.
🍞 Hook #1: Imagine learning Turkish law like climbing stairs: first everyday Turkish, then legal words, then very long, tricky documents.
🥬 Analogy 1 (Curriculum):
- What it is: A step-by-step lesson plan that gets tougher over time.
- How it works: Start with general texts → move to legal articles and cases → practice very long, formal laws and decisions → final mixed review.
- Why it matters: It prevents the model from getting overwhelmed and forgetting what it already knows.
🍞 Anchor: Like piano lessons: scales first, then songs, then full recitals.
🍞 Hook #2: Think of a metal detector that helps you find the shiny parts in a beach of sand.
🥬 Analogy 2 (Checkpoint by downstream):
- What it is: Choosing model snapshots by real task scores (like retrieval) instead of just training loss.
- How it works: During pre-training, test on legal retrieval, pick the snapshot with the highest retrieval score.
- Why it matters: In Turkish legal tasks, best performance often comes before minimum loss; choosing by loss would miss the sweet spot.
🍞 Anchor: A runner’s fastest lap can happen before they feel fully warmed up—so you stop the clock when the lap is fastest, not when heart rate is lowest.
🍞 Hook #3: Picture a magnifying glass that focuses on what matters and filters out distractions.
🥬 Analogy 3 (Smart post-training):
- What it is: Contrastive learning with false-negative filtering to train embeddings.
- How it works: Pull matching query–document pairs together, push mismatches apart, while using a guide model to avoid pushing away true matches by accident.
- Why it matters: Legal texts can be very similar; filtering out false negatives keeps the model from learning the wrong lesson.
🍞 Anchor: If two contract clauses mean the same thing with different wording, don’t teach the model to treat them as far apart.
Before vs After:
- Before: Pick checkpoints by lowest loss; small fine-tunes; English-centric tokenizers; short sequence limits.
- After: Pick checkpoints by retrieval; add contrastive post-training with false-negative filtering; use a Turkish-aware tokenizer; keep long contexts; and use CPT with curriculum.
Why It Works (intuition):
- Real legal tasks prize precise matching over generic language guessing, so test on retrieval during training.
- Turkish morphology needs better tokenization and longer sequences; otherwise, suffix-rich words and long laws break understanding.
- Gradual CPT keeps general skills while adding legal depth, avoiding catastrophic forgetting.
Building Blocks (with mini sandwiches):
- 🍞 You know how filling in missing words helps you understand a story’s flow? 🥬 Masked Language Modeling (MLM): Predict hidden tokens from context; it teaches deep language structure; without it, encoders won’t learn strong bidirectional signals. 🍞 Example: “Kişisel verilerin … işlenmesi için rıza gerekir.” Predicting “açık” (explicit) improves legal phrasing sense.
- 🍞 Imagine a librarian who can read very long books and remember where important parts are. 🥬 ModernBERT: An encoder with long context, efficient attention, and RoPE; without long context, it misses links across pages. 🍞 Example: Reading a multi-page court decision end-to-end.
- 🍞 Think of study sessions that continue after school to master a subject. 🥬 Continual Pre-training (CPT): Keep training on legal Turkish so the model absorbs domain patterns; without CPT, the model stays too general. 🍞 Example: After base training, reading thousands of Turkish regulations fine-tunes legal voice.
- 🍞 Picture changing from story-writing mode to summarizing mode. 🥬 Decoder-to-Encoder Conversion: Modify a decoder to create embeddings by removing the LM head, switching to bidirectional attention, and mean pooling; without proper data and stages, quality lags behind purpose-built encoders. 🍞 Example: Turning a chatty model into a sharp search indexer.
03Methodology
At a high level: Input (large Turkish legal + general corpus) → Pre-train encoders from scratch (MLM; ModernBERT) with ongoing retrieval checks → Post-train embeddings (contrastive + false-negative filtering) → Continual pre-train decoders (Qwen3) with curriculum → Convert decoders to embeddings (optional) → Evaluate on Turkish MTEB + legal tasks.
Step-by-step with the Sandwich pattern for each key step:
- Data preparation and Turkish-aware cleaning
- 🍞 Hook: You know how gardeners remove weeds so plants grow better?
- 🥬 What it is: A careful pipeline that OCRs scanned texts, cleans noise, filters by quality, deduplicates, and enforces Turkish morphology richness.
How it works:
- OCR with a vision-language model (DotsOCR) for complex PDFs (tables, formulas, multi-columns).
- Safety and quality filters (FineWeb), language ID (GlotLID), URL filters.
- Turkish morphology filters: keep samples with healthy suffix entropy and lemma diversity (SE ≥ 75%, LD ≥ 50%).
- Exact and semantic deduplication (SemHash) so repeated templates don’t dominate. Why it matters: Without this, the model absorbs junk, repeats, and poor Turkish patterns, hurting downstream legal accuracy.
- 🍞 Example: Cleaning court decisions so numbered articles, footers, and repeated headers don’t confuse the model.
- Tokenizer tuned for Turkish morphology
- 🍞 Hook: Imagine chopping carrots at natural joints, not random bits.
- 🥬 What it is: A BPE tokenizer with Llama-style pretokenization tailored for Turkish. How it works: Train on Turkish legal + web data; aim for tokens that match real morphemes (roots + suffixes), not random fragments. Why it matters: Bad splits break meaning in agglutinative words; good splits boost understanding and retrieval.
- 🍞 Example: çalışıyorlar → [çalış, ıyor, lar] instead of [ça, lı, şı, yor, lar].
- Pre-train encoders (ModernBERT) from scratch with downstream-aware checkpointing
- 🍞 Hook: Think of practicing piano but timing when your song actually sounds best.
- 🥬 What it is: Train encoders via Masked Language Modeling (30% span masking; RoBERTa-style 80/10/10) on 112.7B tokens and pick checkpoints by legal retrieval scores.
How it works:
- Train ModernBERT-base (155M) and large (403M) on long sequences with RoPE and mixed local/global attention.
- During training, repeatedly evaluate legal retrieval (nDCG@10) and record the best-scoring checkpoint—often before the minimum MLM loss. Why it matters: In Turkish legal tasks, ‘lowest loss’ ≠ ‘best retrieval.’ The right snapshot gives better real-world embeddings.
- 🍞 Example: Version v5 had better legal retrieval than v6 even though v6 had lower MLM loss.
- Post-train encoders for embeddings with contrastive learning + false-negative filtering
- 🍞 Hook: You know how you learn better when your practice questions are neither too easy nor misleading?
- 🥬 What it is: Contrastive learning on MS MARCO-TR to pull true pairs close and push different texts apart, while filtering false negatives using a guide model (BGE-M3 or EmbeddingGemma-300M).
How it works:
- Use InfoNCE/Qwen3-InfoNCE or GISTEmbed.
- Precompute guide embeddings to flag near-duplicates that shouldn’t be treated as negatives.
- Keep long sequence lengths (1,024–2,048) to match legal document length. Why it matters: Without filtering, the model would wrongly push apart truly similar legal passages; shortening sequences hurt regulation/caselaw retrieval.
- 🍞 Example: Two clauses defining ‘explicit consent’ in PDPL with different wording—filtering keeps them close.
- Continual Pre-Training (CPT) of decoders (Qwen3-1.7B and 4B) with curriculum
- 🍞 Hook: Think of a training plan: warm-up, main workout, endurance, and cool-down.
- 🥬 What it is: Keep training pre-trained decoders on Turkish-dominant legal data in phases to gain legal fluency without forgetting general language.
How it works:
- Phase 1: short general Turkish to stabilize.
- Phase 2: legal content (decisions, articles, regulations).
- Phase 3: very long, formal laws and theses.
- Phase 4: mixed refinement. Why it matters: Without curriculum and replay, models can forget general skills; with it, perplexity on legal text drops a lot.
- 🍞 Example: The 1.7B model’s overall legal perplexity fell by 43.1% from base to final phase.
- Decoder-to-Encoder conversion (optional)
- 🍞 Hook: Imagine switching a camera from ‘video’ to ‘photo’ mode.
- 🥬 What it is: Remove LM head, make attention bidirectional, and use mean pooling to get embeddings out of a decoder. How it works: Identity projection keeps 2,048-dim output; post-train on MS MARCO-TR. Why it matters: Useful when you must reuse a decoder, but quality may lag behind purpose-built encoders unless you use large, multi-stage training like Qwen3-Embedding.
- 🍞 Example: The 4B converted model worked but didn’t beat the 155M purpose-built encoder on legal retrieval.
- Evaluation and metrics
- 🍞 Hook: Think of report cards that test different skills: reading, writing, group work.
- 🥬 What it is: Turkish MTEB across classification, retrieval, clustering, pair classification, and STS, plus legal retrieval tasks (Contracts, Regulation, Caselaw). How it works: Embed queries and documents, compare with cosine similarity, and score with nDCG@10; language modeling measured by perplexity (lower is better). Why it matters: Different tasks reveal different strengths; legal tasks stress long, precise matching.
- 🍞 Example: A model scoring 80+ nDCG@10 on Contracts means it finds the right clauses near the top.
Secret Sauce:
- Downstream-aware checkpointing discovered that the best real-world retrieval happens before the lowest pre-training loss.
- Turkish-aware tokenization + morphology filters keep linguistic integrity.
- Long sequence lengths protect legal understanding.
- Lightweight, smart post-training (GISTEmbed + guide) delivers strong gains without huge multi-stage cost.
- Curriculum CPT adds legal depth while preserving general skills.
04Experiments & Results
🍞 Hook: Picture three kinds of tests: Can you find the right page in a giant law book (retrieval)? Can you speak legal Turkish fluently (perplexity)? Can you do lots of everyday tasks (MTEB categories)?
🥬 The Test:
- Retrieval (nDCG@10): Measures how high relevant legal documents show up—like getting the right pages at the top.
- Perplexity: Lower means the model ‘expects’ legal text better—like reading smoothly without stumbling.
- MTEB Score: Average across task types (classification, clustering, pair classification, retrieval, STS)—like an overall GPA.
🍞 Anchor: Scoring 80 in Contracts retrieval is like placing the correct clause near the top of search results most of the time.
🍞 Hook: Competing against bigger kids but using smarter training.
🥬 The Competition:
- Strong baselines: EmbeddingGemma-300M, BAAI/bge-m3, and a tuned bge-m3-stsb; classic Turkish BERT variants; ModernBERT baselines; converted decoders.
- Our models: Encoders trained from scratch (155M and 403M), plus decoders with CPT (1.7B, 4B) and converted to embeddings.
🍞 Anchor: The 155M model often keeps up with 300M–567M models—like a lighter runner with great technique.
🥬 The Scoreboard (with context):
- Encoder embeddings (post-trained):
- 155M (Mursit-Base-TR-Retrieval): MTEB 55.86, Legal 47.52, with contract retrieval ~80.40 nDCG@10.
- 403M (Mursit-Large-TR-Retrieval): MTEB 56.43, Legal 46.42.
- These are in the top tier for Turkish legal retrieval despite having fewer parameters than some SOTA models.
- Decoder CPT (language modeling):
- Qwen3-1.7B CPT (four phases): Overall legal perplexity improved by 43.1% (e.g., Tax Law from 7.29→3.98), showing strong domain adaptation.
- Qwen3-4B CPT (single phase): Overall legal perplexity improved by 36.2% (e.g., Fund Law from 8.277→4.897), showing that bigger models can adapt even without multi-phase structure.
- Checkpoint selection finding: Best retrieval often happened before reaching minimum MLM loss; later checkpoints sometimes regressed in legal retrieval.
- Long-context ablation: Reducing post-training sequence length to 256 (to ‘match’ dataset stats) hurt legal scores by up to 23.6% in regulation—proof that legal work needs long windows.
- Production efficiency: The 155M model ranked fourth overall (92.36%) behind much bigger pipelines, a strong accuracy-to-cost trade-off.
🍞 Anchor: Think of it like this—our 155M runner places right behind three elite sprinters, but uses far less energy.
Surprising Findings:
- ‘Lower training loss’ didn’t always mean better retrieval. The best legal search sometimes happened earlier, suggesting over-optimizing MLM can hurt embedding usefulness in Turkish legal tasks.
- Converted decoders (even 4B) didn’t beat the purpose-built 155M encoder on legal retrieval without huge multi-stage training; architecture and training recipe matter more than raw size.
- Turkish morphology filters worked best when applied early and globally; small post-hoc refinement helped the 1.7B a bit but slightly hurt the 4B—hinting larger models need more data diversity at the end.
05Discussion & Limitations
🍞 Hook: Even great tools have limits, like a powerful magnifying glass that still needs good lighting.
🥬 Limitations:
- Data coverage: Although very large, some subdomains may still be underrepresented, affecting niche legal topics.
- Training resources: Pre-training encoders from scratch and CPT at scale needs serious compute, though still cheaper than multi-stage pipelines.
- Converted decoders: Without synthetic data and multi-stage supervision (like Qwen3-Embedding does), converted decoders lag behind purpose-built encoders on legal retrieval.
- Metric gaps: Perplexity and nDCG@10 show strong trends, but full legal reasoning quality (citations, compliance) still benefits from reward-model-based evaluations and human review.
Required Resources:
- For encoders: Multi-GPU training for pre-training; modest GPUs for post-training with cached guide embeddings.
- For decoders: H100-class clusters for CPT; stable storage and fast interconnects for large token throughput.
- For deployment: Vector DB (e.g., Qdrant), quantization, and indexing tuned for long embeddings and higher throughput.
When NOT to Use:
- If you only need casual Turkish conversations or short general Q&A, generic multilingual models may suffice.
- If your system cannot handle long contexts, legal retrieval quality will drop; consider upgrading memory/context first.
- If you need top-tier cross-lingual retrieval across many languages at once, specialized multilingual SOTA embeddings might be a better fit.
Open Questions:
- How far can downstream-aware checkpointing go—can we automate it across multiple legal tasks to pick a single universal checkpoint?
- Can smaller decoder models match 4B perplexity gains with smarter curricula or orthogonal-subspace methods to further reduce forgetting?
- What is the best balance between morphology-based filtering and data diversity at large scales?
- How can we add trustworthy citation and legal reasoning checks into training, beyond retrieval and perplexity?
🍞 Anchor: Think of the next step as turning a good map into a GPS: we have the roads and the signs; now we want live guidance and safe routes with citations.
06Conclusion & Future Work
🍞 Hook: Imagine a law student who learns Turkish perfectly, studies lots of law books, and practices finding the right article fast—this paper teaches AI to do that.
🥬 3-Sentence Summary:
- Mecellem builds Turkish legal models by pre-training encoders from scratch on a Turkish-dominant corpus, choosing checkpoints by real retrieval scores, and then applying efficient contrastive post-training.
- It also continually pre-trains decoders with a curriculum to gain legal fluency without forgetting general language, showing large perplexity reductions.
- The result is competitive, cost-effective legal retrieval and language modeling in Turkish, with open resources for the community.
Main Achievement:
- Proving that downstream-aware checkpoint selection plus light, smart post-training can rival larger, multi-stage pipelines—and that curriculum CPT brings big legal-domain gains without sacrificing general skills.
Future Directions:
- Explore automated multi-task checkpoint selection, stronger decoder-to-encoder conversions (with synthetic data and staged supervision), and richer legal reward models to guide reasoning and citations.
- Build practical Turkish legal RAG assistants that combine these embeddings with trustworthy citation and verification layers.
Why Remember This:
- Because it shows a practical recipe: respect the language (tokenizer + morphology), respect the task (choose checkpoints by retrieval), and respect memory (long context + curriculum CPT). It’s a blueprint others can copy for low-resource, complex domains—not just Turkish law.
Practical Applications
- •Build a Turkish legal search engine that ranks the most relevant statutes and case law at the top.
- •Create a contract review assistant that retrieves and highlights matching clauses across long documents.
- •Deploy a compliance Q&A bot for PDPL that cites specific articles with precise language.
- •Integrate embeddings into a RAG system to ground legal answers in retrieved sources.
- •Set up a law-student study tool that finds precedent cases and summarizes long decisions.
- •Use the decoder CPT model to draft legally styled Turkish text and then verify with retrieval.
- •Implement a legal hotline assistant that quickly locates official gazette references.
- •Automate de-duplication and clustering of large court decision archives for faster research.
- •Monitor regulation changes by embedding and comparing updated texts over time.
- •Localize corporate policies to Turkish legal norms by retrieving relevant national regulations.