Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks

Atsuki Yamaguchi; Maggie Mi; Nikolaos Aletras

Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks

Beginner

Atsuki Yamaguchi, Maggie Mi, Nikolaos Aletras1/6/2026

arXiv PDF

Key Summary

•The paper teaches language models using extra 'language homework' made from the same raw text so they learn grammar and meaning, not just next-word guessing.
•This framework, called L2T, turns plain text into many small tasks (like fixing typos, unshuffling sentences, and filling missing pieces) to give explicit practice on structure.
•Compared to standard next-word training, models trained with L2T score about 2.8 points higher on average on BLiMP, a grammar-and-meaning test, with gains up to 11.3 points on tough syntax.
•L2T speeds up learning early in training, giving models a head start as soon as 5 billion tokens, especially in semantics.
•On general reasoning benchmarks, L2T stays competitive; performance is mostly stable when there is plenty of unique raw text.
•When raw text is limited and a big model is used, too much L2T can slightly hurt knowledge-heavy tasks, so balancing L2T and raw text matters.
•The 14 tasks span characters, words, sentences, and full passages, nudging the model to notice morphology, syntax, and semantics.
•L2T needs no external labels or helper models; it auto-builds tasks from any raw text, like making quizzes from a book.
•Even with as little as 25% L2T mixed in, linguistic competence improves, but at least 25% raw text is needed to keep broad world knowledge strong.

Why This Research Matters

Better linguistic competence means AI tools are less likely to misunderstand instructions, mix up grammar, or misread subtle meanings. That leads to clearer emails, safer summaries, and more reliable support in education, customer service, and accessibility. Because L2T needs no external labels or helper models, any organization can upgrade pretraining by transforming its own text into structured practice. The early-learning boost helps models get smarter faster, saving compute and time. With the right balance of raw text and tasks, we keep broad knowledge while gaining stronger grammar and meaning. Over time, this approach can reduce “confident but wrong” outputs that come from shallow pattern matching.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re learning soccer by only watching matches on TV. You might copy some moves, but without drills (like passing practice), you’ll miss the rules that make great play possible.

🥬 The Concept (Causal Language Modeling, CLM): CLM is how many language models learn: they read tons of text and try to guess the next word. How it works:

Feed the model a sentence prefix.
Ask it to predict the next token.
Repeat billions of times. Why it matters: This builds fluency and world knowledge, but it doesn’t directly teach grammar rules or deep structure—so the model can sound right while missing key linguistic details. 🍞 Anchor: Like finishing each sentence by guessing the next word. You get good at common patterns, but you aren’t quizzed on why the sentence is correct.

🍞 Hook: You know how a student can memorize poems but struggle to explain the grammar? That’s memorizing without understanding.

🥬 The Concept (Linguistic Competence): Linguistic competence is knowing the rules and meanings that make language work: word forms (morphology), sentence structure (syntax), and meaning in context (semantics). How it works:

Notice patterns (like verb endings or subject–verb agreement).
Apply structure to form correct sentences.
Interpret meanings correctly even when wording changes. Why it matters: Without it, a model may repeat phrases correctly but fail on tricky grammar or long-distance dependencies. 🍞 Anchor: It’s the difference between saying, “He run fast” and knowing it should be “He runs fast,” and also understanding what “bank” means in different contexts.

The World Before: Big models trained with CLM became impressive storytellers and fact recallers. But researchers noticed a “stochastic parrot” effect: models often copy surface patterns without grasping underlying grammar rules. They could finish sentences but sometimes broke subtle rules, struggled with long-range connections in sentences, and tripped on precise phenomena like island constraints or filler–gap relations.

🍞 Hook: Think of a school that uses only reading aloud, never grammar worksheets or writing exercises.

🥬 The Concept (Rote Memorization): Rote learning is copying patterns without understanding the rules behind them. How it works:

See many examples.
Memorize frequent sequences.
Repeat them later. Why it matters: You perform fine on familiar texts but falter on tricky or new structures. 🍞 Anchor: Saying “the cats runs” because you’ve heard “the cat runs” a lot—mixing up the rule when the subject changes.

The Problem: Next-word prediction alone doesn’t explicitly train a model to answer questions like “Which sentence is grammatical?” or “Which word is missing here and why?” Those are the drills humans use when learning a language.

Failed Attempts: Some prior methods tweaked model architectures, used external tools, or required labeled data. Others staged complex curricula or pretraining on artificial languages as warm-ups. These helped in parts but often needed extra supervision, helper models, or didn’t cleanly show what happens if we just reshape the learning signal from raw text itself.

The Gap: We lacked a simple, label-free way to give a decoder-style language model the same kind of structured practice humans get—across characters, words, sentences, and passages—right inside pretraining.

Real Stakes: If your assistant mixes up grammar, it may misread instructions, misunderstand questions, or answer confidently but incorrectly. In everyday life, that means worse summaries, clumsy translations, or chatbots that fail on subtle rules. In education or law/medicine support tools, tiny grammatical misunderstandings can flip meanings.

🍞 Hook: Imagine turning any book into a set of mini homeworks—fill-in-the-blank, fix the typos, unshuffle the sentences.

🥬 The Concept (Structured Input–Output Pairs): Turn raw text into pairs like [messed-up input] → [fixed output] or [question about the text] → [answer]. How it works:

Take a chunk of text.
Apply a transformation or ask a structured question.
Use the original text as the answer key. Why it matters: This gives targeted practice on language rules without needing external labels. 🍞 Anchor: Deleting spaces in a paragraph and asking the model to restore them teaches word boundaries and grammar cues.

That’s why this paper proposes L2T—a way to bake structured language-learning drills right into pretraining, so models practice the rules as well as the words.

02Core Idea

🍞 Hook: You know how coaches add drills to games so players master footwork, passing, and strategy—not just scoring?

🥬 The Concept (L2T: Language Learning Tasks): L2T is a pretraining recipe that mixes standard next-word prediction with 14 auto-made language-learning tasks built from the same raw text. How it works:

Split text into chunks.
Randomly pick a task (like fix typos, unshuffle words, fill the middle paragraph).
Build a structured input–output pair from that chunk.
Train the model on a mix of these pairs and plain raw text. Why it matters: The model stops over-relying on surface patterns and starts learning deeper structure—morphology, syntax, and semantics. 🍞 Anchor: It’s like turning every chapter of a book into mini grammar drills—and mixing them with normal reading—to train a smarter reader.

The “Aha!” in one sentence: Give the model many small, structure-focused puzzles made directly from its reading material, and it learns language rules faster and better.

Three analogies:

Gym analogy: Not just playing matches (raw text) but also doing drills (L2T tasks) builds balanced strength and skills.
Lego analogy: You don’t just see castles; you also get step-by-step rebuild tasks that reveal how pieces fit (syntax and morphology).
Glasses analogy: L2T puts on different “lenses” (character, word, sentence, discourse tasks) so the model notices details it would miss with plain reading.

Before vs After:

Before: The model excels at fluent continuation and memorized facts but stumbles on subtle grammar tests and long-range structure.
After: The model still reads and recalls but now scores higher on focused linguistic tests (like BLiMP), especially tricky syntax (e.g., island effects), and it learns these skills earlier in training.

🍞 Hook: Think of three layers of language: word shapes, sentence skeletons, and meaning muscles.

🥬 The Concepts (Morphology, Syntax, Semantics):

Morphology is word-building (like walk → walks → walked).
Syntax is sentence structure (who did what to whom, and in what order).
Semantics is meaning in context (bank: money place vs river side). How it works: L2T tasks stress each layer—fixing typos (morphology), unshuffling words and sentences (syntax), and filling missing context (semantics and discourse). Why it matters: Balanced training strengthens all layers, so the model handles both local and long-distance dependencies. 🍞 Anchor: When a model restores the right word order in a shuffled sentence, it proves it’s using structure, not just memorized phrases.

Why it works (intuition, no equations):

Multi-task variety disrupts lazy shortcuts: the model can’t just memorize next-word statistics when sometimes the input is scrambled, masked, or missing pieces.
Auto-supervision: The original text is the answer key, so tasks are plentiful and consistent.
Inductive biases: Repeated exposure to structural rewrites nudges the model to build internal representations that track roles, agreements, and dependencies.

Building blocks:

14 tasks across four levels: character (typo, masked char, space, char count), word (masked word, random replacement, shuffle, last phrase, token type count), sentence (delete anomaly, reorder sentences), and discourse (fill middle, complete second half, generate from one-word prefix).
Two data setups: Disjoint (abundant unique documents) and Shared (limited documents reused for both raw and L2T), to separate effects of structure vs. data diversity.
Same compute budget (100B tokens) and model sizes (500M, 1B) to fairly compare with standard CLM.

🍞 Anchor: If you take one news article and make a “fix the typos,” “find the odd sentence,” and “fill the missing middle” puzzle from it, then mix those with normal reading, you’ll practice structure in multiple ways without needing any teacher-made labels.

03Methodology

High-level overview: Raw text → segment into chunks → apply one of 14 L2T tasks to make input–output pairs → pack and mix with normal raw-text samples → train the language model to predict tokens for both.

Step-by-step like a recipe:

Collect and segment text

What happens: Use a high-quality English corpus (FineWeb-Edu). Break documents into sentences and chunks (~512 tokens) so tasks operate on sensible units.
Why it exists: Tasks like reordering sentences or filling the middle need clean boundaries; messy splits confuse the model.
Example: A 700-token article becomes two chunks, each containing complete sentences.

Choose a task and transform the chunk

What happens: Randomly pick 1 of 14 tasks. Produce a structured input (e.g., masked words, shuffled order, missing middle) and set the original or corrected text as the output.
Why it exists: Each task emphasizes a different linguistic skill (morphology, syntax, semantics, discourse). Variety prevents overfitting to surface patterns.
Example: Word Shuffle: “The dog quickly ran home” → “The quickly dog ran home”. The model must restore the original order.

Format into an input–output pair

What happens: Present the transformed text followed by a simple prompt prefix like “Answer:” and then the target output.
Why it exists: Clear formatting tells the model when to produce the answer. Consistency makes learning easier.
Example: Input: “Insert spaces: Ilikeapples\n\nAnswer:” Output: “I like apples”.

Pack and mix with raw text

What happens: Concatenate various L2T pairs and also include plain raw-text segments, shuffling them to fill the training sequence length.
Why it exists: Raw text maintains world knowledge and broad fluency; L2T injects structural practice. The mix balances knowledge and competence.
Example: A sequence might contain: (a) a typo-fix pair, (b) a sentence-reorder pair, and (c) a regular raw-text stretch.

Train with next-token prediction on everything

What happens: Use standard causal language modeling loss, computed on all tokens, including instructions, inputs, and outputs.
Why it exists: One objective, one optimizer—simple and scalable. The diverse inputs force the model to learn to parse and rewrite.
Example: The model learns to output “the sensitivity analysis” after a “Last Phrase” multiple-choice prompt.

The 14 tasks in action (mini-sandwich intros):

🍞 Hook: You’ve seen texts without spaces—hard to read! 🥬 The Concept (Space): Input removes all whitespace; the model must restore it. Why it matters: teaches word boundaries and grammar cues. 🍞 Anchor: “Ilikea” → “I like a”.
🍞 Hook: Typos make words look wrong. 🥬 The Concept (Typo): Random characters are corrupted; the model fixes them. Why it matters: sharpens subword and morphology skills. 🍞 Anchor: “indiaiduals” → “individuals”.
🍞 Hook: Mixed-up word order is like scrambled eggs. 🥬 The Concept (Shuffle): Some words get shuffled; the model restores order. Why it matters: targets syntax and local dependencies. 🍞 Anchor: “The quickly dog ran home” → “The dog quickly ran home”.
🍞 Hook: Story paragraphs sometimes go missing. 🥬 The Concept (Fill Middle): Given start and end passages, the model produces the missing middle. Why it matters: tests long-range coherence. 🍞 Anchor: Paragraph 1 + Paragraph 3 → complete Paragraph 2.
And similarly for Masked Word/Char, Random Word Replacement, Last Phrase, Token Type counting, Sentence Deletion, Reordering, Half, One, and Char Count.

Two data scenarios (sandwich):

🍞 Hook: It’s different to practice with many books versus the same book over and over. 🥬 The Concepts (Disjoint vs Shared): Disjoint: one half of documents for raw text and the other half for L2T; lots of unique material. Shared: the same documents feed both raw text and L2T; limited unique material. Why it matters: Separates the benefit of structure (L2T) from the benefit of seeing more unique documents. 🍞 Anchor: Disjoint = two separate libraries; Shared = the same library repurposed two ways.

Secret sauce:

Multi-granularity signals: character → word → sentence → discourse.
No external labels: the original text is the answer key, so it scales.
Early learning boost: tasks trigger faster development in the “window of maximal development,” giving an early lead that persists.
Simple integration: still just next-token prediction, so no complicated loss juggling or extra models.

Concrete example walkthrough:

Input chunk: “The cat sat on the mat. It purred contentedly.”
Pick task: Masked Word (15% words masked): “The [MASK] sat on the [MASK]. It purred contentedly.”
Pair: Input = transformed text + “Answer:”, Output = “The cat sat on the mat. It purred contentedly.”
Train: The model learns to infer masked words using context, strengthening morphology/semantics.
Mix with raw-text spans so it still learns general fluency and facts.

04Experiments & Results

🍞 Hook: Think of a grammar bee, where two sentences sound similar, but only one is correct. Can the model tell which is right?

🥬 The Concept (BLiMP Benchmark): BLiMP is a set of tiny paired tests covering 12 linguistic phenomena across semantics, morphology, and syntax. How it works:

Show two nearly identical sentences.
Only one is grammatical.
A strong model gives higher probability to the correct one. Why it matters: It isolates real linguistic knowledge instead of style or memorization. 🍞 Anchor: “He run” vs “He runs.” The model should prefer “He runs.”

Setups and competitors (sandwich):

🍞 Hook: Training with lots of different books vs. reusing the same book changes what you learn. 🥬 The Concepts (Disjoint and Shared baselines): Baseline models trained only on raw text (next-word) at 500M and 1B parameters; L2T models use the same token budget (100B) but mix in the 14 tasks. Disjoint uses different documents for raw vs. L2T; Shared reuses the same ones. Why it matters: We compare apples to apples and test whether structure helps beyond extra data diversity. 🍞 Anchor: Disjoint = two libraries; Shared = one library used two ways.

Scoreboard with context:

BLiMP overall (higher is better): • 500M Disjoint: Raw 78.6 → L2T 80.2 (about a letter-grade nudge from B to B+). • 1B Disjoint: Raw 79.0 → L2T 80.8. • 500M Shared: Raw 78.1 → L2T 80.9. • 1B Shared: Raw 78.9 → L2T 81.2. That’s about +2.8 points on average, with some phenomena up to +11.3 points.
Biggest gains: Island effects (a tough syntax constraint) jump by 6.9–11.3 points depending on size/setup. Many different L2T tasks contribute, suggesting the mix of local and global drills helps the model track long-distance dependencies better.
Little change: Determiner–noun agreement and ellipsis were already high with raw-only training, so there wasn’t much headroom to improve.

🍞 Hook: Students often learn the most in the first part of a course.

🥬 The Concept (Window of Maximal Development): Early in training (about the first 20–30B tokens), models improve rapidly. How it works:

Compare models at various training steps.
Look at gains in semantics/morphology/syntax.
See who pulls ahead early. Why it matters: If you get a head start early, you often keep it. 🍞 Anchor: At 5B tokens, L2T already leads by roughly +3 to +6.5 points in some areas, and the gap persists.

General-task benchmarks (reading comprehension, commonsense reasoning, language modeling):

With Disjoint data (lots of unique raw text), L2T stays essentially as strong as Raw (tiny average deltas like −0.87 for 500M and −0.07 for 1B).
With Shared data (limited unique raw text), L2T is mixed: • 500M: slight average improvement (+0.15). • 1B: small average drop (−1.38), mainly on ARC (science-style multiple choice). This suggests that bigger models lean more on repeated raw exposure to reinforce factual knowledge; swapping some repeats for structured tasks can slightly reduce that reinforcement.

Surprises and takeaways:

Structure helps most where raw statistics struggle: complex syntax like island effects.
Some very hard dependencies (like certain filler–gap cases) remain challenging, hinting that extra targeted discourse-level tasks could help.
Balance matters: In a separate ratio study, using only L2T (0% raw) harms knowledge-heavy tasks; at least 25% raw text is healthy, and even 25% L2T gives clear linguistic gains.

05Discussion & Limitations

Limitations:

Task scope: Many tasks focus on sentence-level constraints. While discourse tasks exist (fill middle, half, one-word prefix), the hardest long-distance, cross-sentence dependencies still need more targeted drills.
Scale: Experiments use 500M and 1B parameter models under a 100B-token budget. Larger models may respond differently; some may need more raw text to keep world knowledge.
Ratio sensitivity: Too little raw text can weaken knowledge-intensive performance, especially for larger models in limited-data settings.

Resources needed:

A solid corpus (e.g., FineWeb-Edu) and compute comparable to 100B-token pretraining.
Standard frameworks (PyTorch, Transformers) and simple preprocessing (sentence segmentation, text packing).

When not to use or when to be careful:

If your top priority is memorizing niche facts with minimal data diversity (e.g., a specialized knowledge base), replacing many raw-text repetitions with L2T might slightly reduce fact retention; keep a higher raw-text share.
If you already run a giant model with abundant unique raw text and it meets your needs, the marginal benefit may be smaller; consider L2T mainly to accelerate early training or to target specific weaknesses (e.g., syntax).

Open questions:

What is the optimal curriculum? For example, heavier L2T early, then taper to more raw text as size grows?
Which new discourse-level tasks would best close gaps in filler–gap and other long-distance phenomena?
How does L2T behave in multilingual settings or low-resource languages?
Can adaptive mixing (monitoring a live grammar score) guide task selection and ratios on the fly?

06Conclusion & Future Work

Three-sentence summary: This paper proposes L2T, a simple way to turn raw text into many small language-learning tasks and mix them with standard next-word training. The result is faster and stronger growth in linguistic competence—especially on tricky syntax—while staying competitive on general tasks when raw text is sufficiently available. L2T needs no extra labels or helper models, just the original text as its own answer key.

Main achievement: Showing that structured, label-free tasks made directly from raw text can systematically boost grammar-and-meaning competence (BLiMP +2.8 points on average; up to +11.3 on hard syntax), and do so early in training.

Future directions: Explore richer discourse tasks to tackle remaining long-distance dependencies, design adaptive curricula that shift the L2T/raw mix over time, and extend to multilingual settings. Investigate behavior at larger scales and for domain-specific corpora.

Why remember this: L2T turns any corpus into a teacher, adding grammar drills without external supervision. It’s an easy add-on to pretraining that nudges models away from parroting and toward understanding, improving what matters most: using language correctly and meaningfully.

Practical Applications

•Pretrain internal assistants that follow complex instructions more precisely by adding L2T to existing pipelines.
•Improve grammar and coherence in summarization systems by mixing in sentence reordering and anomaly deletion tasks.
•Strengthen translation models’ handling of word order and agreement via word/sentence shuffle and masked-word tasks.
•Boost code-of-conduct or policy parsing by training on tasks that highlight syntax and long-distance dependencies.
•Enhance educational chatbots’ clarity using discourse-level tasks (fill middle, half) to maintain multi-paragraph coherence.
•Build robust text-cleaning tools (typo fix, space insertion) that also learn deeper morphology patterns.
•Accelerate model training efficiency by emphasizing L2T tasks during early epochs, then tapering to more raw text.
•Customize the L2T mix to target known weaknesses (e.g., add more sentence-level tasks if syntax scores lag).
•Deploy in low-resource settings by generating label-free practice tasks from scarce domain text.
•Benchmark internal models with BLiMP-like minimal pairs to monitor true linguistic growth over time.

Version: 1