ArXiv-to-Model: A Practical Study of Scientific LM Training

Anuj Gupta

ArXiv-to-Model: A Practical Study of Scientific LM Training

Intermediate

Anuj Gupta2/19/2026

arXiv

Key Summary

•This paper shows, step by step, how to train a 1.36-billion-parameter science-focused language model directly from raw arXiv LaTeX files using only 2 A100 GPUs.
•The big lesson is that careful data engineering (filtering, cleaning, deduping, tokenizing) matters as much as fancy model designs, especially with limited compute.
•They built an 80GB core scientific corpus and, after tokenization, trained on about 52.18B scientific tokens, then added 5B tokens for gentle post-training alignment.
•Using a dense LLaMA-style transformer with 24 layers and a 102,400-token SentencePiece vocabulary gave stable training and strong LaTeX/math familiarity.
•Training stability greatly improved with more data: small-data (20GB) runs were noisy; full-data (200GB processed) runs converged smoothly with perplexity around 4.2.
•Storage and I/O were real bottlenecks—sometimes more limiting than raw GPU compute—so fast data pipelines were crucial.
•Separating pretraining (formal scientific text) from post-training (Q&A and chat formatting) kept math skills strong without losing clarity in instructions later.
•Tokenizer choice mattered: a LLaMA-compatible tokenizer avoided embedding and ID mismatches that can silently ruin training.
•They used curriculum learning (start with prose, then add equations) and gradient accumulation to fit big batches on limited memory.
•The work is a practical recipe others can copy to build domain-specialized models under moderate budgets.

Why This Research Matters

Scientific papers are the language of discovery, and this work shows how to teach an AI to read them fluently without needing giant secret datasets. With a practical, transparent recipe, smaller labs and classrooms can build helpful tools for math problem solving, proof drafting, and scientific writing support. By separating formal pretraining from later instruction alignment, the model keeps its math strengths while learning to be helpful in dialogue. The focus on storage and tokenizer correctness helps future builders avoid silent failures that waste money and time. Ultimately, this makes scientific AI development more accessible, reproducible, and useful for real learners and researchers.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re building a giant LEGO city. If your blocks are mixed with rocks, old pieces, and broken parts, the city will wobble. But if you sort, clean, and choose the right pieces, your city stands tall. Training a science-savvy AI is like that.

🥬 Filling (The Actual Concept):

What it is: This paper is a how-to story for training a scientific language model (LM) from raw arXiv LaTeX, focusing on the engineering steps that make it work.
How it works: The team gathers raw LaTeX sources, cleans and filters them, turns them into tokens (tiny text pieces), and trains a dense transformer model on that data, using careful schedules and checks to keep training stable.
Why it matters: Without these careful steps, you waste a lot of data, your model learns bad habits, and training can become unstable or even crash.

🍞 Bottom Bread (Anchor): Think of sorting your school notes: when you remove doodles, staple pages in order, and highlight key parts, you study better. The model “studies” better too when its notes (data) are tidy.

— New Concept — End-to-End Training Pipeline 🍞 Hook: You know how baking a cake goes from shopping for ingredients to mixing, baking, and frosting? 🥬 The Concept: An end-to-end training pipeline is the complete recipe for going from raw data to a trained model.

How it works (steps): (1) Collect arXiv LaTeX, (2) validate and extract files, (3) clean and normalize text, (4) deduplicate repeats, (5) tokenize, (6) assemble mixtures with sampling weights, (7) train the transformer with a smart curriculum, (8) evaluate.
Why it matters: Skip or mess up any step and you get less usable data or a confused model. 🍞 Anchor: If you forget to preheat the oven, your cake bakes unevenly; if you skip archive validation, your training data “bakes” wrong.

The World Before: Big language models could do impressive math and science, but their exact training data and cleaning steps were often secret or unclear. People knew models were powerful, but not the nitty-gritty of making a domain model straight from messy scientific sources like LaTeX.

The Problem: arXiv sources are tricky—many files per paper, custom macros, equations everywhere, weird formatting, and inconsistent metadata. Tiny choices (like removing short docs, handling withdrawn papers, or language detection) can make you lose tons of data or keep junk that confuses training.

Failed Attempts: Teams sometimes scraped PDFs instead of LaTeX (hard to parse), used general web tokenizers that split math badly, filtered languages too early (dropping valid math-heavy English docs), or trained on very small datasets (leading to unstable training and memorization).

— New Concept — Data Engineering 🍞 Hook: Imagine cleaning your room so you can find your homework fast. 🥬 The Concept: Data engineering means organizing, cleaning, and preparing data so models can learn efficiently.

How it works: You filter by subject and date, remove withdrawn or super-short items, validate archives, and assemble mixtures that balance depth and variety.
Why it matters: Messy data slows learning or teaches the model the wrong lessons. 🍞 Anchor: Like labeling folders for math, science, and history, the pipeline labels and balances topics so the model can “study” each one well.

— New Concept — Preprocessing 🍞 Hook: You wash fruit before eating it so you don’t taste dirt. 🥬 The Concept: Preprocessing cleans and formats raw LaTeX into meaningful text while keeping equations and structure.

How it works: Remove figures and formatting clutter, keep theorem-proof structures, concatenate source files, and fix malformed inputs.
Why it matters: If you strip out structure or equations, the model can’t learn scientific reasoning. 🍞 Anchor: It’s like removing orange peels but keeping the juice.

— New Concept — LaTeX Extraction 🍞 Hook: Squeezing juice from oranges is easy when they’re ripe and tough when they’re not. 🥬 The Concept: LaTeX extraction pulls the useful text and math from source archives.

How it works: Validate tarballs, find .tex files, follow \input/\include links, and extract content while keeping math.
Why it matters: Broken archives or odd layouts lead to data loss and gaps in the model’s knowledge. 🍞 Anchor: If some oranges are rotten, you get less juice; if some LaTeX is malformed, you get fewer tokens.

The Gap: We needed a transparent, reproducible, engineering-first guide showing what to keep, what to toss, and how to train a stable, small scientific model with limited GPUs.

Real Stakes: Better scientific LMs can help students check homework steps, help researchers draft proofs, and make scientific writing clearer. For labs without giant budgets, knowing where bottlenecks hide—like storage speed, not just GPU power—saves time and money.

— New Concept — Scaling Laws 🍞 Hook: If a plant is small, it needs less water; if it’s big, it needs more. There’s a rule of thumb for that. 🥬 The Concept: Scaling laws are rules that link model size and the amount of training data you should use.

How it works: Roughly, for 1.36B parameters, you’d like around 20× that many tokens (≈27B) for compute-optimal training.
Why it matters: Too few tokens wastes model capacity; too many can waste compute. 🍞 Anchor: If you pour a tiny cup of water on a big tree, it won’t thrive; if you flood it, you waste water. The paper chose a data-rich regime for better stability in science text.

02Core Idea

🍞 Top Bread (Hook): You know how a chef can make an amazing cake, not by buying the fanciest oven, but by choosing great ingredients, measuring carefully, and following a smart recipe? This paper is that kind of chef’s guide for scientific AI.

🥬 Filling (The Actual Concept):

What it is: The key insight is that a careful, transparent data-and-training pipeline can reliably produce a strong, small scientific language model—even with modest hardware—without inventing a new architecture.
How it works: Start with raw arXiv LaTeX, filter and clean it, tokenize it in a way that’s stable for math, train a dense LLaMA-style transformer with curriculum learning, and keep the conversation-style data separate until post-training.
Why it matters: If you scramble the steps or mix conversational data too early, the model’s math skills suffer and training becomes unstable.

🍞 Bottom Bread (Anchor): It’s like first mastering grammar and vocabulary before practicing polite conversation; the model first learns formal math language, then learns how to answer questions.

— New Concept — Tokenization 🍞 Hook: Cutting pizza into the right slices makes it easy to share; slicing it oddly makes a mess. 🥬 The Concept: Tokenization breaks text into pieces the model can understand.

How it works: Use a SentencePiece tokenizer (LLaMA-compatible, ~102,400 tokens) that doesn’t shatter equations into too many bits.
Why it matters: Over-fragmenting math symbols makes sequences longer and learning harder. 🍞 Anchor: If you cut every pepperoni into crumbs, sharing takes forever. A good tokenizer keeps meaningful chunks.

— New Concept — BPE (Byte Pair Encoding) 🍞 Hook: You notice the letters “c” and “h” often go together as “ch.” 🥬 The Concept: BPE merges frequent character pairs into single tokens to compress text efficiently.

How it works: It scans text, finds frequent pairs, merges them, and repeats—ending with a vocabulary of useful chunks.
Why it matters: Fewer, smarter tokens mean shorter sequences and faster, more stable learning. 🍞 Anchor: “Chemistry” becomes tokens like “Che”, “mi”, “stry” instead of a jumble of single letters.

— New Concept — Dense Transformer Architecture 🍞 Hook: Imagine an orchestra where every musician plays together, not just a few. 🥬 The Concept: A dense transformer uses all its parameters for every token, keeping compute predictable and stable.

How it works: Stacked attention layers look at context; feed-forward layers transform representations; all parts activate per token.
Why it matters: It’s simpler and more stable than expert-routing models (MoE) when you have only 2 GPUs. 🍞 Anchor: Instead of switching musicians on and off, the whole orchestra plays in harmony for each note.

— New Concept — Curriculum Learning 🍞 Hook: In math class, you start with easy problems before tackling proofs. 🥬 The Concept: Curriculum learning introduces content from simple prose to dense equations in stages.

How it works: Stage 1: abstracts and intros; Stage 2: full bodies with theorems and derivations; Stage 3: a balanced mix.
Why it matters: Jumping straight into heavy symbols can destabilize early training. 🍞 Anchor: It’s like learning to ride a bike with training wheels before going off-road.

Before vs After:

Before: People focused on giant models and mysterious data pipelines.
After: We see that transparent, careful engineering—especially tokenization, filtering, deduplication, and curriculum—lets a 1.36B dense model learn scientific text well on modest hardware.

Why It Works (Intuition):

Clean, structured data lowers noise.
A tokenizer that respects math keeps sequences compact and patterns consistent.
More tokens (data-rich regime) smooth out gradient noise, so the model “hears” clearer signals.
Separating pretraining (formal text) from post-training (instructions) prevents skill interference.

Building Blocks:

End-to-end pipeline: raw LaTeX → clean tokens → training → evaluation.
Tokenization/BPE/SentencePiece: stable chunks for equations.
Dense transformer: predictable, stable compute per token.
Scaling laws: enough tokens for the model’s size.
Curriculum learning: easier-to-harder stages.
Gradient accumulation: big effective batches on limited memory.

— New Concept — Gradient Accumulation 🍞 Hook: If your piggy bank is small, you save coins over time until you can buy what you want. 🥬 The Concept: Gradient accumulation sums many small batches before updating the model, acting like a large batch without needing huge memory.

How it works: Process micro-batches, add up gradients, then apply one update.
Why it matters: Stabilizes training when GPU memory limits you. 🍞 Anchor: It’s like taking several small steps to cover the same distance as one big stride.

03Methodology

At a high level: Input (raw arXiv LaTeX + metadata) → [A. Validate & Extract] → [B. Filter & Clean] → [C. Deduplicate] → [D. Tokenize] → [E. Assemble Mixture + Weights] → [F. Train Dense Transformer with Curriculum] → Output (base scientific LM) → [G. Post-training Alignment] → Final interactive model.

Step A: Validate & Extract Sources

What happens: Download arXiv source archives, check integrity, unpack, locate all .tex files, follow \input/\include links.
Why it exists: Broken tarballs and odd project layouts can silently destroy yield.
Example: If an archive is corrupted, it’s skipped; this prevents feeding garbage into training.

Step B: Metadata Filtering & Cleaning

What happens: Keep math, cs, hep-th, hep-ph, quant-ph, stat.ML, stat.TH; year ≥ 2000; remove withdrawn papers; drop bodies under 2,000 chars; apply language detection on title+abstract+cleaned body.
Why it exists: Keeps the focus on formal scientific writing and avoids fragments or retracted work.
Example: A two-page erratum is tossed; a full 20-page proof stays.

— New Concept — Deduplication 🍞 Hook: If you photocopy the same page five times, studying all five wastes time. 🥬 The Concept: Deduplication removes exact and near-duplicate documents.

How it works: Exact hashing for identical content; similarity thresholds for near-duplicates (like minor version updates).
Why it matters: Reduces memorization and gives more variety per token. 🍞 Anchor: You keep the clearest copy of notes instead of stacking the same page in your binder.

Step C: LaTeX Normalization & Structure Preservation

What happens: Strip figures/formatting clutter, preserve equations and environments (theorem, proof), standardize whitespace and macros where safe.
Why it exists: Equations and structure carry the reasoning—don’t lose them.
Example: Remove \begin{figure} blocks but keep \begin{equation} intact.

Step D: Tokenization (SentencePiece, LLaMA-compatible)

What happens: Convert cleaned text into tokens using a ~102,400 vocab. Keep stability for symbols/operators.
Why it exists: Over-fragmented equations blow up sequence length and hurt learning.
Example: “\int_0^1 x^2 dx” becomes a few stable tokens, not a hundred tiny pieces.

— New Concept — Tokenizer–Model Alignment (model_type and files) 🍞 Hook: Using the wrong charger for your phone might fit, but it won’t charge. 🥬 The Concept: The model’s config (model_type) must match the tokenizer class and files used during training.

How it works: config.json chooses classes (e.g., LLaMA); AutoTokenizer loads matching files; IDs must align with embeddings.
Why it matters: A mismatch can silently scramble meaning—every token points to the wrong embedding row. 🍞 Anchor: If every subway stop label is shifted by one, you always get off at the wrong station.

Step E: Mixture Assembly with Sampling Weights

What happens: Upsample high-quality, formal scientific LaTeX (“Gold” docs) and retain broader-domain math sources at lower weights. The pretraining core is cleaned arXiv (~80GB), with OpenWebMath (~50GB) helping variety; post-training adds StackExchange STEM (~10GB), MathInstruct (~100MB), and a small chat set (UltraChat ~1.2GB).
Why it exists: Balance precision (proofs, theorems) and breadth (informal problem statements) while avoiding overfitting.
Example: A rigorous theorem-laden paper appears more often than a casual blog-style math note.

Step F: Training the Dense Transformer

Architecture: 24 layers; d_model=2048; heads=16; FFN=5504; RMSNorm; SiLU; RoPE; vocab ~102.4k; context 4096 (trained at 768 seq length); untied input/output embeddings; bfloat16.
Compute: 2×A100 80GB GPUs; ZeRO Stage 2 or FSDP; activation checkpointing; mixed precision; effective global batch 512–2048 via gradient accumulation; 5,000–8,000 GPU-hours.
Why it exists: Stable, predictable compute per token; fits modest hardware while retaining strong learning capacity.
Example: Micro-batch size per GPU 1–2; accumulate many steps to simulate a big batch.

— New Concept — Curriculum Learning (applied during training) 🍞 Hook: You don’t start piano with Beethoven; you learn scales first. 🥬 The Concept: Train in stages—first prose (abstracts/intros), then full math bodies, then a balanced mix.

How it works: Stage 1 warms up language; Stage 2 introduces equations and theorem-proof patterns; Stage 3 blends both.
Why it matters: Avoids early instability from symbol-heavy sequences. 🍞 Anchor: Training wheels off only after you can balance.

Step G: Optimization Details

What happens: AdamW with weight decay; conservative learning rate schedules; monitor gradient norms; bfloat16; checkpointing; stable data loaders to keep GPUs >95% busy.
Why it exists: Symbolic-heavy text is sensitive—gentle updates and good throughput keep training smooth.
Example: Gradient norms spike in warm-up then settle below ~1.0.

Step H: Post-Training Alignment (kept separate)

What happens: After pretraining on formal text, add instruction-following data (StackExchange, MathInstruct, small chat) to teach question-answering and formatting.
Why it exists: Mixing chat too early can dilute math precision.
Example: The model first learns proofs; later it learns to format step-by-step solutions for users.

Secret Sauce (Why this recipe is clever):

Separation of concerns: pretrain on pure science, align later.
Data-rich regime (~52.18B scientific tokens) reduces gradient noise.
LLaMA-compatible tokenizer avoids fragile re-tokenizer pitfalls.
Infrastructure-first design: storage/I/O throughput treated as first-class citizens.
Conservative schedules and curriculum tame symbolic instability.

04Experiments & Results

The Test: The authors ran 24 training experiments to study stability, convergence, data yield, and bottlenecks. They evaluated mainly with perplexity on a held-out scientific validation set (a standard measure of how surprised the model is by the text—lower is better), plus training/validation loss curves, gradient norms, and GPU utilization.

The Competition: Instead of comparing to giant instruction-tuned models, they compared runs against each other—small-data (≈20GB) vs. full-data (≈200GB processed)—to isolate how data scale affects stability and final quality.

Scoreboard (with context):

Data Produced: ~52.18B tokens for scientific pretraining and ~5B tokens for post-training/alignment.
Small-Data Run (Run 24, 20GB): Loss went down but wobbled and plateaued high—like a student who studies only a few worksheets and keeps guessing.
Full-Data Runs (Run 23 and Run 20, 200GB processed): Loss decreased smoothly with a classic long-tail shape—like a student who practices many different problems and gains steady confidence.
Final Validation: Perplexity ≈ 4.2 (exp(1.438)), meaning the model is reasonably “unsurprised” by scientific text—like getting a strong B+/A− on reading specialized science passages while others with less data get a C+.
Gradient Stability: Early warm-up spikes, then stable norms <1.0; no explosions or vanishing gradients—like starting a jog fast, then settling into a steady pace.
Hardware Utilization: >95% GPU usage, stable power, no ECC errors, and no persistent I/O stalls in the final run—like a kitchen where ovens and mixers run continuously without waiting for ingredients.

Surprising Findings:

Storage/I/O as a Bottleneck: Early on, reading data fast enough was harder than computing—like a busy restaurant where waiters can’t bring ingredients to the chef quickly enough.
Language Filtering Sensitivity: Doing language detection too early removed valid, math-heavy English docs (symbols confused the detector) — meaning timing and context for filters matter.
Instruction Behavior Doesn’t Emerge: Pretraining on formal text alone doesn’t teach the model to follow instructions or chat; it must be added later.
Tokenizer Choice: A domain-trained tokenizer might help, but the LLaMA-compatible tokenizer was robust and safer under tight compute, avoiding ID mismatches that can silently break models.

Meaning of the Numbers:

52.18B scientific tokens for a 1.36B model is a data-rich choice (≈38 tokens per parameter), which smoothed training and improved symbolic stability, even if not perfectly compute-optimal.
Sequence length trained at 768 (though the model supports 4096) improved throughput on 2 GPUs, trading off long-context practice for more stable, faster training steps.

Takeaway: More clean, well-prepared data beat fancier tricks in this setting. With a sensible tokenizer and a curriculum, the small dense model learned scientific text reliably on modest hardware.

05Discussion & Limitations

Limitations (what it can’t do well yet):

Long Conversations/Long Context: Although the architecture supports 4096 tokens, training used 768-token sequences, so very long proofs or multi-page reasoning may stretch its skills.
Not Instruction-Tuned by Default: The base model isn’t a chat assistant until you add post-training alignment data.
Narrow Domain: Focused on math, theoretical physics, and stats learning; it’s not a general web chatbot.
Evaluation Scope: Perplexity is helpful but doesn’t guarantee correct proofs or flawless reasoning; richer math benchmarks are still needed.
Compute and Storage: Even “small” training used 5,000–8,000 GPU-hours and hefty storage with high-throughput I/O—still significant for many groups.

Required Resources:

Hardware: 2×A100 80GB GPUs (or equivalent), fast storage, and robust data loaders.
Software: Distributed training (ZeRO/FSDP), mixed precision, tokenizer/model configs that match exactly.
Data: Cleaned arXiv LaTeX plus curated math datasets; careful filtering and deduplication steps.

When NOT to Use:

If you need a friendly, general-purpose chatbot without extra tuning.
If your task demands very long-context reasoning during training (e.g., book-length documents) without adjusting sequence length.
If you lack fast storage or can’t validate/extract LaTeX reliably (I/O becomes a hard blocker).

Open Questions:

Tokenizer Science: How much better would a carefully domain-trained tokenizer be for symbolic math vs. LLaMA’s default?
Long-Context Training: What gains come from training at 2–4k tokens with memory-efficient tricks?
Better Extraction: Can we reduce LaTeX extraction failures and recover more tokens safely?
Optimal Mixtures: What sampling weights best balance proofs, derivations, and informal problem statements?
Richer Benchmarks: How do perplexity gains translate into theorem-solving, step-checked derivations, and formal proof consistency?

06Conclusion & Future Work

3-Sentence Summary: This paper is a practical blueprint for training a 1.36B-parameter scientific language model from raw arXiv LaTeX, showing that meticulous data engineering and stable training practices matter as much as architecture. Using a LLaMA-style dense transformer, a robust tokenizer, a staged curriculum, and a data-rich corpus (~52.18B scientific tokens), the model trained smoothly on just 2 A100 GPUs and achieved strong familiarity with scientific text (perplexity ≈ 4.2). The work emphasizes that transparent pipelines and infrastructure planning—especially storage/I/O—are key for reproducible, domain-specialized LMs under moderate budgets.

Main Achievement: Turning messy, symbolic LaTeX into a clean, weighted, tokenized corpus and training a stable, small dense transformer—proving that careful engineering can rival scale in impact for domain models.

Future Directions: Explore longer-context training for proofs and derivations; compare domain-trained tokenizers vs. general ones; expand evaluations to math-proof and formal-reasoning benchmarks; refine mixture weights; and strengthen post-training alignment for instruction following without eroding math skills.

Why Remember This: It’s a clear, reproducible recipe for building strong scientific LMs when you don’t have infinite GPUs—teaching the community that the path to better models often runs through better data pipelines, not just bigger parameter counts.

Practical Applications

•Build a specialty math/physics assistant that formats and parses LaTeX reliably.
•Create a proof-explainer that rewrites dense derivations into clearer, step-by-step prose.
•Develop a paper-drafting helper that suggests consistent notation and theorem-proof structure.
•Make a LaTeX linting tool that flags malformed environments and common macro pitfalls.
•Set up a reproducible data pipeline for academic labs to curate their own domain corpora.
•Prototype a course tutor that first learns formal content, then gets aligned for Q&A after.
•Benchmark tokenizers on symbolic math to choose the best one for a given dataset.
•Use the curriculum schedule to stabilize training on small GPU clusters.
•Apply weighted sampling to emphasize high-quality, formal sources over noisy text.
•Run gradient-accumulation training to achieve large effective batches on limited memory.

Version: 1