QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management
Key Summary
- •QwenLong-L1.5 is a training recipe that helps AI read and reason over very long documents by improving the data it learns from, the way it is trained, and how it remembers important stuff.
- •It builds tough practice problems where clues are spread across many places, so the model learns to connect facts instead of just searching for one line.
- •It uses a safer kind of reinforcement learning (RL) with smart sampling and a new method called AEPO to keep training steady instead of chaotic.
- •It adds a memory system so the model can handle tasks far longer than its usual reading window, even up to millions of tokens.
- •Across long-context tests, it beats its own baseline by about 10 points on average and performs close to top proprietary models.
- •On huge inputs (1M to 4M tokens), its memory agent strategy clearly outperforms earlier agent baselines.
- •The long-context skills also boost math, science, tool use, and long conversations, not just the special long-doc tests.
- •The training recipe grows context length step by step and merges a memory expert back into a single strong model.
- •The data pipeline automatically checks that questions really need the document and stay reliable even if extra noise is added.
- •This paper shows that better post-training alone (after pretraining) can unlock powerful long-context reasoning.
Why This Research Matters
Real work often lives in long materials: financial reports, legal cases, scientific papers, software repos, and long chats. A model that can read, remember, and reason across all that reduces errors and saves huge amounts of time. This recipe shows that careful post-training—not just bigger models—can unlock those abilities. It also scales beyond fixed windows using memory, so it can handle million-token tasks that used to be intractable. The improvements spill over into everyday skills like math, science Q&A, tool use, and long conversations. That means better tutors, analysts, coders, and assistants for people at school, at home, and at work.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Top Bread (Hook): You know how reading a whole mystery novel means remembering clues from the beginning, middle, and end—and tying them together to solve the case? A good reader doesn’t just skim; they connect the dots across the entire book.
🥬 Filling (The Actual Concept): Before this work, many AI models were like fast skimmers. They could handle short passages and answer simple questions but struggled when answers were scattered across giant collections: multi-chapter reports, many webpages, huge logs, or long chats.
- What it is: Long-context reasoning means the AI can find, track, and combine facts located far apart in a huge text to produce a correct answer.
- How it worked before (briefly): Researchers mostly extended the “reading window” during pretraining or changed the model’s attention mechanism to be cheaper, then hoped post-training would catch up.
- Why it matters: Without real long-context reasoning, models forget early clues, miss subtle connections, and give shallow answers.
🍞 Bottom Bread (Anchor): Imagine asking, “Across all 20 annual reports, which division grew every year and what policy explains the trend?” A short-context model might miss key lines from earlier years. A long-context reasoner can thread the full story.
The World Before:
- Models could answer simple retrieval questions (like “find the sentence that says X”) but had trouble with multi-hop logic—linking facts hidden across many parts of a very long document set.
- Most progress came from pretraining and architecture tricks to increase context windows, not from post-training methods that teach models to actually reason over long input.
The Problem:
- Post-training (the stage that aligns and sharpens the model’s abilities) lacked a complete recipe for long-context reasoning.
- Gaps included: (1) getting enough hard, high-quality training data where clues are spread out; (2) stabilizing RL so training doesn’t wobble or collapse on long inputs; and (3) building a memory process for inputs far beyond even large windows.
Failed Attempts:
- Simple “needle-in-a-haystack” tests: The model just needed to find a single line—not truly reason across multiple facts.
- One-hop RAG: Retrieve a small chunk and answer; good for short jumps, weak for many-step logic spread across different places.
- RL that worked for short math/code sometimes became unstable with long, messy text (rewards biased, exploding/vanishing gradients, entropy swings).
🍞 Top Bread (Hook): Imagine studying for a test with only true/false flashcards (easy), then suddenly being asked to write a research paper (hard). Your old practice doesn’t prepare you for multi-source reasoning.
🥬 Filling (The Actual Concept): The Gap this paper fills is a full post-training system dedicated to long-context reasoning.
- What it is: A unified recipe combining (a) a long-context data synthesis pipeline, (b) stabilized RL algorithms, and (c) a memory-augmented agent.
- How it works: It builds complex, multi-hop datasets; trains with group-based RL and smart sampling; and equips the model with iterative memory to go beyond its fixed window.
- Why it matters: Without all three together, models either lack the right practice, fall apart in training, or can’t handle ultra-long inputs.
🍞 Bottom Bread (Anchor): Think of training a marathon runner: you need (a) a realistic course with hills (hard data), (b) a safe training plan (stable RL), and (c) water stations and pacing strategy (memory agent). Skipping one leads to failure.
Real Stakes:
- Long business reports, legal cases, scientific literature, codebases, and long chats are standard in real life.
- Better long-context reasoning means more accurate analyses, safer decision support, fewer hallucinations, and stronger tools for learning and work.
- The model’s improvements also spill over to math, science Q&A, tools, and long conversations—because staying consistent over long stretches helps everywhere.
🍞 Top Bread (Hook): Imagine trying to remember every step in a giant LEGO instruction book with hundreds of pages.
🥬 Filling (The Actual Concept): Long-context Reinforcement Learning trains a model to follow those steps across the whole book and build correctly.
- What it is: Teaching the model, via rewards, to read long inputs, plan, and reason step by step.
- How it works: The model proposes answers; a reward checks correctness; the model learns which reasoning led to better outcomes.
- Why it matters: Without RL tailored for long inputs, the model won’t learn to combine far-apart clues.
🍞 Bottom Bread (Anchor): If a model must answer “Compare three safety policies across five yearly reports and pick the best,” long-context RL helps it learn how to track and connect those pieces over many pages without getting lost.
02Core Idea
🍞 Top Bread (Hook): You know how a great study plan includes good practice problems, a steady routine without burnout, and a notebook to remember key ideas? Put those together and you ace the test.
🥬 Filling (The Actual Concept): The “Aha!” is a complete post-training recipe—data + stable RL + memory—that teaches long-context reasoning end-to-end.
- What it is: QwenLong-L1.5’s core idea is unifying three pillars: (1) a scalable long-context data synthesis pipeline, (2) stabilized RL for long inputs, and (3) a memory agent for ultra-long tasks.
- How it works: It generates multi-hop, globally grounded questions; trains with group-relative RL plus task-balanced sampling and AEPO to keep learning stable; and adds a memory loop to go beyond the window.
- Why it matters: Any one pillar alone isn’t enough; all three together make the model both accurate and scalable.
🍞 Bottom Bread (Anchor): Like training a detective: (1) give tricky cases with clues everywhere, (2) teach a calm investigation routine, (3) keep a case file (memory). Result: a detective who solves huge, tangled mysteries.
Multiple Analogies:
- Library analogy: Before, the model skimmed one shelf; after, it checks the whole library, notes cross-references (data), uses a calm search plan (stable RL), and keeps a summary notebook (memory).
- Cooking analogy: Before, one-pan meals; after, a feast where timing and multiple dishes must coordinate (data). The chef follows a reliable playbook (stable RL) and keeps prep lists and leftovers labeled (memory).
- Sports analogy: Before, short sprints; after, a marathon with water stations (memory), a pacing strategy (stable RL), and practice on varied terrains (data).
Before vs. After:
- Before: Models often found single facts but struggled to chain them across long contexts; training could be unstable; very long inputs were out of reach.
- After: The model handles multi-hop, globally scattered evidence, trains stably over longer sequences, and scales to million-token tasks using memory.
Why It Works (intuition, no equations):
- Data: If you only practice finding one sentence, you never learn to connect multiple clues. The pipeline builds tasks that force connections across far-apart parts of the text.
- Stabilized RL: Long inputs are noisy. Task-balanced sampling prevents one task type from skewing updates, and AEPO keeps exploration (randomness) in a Goldilocks zone—not too wild, not too timid.
- Memory: Even a large window has limits. Iterative memory lets the model compress what it has learned so far, plan next steps, and keep going chunk by chunk.
Building Blocks (Sandwich explanations):
- 🍞 You know how a teacher makes custom practice sheets that get harder and cover many topics? 🥬 Long-Context Data Synthesis Pipeline
- What it is: A programmatic way to create hard, multi-hop questions spread across long documents.
- How it works: Break documents into atomic facts and relations, build knowledge graphs and tables, compose multi-step questions, add distractors, then verify quality.
- Why it matters: Without rich, tough data, the model just learns easy retrieval, not deep reasoning. 🍞 Example: From three annual reports and two news articles, the pipeline asks: “Which product line rose every year, and which policy across divisions best explains it?”
- 🍞 Imagine giving equal time to math, reading, science, and art so you grow in all subjects. 🥬 Stabilized Reinforcement Learning (with task-balanced sampling and task-specific advantages)
- What it is: A steadier way to learn from rewards across many task types.
- How it works: Each training batch draws a balanced mix of tasks; advantages (learning signals) are normalized per task to avoid bias.
- Why it matters: Without balancing, one task can dominate, causing unstable training. 🍞 Example: A batch includes multi-hop QA, numerical reasoning, dialogues, and code-doc questions—each contributes fairly to updates.
- 🍞 When learning a new game, sometimes you explore randomly; other times, you exploit what works. 🥬 AEPO (Adaptive Entropy-Controlled Policy Optimization)
- What it is: A method that raises or lowers exploration by masking harmful negative updates when the policy is too random.
- How it works: If model entropy (randomness) gets too high, pause negative gradients from incorrect rollouts; if too low, reintroduce them to prevent collapse.
- Why it matters: Without AEPO, long-context RL can oscillate or stall. 🍞 Example: If answers start looking messy and erratic, AEPO temporarily learns only from good attempts until stability returns.
- 🍞 Picture reading a giant book one chapter at a time while keeping a summary sheet updated. 🥬 Memory-Augmented Architecture (Memory Agent)
- What it is: A loop that reads chunks, updates a memory, plans the next chunk, and finally answers.
- How it works: Split long input; at each step, update a compact memory and plan navigation; at the end, use memory plus instructions to generate the answer.
- Why it matters: Without memory, tasks bigger than the window are impossible to solve coherently. 🍞 Example: For 2 million tokens of logs, the agent summarizes each segment, tracks unresolved questions, and eventually answers the user query.
- 🍞 Think of training wheels, then a bigger bike, then a racing bike. 🥬 Multi-stage Fusion RL Training
- What it is: Gradually increasing input/output lengths; training a memory expert; then merging back into a single strong model.
- How it works: Stage 1→3: grow full-context lengths; parallel: train memory expert; merge models; final stage refines both skills together.
- Why it matters: Without progressive stages and merging, you either stay short or lose stability. 🍞 Example: Start at 20K tokens, move to 60K, then 120K, then add memory for 1M+ before unifying everything.
03Methodology
High-level Recipe: Input (long docs + questions) → Data Synthesis & Verification → Progressive RL (GRPO + task-balanced sampling + task-specific advantage) → AEPO entropy control → Memory agent training → Model merging → Final full-context + memory-capable model.
Step-by-step (with Sandwich explanations for key pieces):
- 🍞 Imagine turning a messy library into a tidy map of facts and links. 🥬 Long-Context Data Synthesis Pipeline
- What it is: A system to generate multi-hop, globally grounded QA pairs at scale.
- How it works (like a recipe):
- Collect diverse long documents (code repos, papers, reports, literature, dialogues) and filter quality.
- Extract atomic facts into knowledge graphs or tables.
- Compose multi-hop and numerical queries by sampling long-range paths or running NL2SQL.
- Inflate context length with irrelevant but realistic distractors.
- Verify with two checks: grounding (remove doc; if model still answers, discard) and robustness (add noise; if answer breaks, discard).
- Why it matters: Without trustworthy, hard data, RL learns shallow shortcuts. 🍞 Anchor: Build a table across five reports; ask: “Total R&D over 2019–2023 for Division A vs. B, and which policy best explains the difference?”
- 🍞 You know how exams should fairly cover all topics—not just one. 🥬 Task-balanced Sampling + Task-specific Advantage Estimation
- What it is: A way to draw equal amounts from each task type and normalize rewards per task.
- How it works:
- Pre-infer difficulty bins and sample evenly.
- In each RL batch, enforce equal counts from: multi-choice, multi-hop QA, general RC, dialogue memory, corpus-level numerics.
- Compute advantages using the reward variance of that specific task (not mixed), to reduce bias.
- Why it matters: Without it, reward noise and imbalance cause unstable updates. 🍞 Anchor: A batch might include 10 MC questions, 10 multi-hop, 10 dialogues, 10 table-calculation questions; each contributes cleanly to learning.
- 🍞 Think of grading a group of practice essays by comparing them together. 🥬 GRPO (Group Relative Policy Optimization)
- What it is: An RL method that compares multiple candidate answers from the same prompt and scores them relative to each other.
- How it works:
- For each input, sample a group of responses.
- Compute a z-scored advantage per response using group rewards (or task-batch variance in our improved setup).
- Update the policy with a token-level loss so long answers don’t drown the signal.
- Why it matters: It avoids needing a separate value network and stays efficient for long contexts. 🍞 Anchor: For a long QA, the model tries 8 candidate answers; the better ones guide the update more than the worse ones, without overfitting to length.
- 🍞 When a class gets too noisy, the teacher pauses certain activities to calm things down. 🥬 AEPO (Adaptive Entropy-Controlled Policy Optimization)
- What it is: A controller that turns down negative gradient updates when the model is overly random.
- How it works:
- Measure batch entropy (randomness).
- If above a high threshold: mask negative-advantage samples, learn from positive only → stabilizes.
- If below a low threshold: reintroduce negatives → prevents collapse and keeps exploration.
- Why it matters: Without AEPO, long-context RL often swings between chaos and stagnation. 🍞 Anchor: If answers look scattered and off-topic, AEPO quiets the chaos until focus returns; then it reopens exploration slowly.
- 🍞 Imagine reading a 4-million-word saga one chapter at a time with a growing summary. 🥬 Memory Agent (Memory-Augmented Architecture)
- What it is: A loop that processes chunks, updates memory, plans next steps, and finally answers with the accumulated memory plus formatting rules.
- How it works:
- Split user query: q_core (what to solve) and q_inst (format rules) to keep reasoning flexible.
- Chunk the long doc, read chunk t, update memory m_t and plan p_t.
- After last chunk, combine m_K with q_inst to produce the final answer.
- Why it matters: It lets the model reason far beyond its fixed window without losing track. 🍞 Anchor: For millions of tokens of technical logs, the agent tracks bugs, merges related clues across chunks, and outputs a precise timeline in JSON at the end.
- 🍞 Think of training for longer races step by step: 5K, 10K, half marathon, then marathon. 🥬 Multi-stage Fusion RL Training & Model Merging
- What it is: Progressive full-context RL (growing input/output lengths), plus a separate memory expert trained with memory-RL, then merged using SCE, then refined.
- How it works:
- Stage 1: 20K in / 12K out → activate long grounding skills.
- Stage 2: 60K in / 20K out → deepen aggregation.
- Stage 3: 120K in / 50K out → strengthen long reasoning.
- Memory-RL: train a memory expert at 128K with chunked processing.
- Merge expert with Stage-3 using SCE (model fusion) → keep both skills.
- Stage 4: final full-context RL to polish both abilities.
- Why it matters: Sudden jumps or mixing everything at once can hurt stability; staged growth works better. 🍞 Anchor: Like first mastering essays at 2 pages, then 5, then 10; separately, learn note-taking; finally, combine both for long research papers.
- 🍞 Before turning in homework, you double-check that the question really needs the sources you used and that your answer survives small changes. 🥬 Data Verification (Grounding + Robustness)
- What it is: Two checks so data truly teaches long-context reasoning.
- How it works:
- Grounding check: Remove the source docs; if the model still answers, discard (it’s just prior knowledge).
- Robustness check: Add irrelevant docs; if the answer collapses, discard (too brittle).
- Why it matters: Without this, training may teach shortcuts unrelated to long-context reasoning. 🍞 Anchor: A QA item only stays if it really needs the long document and still holds up when distractors are added.
Secret Sauce Summary:
- Hard, verified multi-hop data creates the right pressure to learn.
- Stabilized RL (GRPO + task-balancing + task-specific advantages + AEPO) keeps learning on track.
- Memory agent and staged training extend capabilities beyond fixed windows without sacrificing stability.
04Experiments & Results
The Test (What and Why):
- Benchmarks: DocMath (long numeric reasoning), LongBench-V1/QAs and V2 (multi-hop, multi-domain), Frames (Wikipedia multi-hop), MRCR (needle-in-haystack in dialogue but with structure), CorpusQA (corpus-level aggregation; very long).
- Why these: They force the model to pull together scattered clues, calculate across documents, and stay accurate across long and ultra-long inputs.
The Competition:
- Proprietary flagships: GPT-5, Gemini-2.5-Pro.
- Strong open/lightweight models: DeepSeek-R1-0528, Gemini-2.5-Flash-Thinking, Qwen3-Max-Thinking.
- Baseline: Qwen3-30B-A3B-Thinking-2507.
Scoreboard with Context:
- Overall average: QwenLong-L1.5-30B-A3B ≈ 71.82.
- That’s +9.90 over its own baseline (61.92)—like jumping from a solid B- to a strong A- across tough finals.
- Comparable to Gemini-2.5-Pro on long-context reasoning; ahead of DeepSeek-R1-0528 and Gemini-2.5-Flash-Thinking.
- Category highlights:
- MRCR: 82.99 (a huge +31.72 over baseline), showing strong disambiguation and retrieval amid long dialogue clutter.
- CorpusQA: 81.25 (near GPT-5 level), showing robust corpus-level aggregation and calculation.
- LongBench-V2: +6.16 over baseline, especially strong in medium-to-long subsets where clues are more spread out.
- Ultra-long (beyond 128K):
- MRCR 128K–512K: 34.87 vs. 16.55 baseline (memory agent mode) → more than double.
- MRCR 512K–1M: 22.53 vs. 4.24 baseline → a big leap.
- CorpusQA up to 4M tokens: 14.29 (memory mode) where full-context models cannot run; outperforms other agents.
Surprising Findings:
- The largest gains show up precisely where inputs are longest and information is most scattered—clear evidence the data+RL+memory recipe teaches true long-range reasoning, not just retrieval.
- Long-context training also boosts general domains: AIME25 (+3.65), GPQA-Diamond (+0.90), and long dialogue memory (+15.60 on LongMemEval). Skills transfer beyond the original training focus.
- Memory specialization temporarily lowers full-context performance, but model merging (SCE) restores and even improves the unified capability—so one model can do both.
Making the Numbers Meaningful:
- Think of MRCR as a messy classroom where lots of students talk at once (many distractors). Scoring ~83 vs. ~51 baseline is like hearing every key word in the chatter and still answering perfectly.
- On CorpusQA (lots of big reports), scoring ~81 vs. ~72 baseline means the model not only reads more but composes evidence into final, verified computations—like an analyst who summarizes 10 binders into one correct chart.
Ablations and Training Dynamics:
- Task-balanced sampling + task-specific advantage: smoother entropy and response lengths; +2.55 average over GRPO baseline on smaller trials.
- Negative gradient clipping: best when clipping high-entropy negatives; prevents over-penalizing exploratory steps that might later turn correct.
- AEPO: +3.29 over GRPO on 4B tests; on 30B, maintains a healthy exploration–exploitation balance across long runs.
- Progressive stages: Big jump after Stage-1 (activates long grounding), then steady gains for longer contexts; memory-RL + SCE merging is key to unify skills.
Bottom line: Across diverse tests, the model consistently improves where it should if it truly learned to reason over long inputs, and it scales to extreme lengths via memory agent training.
05Discussion & Limitations
Limitations (honest look):
- Long outputs: While output limits were extended (up to 50K), tasks like chapter-level editing or long report drafting remain underexplored.
- Modality: The pipeline is text-only; real-world inputs often mix text, tables, figures, code, and images.
- Reward design: Current rewards blend rule checks with LLM-as-a-judge; nuanced, multi-criteria judging for open-ended tasks is still a challenge.
- Credit assignment: GRPO plus masking/clipping stabilizes training but doesn’t fully solve token-level credit assignment inside complex reasoning chains.
Required Resources:
- Large-scale compute for long-context RL and data synthesis; careful batching and memory to handle 120K+ input and up to 50K output.
- Access to judging models for verification (or robust open judges) and storage for big synthesized datasets.
When NOT to Use:
- If tasks are short and simple retrieval suffices, the full recipe may be overkill.
- If your application needs multi-modal reasoning (charts, images) today, additional modeling and data extensions are required.
- If you cannot afford the compute for staged RL and memory-agent training, lighter fine-tunes might be more practical.
Open Questions:
- Better token-level credit assignment: Can we reward the helpful steps within a mostly-wrong solution so learning is more precise and faster?
- Richer rewards: Can rubric-based LLM judges provide stable, multi-aspect scores that align with human preferences across diverse, open-ended tasks?
- Data flywheel: Can the improved long-context model generate and verify its own high-quality data at scale, closing the loop and reducing reliance on external APIs?
- Multi-modal scaling: How to extend the same recipe to text+tables+figures+code for real enterprise and research pipelines?
06Conclusion & Future Work
Three-sentence summary:
- QwenLong-L1.5 presents a full post-training recipe—synthesized long-context data, stabilized RL (GRPO+balancing+AEPO), and a memory agent—that teaches models to reason across very long inputs.
- It delivers strong gains over its baseline, rivals leading proprietary systems on long-context tasks, and scales to million-token settings with a memory-augmented workflow.
- The learned skills generalize beyond the target domain—improving math, science, tool use, and dialogue memory.
Main Achievement:
- Unifying data, training, and memory into a single, staged post-training system that reliably upgrades long-context reasoning and memory management in an open 30B model.
Future Directions:
- Expand to long-input/long-output tasks (editing, report drafting), build a self-improving data flywheel, design richer token-level credit assignment and rubric rewards, and extend to multi-modal long-context reasoning.
Why Remember This:
- It shows that careful post-training—not just pretraining or architecture—can unlock deep, scalable long-context reasoning.
- The approach is practical: build better data, stabilize RL, and add memory to go beyond any fixed window.
- These ideas point to AI systems that can read, remember, and reason over the truly big stories we work with every day.
Practical Applications
- •Financial analysis across multi-year reports to explain trends and policies with citations.
- •Legal research over many filings to summarize arguments and precedents coherently.
- •Scientific literature reviews that integrate findings from dozens of long papers.
- •Enterprise log analysis over millions of tokens to trace root causes and timelines.
- •Long software repository understanding for cross-file code queries and refactoring plans.
- •Customer support memory that recalls long chat histories and prior resolutions.
- •Education assistants that track learning progress and reference earlier lessons in long study plans.
- •Policy comparison across large government documents to identify consistent impacts.
- •Project management assistants that summarize long meeting notes and action logs.
- •Tool-using agents that maintain long-term memory across many steps and tasks.