QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

Weizhou Shen; Ziyi Yang; Chenliang Li; Zhiyuan Lu; Miao Peng; Huashan Sun; Yingcheng Shi; Shengyi Liao; Shaopeng Lai; Bo Zhang; Dayiheng Liu; Fei Huang; Jingren Zhou; Ming Yan

QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

Intermediate

Weizhou Shen, Ziyi Yang, Chenliang Li et al.12/15/2025

arXiv PDF

Key Summary

•QwenLong-L1.5 is a training recipe that helps AI read and reason over very long documents by improving the data it learns from, the way it is trained, and how it remembers important stuff.
•It builds tough practice problems where clues are spread across many places, so the model learns to connect facts instead of just searching for one line.
•It uses a safer kind of reinforcement learning (RL) with smart sampling and a new method called AEPO to keep training steady instead of chaotic.
•It adds a memory system so the model can handle tasks far longer than its usual reading window, even up to millions of tokens.
•Across long-context tests, it beats its own baseline by about 10 points on average and performs close to top proprietary models.
•On huge inputs (1M to 4M tokens), its memory agent strategy clearly outperforms earlier agent baselines.
•The long-context skills also boost math, science, tool use, and long conversations, not just the special long-doc tests.
•The training recipe grows context length step by step and merges a memory expert back into a single strong model.
•The data pipeline automatically checks that questions really need the document and stay reliable even if extra noise is added.
•This paper shows that better post-training alone (after pretraining) can unlock powerful long-context reasoning.

Why This Research Matters

Real work often lives in long materials: financial reports, legal cases, scientific papers, software repos, and long chats. A model that can read, remember, and reason across all that reduces errors and saves huge amounts of time. This recipe shows that careful post-training—not just bigger models—can unlock those abilities. It also scales beyond fixed windows using memory, so it can handle million-token tasks that used to be intractable. The improvements spill over into everyday skills like math, science Q&A, tool use, and long conversations. That means better tutors, analysts, coders, and assistants for people at school, at home, and at work.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how reading a whole mystery novel means remembering clues from the beginning, middle, and end—and tying them together to solve the case? A good reader doesn’t just skim; they connect the dots across the entire book.

🥬 Filling (The Actual Concept): Before this work, many AI models were like fast skimmers. They could handle short passages and answer simple questions but struggled when answers were scattered across giant collections: multi-chapter reports, many webpages, huge logs, or long chats.

What it is: Long-context reasoning means the AI can find, track, and combine facts located far apart in a huge text to produce a correct answer.
How it worked before (briefly): Researchers mostly extended the “reading window” during pretraining or changed the model’s attention mechanism to be cheaper, then hoped post-training would catch up.
Why it matters: Without real long-context reasoning, models forget early clues, miss subtle connections, and give shallow answers.

🍞 Bottom Bread (Anchor): Imagine asking, “Across all 20 annual reports, which division grew every year and what policy explains the trend?” A short-context model might miss key lines from earlier years. A long-context reasoner can thread the full story.

The World Before:

Models could answer simple retrieval questions (like “find the sentence that says X”) but had trouble with multi-hop logic—linking facts hidden across many parts of a very long document set.
Most progress came from pretraining and architecture tricks to increase context windows, not from post-training methods that teach models to actually reason over long input.

The Problem:

Post-training (the stage that aligns and sharpens the model’s abilities) lacked a complete recipe for long-context reasoning.
Gaps included: (1) getting enough hard, high-quality training data where clues are spread out; (2) stabilizing RL so training doesn’t wobble or collapse on long inputs; and (3) building a memory process for inputs far beyond even large windows.

Failed Attempts:

Simple “needle-in-a-haystack” tests: The model just needed to find a single line—not truly reason across multiple facts.
One-hop RAG: Retrieve a small chunk and answer; good for short jumps, weak for many-step logic spread across different places.
RL that worked for short math/code sometimes became unstable with long, messy text (rewards biased, exploding/vanishing gradients, entropy swings).

🍞 Top Bread (Hook): Imagine studying for a test with only true/false flashcards (easy), then suddenly being asked to write a research paper (hard). Your old practice doesn’t prepare you for multi-source reasoning.

🥬 Filling (The Actual Concept): The Gap this paper fills is a full post-training system dedicated to long-context reasoning.

What it is: A unified recipe combining (a) a long-context data synthesis pipeline, (b) stabilized RL algorithms, and (c) a memory-augmented agent.
How it works: It builds complex, multi-hop datasets; trains with group-based RL and smart sampling; and equips the model with iterative memory to go beyond its fixed window.
Why it matters: Without all three together, models either lack the right practice, fall apart in training, or can’t handle ultra-long inputs.

🍞 Bottom Bread (Anchor): Think of training a marathon runner: you need (a) a realistic course with hills (hard data), (b) a safe training plan (stable RL), and (c) water stations and pacing strategy (memory agent). Skipping one leads to failure.

Real Stakes:

Long business reports, legal cases, scientific literature, codebases, and long chats are standard in real life.
Better long-context reasoning means more accurate analyses, safer decision support, fewer hallucinations, and stronger tools for learning and work.
The model’s improvements also spill over to math, science Q&A, tools, and long conversations—because staying consistent over long stretches helps everywhere.

🍞 Top Bread (Hook): Imagine trying to remember every step in a giant LEGO instruction book with hundreds of pages.

🥬 Filling (The Actual Concept): Long-context Reinforcement Learning trains a model to follow those steps across the whole book and build correctly.

What it is: Teaching the model, via rewards, to read long inputs, plan, and reason step by step.
How it works: The model proposes answers; a reward checks correctness; the model learns which reasoning led to better outcomes.
Why it matters: Without RL tailored for long inputs, the model won’t learn to combine far-apart clues.

🍞 Bottom Bread (Anchor): If a model must answer “Compare three safety policies across five yearly reports and pick the best,” long-context RL helps it learn how to track and connect those pieces over many pages without getting lost.

02Core Idea

🍞 Top Bread (Hook): You know how a great study plan includes good practice problems, a steady routine without burnout, and a notebook to remember key ideas? Put those together and you ace the test.

🥬 Filling (The Actual Concept): The “Aha!” is a complete post-training recipe—data + stable RL + memory—that teaches long-context reasoning end-to-end.

What it is: QwenLong-L1.5’s core idea is unifying three pillars: (1) a scalable long-context data synthesis pipeline, (2) stabilized RL for long inputs, and (3) a memory agent for ultra-long tasks.
How it works: It generates multi-hop, globally grounded questions; trains with group-relative RL plus task-balanced sampling and AEPO to keep learning stable; and adds a memory loop to go beyond the window.
Why it matters: Any one pillar alone isn’t enough; all three together make the model both accurate and scalable.

🍞 Bottom Bread (Anchor): Like training a detective: (1) give tricky cases with clues everywhere, (2) teach a calm investigation routine, (3) keep a case file (memory). Result: a detective who solves huge, tangled mysteries.

Multiple Analogies:

Library analogy: Before, the model skimmed one shelf; after, it checks the whole library, notes cross-references (data), uses a calm search plan (stable RL), and keeps a summary notebook (memory).
Cooking analogy: Before, one-pan meals; after, a feast where timing and multiple dishes must coordinate (data). The chef follows a reliable playbook (stable RL) and keeps prep lists and leftovers labeled (memory).
Sports analogy: Before, short sprints; after, a marathon with water stations (memory), a pacing strategy (stable RL), and practice on varied terrains (data).

Before vs. After:

Before: Models often found single facts but struggled to chain them across long contexts; training could be unstable; very long inputs were out of reach.
After: The model handles multi-hop, globally scattered evidence, trains stably over longer sequences, and scales to million-token tasks using memory.

Why It Works (intuition, no equations):

Data: If you only practice finding one sentence, you never learn to connect multiple clues. The pipeline builds tasks that force connections across far-apart parts of the text.
Stabilized RL: Long inputs are noisy. Task-balanced sampling prevents one task type from skewing updates, and AEPO keeps exploration (randomness) in a Goldilocks zone—not too wild, not too timid.
Memory: Even a large window has limits. Iterative memory lets the model compress what it has learned so far, plan next steps, and keep going chunk by chunk.

Building Blocks (Sandwich explanations):

🍞 You know how a teacher makes custom practice sheets that get harder and cover many topics? 🥬 Long-Context Data Synthesis Pipeline

What it is: A programmatic way to create hard, multi-hop questions spread across long documents.
How it works: Break documents into atomic facts and relations, build knowledge graphs and tables, compose multi-step questions, add distractors, then verify quality.
Why it matters: Without rich, tough data, the model just learns easy retrieval, not deep reasoning. 🍞 Example: From three annual reports and two news articles, the pipeline asks: “Which product line rose every year, and which policy across divisions best explains it?”

🍞 Imagine giving equal time to math, reading, science, and art so you grow in all subjects. 🥬 Stabilized Reinforcement Learning (with task-balanced sampling and task-specific advantages)

What it is: A steadier way to learn from rewards across many task types.
How it works: Each training batch draws a balanced mix of tasks; advantages (learning signals) are normalized per task to avoid bias.
Why it matters: Without balancing, one task can dominate, causing unstable training. 🍞 Example: A batch includes multi-hop QA, numerical reasoning, dialogues, and code-doc questions—each contributes fairly to updates.

🍞 When learning a new game, sometimes you explore randomly; other times, you exploit what works. 🥬 AEPO (Adaptive Entropy-Controlled Policy Optimization)

What it is: A method that raises or lowers exploration by masking harmful negative updates when the policy is too random.
How it works: If model entropy (randomness) gets too high, pause negative gradients from incorrect rollouts; if too low, reintroduce them to prevent collapse.
Why it matters: Without AEPO, long-context RL can oscillate or stall. 🍞 Example: If answers start looking messy and erratic, AEPO temporarily learns only from good attempts until stability returns.

🍞 Picture reading a giant book one chapter at a time while keeping a summary sheet updated. 🥬 Memory-Augmented Architecture (Memory Agent)

What it is: A loop that reads chunks, updates a memory, plans the next chunk, and finally answers.
How it works: Split long input; at each step, update a compact memory and plan navigation; at the end, use memory plus instructions to generate the answer.
Why it matters: Without memory, tasks bigger than the window are impossible to solve coherently. 🍞 Example: For 2 million tokens of logs, the agent summarizes each segment, tracks unresolved questions, and eventually answers the user query.

🍞 Think of training wheels, then a bigger bike, then a racing bike. 🥬 Multi-stage Fusion RL Training

What it is: Gradually increasing input/output lengths; training a memory expert; then merging back into a single strong model.
How it works: Stage 1→3: grow full-context lengths; parallel: train memory expert; merge models; final stage refines both skills together.
Why it matters: Without progressive stages and merging, you either stay short or lose stability. 🍞 Example: Start at 20K tokens, move to 60K, then 120K, then add memory for 1M+ before unifying everything.

03Methodology

High-level Recipe: Input (long docs + questions) → Data Synthesis & Verification → Progressive RL (GRPO + task-balanced sampling + task-specific advantage) → AEPO entropy control → Memory agent training → Model merging → Final full-context + memory-capable model.

Step-by-step (with Sandwich explanations for key pieces):

🍞 Imagine turning a messy library into a tidy map of facts and links. 🥬 Long-Context Data Synthesis Pipeline

What it is: A system to generate multi-hop, globally grounded QA pairs at scale.
How it works (like a recipe):
1. Collect diverse long documents (code repos, papers, reports, literature, dialogues) and filter quality.
2. Extract atomic facts into knowledge graphs or tables.
3. Compose multi-hop and numerical queries by sampling long-range paths or running NL2SQL.
4. Inflate context length with irrelevant but realistic distractors.
5. Verify with two checks: grounding (remove doc; if model still answers, discard) and robustness (add noise; if answer breaks, discard).
Why it matters: Without trustworthy, hard data, RL learns shallow shortcuts. 🍞 Anchor: Build a table across five reports; ask: “Total R&D over 2019–2023 for Division A vs. B, and which policy best explains the difference?”

🍞 You know how exams should fairly cover all topics—not just one. 🥬 Task-balanced Sampling + Task-specific Advantage Estimation

What it is: A way to draw equal amounts from each task type and normalize rewards per task.
How it works:
- Pre-infer difficulty bins and sample evenly.
- In each RL batch, enforce equal counts from: multi-choice, multi-hop QA, general RC, dialogue memory, corpus-level numerics.
- Compute advantages using the reward variance of that specific task (not mixed), to reduce bias.
Why it matters: Without it, reward noise and imbalance cause unstable updates. 🍞 Anchor: A batch might include 10 MC questions, 10 multi-hop, 10 dialogues, 10 table-calculation questions; each contributes cleanly to learning.

🍞 Think of grading a group of practice essays by comparing them together. 🥬 GRPO (Group Relative Policy Optimization)

What it is: An RL method that compares multiple candidate answers from the same prompt and scores them relative to each other.
How it works:
- For each input, sample a group of responses.
- Compute a z-scored advantage per response using group rewards (or task-batch variance in our improved setup).
- Update the policy with a token-level loss so long answers don’t drown the signal.
Why it matters: It avoids needing a separate value network and stays efficient for long contexts. 🍞 Anchor: For a long QA, the model tries 8 candidate answers; the better ones guide the update more than the worse ones, without overfitting to length.

🍞 When a class gets too noisy, the teacher pauses certain activities to calm things down. 🥬 AEPO (Adaptive Entropy-Controlled Policy Optimization)

What it is: A controller that turns down negative gradient updates when the model is overly random.
How it works:
- Measure batch entropy (randomness).
- If above a high threshold: mask negative-advantage samples, learn from positive only → stabilizes.
- If below a low threshold: reintroduce negatives → prevents collapse and keeps exploration.
Why it matters: Without AEPO, long-context RL often swings between chaos and stagnation. 🍞 Anchor: If answers look scattered and off-topic, AEPO quiets the chaos until focus returns; then it reopens exploration slowly.

🍞 Imagine reading a 4-million-word saga one chapter at a time with a growing summary. 🥬 Memory Agent (Memory-Augmented Architecture)

What it is: A loop that processes chunks, updates memory, plans next steps, and finally answers with the accumulated memory plus formatting rules.
How it works:
- Split user query: q_core (what to solve) and q_inst (format rules) to keep reasoning flexible.
- Chunk the long doc, read chunk t, update memory m_t and plan p_t.
- After last chunk, combine m_K with q_inst to produce the final answer.
Why it matters: It lets the model reason far beyond its fixed window without losing track. 🍞 Anchor: For millions of tokens of technical logs, the agent tracks bugs, merges related clues across chunks, and outputs a precise timeline in JSON at the end.

🍞 Think of training for longer races step by step: 5K, 10K, half marathon, then marathon. 🥬 Multi-stage Fusion RL Training & Model Merging

What it is: Progressive full-context RL (growing input/output lengths), plus a separate memory expert trained with memory-RL, then merged using SCE, then refined.
How it works:
- Stage 1: 20K in / 12K out → activate long grounding skills.
- Stage 2: 60K in / 20K out → deepen aggregation.
- Stage 3: 120K in / 50K out → strengthen long reasoning.
- Memory-RL: train a memory expert at 128K with chunked processing.
- Merge expert with Stage-3 using SCE (model fusion) → keep both skills.
- Stage 4: final full-context RL to polish both abilities.
Why it matters: Sudden jumps or mixing everything at once can hurt stability; staged growth works better. 🍞 Anchor: Like first mastering essays at 2 pages, then 5, then 10; separately, learn note-taking; finally, combine both for long research papers.

🍞 Before turning in homework, you double-check that the question really needs the sources you used and that your answer survives small changes. 🥬 Data Verification (Grounding + Robustness)

What it is: Two checks so data truly teaches long-context reasoning.
How it works:
- Grounding check: Remove the source docs; if the model still answers, discard (it’s just prior knowledge).
- Robustness check: Add irrelevant docs; if the answer collapses, discard (too brittle).
Why it matters: Without this, training may teach shortcuts unrelated to long-context reasoning. 🍞 Anchor: A QA item only stays if it really needs the long document and still holds up when distractors are added.

Secret Sauce Summary:

Hard, verified multi-hop data creates the right pressure to learn.
Stabilized RL (GRPO + task-balancing + task-specific advantages + AEPO) keeps learning on track.
Memory agent and staged training extend capabilities beyond fixed windows without sacrificing stability.

04Experiments & Results

The Test (What and Why):

Benchmarks: DocMath (long numeric reasoning), LongBench-V1/QAs and V2 (multi-hop, multi-domain), Frames (Wikipedia multi-hop), MRCR (needle-in-haystack in dialogue but with structure), CorpusQA (corpus-level aggregation; very long).
Why these: They force the model to pull together scattered clues, calculate across documents, and stay accurate across long and ultra-long inputs.

The Competition:

Proprietary flagships: GPT-5, Gemini-2.5-Pro.
Strong open/lightweight models: DeepSeek-R1-0528, Gemini-2.5-Flash-Thinking, Qwen3-Max-Thinking.
Baseline: Qwen3-30B-A3B-Thinking-2507.

Scoreboard with Context:

Overall average: QwenLong-L1.5-30B-A3B ≈ 71.82.
- That’s +9.90 over its own baseline (61.92)—like jumping from a solid B- to a strong A- across tough finals.
- Comparable to Gemini-2.5-Pro on long-context reasoning; ahead of DeepSeek-R1-0528 and Gemini-2.5-Flash-Thinking.
Category highlights:
- MRCR: 82.99 (a huge +31.72 over baseline), showing strong disambiguation and retrieval amid long dialogue clutter.
- CorpusQA: 81.25 (near GPT-5 level), showing robust corpus-level aggregation and calculation.
- LongBench-V2: +6.16 over baseline, especially strong in medium-to-long subsets where clues are more spread out.
Ultra-long (beyond 128K):
- MRCR 128K–512K: 34.87 vs. 16.55 baseline (memory agent mode) → more than double.
- MRCR 512K–1M: 22.53 vs. 4.24 baseline → a big leap.
- CorpusQA up to 4M tokens: 14.29 (memory mode) where full-context models cannot run; outperforms other agents.

Surprising Findings:

The largest gains show up precisely where inputs are longest and information is most scattered—clear evidence the data+RL+memory recipe teaches true long-range reasoning, not just retrieval.
Long-context training also boosts general domains: AIME25 (+3.65), GPQA-Diamond (+0.90), and long dialogue memory (+15.60 on LongMemEval). Skills transfer beyond the original training focus.
Memory specialization temporarily lowers full-context performance, but model merging (SCE) restores and even improves the unified capability—so one model can do both.

Making the Numbers Meaningful:

Think of MRCR as a messy classroom where lots of students talk at once (many distractors). Scoring ~83 vs. ~51 baseline is like hearing every key word in the chatter and still answering perfectly.
On CorpusQA (lots of big reports), scoring ~81 vs. ~72 baseline means the model not only reads more but composes evidence into final, verified computations—like an analyst who summarizes 10 binders into one correct chart.

Ablations and Training Dynamics:

Task-balanced sampling + task-specific advantage: smoother entropy and response lengths; +2.55 average over GRPO baseline on smaller trials.
Negative gradient clipping: best when clipping high-entropy negatives; prevents over-penalizing exploratory steps that might later turn correct.
AEPO: +3.29 over GRPO on 4B tests; on 30B, maintains a healthy exploration–exploitation balance across long runs.
Progressive stages: Big jump after Stage-1 (activates long grounding), then steady gains for longer contexts; memory-RL + SCE merging is key to unify skills.

Bottom line: Across diverse tests, the model consistently improves where it should if it truly learned to reason over long inputs, and it scales to extreme lengths via memory agent training.

05Discussion & Limitations

Limitations (honest look):

Long outputs: While output limits were extended (up to 50K), tasks like chapter-level editing or long report drafting remain underexplored.
Modality: The pipeline is text-only; real-world inputs often mix text, tables, figures, code, and images.
Reward design: Current rewards blend rule checks with LLM-as-a-judge; nuanced, multi-criteria judging for open-ended tasks is still a challenge.
Credit assignment: GRPO plus masking/clipping stabilizes training but doesn’t fully solve token-level credit assignment inside complex reasoning chains.

Required Resources:

Large-scale compute for long-context RL and data synthesis; careful batching and memory to handle 120K+ input and up to 50K output.
Access to judging models for verification (or robust open judges) and storage for big synthesized datasets.

When NOT to Use:

If tasks are short and simple retrieval suffices, the full recipe may be overkill.
If your application needs multi-modal reasoning (charts, images) today, additional modeling and data extensions are required.
If you cannot afford the compute for staged RL and memory-agent training, lighter fine-tunes might be more practical.

Open Questions:

Better token-level credit assignment: Can we reward the helpful steps within a mostly-wrong solution so learning is more precise and faster?
Richer rewards: Can rubric-based LLM judges provide stable, multi-aspect scores that align with human preferences across diverse, open-ended tasks?
Data flywheel: Can the improved long-context model generate and verify its own high-quality data at scale, closing the loop and reducing reliance on external APIs?
Multi-modal scaling: How to extend the same recipe to text+tables+figures+code for real enterprise and research pipelines?

06Conclusion & Future Work

Three-sentence summary:

QwenLong-L1.5 presents a full post-training recipe—synthesized long-context data, stabilized RL (GRPO+balancing+AEPO), and a memory agent—that teaches models to reason across very long inputs.
It delivers strong gains over its baseline, rivals leading proprietary systems on long-context tasks, and scales to million-token settings with a memory-augmented workflow.
The learned skills generalize beyond the target domain—improving math, science, tool use, and dialogue memory.

Main Achievement:

Unifying data, training, and memory into a single, staged post-training system that reliably upgrades long-context reasoning and memory management in an open 30B model.

Future Directions:

Expand to long-input/long-output tasks (editing, report drafting), build a self-improving data flywheel, design richer token-level credit assignment and rubric rewards, and extend to multi-modal long-context reasoning.

Why Remember This:

It shows that careful post-training—not just pretraining or architecture—can unlock deep, scalable long-context reasoning.
The approach is practical: build better data, stabilize RL, and add memory to go beyond any fixed window.
These ideas point to AI systems that can read, remember, and reason over the truly big stories we work with every day.

Practical Applications

•Financial analysis across multi-year reports to explain trends and policies with citations.
•Legal research over many filings to summarize arguments and precedents coherently.
•Scientific literature reviews that integrate findings from dozens of long papers.
•Enterprise log analysis over millions of tokens to trace root causes and timelines.
•Long software repository understanding for cross-file code queries and refactoring plans.
•Customer support memory that recalls long chat histories and prior resolutions.
•Education assistants that track learning progress and reference earlier lessons in long study plans.
•Policy comparison across large government documents to identify consistent impacts.
•Project management assistants that summarize long meeting notes and action logs.
•Tool-using agents that maintain long-term memory across many steps and tasks.

Version: 1