Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents

Zhihan Liu; Lin Guan; Yixin Nie; Kai Zhang; Zhuoqun Hao; Lin Chen; Asli Celikyilmaz; Zhaoran Wang; Na Zhang

Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents

Beginner

Zhihan Liu, Lin Guan, Yixin Nie et al.1/26/2026

arXiv PDF

Key Summary

•LLM agents are usually trained in a few worlds but asked to work in many different, unseen worlds, which often hurts their performance.
•This paper shows two environment features best protect cross-domain generalization: high state information richness (lots to read) and high planning complexity (long, careful plans).
•Surprisingly, surface realism and text similarity matter less; the simple puzzle Sokoban helped more than the realistic ALFWorld for generalization.
•A low-cost trick called state information augmentation (adding small, goal-irrelevant text distractions) reliably boosts out-of-domain success.
•Turning on step-by-step reasoning during RL training is crucial to preserve generalization, even if it doesn’t always raise in-domain scores.
•SFT warmup (mid-training) makes covered domains durable against forgetting but can hurt performance in domains not included in the warmup.
•Across WebShop, Sokoban, ALFWorld, and SciWorld, the best generalizers were trained where states were richer and plans were longer.
•Measuring richness by state character count and planning by average trajectory length provides practical proxies to pick or build better training environments.
•Adding small controlled distractions improved OOD success rates by up to 42.5% in some settings without changing the actual tasks.
•Practical guidance: choose or construct training domains with rich states and long plans, keep explicit reasoning on, and lightly randomize states.

Why This Research Matters

Real-world assistants rarely work in just one tidy place; they face new websites, tools, and rules every day. This paper shows simple, actionable ways to train agents so they keep their skills when the world changes. By choosing info-rich, long-plan training tasks, lightly adding harmless distractions, and keeping step-by-step reasoning on, we can build agents that transfer better to new jobs. That means fewer failures when software updates, site layouts shift, or workflows grow longer. Companies can save costs on re-training and deliver more reliable user experiences. For everyday users, this translates into smarter copilots that adapt instead of breaking when the task looks different.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you practice basketball only on one tiny court with perfect lighting, then you’re asked to play in giant arenas, windy playgrounds, and crowded gyms. You’ll score well on your favorite tiny court, but might struggle everywhere else.

🥬 The Concept (Reinforcement Learning):

What it is: Reinforcement Learning (RL) teaches an AI agent by giving it rewards when it does well, so it learns which actions lead to success.
How it works:
1. The agent looks at the current situation (the state).
2. It chooses an action (like clicking a button or moving in a grid).
3. The environment returns a new state and a reward (success or fail).
4. The agent repeats this many times, changing its behavior to get more rewards.
Why it matters: Without RL, the agent won’t learn from interactive feedback and will keep repeating mistakes in multi-step tasks. 🍞 Anchor: Like a dog learning tricks: sit gets a treat, jump gets a treat, so the dog repeats what earns treats. The AI repeats actions that bring rewards.

🍞 Hook: You know how a kid who only practices spelling short words might panic when they suddenly face a long paragraph they’ve never seen?

🥬 The Concept (Cross-Domain Generalization):

What it is: Doing well in new, different tasks or worlds than the ones you trained on.
How it works:
1. Train in one or a few environments.
2. Test in other environments with different tools, goals, or states.
3. Check if skills transfer without extra training.
Why it matters: Without generalization, an agent becomes a narrow specialist that fails when things look different. 🍞 Anchor: If you only learned to count apples, can you still do math with oranges? That’s cross-domain generalization.

The world before: LLM agents were often fine-tuned on a narrow slice of tasks (like a few websites or one game) but later deployed on a wide variety of jobs (office tasks, lab simulations, shopping, puzzles). Leaders on popular benchmarks sometimes disappointed in real workflows because their training didn’t cover the messy variety of real life.

The problem: How do we choose or build the few training environments so that the agent keeps working well in many unseen domains? We can’t train on everything: interactive simulators are expensive to build, and data flywheels take time and compute.

Failed attempts: People tried betting on realistic-looking worlds or on text similarity between training and test. But these didn’t reliably protect out-of-domain performance. Also, mid-training (SFT warmup) made some domains very sturdy, yet weakened others that weren’t included.

🍞 Hook: Think of two sports drills. One drill makes you read a busy scoreboard and follow many signals; another drill makes you plan ten moves ahead to score. Which drills make you better at strange new games?

🥬 The Concept (State Information Richness):

What it is: How much information the agent must read and filter from each state (like long, detailed observations).
How it works:
1. Present states with many details (some relevant, some not).
2. Force the agent to pick out the signal from the noise.
3. Over time, it learns to ignore distractions.
Why it matters: Without richness, the agent can latch onto shallow shortcuts that don’t transfer. 🍞 Anchor: A detailed map trains you to read and filter better than a tiny sketch.

🥬 The Concept (Planning Complexity):

What it is: How hard it is to chain many actions over time to reach a goal.
How it works:
1. Tasks require multiple, dependent steps.
2. Agents must decompose, track progress, and correct mistakes.
3. Longer, trickier paths exercise deep reasoning.
Why it matters: Without long plans, agents rely on quick heuristics that break in new situations. 🍞 Anchor: Solving a maze with many turns teaches better navigation than a straight hallway.

The gap: We needed a simple, testable recipe for building or choosing training environments that actually protect cross-domain skills.

Real stakes: In daily life, assistants need to navigate businesses’ websites, automate office tasks, operate lab simulations, and plan multi-step workflows. If training doesn’t preserve generalization, agents waste time, make errors, or require costly re-training. Preserving generalization means smoother automation, more reliable copilots, and better user trust across changing tasks.

02Core Idea

🍞 Hook: Imagine training for a school decathlon by practicing in a noisy gym (lots to notice) with obstacle courses (long plans). When meet day moves to a brand-new stadium, you still perform well because you learned to filter noise and plan carefully.

🥬 The Aha Moment: Train LLM agents in environments that are information-rich and require long, careful plans—and lightly add harmless distractions—then keep step-by-step reasoning on; this preserves performance in unseen domains better than relying on realism or text similarity.

Multiple analogies:

City driving: Practice in dense traffic (many signs, billboards, honking) and complicated routes (detours), and you’ll handle new cities more calmly than if you only drove on empty straight roads.
Studying: Do homework with varied, slightly distracting practice (random side notes) and multi-part problems; you’ll learn to focus and plan, not just memorize patterns.
Sports drills: Train with crowd noise and tricky play sequences; you build filtering skills and long-horizon strategy that travel to away games.

Before vs. After:

Before: People often picked training worlds for realism or surface similarity, and agents overfit to shallow cues. Mid-training made some domains tough-as-nails but made others brittle.
After: The key axes are state information richness and planning complexity. Add small, goal-irrelevant text as controlled distractions; keep explicit reasoning; choose tasks that need longer plans. Generalization improves even to very different domains.

Why it works (intuition):

Rich inputs force the agent to develop a “signal picker” that filters away noise. Without this, it latches onto shortcuts (like a specific button label) that don’t hold elsewhere.
Long plans exercise chain-of-thought and progress tracking. This builds robust procedural knowledge rather than one-step reactions.
Light randomization prevents the agent from memorizing exact layouts or phrases; it learns what matters for the goal, not the wallpaper.
Making the agent explain steps out loud (step-by-step reasoning) discourages silent heuristics and encourages reusable strategies.

Building blocks: 🍞 Hook: You know how adding background chatter to practice makes you better at presenting in real classrooms?

🥬 State Information Augmentation (State Randomization):

What it is: Insert small, goal-irrelevant text snippets into states during RL training.
How it works:
1. Choose harmless distractors (ads, unrelated descriptors, unreachable objects).
2. Insert them at a controlled volume ε and sometimes only on a fraction of rollouts.
3. Keep everything else (goals, rewards, dynamics) unchanged.
Why it matters: Without it, agents overfit to narrow patterns and lose OOD performance. 🍞 Anchor: Practicing math with a TV quietly on trains you to focus on the problem, not the background.

🍞 Hook: Think of solving story problems by writing every step instead of guessing the final answer.

🥬 Step-by-Step Reasoning:

What it is: Make the agent articulate its chain-of-thought while acting.
How it works:
1. At each step, the agent reasons in small, explicit steps.
2. It then chooses an action using that reasoning.
3. Repeat across the episode.
Why it matters: Without explicit steps, the agent can adopt brittle shortcuts that collapse in new domains. 🍞 Anchor: Following a recipe makes you a better cook across kitchens than just eyeballing and hoping.

🍞 Hook: Training wheels help you ride early, but if you only ever try one type of bike, other bikes feel weird.

🥬 SFT Warmup (Mid-Training):

What it is: A short phase of supervised examples before or between RL phases.
How it works:
1. Show demonstrations to bootstrap baseline skills (especially in hard domains).
2. Then run RL to refine decisions.
3. Knowledge in covered domains becomes more durable.
Why it matters: Without careful coverage, uncovered domains can be forgotten more. 🍞 Anchor: Practicing on a few tracks makes you fast there, but too much focus can weaken your off-road skills.

Bottom line: High information richness + long plans + light distractions + explicit reasoning beat realism and similarity for preserving cross-domain capability.

03Methodology

High-level flow: Input (choose a source environment) → Measure key axes (richness, planning) → Train with RL (with/without state augmentation; with/without step-by-step reasoning; with/without SFT warmup) → Evaluate in other environments → Compare generalization.

Step 1 — Environments and Models:

Four text-based agent worlds: WebShop (web shopping), Sokoban (grid puzzle), ALFWorld (household tasks), SciWorld (science lab).
Base model: Llama-3.1-8B-Instruct. Two starting policies:
- Ckpt V1: A short RL warmup on WebShop to get basic competence.
- Ckpt V2: Adds an SFT warmup using expert SciWorld demos plus self-generated ALFWorld and WebShop data, enabling better initial coverage. Why this step exists: Without viable starting performance across domains, we can’t measure how training on one domain affects unseen domains. Example: Ckpt V1 had near-zero on some domains; V2’s SFT warmup raised SciWorld from near-zero to workable.

🍞 Hook: Counting words in a page tells you how dense it is to read.

🥬 Measuring State Information Richness:

What it is: An estimate of how much information appears in each state.
How it works:
1. Roll out 128 trajectories with a max of 50 steps.
2. Compute the average character count of states.
3. Use this as a proxy for richness.
Why it matters: Without a proxy, we can’t compare environments fairly. 🍞 Anchor: Sokoban’s textual state (coordinate lists) yielded high counts; ALFWorld’s compact summaries were lower.

🍞 Hook: Longer obstacle courses mean more steps to finish.

🥬 Measuring Planning Complexity:

What it is: How long and tough the plans are.
How it works:
1. Roll out the same 128 trajectories.
2. Record the number of steps per episode; failed ones count as max length (50).
3. Average these lengths to estimate planning difficulty.
Why it matters: Without this, we’d miss how long-horizon reasoning affects generalization. 🍞 Anchor: SciWorld and Sokoban had long average trajectories (~43–44 steps), suggesting higher planning demands.

Step 2 — RL Training Recipe (Group-based RL like GRPO):

For each prompt, sample a group of N full-episode trajectories with the current policy.
Score each episode with a scalar success metric (sparse rewards are common).
Compute normalized advantages within the group.
Update the policy to favor better-than-average trajectories while controlling drift with a KL penalty. Why this step exists: Group-based RL stabilizes training on sparse, multi-turn tasks without per-token value learning. Example: In WebShop, 8 rollouts per prompt, 16 prompts per step, success reward 10, failure 0.

🍞 Hook: Practicing with light background chatter helps focus.

🥬 State Information Augmentation (the secret sauce):

What it is: Inject small, goal-irrelevant text into observations during training.
How it works:
1. Choose harmless distractor snippets (ads in WebShop; trivial objects in ALFWorld; unreachable locations in Sokoban).
2. Control the volume ε (how much text) and apply to a fraction of trajectories (often 50%).
3. Do not change transitions, actions, or rewards—only the text state.
Why it matters: Without it, models can memorize fragile patterns; with it, they learn to filter signals. 🍞 Anchor: Adding 5–30 short lines of irrelevancies trained agents to ignore noise and improved OOD success.

Step 3 — Step-by-Step Reasoning Toggle:

Train with reasoning on (the agent writes its thinking) vs. off (reactive actions only).
Reasoning didn’t always raise in-domain scores, but strongly preserved OOD performance. Why this step exists: To test if explicit reasoning builds transferable strategies versus brittle heuristics. Example: Disabling thinking led to huge OOD drops (over 200% relative collapses in some setups).

Step 4 — SFT Warmup (Mid-Training):

Add supervised demos before RL (SciWorld experts + self-generated ALFWorld/WebShop).
Pros: Domains in the warmup become more robust to later RL shifts.
Cons: Domains not in warmup can be forgotten more. Why this step exists: To bootstrap very hard domains (e.g., SciWorld) and to study consolidation vs. forgetting.

🍞 Hook: Ranking players by how they perform away from their home field.

🥬 OOD Ranking Score and OOD Change:

What it is: Two ways to compare cross-domain robustness.
How it works:
1. OOD Ranking Score: For each training domain, rank success rates across other domains; sum the ranks (lower is better).
2. OOD Change (ΔOOD): Measure relative gain in OOD performance with augmentation vs. without.
Why it matters: Without clear metrics, we can’t quantify generalization. 🍞 Anchor: Sokoban and SciWorld scored best ranks; augmentation gave +5% to +42% relative OOD boosts depending on setup.

Secret sauce (why this method is clever):

It uses simple, cheap proxies (char counts and trajectory lengths) to pick or shape training worlds.
It validates causality by changing only the state richness (via harmless text) and observing OOD improvements.
It shows a practical three-part recipe: pick info-rich/long-plan tasks, lightly randomize states, keep explicit reasoning.

04Experiments & Results

The tests: Train on one domain at a time (WebShop, Sokoban, ALFWorld, SciWorld) and evaluate on the other three. Use two initial policies (Ckpt V1 and V2). Save checkpoints over 150 steps and average the last four. Repeat with different seeds.

The competition: Compare training domains by their cross-domain performance. Track both in-domain gains (ΔID) and OOD success. Also test modeling choices: SFT warmup and step-by-step reasoning; and the environment-side intervention: state information augmentation.

The scoreboard (with context):

Which training domains best preserved generalization?
- Top generalizers: Sokoban and SciWorld. They both have high planning complexity (avg trajectory length ~43–44) and medium-to-high state richness.
- Middle: WebShop (rich states but lower planning complexity).
- Bottom: ALFWorld (low state richness, medium planning complexity).
- Analogy: That’s like getting A-levels in transfer tests for Sokoban/SciWorld, a B for WebShop, and a C for ALFWorld.
Do bigger in-domain gains automatically mean worse OOD? Not exactly.
- Even when Sokoban was trained longer to match or exceed ΔID of others, it still generalized better. So OOD robustness wasn’t just a trade-off with improvement size; environment properties mattered more.
Does realism help? Surprisingly, no.
- The abstract puzzle Sokoban transferred better to realistic SciWorld than realistic ALFWorld did. Realism and surface text similarity were not the main drivers.
State information augmentation (adding harmless noise) works.
- With Ckpt V1: OOD boosts were +32.6% (train ALFWorld), +35.5% (train WebShop), +42.5% (train Sokoban).
- With Ckpt V2: OOD boosts were +7.0% (train ALFWorld), +33.4% (train WebShop), +5.7% (train Sokoban).
- Interpretation: That’s like raising a B- to a solid B+/A- just by adding small, controlled distractions during training.
SFT warmup reshapes retention and forgetting.
- Good news: Domains included in warmup (ALFWorld, WebShop, SciWorld) suffered much smaller OOD declines after later RL. The knowledge was “cemented.”
- Bad news: Uncovered domains (e.g., Sokoban) sometimes dropped more. Over-focusing the warmup can hurt areas you didn’t include.
- Analogy: Practicing only piano and violin makes you great at them, but your guitar skills may fade.
Step-by-step reasoning preserves OOD performance.
- Turning off reasoning often left in-domain scores similar or slightly higher, but OOD performance collapsed dramatically (over 200% relative drops in several cases with Ckpt V2).
- Message: Reactive policies learn fragile patterns; explicit reasoning builds reusable strategies.

Surprising findings:

Sokoban’s simple-looking grid puzzle beat realistic ALFWorld in protecting generalization to other domains, including SciWorld.
Explicit reasoning is a must for cross-domain robustness, even if it doesn’t always help in-domain scores.
Carefully tuned, tiny distractions (augmentation) are enough to make agents more robust without changing tasks.

05Discussion & Limitations

Limitations:

Only four environments were studied; broader coverage (more domains, tools, and action spaces) is needed to confirm universality.
Proxies are coarse. Character count approximates richness; average trajectory length approximates planning complexity and reachability. Finer measures (e.g., mutual information, branching factors tailored to LLMs) could be better.
Augmentation strength ε needs tuning. Too little may not help; too much can hurt in-domain learning.

Required resources:

Multi-turn RL requires interactive simulators and compute (the paper used 8× A100 GPUs). Collecting trajectories and running group-based RL can be resource-intensive.
For very hard domains (e.g., SciWorld), SFT warmup data (expert demos) is needed to bootstrap.

When not to use:

If you must squeeze maximum in-domain performance on one fixed domain and never transfer, disabling reasoning and not adding noise might slightly boost ID scores (at the cost of OOD collapse).
If your environment’s text channel is extremely bandwidth-limited, adding any noise could crowd out necessary info; then prioritize concise, relevant states and consider other regularizers.

Open questions:

Can we design principled, automated augmentation that adapts ε and content to keep the right difficulty?
What formal measures best capture “planning complexity” for LLM agents that don’t search like classical planners?
How should we schedule mid-training (SFT) to consolidate knowledge broadly without erasing uncovered domains?
Can we characterize and quantify how explicit reasoning structures (plans, checklists, self-checks) map to transfer gains?
Is there a theoretical link between group-based RL objectives and robustness to distribution shifts in multi-turn text environments?

06Conclusion & Future Work

Three-sentence summary:

The paper finds that training in environments with high state information richness and high planning complexity best preserves cross-domain generalization for LLM agents.
A simple, low-cost method—state information augmentation (small, harmless distractions)—causally improves out-of-domain success, and keeping step-by-step reasoning on is vital for transfer.
SFT warmup cements skills for included domains but can harm those not covered, so breadth and balance matter when the deployment domain is unknown.

Main achievement:

Turning messy, hard-to-define “generalization” into actionable design rules: pick or build info-rich, long-plan environments; lightly randomize states; keep explicit reasoning.

Future directions:

Expand to more environments and model sizes; refine proxies for richness and planning; automate augmentation; develop adaptive mid-training schedules; connect to theory for robust policy optimization.

Why remember this:

Because what you train on defines what you keep. Training in busy, long-horizon worlds, with tiny distractions and explicit reasoning, pays less “generalization tax” when your agent steps into the unknown.

Practical Applications

•When building RL training sets, prefer environments with longer average trajectories and denser observations to encourage robust planning and filtering.
•Add small, goal-irrelevant text distractors (ads, trivial object notes) to states during training to improve OOD robustness without changing tasks.
•Keep step-by-step reasoning enabled during RL to avoid brittle, non-transferable heuristics.
•If you use SFT warmup, include a broad datamix so uncovered domains are less likely to be forgotten after RL.
•Monitor simple proxies: average state character counts (richness) and average trajectory lengths (planning) to pick better training domains.
•Tune augmentation volume ε: increase gradually until you see a small ID slowdown, then back off to balance learning and robustness.
•Apply augmentation stochastically (e.g., 50% of rollouts) to control difficulty and maintain ID performance.
•Evaluate with OOD Ranking Score across multiple target domains to avoid overfitting to one test world.
•For hard domains with near-zero success, bootstrap with SFT warmup using demonstrations before RL.
•Standardize RL hyperparameters (group size, KL penalty) and logging so you can compare checkpoints across domains fairly.

Version: 1