Reinforcement World Model Learning for LLM-based Agents

Xiao Yu; Baolin Peng; Ruize Xu; Yelong Shen; Pengcheng He; Suman Nath; Nikhil Singh; Jiangfeng Gao; Zhou Yu

Reinforcement World Model Learning for LLM-based Agents

Intermediate

Xiao Yu, Baolin Peng, Ruize Xu et al.2/5/2026

arXiv PDF

Key Summary

•Large language models are great at words, but they struggle to predict what will happen after they act in a changing world.
•This paper introduces RWML, a way for an LLM to learn a 'world model' by comparing what it thinks will happen with what actually happens.
•RWML uses a simple reward based on meaning (embeddings and ROUGE), not exact wording, so it learns the important ideas instead of memorizing text.
•It is fully self-supervised: no expert demonstrations, no expensive judge models, and no task-success labels are required to train the world model.
•On two tough benchmarks (ALFWorld and τ Bench), RWML alone boosts success over the base model and reduces invalid or wasteful actions.
•When you add normal task-success RL after RWML, it beats training with task-success RL alone and rivals methods that need expert data.
•RWML forgets less general knowledge than standard next-state supervised fine-tuning and changes model weights more gently.
•Ablations show the simple, binarized similarity reward is stabler and harder to game than LLM-as-a-judge, and filtering out 'too easy' cases helps.
•RWML works best with reasonably strong base models and acts like a scalable 'mid-training' step before policy RL.
•The big idea: teach the model to imagine the next state accurately first, then teach it to win the task.

Why This Research Matters

Agents that can accurately imagine what happens next take fewer wrong turns, make fewer invalid tool calls, and finish tasks faster. This saves time and resources in customer support, software automation, and home robotics. Because RWML needs no expert data, it’s cheaper and easier to scale across many domains. Its reward is based on meaning, which makes training robust and less gameable than judge models. RWML also preserves general abilities better than next-state supervised fine-tuning, helping maintain broad competence. In short, it is a practical way to build safer, smarter, and more reliable AI helpers.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re playing a new video game. You can read the manual (language), but to win, you also need to predict what will happen if you press a button (world understanding). Reading alone isn’t enough—you need a feel for the game world.

🥬 The Concept (Large Language Models, LLMs): LLMs are computer programs that learn patterns in text so they can understand and generate language. How it works: (1) Read billions of sentences; (2) Learn which words tend to follow others; (3) Use that knowledge to answer questions or write text. Why it matters: Without LLMs, today’s smart assistants wouldn’t be able to read, write, or reason with words. 🍞 Anchor: When you ask, “What’s the capital of France?”, an LLM answers “Paris.”

🥬 The Concept (Reinforcement Learning, RL): RL teaches an agent to make better choices by giving rewards for good outcomes. How it works: (1) The agent tries an action; (2) The environment responds; (3) The agent gets a reward and learns to do more of what worked. Why it matters: Without RL, an agent wouldn’t improve its decisions from experience. 🍞 Anchor: Like learning to ride a bike—small falls teach balance.

🥬 The Concept (World Modeling): A world model is an agent’s inner “imagination” of what happens next after an action. How it works: (1) Look at the current situation; (2) Predict the next situation if you act; (3) Use those predictions to plan. Why it matters: Without a world model, an agent acts blindly and wastes steps. 🍞 Anchor: Before throwing a ball, you picture its arc so you can aim.

🥬 The Concept (Next-Token Prediction): This is training a model to guess the next word in a sentence. How it works: (1) Read text; (2) Hide the next word; (3) Guess it and update the model. Why it matters: It builds strong language skills, but not necessarily action consequences. 🍞 Anchor: Finishing the sentence “Peanut butter and …” with “jelly.”

🥬 The Concept (Generalized Reward Policy Optimization, GRPO): GRPO is a way to train agents with groups of attempts and a reward signal, while keeping them close to a safe reference behavior. How it works: (1) Generate several answers; (2) Score them; (3) Push the model toward higher-scoring answers; (4) Add a gentle “stay close” rule (KL) so it doesn’t go wild. Why it matters: Without GRPO-like training, agents can drift into bad habits or unstable behavior. 🍞 Anchor: A coach who praises the best drills each round but still keeps the team’s style disciplined.

Now the story. The World Before: LLMs were awesome at text: answering questions, writing code, and explaining ideas. But when these models became agents—talking to users, using tools, clicking buttons, or moving in virtual homes—they often acted like great readers who didn’t know how pressing a button changes the world. They could chat well but struggled to predict action consequences: “If I open this drawer, will I see a knife?” or “If I call this tool, what response will I get?”

The Problem: Agentic tasks are long and dynamic. The environment changes after each action. To do well, an agent needs to imagine the next state (the world model) and choose actions that move toward success. But standard LLM pretraining (next-token prediction) teaches matching words, not modeling how actions change states. This mismatch led to agents that over-explained, guessed, or issued invalid tool calls.

Failed Attempts: (1) Supervised next-state prediction (SFT): Train on triplets (state, action, next state) to predict next-state tokens exactly. This cared too much about exact wording (“token-level fidelity”) and could collapse or overfit phrasing instead of learning meaning. (2) LLM-as-a-judge rewards: Ask a big model to score answers. This was noisy, hackable, and expensive. (3) Pure policy RL with task-success rewards: Worked, but rewards were sparse and needed careful, expert design per environment.

The Gap: What was missing was a way to learn a good world model signal that is (a) self-supervised from interaction data, (b) focused on meaning rather than exact words, (c) robust to reward hacking, and (d) scalable without expert labels.

🥬 The Concept (Sim-to-Real Gap Rewards): These rewards measure how close the model’s imagined next state is to the real next state observed, focusing on meaning, not exact words. How it works: (1) The agent predicts the next state; (2) We compare it to the real next state using a semantic similarity (embeddings/ROUGE); (3) If close enough, give a reward (often binarized to 1 or 0). Why it matters: Without this, the agent might memorize wording or trick a judge, not actually learn what happens. 🍞 Anchor: Predicting “You reach the counter. You see utensils.” is judged similar to “You arrive at the countertop and see a spoon and a knife,” even if wording differs.

Real Stakes: In your daily life, customer support bots need to predict what a tool will return before they use it. Software or phone assistants must anticipate what a click or swipe will do. Home robots need to guess where objects likely are. If agents can’t “imagine” the next state, they waste time, make invalid moves, and frustrate users. A reliable, scalable way to teach imagination unlocks better planning, fewer errors, and smoother help for people.

🥬 The Concept (Reinforcement World Model Learning, RWML): RWML is a training method that teaches an LLM to imagine next states accurately using sim-to-real gap rewards, before doing task-success RL. How it works: (1) Collect your own experiences (rollouts); (2) For each (state, action), predict the next state with brief reasoning; (3) Score the prediction by semantic similarity to the real next state; (4) Use GRPO to improve. Why it matters: Without RWML, agents rely on brittle wording matches or sparse end-of-task rewards and learn slower or less reliably. 🍞 Anchor: Like a pilot learning a flight simulator that matches the real plane, then training for actual missions.

By introducing RWML, this paper shows a path where agents first learn to imagine accurately (world modeling) and then learn to win (policy RL). That sequence fits how people learn too: practice imagining what happens next, then use that skill to plan and succeed.

02Core Idea

🍞 Hook: You know how chess players imagine future moves before they decide what to play? If their imagination is accurate, their planning gets much better.

🥬 The Concept (RWML’s Aha!): Teach the model to imagine the next state accurately using a meaning-based reward, then teach it to win the task. That single switch—from exact words to semantic alignment—makes world learning scalable and robust. 🍞 Anchor: If the model says “arrive at countertop; see knife and spoon,” and the world says “you reach counter; see spoon and knife,” we count that as a match and give a reward.

Multiple Analogies:

Weather Forecaster: First learn to predict tomorrow’s weather pattern (cloudy, windy) even if you don’t say the exact same words as the official report; then use that skill to decide whether to plan a picnic. RWML is the weather-learning part; policy RL is picnic planning.
Video Game Minimap: The agent trains a mental minimap that updates correctly when it moves or acts. If the minimap matches the real map, it earns reward; later it uses the minimap to plan a speedrun.
Lab Notebook: A scientist predicts what tomorrow’s experiment will show; if the trend matches (even with different phrasing), the scientist gains confidence. Then they use that confidence to design the next experiment.

Before vs After:

Before: Next-state SFT cared about wording, not meaning; LLM-as-a-judge could be tricked; task-success RL needed carefully designed rewards and was sparse.
After: RWML gives a dense, self-supervised, meaning-focused signal. The agent learns real dynamics first, then policy RL turns that imagination into higher success rates.

Why It Works (intuition, no equations):

Meaning over Wording: Comparing in an embedding space (and with ROUGE for structured tool outputs) says, “Did you capture the important ideas?” not “Did you copy the exact words?”
Binarized Reward (mostly 1 or 0): This reduces noisy gradients and makes it harder to hack the reward by adding fluff words or patterns.
On-Policy RL (GRPO): Training on the model’s own rollouts helps it learn from what it actually does, preserving knowledge better than off-policy SFT and preventing collapse to trivial wording tricks.
Hard-Case Focus: Subsampling “too easy” examples keeps training time for the challenging, knowledge-building transitions.

Building Blocks: 🥬 The Concept (Embedding Space): An embedding turns text into a vector that captures meaning. How it works: (1) Feed text to an embedding model; (2) Get a vector where similar meanings are close; (3) Compare vectors to judge similarity. Why it matters: Without embeddings, we’d over-focus on exact words instead of meaning. 🍞 Anchor: “countertop” and “kitchen counter” land near each other in embedding space.

🥬 The Concept (Cosine Similarity): A simple way to measure how close two meaning-vectors are. How it works: (1) Compute the angle between two vectors; (2) Small angle = high similarity; (3) Turn it into a distance. Why it matters: Without this, we can’t score how semantically close two texts are. 🍞 Anchor: Two arrows pointing almost the same direction = very similar meanings.

🥬 The Concept (Self-Supervised Learning here): Learning from the world’s feedback without expert labels. How it works: (1) Roll out actions; (2) Observe real next states; (3) Use them directly to grade your imagined next states. Why it matters: Without self-supervision, we’d be stuck collecting costly expert data. 🍞 Anchor: Learning to throw by seeing where the ball lands, not needing a coach every time.

🥬 The Concept (Policy RL vs. World Model Learning): Policy RL learns to choose actions to maximize end rewards; RWML learns to imagine next states accurately from intermediate signals. How it works: RWML first, then policy RL. Why it matters: Without separating these phases, learning can be slower and less stable. 🍞 Anchor: First learn the science facts (world model), then ace the lab practical (policy).

🥬 The Concept (Ablation Studies): Tests where you remove or swap one piece to see its impact. How it works: (1) Replace embedding reward with LLM-as-a-judge; (2) Remove easy-sample filtering; (3) Remove RWML. Why it matters: Without ablations, we wouldn’t know which parts truly help. 🍞 Anchor: Taking a piece out of a LEGO model to see if the tower still stands.

In short, the key idea is simple but powerful: reward the model for getting the next state right in meaning, not exact words. Do this first, then optimize for task success. That ordering—and that kind of reward—gives agents a reliable imagination to plan with.

03Methodology

At a high level: Input (triplets of past state, action, real next state) → Predict next state with brief reasoning → Compare meaning with real next state (embedding or ROUGE) → Give a simple reward (often binary) → Update the model with GRPO → Output is a better world-modeling agent.

Step-by-step details like a recipe:

Collect Interaction Data (Self-Play Rollouts)

What happens: Use the target model itself to play through tasks in the environment. For each turn t, store ⟨history s≤t, action at, real next state st+1⟩. Do several rollouts per task to capture variety.
Why this exists: You need pairs of “what I did” and “what actually happened” to learn cause-and-effect.
Example: In ALFWorld, the model tries “go to countertop 1,” and we record the new observation listing objects on the countertop.

Subsample “Too Easy” Examples

What happens: Train a small SFT filter model on a small slice. Use it to find triplets that are consistently easy. Keep only a small fraction of those; keep most medium/hard ones.
Why this exists: Spending time on trivially predictable transitions teaches little. Focusing on harder ones builds deeper world knowledge.
Example: If “open already-open door” almost always yields “door remains open,” we include it less often than “look inside cabinet 3” whose contents vary.

Prompt the Model to Reason and Predict the Next State

What happens: Given s≤t and at, the model first writes a short think-through, then outputs <next_state>...</next_state> describing the predicted observation. No teacher label is provided; it’s pure prediction.
Why this exists: The think-through stabilizes predictions and encourages structured cause-and-effect, not just parroting text styles.
Example: “I moved to the counter, so I should see utensils; the task isn’t done yet.” Then outputs a likely list of objects.

Score the Prediction with Sim-to-Real Gap Rewards

What happens: Compare the predicted next state with the real next state using meaning-based measures:
- For natural language observations: use embeddings and cosine similarity; if similarity is above a threshold, reward = 1, else 0.
- For structured tool outputs (JSON-like): compute a ROUGE-based score and round it to coarse steps for stability.
Why this exists: Environments can describe the same reality in many wordings. Semantic scoring rewards the right idea, not the exact words.
Example: “arrive at countertop; see spoon and knife” vs. “you reach counter; see knife and spoon” → reward 1.

Update the Model with GRPO

What happens: Generate multiple candidate predictions, score them, compute group-relative advantages, and update the policy. Keep the policy near a reference (KL regularization) to avoid drifting too far.
Why this exists: Group-based scoring stabilizes training; KL regularization prevents reward-chasing behaviors that break language quality.
Example: If 5 candidates earn [1,1,0,1,0], the ones with 1 get pushed up, but the model stays near its reference style.

Repeat Over Batches for 1–2 Epochs

What happens: Iterate this process over the filtered dataset. The model gradually internalizes how actions change states.
Why this exists: Learning cause-and-effect takes repetition across many varied situations.
Example: Over time, the model learns knives are often on countertops, and invalid tool calls rarely work.

(Optional) Follow with Policy RL for Task Success

What happens: After RWML (world imagination), run standard RL with end-of-episode task rewards to teach winning strategies.
Why this exists: Accurate imagination boosts planning; success rewards teach choosing the best action sequences.
Example: After learning likely object locations, the agent plans fewer detours to complete tasks faster.

Concrete Data Examples:

ALFWorld: s≤t is a sequence of text observations; at is an action like “go to sidetable 1”; st+1 is the next room description. Reward uses embeddings with a threshold tuned for meaning matches.
τ Bench: s≤t includes dialogue and tool-call history; at is a message or a tool call; st+1 is a user reply or tool response (often JSON-shaped). Reward uses embeddings for user replies and ROUGE-rounded scores for tool responses to capture structure.

The Secret Sauce (what makes it clever):

Meaning First: Rewards care about semantics, not identical wording. This avoids overfitting to phrasing.
Simple, Binarized Rewards: Crisp pass/fail signals are robust and harder to game than LLM judges.
Self-Supervision at Scale: No expert labels needed—just rollouts and the world’s own feedback.
Hard-Example Focus: Filtering “too easy” cases speeds up learning of real causal knowledge.
Gentle Parameter Changes: Empirically, RWML updates fewer parameters than SFT, preserving general knowledge and playing nicely with later policy RL.

Putting it all together: RWML is a mid-training step that upgrades the model’s imagination. With better imagination, the same model, after a policy RL pass, becomes a stronger, more reliable agent.

04Experiments & Results

The Test: The authors measure task success rates in two long-horizon, text-based agent worlds where predicting consequences matters a lot.

ALFWorld: A text world of household tasks (find, pick up, move items). Good agents predict where objects likely are and what actions reveal.
τ Bench: Customer-support style tasks with tools. Good agents predict user replies and tool outputs so they choose valid, helpful steps.

The Competition (Baselines):

Task-success RL only (Policy RL) and Reinforced Finetuning (RFT).
World model learning by supervised next-state prediction (WM SFT).
Expert/strong-LLM data methods: Imitation Learning, Implicit World Modeling (IWM), and Self-Reflection (SR).
Prompting-only references (REACT with various models including GPT-4.1 and GPT-5).

The Scoreboard (with context):

RWML alone (no experts, no task rewards) beats the base model by big margins: +19.6 points on ALFWorld, +7.9 on τ Bench. That’s like going from a C to a solid B/B+ just by practicing imagination.
RWML + Policy RL vs. Policy RL alone: RWML + RL is +6.9 points better on ALFWorld and +5.7 on τ Bench. That’s like an A- vs. a B when everyone tried just studying with practice tests.
Against expert-data methods: On ALFWorld, RWML + RL even outperforms some methods that rely on expert rollouts or strong LLMs. On τ Bench, it’s highly competitive despite using no expert data.

Surprising/Notable Findings:

Less Invalid and Wasteful Actions: After RWML, invalid actions drop a lot (ALFWorld invalid/inefficient from ~59% to ~39%; τ Bench invalid tool calls from ~25% to ~9%). The agent gets more careful because it can imagine failures before acting.
Less Forgetting vs. WM SFT: RWML-trained models retain general skills (like math or coding benchmarks) better than next-state SFT. That matches the intuition that on-policy RL preserves prior knowledge more.
Reward Robustness: LLM-as-a-judge was sometimes hackable and noisy; the simple embedding/ROUGE-based rewards were more reliable.
Parameter Change Profiles: RWML changes fewer parameters across layers than SFT, and stays compatible with later Policy RL (no big disruptive shifts). This likely helps stability.
Base Model Strength Matters: On very hard tasks (τ Bench), stronger base models learn and transfer world knowledge better with RWML. Smaller models improve less.

Takeaway: Teaching the model to imagine next states with meaning-based rewards provides a strong, scalable foundation. Then, adding task-success RL builds on that to reach or beat methods that need costly expert data.

05Discussion & Limitations

Limitations:

Transfer for Small Models: On very tough domains, weaker base models (e.g., 7B scale) didn’t transfer world knowledge to decision-making as strongly as larger ones.
Reward Design Still Matters: While embedding/ROUGE rewards were robust, thresholds and rounding choices affect stability and strictness. Poor settings can under- or over-reward.
Environment Coverage: The model only learns what it experiences. If rollouts miss important situations, the world model may be incomplete.
Partial Structure Handling: For highly structured outputs, ROUGE is a proxy; exact schema compliance and key presence can still be tricky.

Required Resources:

Compute for Rollouts and GRPO training (the paper used B200 GPUs), plus an embedding model for similarity scoring.
Access to the environments and simulators (ALFWorld, τ Bench; possibly a user simulator for τ Bench during training).

When NOT to Use:

If you have abundant, high-quality expert demonstrations and perfect reward shaping already in place, RWML’s added complexity might not pay off as much.
If your environment is extremely small or deterministic, simpler SFT might suffice.
If tool outputs require exact byte-level matching (e.g., strict APIs with zero tolerance), meaning-based rewards may need complements like schema validators.

Open Questions:

Can we design even better semantic rewards that check factual completeness and schema correctness simultaneously?
How can we boost transfer for smaller models—curricula, better sampling, or distilled teachers?
What’s the best way to mix RWML and policy RL—alternate epochs, joint objectives, or adaptive schedules?
Can RWML extend to multimodal agents (vision + language) and real robots while keeping safety and cost manageable?
Can we formally link the gentle parameter-change profile of RWML to reduced forgetting and improved downstream RL performance?

06Conclusion & Future Work

3-Sentence Summary: This paper proposes RWML, a self-supervised way to teach LLM agents to imagine the next state accurately by rewarding semantic agreement between predicted and real next states. After this mid-training step, standard policy RL adds task-success optimization, producing agents that act more reliably and efficiently. Across ALFWorld and τ Bench, RWML boosts base models without experts and, when combined with RL, matches or beats approaches that need expert data.

Main Achievement: Turning world-model learning into a scalable, meaning-first, self-supervised RL problem that cleanly complements policy RL.

Future Directions: Improve semantic rewards (fact coverage, schema checks), enhance transfer for smaller models, explore multimodal and real-world settings, and study joint or adaptive schedules that blend RWML with policy RL. Better analysis tools could also connect parameter-change patterns to stability and generalization.

Why Remember This: RWML shows that teaching an agent to imagine well—using simple, robust, semantics-based rewards—can unlock better planning and decision-making later. It’s a practical recipe: imagine first, win next. This perspective may become a standard mid-training step for building safer, smarter, and more capable agents.

Practical Applications

•Customer support agents that predict likely tool responses before calling them, reducing invalid or wasted tool usage.
•Household assistants that plan efficient search paths by anticipating where items are likely located.
•Developer copilots that foresee compiler or API errors and suggest fixes before running code.
•IT troubleshooting bots that predict device states (e.g., airplane mode, SIM lock) and sequence checks accordingly.
•Data entry and RPA agents that anticipate form validations and required fields to avoid submission failures.
•Educational tutors that predict student misunderstandings and choose the next hint more effectively.
•Healthcare triage assistants that anticipate follow-up questions and required forms while staying within policies.
•Finance helpdesk bots that foresee schema of account records and pre-validate tool queries.
•Web navigation agents that predict page structure changes after clicks and minimize dead-ends.
•Game-playing agents that develop accurate internal simulators to plan multi-step strategies.

Version: 1