From Word to World: Can Large Language Models be Implicit Text-based World Models?

Yixia Li; Hongru Wang; Jiahao Qiu; Zhenfei Yin; Dongdong Zhang; Cheng Qian; Zeping Li; Pony Ma; Guanhua Chen; Heng Ji; Mengdi Wang

From Word to World: Can Large Language Models be Implicit Text-based World Models?

Intermediate

Yixia Li, Hongru Wang, Jiahao Qiu et al.12/21/2025

arXiv PDF

Key Summary

•This paper asks if large language models (LLMs) can act like "world models" that predict what happens next in text-based environments, not just the next word in a sentence.
•The team treats world modeling as next-state prediction: given a history and a new action, the model predicts the next observation and whether the task is done.
•They build a three-part test: (1) fidelity and consistency (are predictions correct and stable over many steps?), (2) scalability and robustness (do more data and bigger models help, and do they generalize?), and (3) agent utility (do agents actually perform better when using the world model?).
•Across five environments (ALFWorld, SciWorld, TextWorld, WebShop, StableToolBench), supervised fine-tuning on interaction trajectories boosts one-step accuracy up to about 99% in structured worlds and strong F1 in tool-use.
•Long rollouts stay consistent in structured settings (consistency ratio around 0.9–1.0), but drift in open-ended web tasks unless anchored with real observations.
•Performance scales predictably: structured worlds saturate with ~20K trajectories and small models, while open-ended worlds need more data and larger models.
•World models help agents avoid irreversible mistakes by verifying actions before execution (notably improving WebShop success), and can generate synthetic trajectories competitive with real ones.
•Warm-starting agents with world-model-style training stabilizes early reinforcement learning and improves final task success.
•Robustness depends on behavioral coverage (training on varied agent behaviors) and diverse environment exposure; expert-only data is not enough.
•Bottom line: with the right data, training, and usage, LLMs can be reliable text-world simulators that make agents safer and more efficient.

Why This Research Matters

Reliable, rewindable practice worlds make agents safer and cheaper to train because risky or irreversible actions can be simulated first. By scaling with data and model size, this approach offers a practical path to improving performance without endlessly expanding costly real environments. Synthetic trajectories reduce reliance on expensive data collection, helping smaller labs or teams build strong agents. Warm-starting reinforcement learning with world-model-style training speeds up learning and reduces instability. Anchoring shows that small doses of real input can greatly improve realism in open-ended domains. Together, these ideas pave the way for dependable AI assistants for web tasks, education, science labs, and more. Ultimately, it shifts LLMs from storytellers to simulators that help agents act wisely.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re playing a choose-your-own-adventure game. Each choice changes what happens next, and the fun part is seeing the world respond to you.

🥬 The Concept (World before this paper): For years, AI agents have learned by trying things in real or simulated environments—like games, websites, or science labs—one step at a time. This is called agentic learning. As agents got better, they needed more and more experience to keep improving, but real environments are hard to scale, slow, and sometimes unforgiving (one wrong click ends a shopping episode!). How it worked: (1) Agents act in an environment. (2) They observe what happens. (3) They improve from that experience. Why it matters: Without easy, safe, and plentiful practice spaces, agents hit a wall—they can’t gather enough experience to learn reliably and quickly.

🍞 Anchor: Think of trying to master piano but only getting 5 minutes per week on a real piano. Progress is slow and frustrating.

🍞 Hook: You know how a good storyteller can guess what might happen next in a story because they understand the world the story lives in?

🥬 The Concept (Large Language Models): LLMs have learned a lot about the world by reading tons of text. They’re great at guessing the next word, but could they also guess the next event in an interactive world—like what a website would show after a search or what a lab sim would say after heating a beaker? How it works: (1) LLMs read the history. (2) They predict the next output. (3) They adapt with examples. Why it matters: If LLMs can predict next-world-states (not just words), we could turn them into low-cost, rewindable practice worlds for agents.

🍞 Anchor: It’s like using a super storyteller as a flight simulator for pilots-in-training.

🍞 Hook: Imagine practicing tricky moves in a video game with a level editor where you can rehearse risky jumps without losing your progress.

🥬 The Concept (The Problem): Past attempts often made LLMs write plausible text, but that didn’t guarantee the text matched what a real environment would do. They were missing three pillars: (1) fidelity over many steps, (2) predictable scaling, and (3) proven usefulness to agents. How it works (what was missing): (1) Models needed to keep an internal, consistent state. (2) We needed to know how data size and model size change results. (3) We had to show agents perform better using the model. Why it matters: Without these, simulated practice can teach the wrong lessons, and agents won’t reliably improve.

🍞 Anchor: It’s like practicing basketball on a court where the hoop height changes randomly—you might get good at the wrong game.

🍞 Hook: Think about learning in a text-only game (no pictures), where everything—what you see and what you do—is written in sentences.

🥬 The Concept (Text-based worlds as a testbed): The authors focus on text-only environments, a sweet spot where language models can naturally read and write the entire world interaction. How it works: (1) Use five representative worlds: ALFWorld, SciWorld, TextWorld (structured), and WebShop, StableToolBench (open-ended). (2) Treat each step as: history + new action → next state + success flag. (3) Train LLMs with many such steps. Why it matters: This setting turns next-token prediction into next-state prediction under a fixed protocol, letting us test if LLMs can be real simulators.

🍞 Anchor: It’s like using chat messages to play a science lab game: you type “heat water,” the game replies with results, and you check if an LLM can be that game engine accurately.

🍞 Hook: Picture trying to study if a calculator is good: you need to check if it’s correct, how it improves with practice, and whether it helps students pass tests.

🥬 The Concept (Three-level gap this paper fills): The authors propose a three-level evaluation: (1) Fidelity and Consistency (one-step and long-horizon correctness), (2) Scalability and Robustness (data/model size effects, OOD generalization), (3) Agent Utility (does it help agents avoid mistakes and learn faster?). How it works: (1) Measure exact-match and rollout transferability. (2) Vary data/model scales and environment difficulty. (3) Use the model to verify risky actions, synthesize data, and warm-start RL. Why it matters: This nails both scientific rigor and practical usefulness, showing not just that the model is smart, but that it makes agents better.

🍞 Anchor: After testing, they find strong, scalable gains in structured settings and practical boosts in safety and training efficiency.

🍞 Hook: Imagine you need variety in gym practice—dribbling, shooting, defense—to be ready for any game day surprise.

🥬 The Concept (Behavioral Coverage): Training only on perfect, expert behavior leaves the model surprised by messy real actions. Including diverse, imperfect trajectories teaches the model to handle many situations. How it works: (1) Mix data from different agents (strong and weaker). (2) Train the world model on this mixture. (3) Test with agents it didn’t see. Why it matters: Without diverse behavior, the model fails when agents act differently during deployment.

🍞 Anchor: Like practicing soccer drills with teammates of all skill levels so you’re ready for any pass, not just perfect ones.

02Core Idea

🍞 Hook: You know how reading a comic, you can often guess the next panel because you understand the story’s rules?

🥬 The Concept (Aha! Moment): Treat an LLM as a world model by training it to predict the next state and success flag in a fixed, multi-turn interaction—turning next-word prediction into next-world-step prediction. How it works: (1) Format each interaction as: observations and actions so far, plus a new action. (2) The LLM predicts the next observation and whether the task finished. (3) Repeat to “roll out” imagined futures. (4) Fine-tune with real trajectories to align the model’s dynamics with the environment’s rules. Why it matters: Without aligning on next-state transitions, the model may write nice-sounding text that doesn’t match what truly happens, leading agents astray.

🍞 Anchor: In ALFWorld, when you type “open fridge,” a good world model replies with what’s inside the fridge, not a random kitchen story.

Multiple Analogies:

GPS for Actions: The agent is a driver, the world model is the GPS simulator that estimates, “If you turn left now, you’ll end up on Maple Street in 2 minutes.”
Science Lab Sandbox: Before mixing chemicals, you simulate what will happen to avoid dangerous reactions.
Video Game Practice Mode: Try a risky combo in a practice arena; if it works in sim, do it for real.

Before vs. After:

Before: LLMs wrote plausible text but often drifted over many steps and weren’t proven to help agents.
After: With dynamics-aligned fine-tuning, LLMs keep coherent latent state over long rollouts, scale with data/model size, and measurably improve agents via verification, synthetic data, and RL warm-starts.

Why It Works (intuition):

LLMs already encode vast world knowledge. Supervised fine-tuning on interaction trajectories reshapes that knowledge to respect environment dynamics step-by-step.
The dialogue history acts as memory, letting the model maintain a latent state across turns.
Providing initialization context in structured worlds reduces uncertainty, anchoring predictions.
Mixing diverse behaviors teaches robustness to the many ways agents might act.

Building Blocks (each as a sandwich):

🍞 Hook: Imagine a recipe card that lists ingredients (history) and the next step you’re about to take. 🥬 The Concept (Multi-Turn Interaction Protocol): A fixed format where each turn has: observation, agent’s thought and action, then the environment’s next observation and success flag. How it works: (1) Collect dialogues of real agent–environment turns. (2) Train the LLM to map history + action → next observation + reward. (3) Use the same protocol during rollouts. Why it matters: Without a consistent protocol, the model can’t learn stable cause→effect patterns. 🍞 Anchor: “You are at the fridge. Action: open fridge. Next: The fridge is open; you see milk and eggs.”

🍞 Hook: Think of practicing scales before playing a song. 🥬 The Concept (Dynamics-Aligned Fine-Tuning): Supervised training on many real transitions, so the model respects environment rules. How it works: (1) Gather thousands of trajectories. (2) Fine-tune to minimize next-state prediction error. (3) Validate on held-out environments. Why it matters: Prompting alone underfits long-tail behaviors; fine-tuning teaches the real rules. 🍞 Anchor: Accuracy jumps to ~99% in structured worlds after SFT.

🍞 Hook: It’s easier to guess the next chess move if you see the whole board. 🥬 The Concept (Initialization Context): Structured environments provide fuller initial states to the world model than the agent sees, helping it track hidden variables. How it works: (1) Feed room layouts or lab inventories to the model at start. (2) Let the model update this latent state over time. Why it matters: Without fuller context, predictions drift faster in partially observable settings. 🍞 Anchor: In ALFWorld, knowing what’s inside closed cabinets helps predict what “open cabinet” reveals.

🍞 Hook: Training for a marathon needs more miles and stronger muscles as the course gets tougher. 🥬 The Concept (Scalability & Robustness): Performance improves with more data and bigger models, especially in open-ended worlds. How it works: (1) Scale data from 1K to 160K transitions. (2) Scale models from 0.5B to 7B parameters. (3) Mix environments and agent behaviors. Why it matters: Without enough data/capacity, open-ended worlds (like the web) remain too diverse to simulate faithfully. 🍞 Anchor: Structured tasks saturate at ~20K samples; WebShop keeps improving up to ~70K+.

🍞 Hook: Before buying online, you might read reviews to avoid a bad purchase. 🥬 The Concept (Agent Utility): Use the world model for action verification, synthetic data generation, and warm-starting RL. How it works: (1) Simulate risky steps before executing. (2) Generate extra trajectories for SFT. (3) Pre-train agents on dynamics to stabilize RL. Why it matters: These tools turn accuracy into real gains—safer agents and faster learning. 🍞 Anchor: In WebShop, verifying checkout increases success across many agents.

03Methodology

At a high level: Input (dialogue history + proposed action) → Step A: Format and condition the LLM → Step B: Predict next observation and success flag → Step C: Roll out multiple steps if needed → Output (imagined trajectory that agents can use).

Step A: Dialogue Packaging

What happens: We bundle the history of observations (S), agent thoughts (T), and actions (A), then append the new action to create the model’s input.
Why this exists: A consistent protocol helps the model learn cause→effect reliably; without it, predictions become style-driven rather than rule-driven.
Example: History says “You are at the fridge (closed).” New action: “open fridge.” The input ends with that action.

Step B: Next-State Prediction

What happens: The LLM predicts the next observation (S′) and a binary success/termination flag (R′). This is the core of next-state modeling.
Why this exists: Agents need to know what the world will look like after their action; without a correct next-state, plans fail.
Example (ALFWorld): From “open fridge,” the model outputs “The fridge is now open; you see milk, apple.” R′=0 (not done yet).

Step C: Long-Horizon Rollout

What happens: We repeat Step A→B as the agent continues acting, chaining predictions into a trajectory. We can roll out entirely in the model or interleave with real observations (anchoring) to reduce drift.
Why this exists: Many tasks require multi-step planning. Without stable rollouts, small errors snowball across steps.
Example (TextWorld): “unlock wooden door with key” → “open wooden door” → “go east” → kitchen description updates consistently.

Step D: Training the World Model (Dynamics-Aligned SFT)

What happens: Supervised fine-tuning on large collections of real trajectories, minimizing next-state errors.
Why this exists: Few-shot prompting helps but plateaus in open-ended tasks; SFT teaches rare and structured transitions at scale.
Example: Training raises next-state EM accuracy to ~99% in ALFWorld/SciWorld and boosts StableToolBench F1 to ~49%.

Step E: Initialization Context (Structured Worlds)

What happens: For ALFWorld and SciWorld, the model receives fuller initial state (e.g., room contents, lab inventories) than the agent sees.
Why this exists: These are partially observable worlds (POMDPs). Without hidden-state hints, the model struggles to predict outcomes of interactions with unseen objects.
Example: Predicting what “open cabinet 3” reveals requires knowing what was inside at start.

Step F: Scaling and Robustness Strategies

What happens: We study data scaling (1K → 160K), model scaling (0.5B → 7B), cross-environment mixed training, and mixed-agent behavior coverage.
Why this exists: Open-ended domains (WebShop, StableToolBench) demand both breadth (data) and depth (capacity). Mixed training shares transferable dynamics across tasks.
Example: Mix3/4/5 (combining environments) accelerates learning in TextWorld/WebShop; mixed-agent training improves OOD stability for weaker agents.

Step G: Agent Utility Tools

Pre-Execution Verification

What happens: Before an irreversible action (e.g., checkout in WebShop), simulate in the world model. If predicted success, execute; else, revise plan.
Why this exists: Prevents costly, unrecoverable mistakes. Without it, a single error can end the episode.
Example: With a modest budget (2–10 checks), agents like GPT-4o and Claude increase task success.

Synthetic Trajectory Generation

What happens: Use the world model to create successful practice episodes for SFT when real interaction is expensive.
Why this exists: Reduces the experience bottleneck; without synthetic data, training can stall.
Example: In SciWorld and WebShop, 1K synthetic trajectories match 1K real, and mixing both performs best.

Early Experience (Warm-Starting RL)

What happens: First train agents to predict next environment responses (like the world model task), then do normal SFT and RL.
Why this exists: Early dynamics exposure gives better priors, stabilizing exploration and improving final success.
Example: In ALFWorld and SciWorld, warm-started agents learn faster and finish higher.

The Secret Sauce:

Dynamics-aligned SFT on multi-turn trajectories (not just prompts) teaches precise environment rules.
Initialization context for structured worlds improves latent-state tracking.
Anchoring rollouts with partial real observations in open-ended worlds cures drift.
Mixed-agent and mixed-environment data broaden coverage and transfer useful patterns across tasks.

Mini Data Walkthroughs:

ALFWorld: Input: “You’re at cabinet 3 (closed). Action: open cabinet 3.” Output: “Cabinet 3 is open; you see a mug.” Next, “take mug,” updates inventory, not room.
WebShop: Input: “Search: ‘wireless mouse under $20’,” output lists products; “add item #3 to cart,” output updates cart; verifying “checkout” predicts success/failure.
SciWorld: “heat beaker on stove, then use thermometer,” outputs temperature change and success if threshold is met.

04Experiments & Results

The Test (What they measured and why):

Next-State Prediction Fidelity: Does the model predict the exact next observation and termination flag correctly?
Rollout Consistency: Do sequences of predicted actions still work when replayed in the real environment? Measured by Real (true success), WM (success inside world model), W2R (success when WM-generated actions are executed in real), and Consistency Ratio (CR=W2R/Real).
Scalability & Robustness: How performance changes with more data, larger models, and out-of-distribution (OOD) shifts.
Agent Utility: Do agents actually get better with verification, synthetic data, and early experience?

The Competition (Baselines and Comparisons):

Prompted API LLMs (e.g., GPT-4o, GPT-4.1, GPT-5, Gemini-2.5-flash, Claude-sonnet-4.5) in zero-shot and few-shot.
Supervised fine-tuned open-source backbones (Qwen2.5-7B, Llama-3.1-8B), plus a size sweep (0.5B to 7B).

The Scoreboard (with context):

One-Step Fidelity (Table 1)

Prompting works somewhat in structured worlds: few-shot Claude improves on SciWorld into the 70% range but stalls in open-ended WebShop (mid-50s).
Supervised fine-tuning shines: Qwen2.5-7B and Llama-3.1-8B reach ~99% EM on ALFWorld and ~98% on SciWorld; StableToolBench hits ~49% F1 (open-ended tool outputs). That’s like acing nearly every quiz question in structured classes and getting solid partial credit in a free-form essay exam.

Long-Horizon Consistency (Table 2)

Structured environments (ALFWorld, SciWorld, TextWorld) show strong CR around 0.9–1.0. That’s like practicing a play and then performing it on stage almost exactly as rehearsed.
WebShop shows lower CR (often under 0.8) due to search diversity and open-world variability. However, anchoring rollouts with real initial observations boosts CR dramatically (e.g., GPT-4o jumps near 1.0), proving that partial grounding cures simulation drift.

Scaling Laws (Figures 2 and 3)

Data: Structured worlds saturate around ~20K trajectories; open-ended worlds keep improving up to ~70K+ (StableToolBench not saturated at 160K). Translation: simple games need fewer practice rounds, complex worlds need many.
Model size: Small models (~1.5B) handle structured dynamics; open-ended tasks benefit from larger capacity (steady gains up to 7B). Bigger brains help with messy, varied realities.

Generalization (Figures 4 and 5)

OOD in ALFWorld: Success stays close to real even when room layouts change or unseen room types appear—evidence the model learns dynamics, not just memorized maps.
Cross-Environment Mixed Training: Jointly training across 3–5 environments accelerates learning and improves accuracy, especially in TextWorld/WebShop. Exception: StableToolBench prefers its own schema-heavy data.

Behavioral Coverage (Table 3)

Training only on expert (GPT-4o) trajectories hurts OOD performance. Mixing trajectories from multiple agents raises consistency ratios for weaker agents (e.g., GPT-4o-mini’s CR jumps from 0.49 to 0.81). Variety trains resilience.

Agent Utility

Safety via Verification (Table 4): In WebShop, pre-execution checks boost success for all agents, with medium budgets (2–10) offering the best trade-off. It’s like proofreading before you click “Submit.”
Synthetic Data vs Real (Figure 6): 1K synthetic trajectories can match 1K real; mixing both yields the most stable gains. That’s a budget-friendly way to grow experience.
Early Experience (Figure 7): Warm-starting agents with world-model-style SFT stabilizes early RL and improves final success in ALFWorld and SciWorld.

Surprising Findings:

Anchoring a small piece of reality in open-ended tasks almost eliminates rollout drift—a little truth goes a long way.
Mixed-environment training transfers helpful dynamics between very different tasks, suggesting shared structural knowledge (like causality and procedure) is reusable.
Synthetic trajectories are competitive with real ones, reducing reliance on expensive interactions.

05Discussion & Limitations

Limitations:

Open-Ended Drift: In diverse settings like WebShop, the world model can diverge from reality over long rollouts. This is mitigated by anchoring with real observations, but not fully solved.
Partial Observability (POMDP): Agents see less than the full state; while initialization context helps in structured worlds, many domains cannot expose hidden variables, making predictions harder.
Reward Simplicity: The binary success flag is coarse. Nuanced progress signals may be needed for richer tasks.
Data and Compute Demands: Open-ended robustness needs lots of diverse trajectories and adequate model capacity, which can be resource-intensive.
Behavioral Coverage Gaps: Relying on expert-only data hurts generalization to weaker or differently styled agents.

Required Resources:

High-quality interaction logs (tens of thousands for structured tasks; up to hundreds of thousands for open-ended).
Sufficient model capacity (≥1.5B for structured, larger for open-ended), plus GPU resources for SFT.
Environment interfaces compatible with the multi-turn protocol and, when available, initialization context.

When NOT to Use:

High-stakes, real-time domains where even simulated mispredictions are unacceptable and anchoring isn’t possible.
Extremely open-ended, rapidly changing environments without stable schemas or accessible ground-truth anchors.
Situations lacking diverse training trajectories—expect brittleness and low OOD robustness.

Open Questions:

How far can anchoring and retrieval push long-horizon fidelity in open-ended worlds while keeping costs low?
Can we learn compact latent states explicitly (not just implicitly) to improve stability and interpretability?
What curricula best trade off synthetic and real data to minimize cost but maximize agent learning?
How to design richer reward signals and better success criteria that reflect nuanced progress?
Can a single multimodal world model unify text, vision, and action to generalize beyond text-only domains?

06Conclusion & Future Work

Three-Sentence Summary:

This paper shows that with the right training, LLMs can act as text-based world models by predicting next states under a fixed interaction protocol.
These world models scale predictably with data and model size, maintain long-horizon consistency in structured domains, and tangibly help agents through verification, synthetic data, and RL warm-starts.
However, benefits depend on environment complexity and behavioral coverage; open-ended worlds require anchoring and broader training data.

Main Achievement:

Reframing next-token prediction as next-state prediction and demonstrating, across five environments, a rigorous three-level framework (fidelity/consistency, scalability/robustness, and agent utility) that validates LLMs as practical world simulators for agent learning.

Future Directions:

Develop lightweight anchoring and retrieval to curb drift in open-ended domains.
Expand to multimodal and embodied settings with explicit latent-state tracking.
Explore optimal mixtures of real and synthetic data for cost-effective training.
Design richer success signals and safety checks for complex tasks.

Why Remember This:

It bridges words and worlds: the same engines that predict text can, with alignment, simulate interactive environments that help agents learn safely and efficiently. This unifying view opens a path to scalable, rewindable practice for decision-making systems across many domains.

Practical Applications

•Add a pre-execution verification step to web agents so they simulate checkout and only buy when the world model predicts success.
•Generate synthetic training trajectories to augment scarce real data for agent SFT in structured environments.
•Warm-start RL agents by first training them to predict next states and termination flags in their target environment.
•Anchor simulated rollouts with occasional real observations in open-ended tasks to reduce drift and improve transfer.
•Train a single mixed world model across multiple environments to deploy one simulator that generalizes broadly.
•Collect trajectories from diverse agents (not just experts) to increase behavioral coverage and robustness.
•Use initialization context (when available) to improve hidden-state tracking in partially observable settings.
•Continuously scale data and model size based on environment complexity, prioritizing larger models and more data for open-ended domains.
•Leverage world models to test high-stakes or irreversible actions in robotics or tools before real execution.
•Adopt consistency metrics (Real, WM, W2R, CR) in your agent evaluation pipeline to detect simulation-to-reality gaps early.

Version: 1