Aligning Agentic World Models via Knowledgeable Experience Learning

Baochang Ren; Yunzhi Yao; Rui Sun; Shuofei Qiao; Ningyu Zhang; Huajun Chen

Aligning Agentic World Models via Knowledgeable Experience Learning

Intermediate

Baochang Ren, Yunzhi Yao, Rui Sun et al.1/19/2026

arXiv PDF

Key Summary

•WorldMind teaches AI agents to learn the rules of the real world while they act, instead of cramming everything into fixed model weights.
•It builds a World Knowledge Repository (a rulebook) from two kinds of experiences: Process Experience (what went physically wrong) and Goal Experience (what worked well).
•The agent runs a Predict–Act–Verify loop so prediction errors become clues for writing new, helpful rules.
•Process Experience stops "physical hallucinations," like trying to slice without holding a knife, by adding safety rules from errors.
•Goal Experience captures winning playbooks from successful runs so future plans aim straight at the goal.
•A gating trick only lets the agent simulate future states when objects are grounded (seen or known), saving time and reducing mistakes.
•Across EB-ALFRED and EB-Habitat, WorldMind raises strict Success Rate and does many more correct sub-steps than strong baselines.
•The learned rulebook transfers across different models and environments, proving it encodes general physical and procedural knowledge.
•WorldMind is training-free at test time: it updates external knowledge, not the model’s parameters, which makes it practical and flexible.
•This approach turns failures into fuel, aligning an agent’s inner world model with reality like a careful, curious scientist.

Why This Research Matters

This work turns every failure into a useful, shareable rule, which makes AI helpers safer and more reliable in your home, on your computer, and in the real world. Because rules are stored outside the model, different agents can swap knowledge instantly instead of paying for expensive retraining. That means faster improvement in robots that clean, cook, or fetch, and in web agents that book travel or fill forms. The method respects real-world physics, cutting down on silly or risky actions. It also explains itself through human-readable rules, which helps developers and users trust and debug the system. Over time, these agents build a living, community rulebook of common sense.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re learning to bake cookies. A recipe tells you the steps, but your hands and eyes tell you what actually happens in the oven. If the cookies burn, that failure teaches you a rule: don’t leave them in too long.

🥬 The Concept (Agentic World Models):

What it is: An agentic world model is an AI that not only follows instructions but also predicts how the world will change when it acts, like a little scientist in its head.
How it works: 1) It imagines the next state of the world, 2) chooses an action, 3) checks what really happened, and 4) updates its beliefs.
Why it matters: Without this inner prediction machine, the agent is just guessing and will repeat silly or unsafe actions. 🍞 Anchor: A home robot planning to “pick up the apple” imagines moving closer, reaching out, and holding the apple before it tries—so it doesn’t keep grabbing at thin air.

🍞 Hook: You know how your brain sometimes thinks you can jump further than you actually can, and you only learn your limit when you try?

🥬 The Concept (Physical Hallucinations):

What it is: Physical hallucinations are plans that sound smart in words but break real-world rules (like trying to slice without a knife).
How it works: The AI proposes a step, ignores a physical constraint (e.g., hands are full), and fails when acting.
Why it matters: These errors waste time, create dead-ends, and can cause damage in real settings. 🍞 Anchor: Saying “Open the fridge” when you’re far away or the door is already open—sounds fine, fails in reality.

🍞 Hook: Think of guessing the next scene in a movie and checking if you were right when it plays.

🥬 The Concept (Predictive Coding):

What it is: Predictive coding is the idea that a smart system constantly predicts what will happen, then uses the difference from reality (prediction error) to learn.
How it works: 1) Predict next state, 2) observe true state, 3) measure error, 4) adjust beliefs/rules to reduce future errors.
Why it matters: Errors become learning signals instead of just failures. 🍞 Anchor: Expecting a cabinet to be empty, then seeing a bowl inside teaches, “This cabinet can store bowls—check it next time.”

🍞 Hook: Imagine keeping a notebook of cooking mistakes and victories so you don’t repeat burns and you reuse winning tips.

🥬 The Concept (World Knowledge Repository, WKR):

What it is: WKR is an external rulebook where the agent stores what it learned about how the world really works.
How it works: After each attempt, the agent writes down new constraints (don’ts) and shortcuts (do’s) as simple, readable rules.
Why it matters: Rules live outside the model’s fixed memory, so they can grow, be shared, and be used instantly without retraining. 🍞 Anchor: “You can’t pick up an object if your hands are full” and “Check both sides of the counter” get saved and reused.

🍞 Hook: Before this paper, many tried to pour all world rules into the model’s weights, like trying to memorize every cookie recipe forever.

🥬 The Problem:

What it is: Static fine-tuning makes models rigid; they can’t adapt to new, weird, or changing environments without costly retraining.
How it works: Weights are updated offline and stay the same during use, so new surprises cause old mistakes.
Why it matters: Real homes and websites change; an agent must learn on the fly. 🍞 Anchor: A robot trained in one kitchen fails in another if the trash can is inside a cupboard instead of beside the sink.

🍞 Hook: What was missing was a way to learn from each step while working—like keeping a live diary.

🥬 The Gap Filled by This Paper:

What it is: A training-free method to align an agent’s inner predictions to real physics and good procedures by writing external rules from experience.
How it works: The agent runs a Predict–Act–Verify loop, turns errors into Process Experience and successes into Goal Experience, and saves both in the WKR.
Why it matters: The agent becomes flexible, safer, and faster at solving tasks without retraining. 🍞 Anchor: The first time “Pick up bowl” fails because hands are full, the agent forever remembers to “Put down first” before picking up.

🍞 Hook: Why should you care? Because we want helpful robots and software assistants that don’t break things or get stuck.

🥬 Real Stakes:

What it is: Safer home robots, steadier web agents, and smoother multi-step helpers.
How it works: Turn each failure into a new rule; turn each success into a reusable strategy.
Why it matters: Fewer dumb mistakes, more reliable help in daily life. 🍞 Anchor: A cooking robot stops repeating “turn on stove” when it’s already on and learns to check both stove knobs before placing a pot.

02Core Idea

🍞 Hook: Picture a careful explorer who sketches the map as they go, marking dead-ends in red and shortcuts in green.

🥬 The Aha Moment:

What it is: Let the agent learn a live rulebook from its own experiences—errors make red rules (don’ts), successes make green guides (do’s)—so its inner world model stays aligned with reality without retraining.
How it works: 1) Predict–Act–Verify each step, 2) turn prediction errors into Process Experience (physical constraints), 3) turn wins into Goal Experience (procedural heuristics), 4) store both in a World Knowledge Repository and retrieve them next time.
Why it matters: Plans become both possible (physically valid) and purposeful (goal-directed). 🍞 Anchor: After failing to “open a cabinet” from far away, the agent writes “Move close before open,” then nails it next time.

🍞 Hook: You know how you keep a travel journal of places to avoid and places to revisit?

🥬 Multiple Analogies:

Analogy 1 (Sports Coach): Missed shots (errors) teach form limits—“bend knees first”—while highlight reels (success) become drills. The WKR is the playbook.
Analogy 2 (Cooking): Burnt batches add safety timers (Process Experience); perfect batches become go-to recipes (Goal Experience). The WKR is the recipe box.
Analogy 3 (GPS with Traffic): Road closures (errors) add red no-go zones; fast routes (success) become green preferred paths. The WKR is the live traffic layer. 🍞 Anchor: The agent avoids bumping into closed doors (red rule) and takes the shortest cabinet-check order (green guide).

🍞 Hook: Before vs. After is like memorizing every test answer vs. learning how to study new questions.

🥬 Before vs. After:

What it is: Before, models stuffed physical rules into fixed weights; after, rules live outside and grow with use.
How it works: Before: retrain to adapt; After: write/read rules at inference time.
Why it matters: Adaptation becomes cheap, fast, and shareable across agents. 🍞 Anchor: A web agent that times out on a stubborn login learns “don’t retry the same button thrice; refresh first” and shares it with a newer agent.

🍞 Hook: Why does this work so well?

🥬 Why It Works (Intuition):

What it is: Prediction error is gold. It points exactly to the boundary your inner world model got wrong.
How it works: When predicted-next-state ≠ real-next-state, the agent extracts the missing causal rule and stores it.
Why it matters: The gap closes where it counts, making future simulations more trustworthy. 🍞 Anchor: Predicts “I will hold the knife,” but reality says hands are full—rule learned: “Put down before pick up.”

🍞 Hook: Let’s break the big idea into snack-sized blocks.

🥬 Building Blocks:

World Knowledge Repository (WKR): the external rulebook.
Process Experience: rules from errors that enforce physics.
Goal Experience: strategies from successes that speed up tasks.
Predict–Act–Verify loop: turns experience into rules.
Constrained Simulation (gating): only imagine futures when objects are seen or known.
WK-MDP: a decision setup where knowledge W steers both actions and predictions.
Why they matter together: Physics rules prevent bad moves; goal rules guide good ones; gating saves time; the loop keeps learning. 🍞 Anchor: With these pieces, the agent opens the right fridge door, in the right order, from the right distance—then remembers that pattern for next time.

03Methodology

🍞 Hook: Think of a tidy science lab notebook: you predict, you test, you record a rule, and you use it next time.

🥬 High-Level Recipe:

What it is: Input → Retrieve rules → Predict and choose action → Act → Verify → Write new rules → Repeat.
How it works: 1) See the scene and goal, 2) pull matching rules from the WKR, 3) simulate next state (only if grounded), 4) take action, 5) compare predicted vs. real, 6) add Process or Goal Experience, 7) plan the next step.
Why it matters: Each cycle shrinks the gap between imagination and reality. 🍞 Anchor: Goal: “Put the apple in the bowl.” The agent predicts grabbing the apple, acts, checks success, and adds any new rule it learned.

🍞 Hook: Before we dive deeper, meet the learning loop that powers everything.

🥬 Predict–Act–Verify Loop:

What it is: A tight cycle where the agent imagines, tries, and checks.
How it works: 1) Predict future state and action, 2) act in the environment, 3) verify real state, 4) generate a learning signal from any mismatch.
Why it matters: No wasted failures—every miss becomes a rule. 🍞 Anchor: Predicts “cabinet will be open,” acts “open cabinet,” verifies “still closed” because too far—learns “must be close to open.”

🍞 Hook: You know how you title a note with only the important parts so it’s easy to search later?

🥬 State Abstraction:

What it is: Turning raw, messy details into clean, high-level facts (e.g., hand empty, door closed, object visible).
How it works: The agent converts the real next state into a short semantic summary before comparing with its prediction.
Why it matters: Learning focuses on causal rules, not noisy pixels. 🍞 Anchor: Instead of storing a whole image, it saves “Fridge: closed; Distance: far; Hand: full.”

🍞 Hook: Imagine a referee checking if the predicted play actually happened.

🥬 Judgment (Verifier):

What it is: A check that compares predicted abstract state with the real abstract state.
How it works: If there’s a semantic mismatch, it flags a physical hallucination.
Why it matters: Only true, important errors trigger learning. 🍞 Anchor: Predicted “Holding Knife,” but real says “Hand empty”—flag! Learn why.

🍞 Hook: Think of writing a sticky note after a mistake so you don’t do it again.

🥬 Self-Reflexion:

What it is: A reflection step that turns a flagged error plus its context into a simple, reusable rule.
How it works: It reads the recent history, the action, and the mismatch, then synthesizes a verbal causal rule.
Why it matters: The agent explains its own mistake to its future self. 🍞 Anchor: “If object not visible, use ‘find’ or ‘navigate’ first.” gets written to the WKR.

🍞 Hook: Collecting red lights is not enough; we also want green arrows that point us the right way.

🥬 Goal Experience:

What it is: Heuristics distilled from successful trajectories—procedural playbooks that worked.
How it works: After success, the agent summarizes high-level steps that led to the goal and saves them.
Why it matters: Future planning starts closer to the winning route. 🍞 Anchor: “Check both sides of the counter, then the drawer” becomes a reusable search pattern.

🍞 Hook: Where do all these sticky notes live so you can find them later?

🥬 World Knowledge Repository (WKR):

What it is: The shared library of Process (constraints) and Goal (heuristics) rules.
How it works: It stores rules as readable text, retrieves relevant ones by semantic similarity to the current goal and scene, and feeds them into planning.
Why it matters: External, editable, and shareable knowledge beats rigid, hidden weights. 🍞 Anchor: Searching for “put fruit in bowl” pulls rules about clearing hands, checking visibility, and the best order of locations.

🍞 Hook: Only daydream when it’s useful—don’t imagine wild guesses.

🥬 Constrained Simulation (Gating):

What it is: The agent only simulates future states when target objects are grounded (seen or remembered precisely).
How it works: If ungrounded, it acts (e.g., explore) but skips prediction writing to avoid fantasy states.
Why it matters: Cuts hallucinations and speeds inference. 🍞 Anchor: If the apple isn’t visible, the agent plans to navigate first and writes “Exploration phase: target not visible, prediction skipped.”

🍞 Hook: Finally, how do we put knowledge directly into decision-making?

🥬 WK-MDP (World Knowledge–Augmented MDP):

What it is: A planning setup where the policy picks both action and predicted-next-state while being guided by WKR rules.
How it works: The goal is “maximize success, while minimizing prediction–reality mismatch,” so plans are both right and real.
Why it matters: It bakes alignment into the objective. 🍞 Anchor: The agent prefers a sequence that reaches the goal and also respects rules like “hands must be free to pick up.”

Example Walkthrough (with actual data-like steps):

Goal: “Place the apple in the bowl on CounterTop_2.”
Retrieve rules: “If target not visible, navigate first.” “Hands must be empty before picking up a new object.” “Be close to open/close.”
Step 1 Predict & Gate: Apple not visible → predicted_state: “Exploration phase: target not visible, prediction skipped.” Action: find → CounterTop_2.
Step 2 Verify: Apple spotted.
Step 3 Predict: “After pick, hand will hold Apple_1; Apple_1 no longer on CounterTop_2.” Act: Pick Apple_1. Verify: success or learn rule if failed (e.g., hand was full → add constraint).
Step 4 Predict: “After place, hand empty; Apple_1 in Bowl_1.” Act: Place Apple_1 into Bowl_1. Verify: success → distill Goal Experience sequence.

Secret Sauce:

Turn every mismatch into a crisp, human-readable rule (Process Experience), and every success into a compact strategy (Goal Experience), then retrieve just-in-time and simulate only when grounded. That trio keeps plans real, fast, and focused.

04Experiments & Results

🍞 Hook: Think of a school tournament where teams must follow the rules and also finish the course correctly.

🥬 The Test:

What it is: The agent had to perform household-like tasks in EB-ALFRED and EB-Habitat (benchmarks with subsets like Base, Common Sense, Complex, Visual, and Spatial).
How it works: It followed multi-step instructions (e.g., find, open, pick, place) with feedback from the environment.
Why it matters: These tests check both final success and whether the steps along the way were valid. 🍞 Anchor: “Put the sponge in the sink” involves moving, checking visibility, picking, and placing—every step can pass or fail.

🍞 Hook: Scoreboards use clear grades so we know who did best and why.

🥬 The Metrics (Success Rate, Goal-Conditioned Success):

What it is: Success Rate (SR) is like getting 100% only if you finish the whole task; Goal-Conditioned Success (GC) gives partial credit for correct sub-steps.
How it works: SR is strict (all-or-nothing). GC rewards progress through valid intermediate goals.
Why it matters: SR shows end-to-end reliability; GC shows procedural correctness even when time runs out. 🍞 Anchor: If you correctly find the bowl, open the cabinet, and pick up the sponge but run out of time before placing it, SR=0 but GC>0.

🍞 Hook: Who did WorldMind go up against?

🥬 The Competition:

What it is: Strong baselines like ReAct, Best-of-N, SimuRA, ReasoningBank, Synapse, and AWM.
How it works: Each baseline mixes reasoning and acting differently; few explicitly turn failures into rules at test time.
Why it matters: Beating them shows the value of experiential alignment. 🍞 Anchor: ReAct reasons step-by-step but doesn’t build a growing rulebook of do’s and don’ts from its own mistakes.

🍞 Hook: Results time—did the new playbook pay off?

🥬 The Scoreboard with Context:

What it is: WorldMind raised strict task completion and improved process correctness across datasets and backbones.
How it works: On EB-ALFRED with GPT-3.5-turbo, SR rose from about 44.4% (ReAct) to 48.0% (WorldMind), and GC jumped to about 63.0%. On EB-Habitat with GPT-4.1-mini, SR reached about 50.8%, beating ReAct by roughly 9.2 points, with GC around 57.2%.
Why it matters: That’s like going from a solid B to an A- on finals, while also getting most steps right on practice drills. 🍞 Anchor: Even when WorldMind didn’t finish a task, it often did more correct sub-steps, showing cleaner, safer behavior.

🍞 Hook: Anything surprising?

🥬 Surprising Findings:

What it is: Cross-model experience transfer worked: swapping the WKR between GPT-3.5-turbo and GPT-4.1-mini still improved performance.
How it works: External rules captured general physics and procedures, not model quirks.
Why it matters: Teams can share a rulebook; new agents benefit immediately. 🍞 Anchor: One agent’s rule “don’t repeat the same invalid click—refresh first” helped another agent on a different backbone.

🍞 Hook: How did errors change when the agent got smarter?

🥬 Error Redistribution:

What it is: Invalid Actions (breaking physics) dropped a lot; Timeouts sometimes rose because the agent avoided fatal moves and explored longer; Wrong Terminations shrank when Goal Experience guided when to stop.
How it works: Process Experience filtered unsafe steps; Goal Experience prevented quitting too soon.
Why it matters: Fewer crashes and better stamina make sturdier agents. 🍞 Anchor: In Habitat, invalid actions fell (e.g., from 105 to 67 on one setting), while the agent used its steps more wisely instead of getting disqualified early.

🍞 Hook: Does it generalize beyond houses?

🥬 Cross-Environment Results:

What it is: In a hybrid Embodied Web Agent setting, the agent had to switch between browsing and acting physically.
How it works: Completion jumped strongly (e.g., roughly doubling), with fewer repeated actions and step errors.
Why it matters: The same rulebook style helps in both digital and physical worlds. 🍞 Anchor: “Don’t click the same broken button thrice; try refresh” is just like “Don’t pull a closed drawer from far away; step closer first.”

05Discussion & Limitations

🍞 Hook: Every strong tool comes with proper care instructions.

🥬 Limitations:

What it is: Where WorldMind can struggle.
How it works: If the vision system mis-sees objects (e.g., calls a mug a bowl), Process Experience can’t fully fix that—the base perception must be decent. Also, while we know rules shape behavior, we don’t yet have a full mathematical map of how explicit rules shift the model’s internal boundaries. Finally, real-time multi-agent sharing (live syncing, conflict resolution) is not solved yet.
Why it matters: Knowing these edges focuses future improvements. 🍞 Anchor: If the agent can’t see the knife in a cluttered drawer due to perception errors, even perfect pick-up rules won’t help.

🍞 Hook: What do you need to run this well?

🥬 Required Resources:

What it is: Ingredients for success.
How it works: A capable LLM/VLM backbone, memory for the WKR, retrieval to find relevant rules fast, and compute for Predict–Act–Verify at inference.
Why it matters: Each part supports reliable, low-latency learning on the fly. 🍞 Anchor: Think of WKR as a library (storage) plus a good librarian (retrieval) so the right rule arrives at the right moment.

🍞 Hook: Are there places where you might skip this method?

🥬 When NOT to Use:

What it is: Mismatched scenarios.
How it works: If the world is static, tiny, or fully known, heavy experiential alignment adds overhead. If perception is very poor, fix sensing first. If you can’t log interactions (e.g., strict privacy), you can’t easily build Process/Goal Experience.
Why it matters: Use the right tool for the job. 🍞 Anchor: Assembling a toy with two steps from a perfect manual doesn’t need a growing rulebook.

🍞 Hook: What mysteries remain?

🥬 Open Questions:

What it is: Next puzzles to solve.
How it works: Can we mathematically track how rules bend the model’s decision surface? How to synchronize multi-agent WKRs in real time without contradictions? How to auto-clean or compress rules at scale? Can we learn when to gate prediction even more cleverly?
Why it matters: Cracking these will scale safer, smarter agents everywhere. 🍞 Anchor: Imagine a team of kitchen robots sharing one evolving cookbook without stepping on each other’s toes—how do they agree on the best version fast?

06Conclusion & Future Work

🍞 Hook: Think of an explorer who turns every stumble into a better map.

🥬 Three-Sentence Summary:

What it is: WorldMind aligns an agent’s inner world model with real-world physics and good procedures by turning mistakes into constraint rules (Process Experience) and successes into strategy rules (Goal Experience).
How it works: Through a Predict–Act–Verify loop and a World Knowledge Repository, the agent updates knowledge at test time—no retraining—then plans with constrained, grounded simulation.
Why it matters: This reduces physical hallucinations, boosts task success, and allows knowledge to transfer across models and environments. 🍞 Anchor: A robot that once tried to open cabinets from across the room now steps closer first—and teaches that rule to its robot friends.

🥬 Main Achievement:

What it is: A practical, training-free pathway to world-aligned planning that externalizes knowledge for reuse and sharing.
How it works: Encode errors and wins as readable rules that guide both predictions and actions.
Why it matters: It’s a simple, powerful switch from static memorization to live, explainable learning. 🍞 Anchor: Like swapping from a fixed cookbook to a living recipe box that grows every time you cook.

🥬 Future Directions:

What it is: Where to go next.
How it works: Stronger perception coupling, theory of how rules reshape internal models, real-time multi-agent WKR sync, rule compression and conflict resolution.
Why it matters: These make agents more capable, cooperative, and efficient. 🍞 Anchor: A fleet of home robots sharing and refining one safe, smart playbook.

🥬 Why Remember This:

What it is: The lasting idea.
How it works: Failure isn’t the end—it’s the teacher. Write it down, reuse it, and share it.
Why it matters: This mindset turns any capable model into a careful learner that respects the real world. 🍞 Anchor: Next time an agent stumbles, expect a new, better rule to appear in its growing book of common sense.

Practical Applications

•Home robots that learn safe object handling rules (e.g., clear hands before pick-up) during daily chores.
•Warehouse pick-and-place systems that turn misgrabs into new handling constraints to reduce damage.
•Office assistants that avoid repeated bad clicks or form submissions by adding recovery rules like refresh-first.
•Cooking assistants that learn reliable search orders for utensils (left counter → right counter → drawer).
•Elderly care robots that reduce risky actions (e.g., moving heavy items without proper grip) by learning constraints.
•Education bots that capture successful study sequences (goal heuristics) to guide future problem solving.
•Customer service agents that record and reuse successful troubleshooting playbooks across teams.
•AR navigation aids that learn building-specific procedures (e.g., which doors auto-lock) and share them with new users.
•Industrial inspection drones that convert near-miss events into safety no-go rules to prevent accidents.
•Web automation agents that log cross-website error-handling strategies (e.g., “2 failed logins → reset flow”).

Version: 1