World Craft: Agentic Framework to Create Visualizable Worlds via Text

Jianwen Sun; Yukang Feng; Kaining Ying; Chuanhao Li; Zizhen Li; Fanrui Zhang; Jiaxin Ai; Yifan Chang; Yu Dai; Yifei Huang; Kaipeng Zhang

World Craft: Agentic Framework to Create Visualizable Worlds via Text

Intermediate

Jianwen Sun, Yukang Feng, Kaining Ying et al.1/14/2026

arXiv PDF

Key Summary

•World Craft lets anyone turn a short text description into a playable, visual game world without coding.
•It solves two big problems at once: complicated, fragmented game tools and the fuzzy way people describe spaces with words.
•The framework has two parts: World Scaffold (the builder that assembles scenes from structured data) and World Guild (a team of AI agents that figure out what you want, lay things out, fix mistakes, and make matching art).
•A key idea is to split the job into two steps: first understand the story and structure (Z), then place exact objects with coordinates (G).
•A special error-correction dataset, built by reversing perfect scenes and adding controlled mistakes, teaches the system how to spot and fix layout errors.
•In tests, World Craft beat strong general LLMs and commercial code agents on spatial logic, style consistency, and how well the scene matches the text.
•The Critic agent’s iterative checks catch problems like blocked doors and floating objects, improving results round by round.
•An asset library plus reference-guided synthesis keeps the art style unified, so scenes look like one world instead of a collage.
•Human studies showed strong agreement with the automatic metrics, meaning the scores reflect what people actually prefer.
•Today it shines at single indoor scenes; future work aims at big outdoor towns and deeper physics-heavy interactions.

Why This Research Matters

World Craft lowers the barrier to world-building so teachers, students, writers, and indie creators can bring ideas to life without coding. Better, more customizable worlds help researchers study social behavior in safer, simulated spaces. Games and simulations made faster mean more time for storytelling and learning. Consistent art and solid spatial logic make scenes feel polished, not patchwork. The error-correction approach shows a path for AI that not only creates but also fixes itself. As it scales to larger environments, it could speed up virtual city planning, training simulations, and interactive education.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how you can describe your dream bedroom in words—cozy bed by the window, bookshelf on the left, and a lamp for late-night reading—but turning that into a real room layout still takes measuring, planning, and building?

🥬 Filling (The Actual Concept)

What it is: Before this paper, creating a playable, visual world from words was like asking a carpenter to build a house from a poem—beautiful, but vague.
How it works (what the world looked like before):
1. People used game engines like Unity or Godot with many separate tools and complex steps.
2. Most "AI Town" simulations reused fixed maps or simple grids, so users couldn’t easily customize worlds.
3. General LLMs struggled to convert fuzzy text ("a cozy cafe with a counter and two round tables") into exact geometry (sizes, coordinates, walkable paths).
4. End-to-end text-to-3D work focused mainly on visuals, not on the strict logic needed for agents to move, interact, and not bump into walls.
Why it matters: Without a bridge from words to precise layouts, regular people can’t make their own worlds; research on agent behavior gets stuck using boring, unchangeable environments; and creative ideas stay trapped in text.

🍞 Bottom Bread (Anchor) Imagine your teacher says, "Make a library with a quiet reading area, a cafe corner, and a sunny garden out back." You can picture it, but the computer needs exact coordinates for shelves, doors, and tables. That gap is the problem this paper tackles.

🍞 Top Bread (Hook) Imagine giving a friend directions like "meet me near the big tree" versus telling a robot "go to (x=15, y=9)."

🥬 Filling (The Actual Concept)

What it is: Human language and spatial instructions are different kinds of directions—one fuzzy and story-like, the other exact and math-like.
How it works:
1. Humans describe scenes with vibes, zones, and relationships ("cozy nook by the window").
2. Computers need numbers: object sizes, positions, and which tiles you can walk on.
3. Jumping straight from fuzzy vibes to exact numbers causes mistakes like floating chairs or blocked doors.
Why it matters: If we don’t translate carefully, AI builds worlds that look okay in text but break when you try to play.

🍞 Bottom Bread (Anchor) Telling a game engine "make it cozy" won’t work; telling it "put a 2x2 table at (6,12) and leave a 2-tile walkway" will.

🍞 Top Bread (Hook) You know how photos sometimes fail to capture how a beautiful sunset really felt?

🥬 Filling (The Actual Concept: Semantic Gap)

What it is: The semantic gap is the distance between how we say things and how machines must do things.
How it works:
1. People: "an inviting cafe" (rich meaning, but vague).
2. Machines: need doors clear, chairs aligned, and paths unblocked.
3. Direct mapping often creates "physical hallucinations"—objects overlap, doors are blocked, or rooms don’t connect.
Why it matters: Without closing this gap, worlds won’t be playable or logical.

🍞 Bottom Bread (Anchor) If you say "put a fountain in the garden," an AI might place it right in front of the only door. That’s the semantic gap in action.

🍞 Top Bread (Hook) Imagine a LEGO instruction booklet that tells you exactly which bricks go where.

🥬 Filling (The Actual Concept: World Scaffold)

What it is: World Scaffold is a standard building kit that turns structured scene data into a working, visual game world.
How it works:
1. It expects a neat package G = (M, A, L, P): metadata, assets, layout, and properties.
2. It compiles floors, walls, and object coordinates into a scene you can load and play.
3. It adds navigation and interaction so agents can walk and act.
Why it matters: Without a standard scaffold, AI agents would have to poke many confusing engine tools; with it, one clean format builds the entire scene.

🍞 Bottom Bread (Anchor) Give World Scaffold a list like "table at (6,12), door at (12,20), bookshelf here"—and it builds the room automatically.

🍞 Top Bread (Hook) Think of a school project where a writer, a planner, an editor, and an illustrator each do their part.

🥬 Filling (The Actual Concept: World Guild)

What it is: World Guild is a team of AI agents with different jobs who work together to make the world.
How it works:
1. Enricher: turns your story into a clear, coordinate-free plan (Z).
2. Manager: converts Z into exact coordinates and properties (G).
3. Critic: checks for logic/physics mistakes and asks for fixes.
4. Artist: makes matching visual tiles using a style reference library.
Why it matters: One giant step is hard; four smaller steps are reliable.

🍞 Bottom Bread (Anchor) You say "luxury underground bathhouse"; the Enricher lays out zones, the Manager places pools and doors, the Critic unblocks paths, and the Artist makes matching tiles.

🍞 Top Bread (Hook) Imagine choreographing a dance where each dancer knows their moves and timing.

🥬 Filling (The Actual Concept: Multi-Agent Framework)

What it is: Multiple AI agents coordinate, each specializing in a part of the build.
How it works:
1. Pass the baton: Enricher → Manager → Critic → (back to Manager if fixes needed) → Artist.
2. Each agent has clear rules and tools.
3. Iteration continues until checks pass.
Why it matters: Specialization reduces errors and makes complex tasks manageable.

🍞 Bottom Bread (Anchor) It’s like a relay race: the plan goes around the loop until the Critic gives a thumbs-up.

🍞 Top Bread (Hook) Think of arranging furniture so people can walk without bumping into stuff.

🥬 Filling (The Actual Concept: Layout Generation)

What it is: The step that places every object, size, and orientation where it belongs.
How it works:
1. Read the zone plan (Z).
2. Decide grid size, pick assets, and place coordinates.
3. Ensure connections and collision-free paths.
Why it matters: Without correct layout, the world is unplayable.

🍞 Bottom Bread (Anchor) If the table blocks the only door, the scene looks fine in a picture but fails in a game.

🍞 Top Bread (Hook) Have you ever taken a perfect LEGO model and then removed a few bricks to learn how to fix it?

🥬 Filling (The Actual Concept: Reverse Synthesis)

What it is: Start from perfect scenes, turn them back into descriptions, then add controlled mistakes to teach models how to fix things.
How it works:
1. Build gold-standard layouts with rules and checks.
2. Convert them into text-like plans (no coordinates).
3. Intentionally break them (collisions, wrong rooms, swapped objects) and store the right fixes.
Why it matters: Models learn not just to build, but to notice and repair errors.

🍞 Bottom Bread (Anchor) It’s like practicing math by studying worked answers, then solving the same problem after someone scrambles it.

🍞 Top Bread (Hook) Imagine getting a homework sheet with mistakes marked and hints on how to correct them.

🥬 Filling (The Actual Concept: Error-Correction Dataset)

What it is: A big set of broken-and-fixed scene examples that trains the AI to self-correct.
How it works:
1. For each gold scene, create versions with 2–15 issues.
2. Pair each broken version with instructions on how to fix it.
3. Train the model to apply the right edits to reach the gold result.
Why it matters: Without seeing mistakes and fixes, the AI can’t improve during critique rounds.

🍞 Bottom Bread (Anchor) If a door is blocked by a bookshelf, the dataset includes the exact instruction: "Move bookshelf right by 2."

02Core Idea

🍞 Top Bread (Hook) Imagine building a city from a story: first you sketch neighborhoods (schools here, parks there), then you draw exact streets and addresses.

🥬 Filling (The Actual Concept)

The "Aha!" in one sentence: Split text-to-world into two linked problems—understand the story as a zone plan (Z), then compute exact, playable geometry (G)—and let a team of specialized AI agents iterate until it’s right.

Multiple analogies:

Kitchen analogy: The Enricher is the head chef writing the menu (Z), the Manager is the line cook plating dishes precisely (G), the Critic is the food taster sending plates back if undercooked, and the Artist is the pastry chef making everything look consistent and delicious.
Map analogy: Z is the subway map (which stations connect); G is the train timetable and exact track coordinates; the Critic checks for closed tunnels, and the Artist paints consistent station signs.
School play analogy: The script (Z) describes scenes and entrances, the stage manager (Manager) marks tape on the floor, the director (Critic) fixes blocking, and the set designer (Artist) keeps the style matching.

Before vs After:

Before: One big leap from fuzzy stories to precise worlds caused blocked paths, floating items, mismatched styles, and lots of manual fixing.
After: A staged pipeline—and training on broken-to-fixed examples—turns vague ideas into reliable, runnable scenes with fewer errors and unified art.

Why it works (intuition):

Decoupling reduces cognitive load: answering “what goes where in general?” (Z) is easier than “give me every exact number” (G).
Iterative critique adds safety nets; each round catches more mistakes, like proofreading drafts.
Reference-guided art makes styles align, avoiding the “collage” look.
Reverse synthesis gives the model muscle memory for editing, not just generating.

Building blocks:

World Scaffold: the standard format that any LLM can target to get a working scene.
World Guild: four agents (Enricher, Manager, Critic, Artist) with clear roles and handoffs.
Intermediate Z: a coordinate-free, commonsense zone plan bridging text and geometry.
Error-correction data: gold layouts, controlled breaks, and fix instructions.
Two-stage training: stage 1 aligns language to Z; stage 2 turns Z (and fix notes) into G.

🍞 Bottom Bread (Anchor) Tell World Craft: “A luxury underground bathhouse with hexagonal copper pools and hidden panels.” It first plans zones and connections (pools, corridors, niches), then places exact pool tiles and doors, the Critic unblocks any paths, and the Artist makes matching copper textures—ending in a playable, coherent scene.

03Methodology

High-level recipe: Input text → Enricher (I → Z) → Manager (Z → initial G) → Critic loop (checks → fixes) → Artist (assets) → World Scaffold assembles a playable scene.

Step 1: Enricher (Semantic normalization)

What happens: Converts the user’s natural-language instruction I into a coordinate-free layout description Z: zones, adjacency, main entrances, and rough distributions.
Why it exists: If we skip Z, the Manager must guess too many numbers from fuzzy words and will make geometry mistakes.
Example: Input: “A cafe-library: quiet reading left, cafe tables right, kitchen in back; main glass door to a garden.” Output Z: {zones: reading(left), cafe(right), kitchen(back-right), garden(back); connections: door between cafe and garden; main corridor down center}.

Step 2: Manager (Constrained layout generation)

What happens: Reads Z and outputs initial G = (M, A, L, P): picks grid size, instantiates assets (shelves, doors, tables), and places coordinates/orientations. It designs L (layout) so paths connect and properties P define what’s walkable.
Why it exists: Someone must translate “reading on the left” into exact rectangles, object footprints, and door positions.
Example data: G with floor rectangles (wood, tile, grass), wall lines, and objects like {"bookshelf_tall" at (1,1)}, consistent with the paper’s sample.

Step 3: Critic (Iterative critique and refinement)

What happens: Runs rule-based checks (collisions, connectivity) and semantic checks, then issues concrete fixes C (e.g., “Move bookshelf 2 right”). The Manager applies C and re-outputs G. Repeat up to T rounds.
Why it exists: First drafts often have small defects; the loop catches and corrects them.
Example: If the door is blocked, the Critic says “Shift table from (12,20) to (14,20),” improving CFR and RCS.

Step 4: Artist (Reference-guided asset synthesis)

What happens: For each asset definition A, the Artist retrieves a style-matching reference from a 5.5k+ tile library, then generates or selects consistent tiles so everything looks like one world.
Why it exists: Without style anchoring, objects look like a random collage (mixed resolutions/colors).
Example: "hexagonal copper pool" retrieves a copper-toned tile reference; chairs, counters, and walls all share the same pixel-art look.

Step 5: World Scaffold (Automatic assembly)

What happens: Compiles L and P with the generated tiles into a playable scene with navigation meshes and interaction tags.
Why it exists: A standard target makes it easy for any LLM agent to build scenes without wrangling low-level engine APIs.
Example: The scaffold exports a running scene where agents can pathfind around tables, open doors, and move to the garden.

Secret sauce elements

Intermediate Z bridge: Eases the text→geometry leap.
Critic loop: Adds a self-correction cycle that measurably improves CFR/RCS and reduces OPS.
Asset library: Cuts VGG style loss and raises Visual Harmony (VH), making VLM-based semantic scores more stable.
Reverse synthesis data: Teaches the Manager to perform targeted edits, not just regenerate.

Data construction (to teach spatial common sense)

Scenario Initialization: Start from diverse seeds (reality, literature, film, games) and augment styles (e.g., cyberpunk, primitive) to get broad coverage.
Scene Design (Gold layouts): Use procedural room generation + LLM functional labeling + “12-zone grid” orientation helper + a “Physical Placer” to avoid collisions + a Teacher Model and light human review for edge cases.
Data Annotation: Reverse-engineer gold G into text Z; apply a “Chaos Monkey” to create G_error with 2–15 issues and record fixes C. Build two datasets: D_A for (Z→G_gold) and (G_error, C→G_gold), and D_B for (I→Z) at different instruction lengths.

Training strategy

Stage 1 (Semantic Alignment): Train Enricher on D_B to map any instruction density (short/medium/long) to a clean Z.
Stage 2 (Spatial Refinement): Train Manager on D_A to generate G from Z and to apply edit instructions C for corrections.

Concrete example walkthrough

Input: “A study with tall bookshelves, a long table on a rug in the middle, two small cafe tables on the right, a kitchen in the back-right, and a glass door to a small garden.”
Enricher: Z defines zones and connections.
Manager: picks grid 35x28, places shelves, rug+table, cafe tables, kitchen objects, glass door, and garden fountain.
Critic: finds if any chair blocks a path; suggests moving it.
Manager: applies fix; paths now clear.
Artist: retrieves pixel-art references; tiles match.
Scaffold: outputs a working scene; agents can walk from the reading zone to the garden without collisions.

04Experiments & Results

The test: What was measured and why

Goal: Can World Craft turn text into playable scenes that make sense physically, look consistent, and match the narrative?
Metrics:
- Layout rationality: Collision-Free Rate (CFR), Room Connectivity Score (RCS), Object Placement Score (OPS; lower is better).
- Element richness: Component Existence Rate (CER), Object Volume Density (OVD), Property Consistency (PAC; lower is better violations).
- Intent alignment: Visual-Semantic Alignment using CLIP (VSA-C) and a VLM (VSA-V).
Dataset: 300 held-out instructions across short/medium/long, from 100 seeds never seen in training.

The competition: Who it was compared against

General LLMs: Qwen3-235B (open SOTA), Gemini-3-Pro (closed SOTA), and a Qwen3-32B base.
Code agents: Cursor and Antigravity (with human-in-the-loop debugging allowed).

The scoreboard (with context)

Our two-stage + Critic + correction-data model hit:
- CFR 0.94 (like getting nearly all hallways and doors clear when others still bump into furniture),
- RCS 0.88 (rooms actually connect the way humans expect),
- OPS 3.03 (low errors in object placement),
- CER 0.99 and OVD ~7.13 (scenes are well-populated without chaos),
- PAC 3.64 (fewer size/physics inconsistencies),
- VSA-C 28.07 and VSA-V 6.80 (strong text–image alignment).
Compared to strong baselines, that’s like moving from a B/B+ in spatial sense to an A/A+.

Human evaluation and metric validation

Five experienced players did pairwise choices on 150 samples. Automated metrics strongly correlated with human preferences (Pearson |r| ≈ 0.9+; Fleiss’ kappa around 0.6). This means the numbers match what people actually like.

Stability across instruction lengths

Many general models swing in quality when inputs are very short or very long. World Craft stayed stable because Stage 1 normalizes instructions into Z, making Stage 2’s job easier.

Critique rounds matter

Models trained only on standard data improved little over multiple Critic rounds. Models trained on correction data improved steadily, especially on spatial metrics, proving the value of learning edits.

Asset library ablation

Without the library, VGG style distance shot up and Visual Harmony (VH) fell; VSA-V also dipped slightly, suggesting inconsistent art hurts even AI judges. With the library, style is cohesive and scores improve.

Code agent comparison

Even with up to 60 minutes of human debugging, Cursor and Antigravity took longer to reach runnable scenes and lost to World Craft in human and VLM preferences. Our one-shot pipeline built scenes fast (minutes) and won most head-to-head comparisons.

05Discussion & Limitations

Limitations (what it can’t do yet)

Scale: It mainly targets single indoor scenes. Full outdoor towns with roads, terrain, and many buildings need multi-level planning and will require extensions.
Physics depth: Current interactions cover navigation and basic logic; fluid dynamics, destruction, and evolving worlds are not yet in scope.

Required resources

Models: One LLM for Enricher (e.g., 8B) and a larger one for Manager (e.g., 32B); a strong Critic helps (they used GPT-5.1); a VLM/CLIP stack is used for evaluation.
Data: The asset library (5.5k+ tiles) and the reverse-synthesis correction datasets.
Compute: Multi-GPU training for fine-tuning; runtime is modest for generation.

When not to use

If you need photorealistic, physics-heavy AAA scenes with advanced fluid/rigid-body simulation.
If your world is mostly outdoor, multi-block city planning today.
If strict art direction demands a custom, hand-crafted style not represented in the asset library.

Open questions

How to scale from rooms to neighborhoods and cities with hierarchical planners?
How to encode long-horizon narrative constraints (e.g., quests) into layout decisions?
How to integrate richer physics and time-evolving environments without breaking the simple scaffold?
How to involve users interactively for light edits while keeping the automatic pipeline robust?
How to balance speed, style control, and spatial optimality for different creators?

06Conclusion & Future Work

Three-sentence summary

World Craft turns natural language into playable, coherent worlds by splitting understanding (Z) from execution (G) and coordinating a team of specialized AI agents.
A standardized World Scaffold, a multi-agent World Guild, and a reverse-synthesis correction dataset together deliver strong spatial logic, consistent art, and close instruction alignment.
Experiments and user studies show big gains over strong LLMs and code agents, with fast, one-shot generation and reliable quality.

Main achievement

Demonstrating that a decoupled, critique-driven, and data-augmented pipeline can reliably bridge the semantic gap between fuzzy stories and exact, executable scene layouts for non-experts.

Future directions

Scale to outdoor, multi-building towns with hierarchical planning; expand interaction physics; grow the asset library and style controls; explore interactive editing while preserving guarantees.

Why remember this

It shows that the right recipe—clear intermediate plans, standard targets, iterative checks, and learning from fixes—can turn everyday words into worlds people can see, navigate, and study. That’s a blueprint for democratizing world-building and accelerating agent research.

Practical Applications

•Classroom projects: Students write a description and instantly explore a playable scene for storytelling or history lessons.
•Game jams and prototyping: Indie teams draft multiple level ideas from text and iterate faster with the Critic’s help.
•UX research: Quickly create test environments to study agent behavior, navigation, and social interactions.
•Storyboarding: Authors and filmmakers visualize settings from script snippets to refine mood and layout.
•Training simulations: Build practice spaces (e.g., offices, clinics) for role-playing scenarios without heavy engine work.
•Design previsualization: Interior layouts generated from briefs help discuss flows and accessibility early.
•Edutainment: Kids design museums or science labs from text and learn spatial reasoning by fixing Critic-flagged issues.
•HCI demos: Rapidly craft interactive worlds to showcase multi-agent systems and tool use.
•Content pipelines: Level designers use Z→G generation for blockouts, then hand-polish specifics.
•Research benchmarks: Use the evaluation metrics and datasets to test spatial reasoning and self-correction in LLMs.

Version: 1