SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

Nicholas Pfaff; Thomas Cohn; Sergey Zakharov; Rick Cory; Russ Tedrake

SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

Beginner

Nicholas Pfaff, Thomas Cohn, Sergey Zakharov et al.2/9/2026

arXiv

Key Summary

•SceneSmith is a smart team of AI helpers that turns a short text like 'a cozy study with books and a desk' into a full 3D home scene you can drop right into a robot simulator.
•It builds scenes like a builder stacking blocks: first the rooms, then big furniture, then wall and ceiling items, and finally lots of small objects (the clutter we see in real homes).
•Every object it makes comes with physics (mass, friction, collision shapes), so things don’t float or pass through each other and can be picked up by robots.
•A designer AI proposes changes, a critic AI checks them, and an orchestrator AI decides when to accept or redo—this back-and-forth creates realistic and stable scenes.
•For simple objects, SceneSmith generates new 3D assets from text; for movable furniture like cabinets with doors, it retrieves trusted articulated models.
•It produces 3–6 times more objects than other methods yet keeps collisions under 2% and 96% of objects stable after physics checks—like going from a tidy showroom to a real, lively home.
•In tests with 205 people, SceneSmith won about 92% of the time for realism and 91% for following the prompt compared to other systems.
•Robots can be tested end-to-end: a task in words becomes scenes, a policy runs in those scenes, and an evaluator agent checks if the job was done.
•The result is a big step toward safely training and checking home robots before they ever step into your living room.
•Even with strong results, it still needs good compute, better articulated generation, and careful prompts to shine.

Why This Research Matters

SceneSmith helps robots practice in virtual homes that finally look and act like real ones, so they make fewer mistakes around people and pets. It makes training faster and cheaper by creating many varied, cluttered scenes from just a sentence. Designers, teachers, and game makers can quickly get rich environments without manually modeling every item. Researchers can test the same robot skill across hundreds of realistic layouts, not just a few tidy rooms. Because objects come with physics, the worlds aren’t just pretty—they behave properly when pushed or picked up. This speeds up progress toward helpful home robots that can set tables, tidy shelves, and fetch items safely. In short, it turns words into working worlds.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re setting up a practice obstacle course for a friend who’s learning to ride a bike. If the course is just a flat empty field, your friend won’t be ready for real streets with curbs, cars, and people. Robots have the same problem when we train them in simple virtual rooms.

🥬 The Concept (Robotics Simulation): It’s a safe video-game-like world where robots practice before trying tasks in real homes. How it works: 1) Build a virtual place, 2) Put in objects, 3) Let the robot try things, 4) See what works and fix problems. Why it matters: Without good practice worlds, robots do fine in ‘easy mode’ but struggle in real, messy homes. 🍞 Anchor: Practicing ‘pick up the red cup’ in a bare room won’t prepare a robot for finding that cup under a pile of dishes in a real kitchen.

The world before: Many simulators gave us rooms with just a few items: a couch here, a table there, maybe a plant. That looked neat for pictures but didn’t feel like actual homes, which have lots of stuff—books, utensils, papers, toys, and more—and furniture that opens, like drawers and cabinets. Robots trained in these simple scenes often froze or failed when meeting real clutter.

🍞 Hook: You know how a diorama with only two toy pieces looks fake? Real shelves hold many different things, not just one.

🥬 The Concept (Text-to-3D Synthesis): Turning written descriptions into new 3D objects. How it works: 1) Read the words (‘a blue ceramic bowl’), 2) Make a reference image, 3) Turn that image into a 3D shape, 4) Clean it up and size it. Why it matters: If you only use a fixed library, you repeat the same few items; text-to-3D keeps scenes fresh and varied. 🍞 Anchor: Type ‘short red ottoman’ and get a new 3D ottoman that wasn’t in the library yesterday.

Problem: Past scene makers usually chose objects from a limited catalog and focused on where to place big furniture. They often skipped small objects (the clutter robots must handle), ignored physical properties (like mass or friction), and didn’t support articulated things (drawers/doors). That meant the scenes looked okay but weren’t great for testing manipulation.

🍞 Hook: Guessing a backpack’s weight by looking at it isn’t perfect, but you can tell if it’s likely light (empty) or heavy (full of books).

🥬 The Concept (Physical Property Estimation): Estimating mass, friction, and shape needed for physics. How it works: 1) Show the AI multiple views, 2) Predict material and a reasonable mass range, 3) Compute collision shapes so objects can bump properly, 4) Set orientation so ‘on the floor’ means really on the floor. Why it matters: Without physics, objects float, sink through tables, or can’t be grasped correctly by robots. 🍞 Anchor: A ceramic mug slides less than a glossy plastic toy; friction settings capture that.

Failed attempts:

Procedural rules: Fast, but too simple (homes felt same-y).
Learned layout models: Good at ‘where to put a sofa’, bad at ‘30 bowls on shelves’ or fitting real clutter.
LLM/VLM planners: Great at words but often placed items overlapping or floating.
Asset-only focus: Nice objects, no full scenes.
Scene-only focus: Decent layouts, but reused assets and missing physics.

🍞 Hook: Building a Lego city works best when you do it in layers: baseplates first, then buildings, then people and props.

🥬 The Concept (Hierarchical Scene Construction): Build big-to-small in stages (rooms → furniture → wall/ceiling items → small objects). How it works: 1) Make the floor plan, 2) Add large items, 3) Decorate walls and ceiling, 4) Fill surfaces with small stuff. Why it matters: Jumping straight to ‘tiny details’ without a solid layout creates chaos and collisions. 🍞 Anchor: First place the dining table and chairs; only then add plates, forks, and a flower vase on top.

The gap: We needed one system that both creates new, varied, physically valid objects and arranges them in dense, realistic rooms—all from a single natural-language prompt.

Real stakes:

Safer homes: Better testing reduces robot mistakes near people and pets.
Faster progress: Teams can train on tougher, more realistic cases before touching a real house.
Better products: Vacuum, delivery, and helper robots improve quicker.
Educational and VR uses: Architects, designers, and game makers get rich scenes instantly.

🍞 Hook: A neat trick for creating full, lively rooms is making many small objects, not just one or two.

🥬 The Concept (Object Density Generation): Matching how ‘full’ a real room is. How it works: 1) Detect support surfaces (shelves, tables), 2) Decide what belongs there, 3) Place many items with spacing and stacks, 4) Re-check physics. Why it matters: Without enough small items, robots won’t learn skills needed for messy homes. 🍞 Anchor: A pottery shop prompt becomes shelves with 30 bowls and 30 cups, not just a single display piece.

Finally, to make all this planning ‘human-smart,’ SceneSmith uses multiple agents that reason, check, and improve as they go. That bridges the last mile between text prompts and robot-ready worlds.

02Core Idea

🍞 Hook: You know how group projects work best when one person proposes ideas, one person double-checks them, and one person keeps everyone on track?

🥬 The Concept (Agentic Framework): A team of AI helpers—Designer, Critic, and Orchestrator—take turns making, checking, and approving scene edits. How it works: 1) Designer proposes room layout or object placements, 2) Critic scores realism, physics, and prompt match, 3) Orchestrator accepts, asks for fixes, or rolls back to a safer checkpoint. Why it matters: Without checks and balances, scenes get unrealistic or physically wrong. 🍞 Anchor: The Designer adds chairs; the Critic notices they don’t face the table; the Orchestrator asks for a fix.

The ‘Aha!’ in one sentence: Let a small team of specialized AIs build scenes step-by-step, while tightly attaching physics-aware assets, so the final world works immediately in a robot simulator.

Three analogies:

Movie set: Stage crew makes rooms (Designer), safety officer checks hazards (Critic), director chooses takes (Orchestrator).
Cooking show: Chef cooks (Designer), food critic tastes (Critic), producer keeps time and retakes (Orchestrator).
Lego city: Builder places blocks (Designer), friend checks fit and stability (Critic), parent decides when it’s done (Orchestrator).

Before vs. After:

Before: Sparse rooms, repeated assets, little clutter, and physics issues.
After: Dense, varied objects created on demand, realistic placements, low collisions, and high stability—ready for robots.

🍞 Hook: Imagine asking for ‘a round rug and three pillows’ and getting exactly those, in matching style, not random leftovers.

🥬 The Concept (Asset Routing): A smart chooser that decides whether to generate a new 3D object, fetch an articulated model, or use a thin decorative covering. How it works: 1) Read the request, 2) If static item, do text-to-3D; if it moves (like a cabinet door), retrieve a known articulated asset; if it’s a flat decor (rug/poster), use a thin covering, 3) Validate each asset’s quality and semantics. Why it matters: The wrong kind of asset breaks physics or looks wrong (e.g., a door that can’t open). 🍞 Anchor: ‘Fruit bowl’ becomes a bowl plus individual fruits, not one fused lump.

🍞 Hook: When you write a report, you zoom in on each section. Scenes need that too.

🥬 The Concept (Hierarchical Prompt Refinement): Breaking the big prompt into smaller sub-prompts for each room and surface. How it works: 1) From ‘community center,’ make room-level prompts (gym, office, storage), 2) From room prompts, make surface prompts (this shelf vs. that table), 3) Keep style and constraints consistent. Why it matters: Without local focus, placements become random and miss details. 🍞 Anchor: ‘Dining room set for 12’ spawns exact place-setting instructions for each seat.

🍞 Hook: Shelves don’t live alone—the left and right shelf should make sense together.

🥬 The Concept (Joint Surface Population): Populate related surfaces together so they coordinate. How it works: 1) Group surfaces (all bookshelf levels), 2) Place items considering the set (e.g., books on one, plants on another), 3) Check spacing and style harmony. Why it matters: Random per-shelf choices feel messy; joint planning looks curated and real. 🍞 Anchor: ‘Books on one shelf, plants on the other’ actually shows up that way.

Why it works (intuition, no equations):

Staging decisions avoids getting lost in details too early.
A friendly tug-of-war (Designer vs. Critic) steers toward both creativity and correctness.
Smart asset choices mean items match the task (movable doors need articulated models).
Built-in physics and settling steps clean up tiny mistakes so the final scene “just works.”

Building blocks:

Room layout with adjacency logic and doors/windows.
Furniture placement with tools for snapping and facing.
Wall and ceiling fixtures with coordinate frames.
Small-object population with stacks, piles, and container filling.
Physics prep: collision shapes, masses, friction, gravity settling.
Export to multiple simulators so robots can act right away.

03Methodology

At a high level: Text prompt → Layout (rooms, doors, windows) → Furniture placement → Wall and ceiling fixtures → Small-object (manipuland) population → Physics cleanup → Export to simulators.

Step 1: Read the prompt and draft the house layout.

What happens: The Designer proposes room shapes, sizes, and connections (e.g., hallway to bedroom) and adds doors/windows; the Critic checks flow and lighting; the Orchestrator keeps the best version.
Why it exists: A good floor plan prevents later object overlaps and makes traffic paths realistic.
Example: ‘A pottery store with shelves along the walls’ becomes a rectangular front room, door at the front, and windows on exterior walls.

🍞 Hook: Placing big pieces first makes the rest easy—like setting the table before the tiny decorations. 🥬 The Concept (Furniture Placement Tools): Helpers that check facing, snap objects together, and verify reachability. How it works: 1) Place sofas, tables, cabinets, 2) Use ‘check facing’ so chairs point to tables or TVs, 3) Use ‘snap’ so items align and small gaps vanish, 4) Check reachability so a robot can pass. Why it matters: Without these, chairs might face walls and robots get blocked. 🍞 Anchor: Two guest chairs end up neatly facing the desk, with space to walk around.

Step 2: Add wall and ceiling items.

What happens: The Designer adds shelves, posters, clocks, lights, and fans using wall/ceiling coordinate frames; Critic checks heights and openings; Orchestrator approves or asks for fixes.
Why it exists: These elements are common in real rooms and affect both realism and lighting.
Example: ‘A Minecraft-themed gaming room’ gets a matching poster plus a ceiling light centered over the seating area.

🍞 Hook: Real rooms feel real because of all the little things—books, bowls, pens, plants. 🥬 The Concept (Manipuland Placement): Filling support surfaces (tables, shelves) with small, movable items. How it works: 1) Detect surfaces (like shelf levels), 2) Decide what belongs there from the prompt and context, 3) Place many items with spacing, stacks, piles, and container fills, 4) Validate physics. Why it matters: Robots must learn to see and handle clutter; without manipulands, scenes are too easy. 🍞 Anchor: The pottery store shelves really get ‘at least 30 cups and 30 bowls,’ not just three.

Step 3: Get the right assets in the right way.

Static items: Text-to-3D pipeline makes new meshes on demand (image generation → segmentation → 3D reconstruction), then scales, orients, and adds collision shapes and physics.
Articulated items (drawers/doors): Retrieved from a library with correct joints, then given physics properties.
Thin coverings: Rugs and posters as lightweight surfaces with materials (fast and pretty, without heavy physics).
Why it exists: The scene needs both variety and correctness; generation for variety, retrieval for moving parts.

🍞 Hook: If you and a friend bump into each other, you both stop—3D objects need that same ‘bump-sense.’ 🥬 The Concept (Collision Geometry): Invisible simple shapes that help the physics engine detect and resolve contact. How it works: 1) Split the mesh into convex parts, 2) Use those for fast and stable collision checks, 3) Keep visual mesh separate for looks. Why it matters: Without this, objects pass through each other or wobble strangely. 🍞 Anchor: A stack of plates rests neatly because their collision shapes fit and prevent interpenetration.

Step 4: Physics cleanup and settling.

What happens: After placement, a solver nudges overlapping items apart, then runs a gravity simulation so wobbly objects settle into stable positions.
Why it exists: Even careful placement can leave tiny overlaps or unstable stacks; this makes the scene truly ready-to-use.
Example: A cup barely intersecting a plate is separated by a millimeter and then gently settles under gravity.

Step 5: Export and evaluate robot policies.

What happens: Scenes export to engines like Drake, MuJoCo, or Isaac Sim. A robot policy runs (e.g., pick-and-place), and an evaluator agent checks success by looking at object states and rendered views.
Why it exists: This closes the loop from natural language to robot testing and automatic grading.
Example: ‘Find a fruit and place it on the table’ runs across many generated kitchens; the evaluator confirms the fruit ended up on the tabletop.

The secret sauce:

Trio teamwork (Designer–Critic–Orchestrator) prevents drift and catches mistakes.
Hierarchical prompts keep global style while enabling precise local control.
Smart asset routing balances novelty (new items) and correctness (movable joints).
Physics-first preparation (collision shapes, mass, friction) and settling ensure scenes behave like the real world.

04Experiments & Results

The test: The authors fed 210 prompts into SceneSmith, ranging from everyday rooms to special shops and full houses. They measured how many objects appear, how well the scenes match the prompt, how realistic they look to humans, and how well they pass physics checks (few collisions, high stability). They also tried different ‘competitor’ systems and turned off parts of SceneSmith to see what hurt performance.

The competition: Six strong baselines, including HSM and Holodeck (popular scene generators), plus SceneWeaver, I-Design, and two versions of LayoutVLM. These are respected methods that focus on layouts, constraints, or agent-like refinement—but usually not the full ‘tiny clutter + physics’ package.

The scoreboard with context:

Realism (human preference): About 92% of the time, people picked SceneSmith’s rooms as more realistic. That’s like getting an A+ when others get B’s.
Prompt faithfulness: About 91% of the time, people said SceneSmith followed the request better.
Object count: Around 71 objects per room on average vs. 11–23 for others—like turning a nearly empty shelf into a real store display.
Collision rate: About 1.2% for SceneSmith vs. 3–29% for others—fewer ‘ghost overlaps.’
Stability: About 96% of objects stayed put after physics settling vs. 8–61% for others—huge gap in robot-readiness.
House-level: SceneSmith kept its edge, with more objects and better physics than Holodeck.

Surprising findings:

Navigation metrics (free space) were lower for SceneSmith—and that’s expected. When you add 3–6× more stuff, there’s less empty floor. For robot training in clutter, that’s actually the point.
Turning off asset validation or visual observations made results clearly worse. Generating new assets (instead of using a fixed set) really helped too.
Removing the Critic saved money but reduced object density and flexibility; it might be okay for some use cases.

Examples people liked:

A pottery store truly packed with bowls and cups on shelves.
A dining room table ‘set for 12’ with the right number of plates and utensils.
Themed rooms (like Minecraft-style decor) where details match the vibe.
Houses with sensible room connections (e.g., hallways, bathrooms attached to bedrooms) that feel like real floor plans.

Bottom line: SceneSmith isn’t just about pretty pictures—it’s about scenes that act right, with objects robots can bump, grasp, and move. Winning human preferences while nailing physics is what makes it stand out.

05Discussion & Limitations

Limitations:

Compute demand: Text-to-3D and physics steps need solid GPUs and time, especially for whole houses with hundreds of items.
Articulated generation: Current text-to-3D struggles with reliable moving parts; retrieval is used instead.
Prompt quality: Vague or contradictory prompts can produce mismatches and require retries.
Determinism: The evaluator agent and VLM-based checks can disagree in rare edge cases (though agreement was very high).
Navigation trade-off: Dense scenes reduce free space, which is helpful for manipulation but tougher for mobile navigation.

Required resources:

A modern GPU setup for parallel text-to-3D jobs.
Access to a high-quality articulated asset library.
A physics engine (e.g., Drake), collision tools, and storage for many assets.
VLM models for routing, validation, and property estimation.

When not to use:

If you need exact replicas of a specific real apartment (use 3D scanning instead).
If you must generate on a low-power device instantly.
If you need guaranteed, rule-based grading (agent grading is flexible but not strictly deterministic).
For tasks where maximum empty space is required (SceneSmith aims for realistic clutter).

Open questions:

Can future text-to-3D produce high-quality articulated objects reliably?
How to further tighten physical accuracy (materials, friction) with minimal overhead?
Can we learn styles and object distributions directly from large-scale real scans while keeping privacy?
How to mix robot-in-the-loop feedback so scenes adapt to policy weaknesses automatically?
How to make evaluator agents more consistent without losing open-ended flexibility?

06Conclusion & Future Work

Three-sentence summary: SceneSmith turns a short text into a full, robot-ready indoor world by combining a trio of AI agents with physics-aware asset generation. It builds scenes layer by layer—rooms, big furniture, walls/ceilings, then many small objects—and cleans them up with collision fixes and gravity settling. The result is dense, realistic, low-collision scenes that people prefer and robots can use immediately.

Main achievement: Strong, end-to-end integration of agent teamwork, hierarchical prompting, on-demand asset creation, and physics preparation, producing 3–6× denser scenes with far fewer collisions and much higher stability than prior work.

Future directions: Make articulated text-to-3D reliable, speed up generation with smarter caching and parallelism, improve evaluator determinism, and add robot-in-the-loop scene adaptation. Also, expand style control and real-data grounding for even richer, diverse homes and shops.

Why remember this: It’s a practical bridge from words to robot worlds—finally giving robots messy, realistic practice grounds that look and act like our homes, not empty showrooms. That can make home robots safer, smarter, and more helpful, faster.

Practical Applications

•Train and stress-test home robots on realistic clutter before real-world trials.
•Generate diverse benchmark scenes for fair comparison of robot policies.
•Rapidly prototype interior layouts for education, design, or research.
•Create rich virtual environments for VR/AR experiences and games.
•Teach robotics concepts (grasping, navigation) with hands-on simulated labs.
•Automatically evaluate pick-and-place policies across many kitchens and offices.
•Plan service robot routes in shops with true shelf density and narrow aisles.
•Simulate warehouse backrooms and storage rooms with varied shelving and boxes.
•Test safety cases (blocked doors, tight spaces) in controlled, repeatable scenes.
•Create themed environments (e.g., pottery store, community center) from a single paragraph.

Version: 1