SurgWorld: Learning Surgical Robot Policies from Videos via World Modeling

Yufan He; Pengfei Guo; Mengya Xu; Zhaoshuo Li; Andriy Myronenko; Dillan Imans; Bingjie Liu; Dongren Yang; Mingxue Gu; Yongnan Ji; Yueming Jin; Ren Zhao; Baiyong Shen; Daguang Xu

SurgWorld: Learning Surgical Robot Policies from Videos via World Modeling

Intermediate

Yufan He, Pengfei Guo, Mengya Xu et al.12/29/2025

arXiv PDF

Key Summary

•SurgWorld teaches surgical robots using videos plus text, then guesses the missing robot moves so we can train good policies without collecting tons of real robot-action data.
•The team curated SATA, a 2,447-clip dataset (300k+ frames) with expert text that precisely describes tool–tissue actions like needle grasping and knotting.
•They fine-tuned a powerful video world model (Cosmos-Predict2.5) on SATA to generate realistic, text-aligned surgical videos that follow instructions.
•An inverse dynamics model (IDM) turns those synthetic videos into pseudo-kinematics (the robot’s best-guess movements), creating paired video–action data at scale.
•Policies trained on real data plus SurgWorld synthetic data beat policies trained only on real demos on a real surgical robot for needle pickup and handover.
•Video quality improved strongly: SurgWorld cut FVD from 175.4 (zero-shot) to 106.5 and scored highest on expert clinical realism ratings.
•Few-shot adaptation worked: with just 5 real trajectories, SurgWorld achieved a 73.2% task success rate, better than models without SATA pretraining.
•Adding more synthetic data reduced policy action errors (MSE) across positions, rotations, and gripper angles, showing consistent gains.
•This approach offers a safe, scalable path to surgical autonomy by unlocking the value in abundant unlabeled surgical videos.
•The method bridges a key gap: it connects text-aligned surgical video generation to robot policy training via pseudo-kinematics.

Why This Research Matters

SurgWorld makes it possible to train capable surgical robot policies without collecting huge, hard-to-get paired datasets from operating rooms. Hospitals and research labs can safely generate realistic, instruction-following videos and estimate the matching actions, greatly expanding training material. This could speed up progress on reliable surgical assistance for tasks like suturing, knotting, or needle handovers, improving consistency and reducing fatigue for clinicians. By relying less on scarce in-vivo data, development can proceed faster while respecting privacy and safety. Over time, this approach could help standardize core surgical skills in robots, leading to more predictable outcomes. It also shows a recipe for other high-stakes domains—use world models plus inverse dynamics to unlock unlabeled video troves.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine learning to tie your shoes by only watching videos but never seeing a close-up of the hands or feeling the laces. You’d learn something, but important details would be missing.

🥬 The Concept (Physical AI): What it is: Physical AI means teaching machines to see, think, and act correctly in the real world, not just answer questions on a screen. How it works:

Watch the world through sensors like cameras.
Understand what’s happening (tools, tissue, motions).
Choose actions that safely change the world (move, grasp, pull). Why it matters: Without Physical AI, robots can’t do careful hands-on tasks like surgery because they don’t connect seeing with doing.

🍞 Anchor: A surgical robot must see a needle, plan a gentle grasp, and move precisely—all parts of Physical AI in action.

The World Before: Robots in homes and factories started getting really good because they used big “Vision–Language–Action (VLA)” models trained on lots of paired data: images or videos (what they saw), text (what to do), and actions (what they did). But in surgery, we mostly have videos—tons of them online—without the matching action data (the robot’s exact positions, rotations, and gripper angles). That’s like having nature documentaries without the field notes.

🍞 Hook: You know how following a recipe is easier when you have both the cooking video and the written steps?

🥬 The Concept (Vision–Language–Action models): What it is: A VLA model learns to map what it sees (vision) and reads (language) to what it should do (action). How it works:

Read the instruction (e.g., “pass the needle to the right hand”).
Look at the scene to find tools and tissue.
Output the next moves for the robot. Why it matters: Without VLA, a robot can’t follow high-level instructions in new situations.

🍞 Anchor: When told “pick up the needle,” a VLA focuses on the needle and plans the grip and lift.

The Problem: In surgery, collecting paired video + precise robot kinematics is hard, costly, and regulated. Imitation learning (copying expert motions) needs lots of these pairs, and when the dataset is small, errors snowball (covariate shift). Simulators help, but soft tissue is tricky and visuals often don’t match real endoscopy, so trained policies may not transfer.

🍞 Hook: Imagine trying to practice piano on a cardboard keyboard. It helps a little, but it’s not the same as real keys.

🥬 The Concept (Paired video–kinematics data): What it is: Videos aligned with exact robot movements at each frame. How it works:

Record endoscope video.
Record robot tip positions, orientations, and gripper angle per frame.
Align them in time. Why it matters: Without pairing, the robot only sees “what happened,” not “how to do it.”

🍞 Anchor: Knowing the frame where the needle enters tissue plus the 3D tip path teaches the robot the correct puncture angle and depth.

Failed Attempts: Pure simulators miss surgical realism; using only small real datasets leads to brittle policies; and generic video generators don’t understand surgical tools and anatomy well enough to teach robots safely.

The Gap: We needed a way to transform the ocean of unlabeled surgical videos into training fuel that includes actions—even when no actions were recorded.

Real Stakes: Better training means steadier surgical assistance, fewer errors, reduced surgeon fatigue, and more consistent outcomes. If robots can learn core skills like needle passing from safely generated examples, hospitals can scale training without risking patients or overloading operating rooms.

02Core Idea

🍞 Hook: Think of learning a dance by watching a high-quality animation and also getting step-by-step foot positions. You could practice perfectly even without a human partner.

🥬 The Concept (The “Aha!” Moment): What it is: Use a surgical world model to generate realistic, text-guided surgical videos, then use an inverse dynamics model to guess the missing robot actions, creating paired data to train strong policies. How it works (big picture):

Build SATA, a dataset of surgical clips with expert action text.
Fine-tune a powerful video world model on SATA so it makes photorealistic, instruction-following surgery videos.
Train an inverse dynamics model (IDM) to infer actions from pairs of frames.
Generate many synthetic videos from text prompts, and label them with pseudo-kinematics using IDM.
Train a VLA policy on both real demos and these synthetic pairs. Why it matters: Without this, we stay stuck with tiny paired datasets; with it, we scale safely using abundant unlabeled videos.

🍞 Anchor: Prompt “left tool passes needle to right tool,” generate a rollout, label each frame’s 20D action, and train a policy that now nails handovers more reliably.

Three Analogies:

Movie Director: SATA is the script, SurgWorld is the animation studio, IDM is the choreographer who turns frames into precise moves, and the VLA is the actor who learns the role.
Cooking Class: Text is the recipe, SurgWorld shows the cooking video, IDM lists exact spoon and pan positions, and the policy becomes a chef who can cook without the teacher.
Map and Driving: Text is directions, SurgWorld shows street views, IDM recovers steering and speed, and the policy becomes a safe driver.

Before vs. After:

Before: Needed lots of paired video+action data; surgical policies overfit and struggled.
After: Generate text-aligned videos and infer actions at scale; policies become more data-efficient and generalizable.

Why It Works (intuition):

Text grounding forces SurgWorld to codify “who does what where,” keeping videos semantically correct.
IDM bridges vision and control by reconstructing the missing action timeline from visual change.
Mixing real demos with synthetic pairs anchors learning to reality while expanding coverage.

Building Blocks in Bite-Sized Pieces:

🍞 Hook: You know how a librarian writes detailed summaries so students quickly find the right book? 🥬 SATA Dataset: What it is: 2,447 expert-labeled clips (300k+ frames) across four key suturing actions with precise text about tools, tissue, and spatial relations. How it works: Collect from YouTube/public sets → segment by actions → add fine-grained, expert text. Why it matters: Without sharp text labels, videos can’t be reliably controlled by prompts. 🍞 Anchor: “Left needle driver punctures right side of dorsal venous complex” reliably guides the generator to the correct motion.

🍞 Hook: Imagine a VR simulator that actually obeys your story prompt. 🥬 SurgWorld (diffusion-based world model): What it is: A fine-tuned version of Cosmos-Predict2.5 that makes realistic surgical videos from a starting frame and a text prompt. How it works: Encode the first frame → use a transformer to predict future latent frames with flow matching → decode to video, guided by text. Why it matters: Without realistic, controllable videos, you can’t synthesize useful training examples. 🍞 Anchor: Same initial frame + “two-time handover” yields a clean left→right→left sequence.

🍞 Hook: Picture seeing two photos in a flipbook and figuring out all the in-between moves. 🥬 Inverse Dynamics Model (IDM): What it is: A model that infers the robot’s action sequence from visual changes between two frames. How it works: Input frame i and frame i+T → predict the 16-step action path (positions, rotations, gripper) that explains the change. Why it matters: Without IDM, synthetic videos stay unlabeled and can’t train action policies. 🍞 Anchor: From “needle moved 2 cm and rotated,” IDM outputs the 20D actions that could have caused it.

🍞 Hook: Like training wheels on a bike—safe repeats build skill fast. 🥬 VLA Policy: What it is: A model (e.g., GR00T N1.5) that maps current image + text + state to the next 16 actions. How it works: Pretrain on broad data → finetune with real demos + SurgWorld/IDM synthetic pairs. Why it matters: Without the final policy, the robot can’t act in the real OR. 🍞 Anchor: The learned policy executes a smoother, closer-to-expert needle pickup and handover.

03Methodology

At a high level: Input (SATA videos + a few real robot demos) → [Train SurgWorld world model] → [Train IDM for actions] → [Generate synthetic videos + pseudo-kinematics] → [Train VLA policy on real + synthetic] → Output: a better surgical policy.

Step 1: Curate SATA (Surgical Action–Text Alignment)

What happens: Gather 2,447 clips (300k+ frames) from YouTube and public datasets, each labeled as needle grasping, puncture, suture pulling, or knotting, with expert-written text describing tools, anatomy, and interactions.
Why this step: Without precise text, the world model can’t follow prompts; without consistent action categories, it can’t learn motion primitives.
Example data: “Bottom fenestrated bipolar forceps holds suture while right needle driver punctures the right and midline of the patient’s dorsal venous complex.”

🍞 Hook: Like labeling sports clips with plays (pass, dribble, shoot) and notes on who passes to whom. 🥬 SATA: What it is: A text-aligned video library built for Physical AI. How it works: Aggregate clips → action-segment → expert text with spatial and interaction details. Why it matters: Prompts become reliable controls for video generation. 🍞 Anchor: Prompt “three-time handover” leads to the exact left→right→left→right rhythm.

Step 2: Train and Finetune SurgWorld

What happens: Start from Cosmos-Predict2.5 (a diffusion-based video world model). Insert LoRA adapters (parameter-efficient finetuning). Train with Flow Matching so the model predicts a velocity in latent space, improving stability and quality. Condition on an initial frame and the text prompt to roll out future frames.
Why this step: We need a generator that creates realistic, instruction-following surgical sequences with correct tool–tissue behavior.
Example: Given a clean endoscopic view with no visible tools and the prompt “left needle driver passes needle to right needle driver,” the model generates a plausible pickup and pass sequence.

🍞 Hook: Think of adding training wheels (LoRA) to a big bike (Cosmos) so you can quickly ride on a new road (surgery) without rebuilding the bike. 🥬 SurgWorld: What it is: Cosmos-Predict2.5 tuned on SATA to make surgical videos from text and a starting frame. How it works: Encode starter frame → transformer predicts latent future frames via flow matching → decode to realistic video aligned with text. Why it matters: Quality and control are crucial; bad or off-prompt videos won’t teach good policies. 🍞 Anchor: “Needle puncture” prompts show the correct entry angle and tissue response.

Step 3: Train the IDM (Inverse Dynamics Model)

What happens: Train an IDM that sees two frames T=16 apart and predicts the 16 in-between robot actions in a 20D vector per step: left/right tip positions (x,y,z), orientations (6D each), and gripper openings.
Why this step: Videos alone don’t include actions; IDM recovers the missing kinematics.
Example: For a left tool tip moving 1.2 cm toward the needle while rotating, the IDM outputs the trajectory of 3D positions, smooth orientation changes, and gripper angle to achieve the grasp.

🍞 Hook: Show a magician two snapshots—IDM reveals the hidden moves in between. 🥬 IDM: What it is: A model that fills in the action timeline between two frames. How it works: Input frame i and i+16 → output 16 steps of 20D actions that would transform i into i+16. Why it matters: It turns synthetic videos into paired video–action data for policy learning. 🍞 Anchor: Watching the thread straighten, IDM estimates the pulling path and gripper motion.

Step 4: Generate Synthetic Rollouts and Pseudo-Kinematics

What happens: Use SurgWorld to roll out many videos per prompt (e.g., 56 initial frames × 10 random seeds = 560 rollouts). Use the IDM to label each generated video with pseudo-kinematics.
Why this step: Scale matters; more diverse, text-controlled examples improve generalization.
Example: From the same initial frame, generate one-, two-, and three-time handovers, each with matching action labels.

🍞 Hook: Like practicing the same piano piece at different tempos to master control. 🥬 Synthetic Paired Data: What it is: SurgWorld videos labeled by IDM actions. How it works: Prompt → generate video → IDM infers per-frame 20D actions. Why it matters: Greatly expands training data beyond scarce real demos. 🍞 Anchor: Hundreds of needle passes teach the policy subtle timing for safe, smooth handovers.

Step 5: Train the VLA Policy (e.g., GR00T N1.5)

What happens: Start from a strong pretrained VLA. Mix real demos (5/10/20) with synthetic pairs (56 or 560). Train to predict the next 16 actions from current image + text + robot state.
Why this step: The final goal is a controllable, real-world surgical policy.
Example: The policy’s left-arm trajectory tracks the ground truth closely when trained with 10× synthetic data, lowering mean squared error across position, rotation, and jaw.

🍞 Hook: A coach who blends real scrimmages with high-quality drills to build game-day skill. 🥬 VLA Policy: What it is: The robot’s brain that turns what it sees and reads into safe actions. How it works: Finetune on real+synth data to output 16-step actions. Why it matters: This is what runs on the robot. 🍞 Anchor: On test episodes, the policy achieves smoother, more accurate needle pickup and handover.

Secret Sauce:

Fine-grained text in SATA gives precise prompt control.
Flow-matched, LoRA-adapted SurgWorld yields realistic, stable, and semantically aligned videos.
IDM bridges vision to action, unlocking synthetic paired data at scale.
Mixing real and synthetic grounds the model in reality while expanding diversity.

04Experiments & Results

The Test: The authors evaluated (1) video generation quality and text alignment on the SATA dataset, (2) few-shot finetuning on five real trajectories for needle pickup/hand-over, and (3) downstream policy accuracy on 40 held-out episodes using real-only vs. real+synthetic training.

🍞 Hook: Like judging a cooking class by how good the dishes look, how closely they match the recipe, and how well students cook on their own afterwards.

🥬 Video Quality and Alignment: What it is: Compare three variants—Zero-shot (no surgical finetuning), Action-category (coarse prompts per action), and SurgWorld (fine-grained SATA text). How it works: Measure FVD (lower is better for realism) and VBench metrics: Dynamic Degree (DD), Image Quality (IQ), Overall Consistency (OC). Also run a human expert study rating text alignment, tool consistency, and anatomical realism on a 1–3 scale. Why it matters: If videos aren’t realistic and on-prompt, they won’t teach good robot behavior.

🍞 Anchor: SurgWorld scored FVD 106.5 vs. 175.4 (zero-shot) and achieved the best DD and OC, like moving from a shaky C to a solid A in video realism and coherence.

Results:

Table 1: FVD ↓ (106.5 vs. 175.4), DD ↑ (62.4 vs. 26.9), IQ ↑ (49.3 vs. 48.7), OC ↑ (21.5 vs. 18.0). SurgWorld clearly outperforms baselines.
Qualitative: Given the same starting frame, prompts like one/two/three-time handovers or puncture lead to distinct, correct sequences. Notably, two- and three-time handovers are novel compositions not explicitly seen during training.
Human Experts: SurgWorld earned the highest ratings across text-video alignment, tool consistency, and anatomy realism; zero-shot and action-category variants showed tool hallucinations and weaker temporal coherence.

Few-Shot Adaptation on Real Trajectories:

Setup: Finetune with only 5 real trajectories and test on 56 hold-out initial frames (from out-of-domain episodes). Compare Zero-shot, Finetuned-Orig, and SurgWorld (SATA-pretrained then finetuned).
Outcome: SurgWorld: 73.2% success rate (experts judged task completion) and the lowest FVD (207.1) among finetuned models, beating Finetuned-Orig (51.8% SR). SATA pretraining improves stability and clinical plausibility.
Meaning: Like getting an A- when the generic model gets a C, using only a handful of real examples.

Policy Learning with Synthetic Data:

Data: 60 demos total; last 40 are held-out tests. Train policies with 5/10/20 real demos. Add synthetic sets: 56 (1×) or 560 (10×) SurgWorld rollouts labeled by IDM.
Metric: Mean Squared Error (MSE) of 20D action predictions (positions, rotations, gripper) vs. ground truth.
Findings: Adding synthetic data consistently lowers MSE across all components. With 10× synthetic, the left-arm trajectory tracks ground truth much better (as in Fig. 7). The trend holds across different finetuning steps, multi-view variants, and even other VLA models (π0.5), showing broad utility.

Surprising Findings:

Novel Prompt Composition: Despite training on single handovers, SurgWorld performs coherent two- and three-time handovers from text alone, indicating strong compositional generalization.
Single-View Synthetic Helps Multi-View Policies: Even though generated videos are single-view, they still improve multi-view robot policy training—suggesting the value lies in motion patterns, not just viewpoints.

Bottom Line with Context:

SurgWorld improves video realism (FVD and VBench), earns higher clinical realism scores, adapts well with only five real trajectories, and produces synthetic data that reliably boosts downstream policy accuracy—akin to turning a small tutoring session into a full library of practice problems that actually match the exam.

05Discussion & Limitations

Limitations:

Embodiment Finetuning: SurgWorld and the IDM must be finetuned for each new surgical robot and setup, requiring some real data and engineering time.
Pseudo-Kinematic Noise: IDM labels are estimates, not ground truth; residual errors could nudge policies toward slightly off trajectories if not mixed with real demos.
Dataset Scope: SATA, while large and detailed, doesn’t cover all procedures or rare edge cases (e.g., unusual anatomies, instruments, or complications).
Physics Gaps: Even realistic videos can’t fully capture soft-tissue physics (e.g., deformation, bleeding dynamics) or haptic cues that might matter for advanced tasks.

Required Resources:

Compute for world-model finetuning (diffusion transformer with LoRA), IDM training, and policy finetuning.
Expert time for high-quality text annotations (if expanding SATA to new tasks).
A small but representative set of real trajectories for each robot embodiment and camera setup.

When NOT to Use:

Tasks Demanding Haptics: If force feedback is essential (e.g., delicate membrane peeling), video-only learning may be insufficient.
Zero Real Data: Without any real samples for the target robot/camera, transfer might be unreliable.
Out-of-Scope Prompts: Prompts describing tools or anatomy outside the trained distribution may cause hallucinations or unsafe plans.

Open Questions:

Stronger Action Grounding: Can we integrate action-conditioned generation directly (joint video+action world models) to reduce reliance on IDM?
Better Tissue Physics: How to blend learned video models with soft-body simulators or differentiable physics for truer dynamics?
Safety Guarantees: How to layer formal safety constraints or verification atop VLA policies trained with synthetic labels?
Data Expansion: What’s the minimal real data needed per robot to unlock robust generalization? How far can we push multi-view from single-view synthesis?
General Surgical Autonomy: Can this pipeline scale from suturing primitives to multi-step procedures (e.g., cholecystectomy) with reliable step-level autonomy?

06Conclusion & Future Work

Three-Sentence Summary: SurgWorld turns unlabeled surgical videos into useful training fuel by generating realistic, text-aligned rollouts and inferring the missing actions with an inverse dynamics model. Mixing these synthetic pairs with a small number of real demonstrations trains stronger VLA policies that perform better on real surgical robots. This offers a scalable, safer path to surgical autonomy without requiring massive, hard-to-collect paired datasets.

Main Achievement: The first integrated framework that connects surgical world modeling, fine-grained text grounding, and pseudo-kinematics generation to directly improve real robot policies.

Future Directions: Expand SATA to more procedures and edge cases; improve IDM precision or incorporate joint video–action world models; explore multi-view synthesis and soft-tissue physics; and add safety layers for clinical deployment. Investigate end-to-end training loops where policies influence which synthetic data to generate next (active data generation).

Why Remember This: SurgWorld shows how to unlock the vast ocean of unlabeled surgical videos by pairing them—synthetically and at scale—with actions. That shift turns data scarcity into data abundance, speeding progress toward consistent, teachable, and safer surgical skills for robots that assist real clinicians.

Practical Applications

•Pretrain surgical assistants on needle pickup and handover using synthetic, text-guided rollouts.
•Augment small hospital datasets with SurgWorld videos plus IDM pseudo-kinematics to boost policy accuracy.
•Practice rare or risky scenarios (e.g., tricky puncture angles) synthetically before attempting them on phantoms or simulators.
•Rapidly adapt policies to new tools or camera setups by finetuning SurgWorld and IDM with a few trajectories.
•Create multi-step training curricula (one-, two-, three-time handovers) from prompts to build robust sequencing skills.
•Benchmark and stress-test policies by generating controlled variations in lighting, tissue motion, or tool approach paths.
•Support multi-view policy training even when synthetic data is single-view by focusing on motion coverage.
•Prototype instruction-following behaviors by iterating on prompts without expensive new data collection.
•Generate educational clips that pair visual behavior with estimated actions for trainee surgeons and engineers.
•Use pseudo-labeled data to compare different VLA architectures (e.g., GR00T vs. π0.5) fairly and cheaply.

Version: 1