Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

Jing Tan; Zhaoyang Zhang; Yantao Shen; Jiarui Cai; Shuo Yang; Jiajun Wu; Wei Xia; Zhuowen Tu; Stefano Soatto

Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

Intermediate

Jing Tan, Zhaoyang Zhang, Yantao Shen et al.1/5/2026

arXiv PDF

Key Summary

•Talk2Move is a training recipe that lets an image editor move, rotate, and resize the exact object you mention using plain text, while keeping the rest of the picture stable.
•It uses reinforcement learning (RL) so the model can practice many possible edits and get rewarded only when the object’s geometry matches the instruction.
•Instead of needing thousands of expensive before/after examples, it learns from input-only images plus a smart, object-focused reward that checks displacement, angle, and size.
•A special GRPO method creates multiple “what-if” edit paths (rollouts) from the same image so the model can compare and improve faster.
•An “early-exit” trick skips the unhelpful parts of the diffusion process, cutting training time while keeping edit quality high.
•Talk2Move beats strong baselines in moving and resizing objects and wins human preferences by a large margin, while matching or beating rotation precision on average error.
•It preserves the background much better than many general-purpose editors, which often redraw the whole scene.
•The team built a data pipeline that auto-generates scenes, instructions, and a small set of supervised pairs to kick-start learning.
•Results hold on both synthetic and real images, showing better spatial accuracy and scene coherence.
•This approach can extend to other generative models for more controllable, verifiable image edits.

Why This Research Matters

This work makes image editing feel like giving directions to a helpful friend: say what you want, and the exact object moves, turns, or resizes while the rest stays put. It lowers the barrier for creators, teachers, and everyday users who don’t want to learn masks, handles, or 3D software. Because it learns from rewards instead of expensive paired data, it’s far more scalable and practical. The early-exit trick makes training faster and greener, which is important for responsible AI. Better geometric control means more reliable layouts for design, advertising, e-commerce, AR/VR, and robotics perception. The approach is general and could guide other generators to be more controllable and verifiable. Over time, this can turn AI image tools from “artsy but vague” into professional, precise assistants.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how you might tell a friend, “Slide the mug to the left,” and they know to move just the mug, not the whole table? We want computers to understand and do that inside pictures.

🥬 The Concept (Text-guided scene editing): It is asking an image editor to change only certain parts of a picture (like one object) using natural language.

How it works: 1) You give an image and a text instruction. 2) The model figures out which object you meant. 3) It changes that object’s position, angle, or size while trying to keep everything else the same. 4) It returns the edited image.
Why it matters: Without good text-guided editing, you’d have to hand-draw masks, drag points, or use 3D tools. That’s hard for most people.

🍞 Anchor: “Move the mug to the left, from on the table to on the laptop” makes the mug slide to the laptop without repainting the whole room.

🍞 Hook: Imagine rearranging furniture in your bedroom. You can push the chair, turn the lamp, or pick a bigger rug—those are basic moves.

🥬 The Concept (Geometric transformations): These are the simple, physical changes to an object: translation (move), rotation (turn), and resizing (make bigger/smaller).

How it works: 1) Identify the object. 2) Choose the type of change (move/turn/scale). 3) Apply the change by the right amount and direction. 4) Keep the rest of the scene intact.
Why it matters: If the editor can’t do these cleanly, objects drift, duplicate, or the background gets messed up.

🍞 Anchor: Turning a toy car 90 degrees and sliding it in front of a toy house is exactly rotation plus translation.

🍞 Hook: Think of teaching a robot a trick by showing perfect before/after examples again and again. That’s tedious and expensive.

🥬 The Concept (Paired supervision scarcity): It means we don’t have many “before image + exact after image” pairs for each instruction (like “rotate 90°”).

How it works: Datasets rarely include precise, object-level edits; making them by hand or video is costly.
Why it matters: Without pairs, simple training that matches pixels (MSE) can’t teach precise geometry changes.

🍞 Anchor: If you only have the original photo of a lamp and not the edited version turned 90°, it’s hard for a model to learn this exact move by copying.

🍞 Hook: When you erase noise from a messy picture to reveal a clean one, you’re doing something like “diffusion denoising.”

🥬 The Concept (Diffusion image editing backbones): These models start from noisy images and step-by-step remove noise to get an image that matches your text.

How it works: 1) Begin with noise. 2) Follow a sequence of steps that predict and subtract noise. 3) End with a sharp image.
Why it matters: Diffusion models are powerful and controllable, but following fine-grained spatial instructions purely from text can be tricky.

🍞 Anchor: Asking for “a red mug on a laptop” works great, but “move that exact mug left onto the laptop” is much harder.

🍞 Hook: Training by only counting pixel differences is like grading a drawing by how many dots match, not whether the chair actually moved to the left.

🥬 The Concept (Pixel-level loss limits): Losses like MSE measure per-pixel similarity and can miss whether the right object moved, rotated, or resized.

How it works: 1) Compare pixels. 2) Penalize differences. 3) Average the penalty.
Why it matters: The model may smear or repaint the scene to reduce pixel error instead of moving the intended object precisely.

🍞 Anchor: The chair might look similar overall, but it never actually shifted to the left—yet the pixel score might still look okay.

🍞 Hook: What if the model could try many edits, get told which try was closer to the instruction, and keep getting better—like practicing a sport?

🥬 The Concept (Reinforcement Learning, RL): A way for a model to learn by trying actions and receiving rewards for good results.

How it works: 1) The model makes an edit. 2) A reward judges how well it followed the instruction. 3) The model updates itself to do better next time. 4) Repeat.
Why it matters: RL doesn’t need exact before/after pairs; it only needs a good “judge” of success.

🍞 Anchor: If the reward says “you moved the mug left by the right amount,” the model learns to repeat that successful move.

The world before: People used dragging handles, 3D lifts, or big unified models. Dragging needs human clicks and skills. 3D pipelines can be slow and finicky. Unified models can follow text but may redraw backgrounds or miss precise geometry. The problem: How do we move, rotate, and resize named objects using just text, without massive paired data and without destroying the scene? Failed attempts: Pixel matching (MSE) ignores geometry; pure text conditioning often blurs spatial precision; datasets of exact geometric edits are scarce. The gap: We need a way to practice edits without pairs, get feedback about geometry itself, and spend compute only on the most useful parts of the editing process. Real stakes: Everyday users want simple text edits (“make the plant 1.5× bigger”), creators want layout control without masks, and robots or AR apps need precise placements. Talk2Move fills this by using RL with group rollouts, object-centric rewards, and a time-saving “early-exit” that focuses learning on the steps that matter most.

02Core Idea

🍞 Hook: Imagine coaching a soccer team by letting them try several plays at once, then rewarding the play that best reached the goal and practicing those moves more.

🥬 The Concept (Key insight): Let the image editor explore multiple edit attempts per instruction and reward it only when the specific object’s movement, rotation, or size matches the text—then update the editor to favor those successful moves.

How it works: 1) Generate diverse edit paths (rollouts). 2) Score each path with object-centric, spatial rewards. 3) Update the model toward the best-performing paths using GRPO. 4) Skip unhelpful steps to save time (early-exit).
Why it matters: This removes heavy dependence on paired data and directly teaches geometry, not just pixels.

🍞 Anchor: For “Rotate the traffic light 90° CCW,” the model tries a few rotations; the one closest to 90° gets a higher reward; the model shifts its behavior toward that success.

Three analogies for the same idea:

Map app analogy: 🍞 Hook: You try several routes from home to school. 🥬 Core: GRPO lets the model try multiple “routes” (rollouts) to the edited image; the travel time (reward) decides the winner. 🍞 Anchor: The route that reaches “mug on laptop” fastest is chosen, and next time similar routes are favored.
Cooking analogy: 🍞 Hook: You tweak a recipe ingredients ratio a few ways and taste them. 🥬 Core: The spatial reward is your taste test, scoring which version nailed “1.5× bigger cake layer.” 🍞 Anchor: The chef (model) adopts the winning ratio for future cakes.
Classroom analogy: 🍞 Hook: Students try different problem-solving steps; the teacher marks the best approach. 🥬 Core: The model attempts several denoise steps; the reward is the teacher’s grade on geometric accuracy; GRPO updates the “class notes.” 🍞 Anchor: Over time, the class reliably gets the right “angle” and “distance.”

Before vs After:

Before: Editors followed style words better than exact spatial commands; backgrounds often got repainted; training needed costly pairs.
After: The model practices geometry directly via rewards, keeps the scene stable, and needs fewer paired examples thanks to RL exploration.

🍞 Hook: You know how focusing on the most important parts of a lesson saves time and boosts learning?

🥬 The Concept (Early-exit/active step sampling): A trick to skip diffusion steps that add little learning signal and focus on steps where edits most affect geometry.

How it works: 1) Measure which denoising step’s outcomes vary most in reward. 2) Pick that as the “exit step.” 3) Use shortcuts to jump past later steps during training.
Why it matters: It can halve the compute cost while keeping or even improving edit quality.

🍞 Anchor: If step 4 is where layout is decided for translation, practice mostly there instead of wasting time perfecting textures later.

🍞 Hook: Judging a robot’s fetch by asking, “Did it grab the right toy and bring it the right distance?” is better than, “Do the camera pixels look similar?”

🥬 The Concept (Spatially grounded rewards): Scores that check the moved object’s displacement, orientation, and size—separately from the rest of the scene.

How it works: 1) Find the object with segmentation. 2) Measure center shift, rotation angle, or size ratio. 3) Compare to the instruction’s target. 4) Give a higher reward to better matches.
Why it matters: It teaches geometry precisely and keeps the background preserved.

🍞 Anchor: “Move the chair left to be beside the dresser” earns reward for leftward shift and being near the dresser, not for repainting the wall.

Building blocks (in simple pieces):

🍞 Hook: Like playing many quick scrimmages. 🥬 GRPO rollouts: Try multiple noisy denoise paths; update toward higher-reward ones; clip updates to stay stable. 🍞 Anchor: The team practices the winning play.
🍞 Hook: Like highlighting key steps in math homework. 🥬 Off-policy step evaluation: Probe which diffusion step has the biggest reward spread; that’s where learning bites. 🍞 Anchor: Focus drills on that step.
🍞 Hook: Like using a ruler to check distance, a protractor for angle, and a scale for size. 🥬 Specialist reward tools: Segmentation for finding the object, depth/orientation models for 3D hints, and simple math to score accuracy. 🍞 Anchor: Precise tools make precise coaching possible.

Why it works (intuition): By exploring multiple edit attempts per prompt, the model doesn’t have to guess one brittle path. The reward looks at object-level geometry—not pixels—so it optimizes the thing we actually care about. Early-exit keeps practice time where it’s most useful. Together, they teach reliable, interpretable edits that match text instructions with less data and less scene damage.

03Methodology

At a high level: Input image + text → (1) SFT cold start to learn basic moves → (2) RL with flow-based GRPO creates multiple edit attempts → (3) Spatial rewards score geometry → (4) Early-exit focuses training on key steps → Output edited image.

SFT cold start with LoRA 🍞 Hook: Before riding a bike on hills, you first learn balance on flat ground. 🥬 The Concept (SFT cold start with LoRA): A quick fine-tune so the model learns basic object-aware edits before RL.

How it works: 1) Insert small LoRA adapters into attention/normalization/linear layers of the editor backbone (QwenImageEdit). 2) Train briefly on a tiny set of paired examples for translation, rotation, and resizing. 3) Freeze big parts (text encoder, VAE, ViT) to keep it lightweight. 4) Save this as the starting checkpoint.
Why it matters: It stabilizes RL by giving the model a sense of where objects are and how edits roughly work. 🍞 Anchor: After a few lessons, the model can already nudge a mug or scale a tree a bit, even if not perfect.

Tiny paired data creation 🍞 Hook: If you can’t film a full sports season, shoot a few drills. 🥬 The Concept (Synthetic supervision for kick-start): Create limited, high-quality pairs for each transform.

How it works: 1) Translation/rotation pairs: generate short videos from a reference image to simulate realistic moves; filter for accuracy and consistency; keep ~800 translation and 43 rotation pairs. 2) Resizing pairs: use editing models to scale objects; filter to ~110 pairs.
Why it matters: Even a small, curated set teaches strong priors without massive labeling. 🍞 Anchor: A short clip of “chair turning 90°” is enough to learn “this is what 90° looks like.”

RL with flow-based GRPO and rollouts 🍞 Hook: Think of trying five different swings at the ball and keeping the one that goes straight. 🥬 The Concept (Flow-based GRPO rollouts): During denoising, inject small noise at steps to create multiple candidate edit paths.

How it works: 1) Start from the SFT checkpoint. 2) For each input, generate G variations by slightly perturbing steps in the flow-based (ODE-like) sampler. 3) Each variation yields an edited image. 4) Compare them with rewards; update the policy toward higher-reward transitions using a clipped objective (PPO-style).
Why it matters: The model explores different spatial outcomes and learns what truly matches the text. 🍞 Anchor: For “make the music stand 1.5× bigger,” the best-sized attempt wins and teaches the model to scale precisely.

Off-policy step evaluation (finding the exit step) 🍞 Hook: If one part of your routine practice gives the biggest improvements, you spend more time there. 🥬 The Concept (Extrinsic uncertainty via reward variance): A way to measure which denoising step most affects success.

How it works: 1) On a few images, perturb individual steps and record the reward distribution for early exit points. 2) The step with the highest reward variance is the most informative. 3) Choose it as the exit step (K).
Why it matters: It tells us where geometry decisions are made (e.g., early steps for layout, late steps for fine rotation), so we can focus compute there. 🍞 Anchor: For translation/resizing, step 4 (of 10) often matters most; for rotation, later steps (like step 10) matter more.

Active step sampling (early-exit shortcuts) 🍞 Hook: If the main idea is solved by step 4, don’t spend time polishing steps 5–10 during training. 🥬 The Concept (Early-exit/shortcuts): Stop sampling after the exit step and jump to the final prediction.

How it works: 1) During training, sample only up to step K. 2) Use shortcuts to predict the final image from K. 3) Compute rewards there. 4) Update the policy using those results.
Why it matters: Cuts training time (up to about 2× in practice) while preserving strong learning signals. 🍞 Anchor: It’s like leaving class after the key lecture and skipping the end-of-period announcements.

Spatially grounded rewards (object-first scoring) 🍞 Hook: Use measuring tools, not just eyeballing it. 🥬 The Concept (Object-centric scoring): Evaluate geometry by localizing the object and measuring the exact change.

How it works: 1) Use text-driven segmentation to get masks/boxes for the target object in original and edited images. 2) Translation: measure center shift and direction; depth models add forward/backward cues. 3) Rotation: use an orientation estimator to align axis/direction/angle. 4) Resizing: compare size ratio between before/after boxes. 5) Normalize scores to a unified reward.
Why it matters: Rewards teach exactly what the instruction asks (distance, angle, scale), keeping background intact. 🍞 Anchor: “Move the can from right holder to left holder” is scored by leftward distance and correct final location—not by how shiny the dashboard looks.

Data generation for RL without costly pairs 🍞 Hook: You can practice many sentences about the same scene, like “move left,” “move right,” “make bigger.” 🥬 The Concept (Prompt diversity for exploration): Build lots of input-only samples by generating scenes and auto-making instructions.

How it works: 1) Use an LLM to write scene descriptions with multiple objects. 2) Use text-to-image to synthesize images. 3) Use a VLM to create templated edit instructions. 4) Train RL using these input-only samples plus spatial rewards.
Why it matters: Dramatically reduces the need for expensive paired edits, enabling scalable training. 🍞 Anchor: One living-room image can support many instructions like “move the vase left” or “rotate the chair 90°.”

Putting it all together, the training loop looks like this: input image + instruction → generate G rollout candidates by perturbing denoise steps → compute spatial rewards from object metrics → compute group-relative advantages (higher reward = better) → update policy with clipping for stability → (periodically) run off-policy step evaluation to keep the exit step sharp → repeat. The result is an editor that understands instructions like a careful stagehand: it moves the prop you asked for, by the amount and direction you said, and leaves the set untouched.

04Experiments & Results

🍞 Hook: To see if a new move works in basketball, you test it in scrimmages against good players and keep score.

🥬 The Concept (The tests): We measured whether the editor really moved, turned, or resized the correct object as asked, how far it moved, how close the angle was, and how much the background stayed the same.

How it works: 1) Translation: Translation Distance (how far the object center moved) and Accuracy (passed all checks: right direction, same object identity, no duplicate left behind, and background preserved). 2) Rotation: Rotation Error (average angle mismatch) and Accuracy (within ±20°). 3) Resizing: Scale Error (difference from target scale) and Accuracy (within ±10%). 4) Background: CLIP similarity and L1 distance.
Why it matters: Raw percentages are clearer when you know what they mean for actual object control and scene quality.

🍞 Anchor: “Rotate 90° CCW” counts as accurate if the edit is between 70° and 110° CCW, and smaller angle errors are better.

Datasets and competitors:

Synthetic benchmark: 3,200 training samples over 800 images (with many text variations), plus a 100-sample test per transform type.
Real images: A curated set from OpenImages-V6.
Baselines: GPT-Image-1 (a strong general editor), Flux-Kontext, Bagel, and QwenImageEdit (our backbone).

Scoreboard with context (synthetic benchmark):

Translation: Talk2Move hits 76.67% accuracy, like scoring an A when others sit around C–B range; it also shows the largest correct movement (Translation Distance 0.667). Human judges pick it 57.5% of the time—far above others.
Rotation: Talk2Move has the best mean precision (lowest Rotation Error 0.2861). GPT-Image-1 achieves a slightly higher accuracy under the ±20° pass/fail rule (32.33% vs our 29.55%), but our lower average error means our angles are typically closer to the target, and human judges prefer Talk2Move 68.75% of the time.
Resizing: Talk2Move reaches 49.17% accuracy, crushing others (e.g., 15.28% for a strong unified baseline). Human preference is also strongly in our favor (63.89%).

Real images (OpenImages-V6):

Translation: 53.85% accuracy, higher than baselines; movement magnitude is also strongest.
Rotation: 31.25% accuracy and competitive error.
Resizing: Accuracy comparable to the best baselines in this tougher real-world set.

Background preservation:

Talk2Move maintains low L1 differences (lower is better) across tasks and keeps CLIP similarity high, indicating good scene identity. GPT-Image-1 often repaints scenes (lower CLIP similarity), matching the visual observation that it changes lighting/style more.

Ablations (what changed what):

SFT vs RL: SFT (small paired set) gives a helpful jumpstart. Adding RL on top further boosts movement distances and success rates. With only one-tenth the data, SFT alone underperforms, while RL + spatial rewards still reaches competitive results—showing strong data efficiency.
Early-exit efficiency: For translation, choosing exit at step 4 cuts iteration time by 49% vs full sampling and 14% vs sliding window, yet yields the best accuracy and distance. Reward curves also converge faster with shortcuts.
Reward models: A VLM-only reward looks optimistic but unstable. Spatially grounded rewards (segmentation + orientation/depth + rules) yield lower rotation error and higher accuracy—more reliable coaching.

Surprises:

Rotation needs later steps (like step 10) more than translation/resizing—suggesting angle fine-tuning lives later in the denoise process, while layout/size lock in earlier.
Skipping steps not only saves time but can improve learning by avoiding uninformative or even confusing stages.
Human raters strongly prefer Talk2Move even when a raw precision metric is close—likely because scene coherence and object identity preservation look better to the eye.

Bottom line: Talk2Move consistently improves object-level control, especially for moving and resizing, keeps backgrounds intact, and learns efficiently thanks to rollouts, spatial rewards, and early-exit.

05Discussion & Limitations

🍞 Hook: Even a great tool has places where it shines and places where it struggles.

🥬 The Concept (Limitations): What Talk2Move can’t do yet, and when to be careful.

How it works: 1) Dependence on object detection/segmentation: if the object is tiny, occluded, or ambiguous, rewards can be noisy. 2) Rotation along 3D axes in 2D images is hard; orientation estimates can be tricky for symmetric objects. 3) Long, chained edits (many steps in one instruction) may still stress the model. 4) Unusual instructions outside the templates (e.g., “tilt 37° around a diagonal axis while half-scaling”) may not be well handled. 5) Training still needs GPUs, though early-exit helps.
Why it matters: Knowing boundaries keeps expectations realistic and guides the next improvements.

🍞 Anchor: If a “man” is partly hidden behind a wall, both finding him and measuring his rotation become less reliable.

Resources required:

A diffusion editor backbone (here, QwenImageEdit) with LoRA fine-tuning capacity.
A modest set of curated pairs for SFT (hundreds, not thousands).
RL training with multiple rollouts per sample; 16 H200 GPUs were used in the paper (about 160 GPU hours per subtask). Smaller setups can work with fewer rollouts and careful scheduling.

When not to use:

If you only need style changes (color, lighting) and don’t care about precise geometry, simpler editors may suffice.
If you need exact 3D-consistent rotations for CAD-like precision, a 3D-aware pipeline may be better.
If your scene has extreme occlusions or microscopic objects, detection and rewards may be too noisy.

Open questions:

Can we make rotation as robust as translation/resizing under stricter metrics while keeping scene coherence?
How far can we push data efficiency—fewer rollouts, smarter rewards, or semi-supervised signals?
Can we unify multiple sequential edits in one go (e.g., “move then rotate then scale”) with consistent rewards?
How well does this generalize to videos, 3D assets, or AR scenes with temporal consistency needs?
Can rewards learn from human preference directly while staying geometry-aware and interpretable?

Overall, Talk2Move sets a strong foundation for text-to-geometry editing with clear next steps in robustness, efficiency, and generalization.

06Conclusion & Future Work

Three-sentence summary: Talk2Move teaches an image editor to move, rotate, and resize exactly the object you name using reinforcement learning with group rollouts and object-centric rewards. It saves time and data by focusing training on the most informative diffusion steps (early-exit) and by judging success with precise geometric measurements instead of pixels. The result is accurate edits and preserved backgrounds on both synthetic and real scenes, often preferred by humans over strong baselines.

Main achievement: Turning text-guided geometric object editing into an RL problem with GRPO rollouts and spatially grounded rewards—plus an early-exit mechanism that doubles training efficiency—so the model learns geometry directly and keeps scenes coherent.

Future directions: Improve rotation under strict pass/fail rules while keeping low average error; make rewards even more robust to occlusion and symmetry; extend to video/3D; support multi-step composite instructions; and blend geometry-aware rewards with human preference signals. Broader use across GANs or autoregressive editors could unlock standardized, verifiable control in many generators.

Why remember this: It’s a clean recipe—practice many edit attempts, measure geometry precisely, and train where learning bites the most. That combination moves text-based editing from “pretty pictures” toward “precise, controllable tools,” which is exactly what creators, designers, and interactive apps need.

Practical Applications

•Interface-free product placement: Move a product next to a logo or onto a shelf using only text instructions.
•E-commerce photo fixes: Align, rotate, and resize items to standard views without re-shooting images.
•Storyboarding and layout design: Quickly arrange props and characters in scenes by typing directions.
•Education content creation: Teachers adjust diagrams (move arrow, rotate shape, scale label) using simple text.
•AR/VR scene prep: Precisely position virtual objects in real backdrops for demos and prototypes.
•Robotics perception simulation: Create controlled variations (pose, size, position) of objects in scenes for training.
•Marketing and social media: Keep the background intact while repositioning items to match brand guidelines.
•UI/UX mockups: Nudge icons or components to exact spots while preserving the rest of the screen.
•Scientific figure editing: Rotate/resize annotated elements without distorting surrounding data.
•Game asset previews: Arrange 2D scene composites with text to test levels or cutscenes quickly.

Version: 1