MotionEdit: Benchmarking and Learning Motion-Centric Image Editing

Yixin Wan; Lei Ke; Wenhao Yu; Kai-Wei Chang; Dong Yu

MotionEdit: Benchmarking and Learning Motion-Centric Image Editing

Beginner

Yixin Wan, Lei Ke, Wenhao Yu et al.12/11/2025

arXiv PDF

Key Summary

•This paper creates MotionEdit, a high-quality dataset that teaches AI to change how people and objects move in a picture without breaking their looks or the scene.
•It also introduces MotionNFT, a training method that rewards models for matching the correct motion using optical flow, like comparing how pixels should move before and after an edit.
•Previous image editing systems were great at changing colors and styles but weak at changing poses, actions, and interactions in a believable way.
•MotionEdit is built from real video frames, so the 'before' and 'after' images show natural, physically sensible motion changes.
•A new Motion Alignment Score (MAS) checks if the model's edit moves the right parts in the right direction and by the right amount.
•Across strong baselines (like FLUX.1 Kontext and Qwen-Image-Edit), MotionNFT boosts overall edit quality and motion faithfulness without hurting general editing skills.
•The dataset covers six motion types, such as pose, orientation, locomotion, and subject–object interactions, making evaluation more complete.
•Human studies and MLLM-based metrics both show that MotionNFT’s improvements are meaningful and preferred.
•The method is lightweight to deploy and adds no extra cost at inference time.
•Hard cases remain: multi-subject coordination and identity preservation in crowded scenes are still challenging.

Why This Research Matters

Motion-focused editing turns AI from a color-and-style painter into a director of actions and interactions. That unlocks faster animation, more accurate video generation, and better training tools where bodies move realistically. Creators can fix a pose or stage a scene without redrawing everything, saving time and keeping characters on-model. In education and healthcare simulations, correct motion (not just pretty images) can improve learning and safety. For AR/VR and games, believable pose changes deepen immersion. Overall, this work puts motion understanding at the center of visual AI, which is essential for dynamic media.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how in comics, the same hero must look like themselves while striking very different poses from panel to panel? If the pose changes but the face or costume suddenly changes too, it feels wrong.

🥬 The Concept (Image Editing, the starting point): What it is: Image editing is when a computer changes a picture based on a written instruction. How it works: 1) Read the instruction, 2) Decide what needs to change in the picture, 3) Create a new picture that follows the instruction, 4) Keep the rest of the picture the same. Why it matters: Without careful editing, the result can look fake—wrong colors, missing objects, or a pose that makes no sense.

🍞 Anchor: For example, changing “Make the sky pink” is simple; changing “Make the runner bend down to tie his shoe” is much harder because the shape of the body must change believably.

The World Before: Over the last few years, AI got surprisingly good at following text instructions to tweak pictures. These systems could turn a blue shirt red, add a hat, or shift styles from ‘photo’ to ‘cartoon.’ But they mainly learned ‘static’ edits—appearance-only changes. They weren’t trained well on action or pose changes that you’d expect in animation or sports scenes—things like turning a head, lifting a hand, or two people facing each other. When asked for motion edits, many models either did nothing (editing inertia), changed the wrong thing (motion misalignment), or broke the subject’s identity (distorted hands or faces).

🍞 Hook: Imagine a dance teacher asking a student to turn left, but the student keeps changing costumes instead. That’s what many image editors did: they kept changing appearance, not motion.

🥬 The Concept (Motion-Centric Image Editing): What it is: Motion-centric image editing means changing actions, poses, or interactions in a single image while keeping identity, background, and style consistent. How it works: 1) Read a motion instruction (e.g., “raise the right arm”), 2) Understand which parts must move, by how much, and in what direction, 3) Redraw the image so the motion looks natural and physically possible, 4) Keep everything else the same. Why it matters: Without motion awareness, the model may only repaint colors or add artifacts instead of actually changing pose or action.

🍞 Anchor: If you ask, “Make the boy and girl face each other,” a good motion edit rotates their bodies and heads toward each other—without changing their clothes or the room.

The Problem: High-quality motion edit supervision was missing. Existing datasets focused on static edits. Even the ones that included motion often had low-quality ‘ground truth’—for example, the final image didn’t actually show the requested action, or it had artifacts and identity changes. When you train on messy answers, you learn messy habits.

Failed Attempts: People tried training with mixed datasets, or with general reward signals from multimodal language models (MLLMs). While those rewards helped keep images sharp and on-topic, they didn’t teach “how things should move.” The models learned to describe the edit more than to perform the right geometric change.

🍞 Hook: Think of grading a dance routine only by costume neatness and smiling—not by whether the steps matched the music. That’s how motion edits were being judged.

🥬 The Concept (Optical Flow): What it is: Optical flow is a way to measure how each pixel would move from one image to another. How it works: 1) Take two images, 2) For every pixel, estimate its shift (direction and distance), 3) Get a flow field showing motion across the image. Why it matters: Without flow, a system can’t easily tell if a hand actually moved up or if only the sleeve color changed.

🍞 Anchor: It’s like tracking where each confetti piece falls between two photos of a parade to see the wind’s direction and strength.

The Gap This Paper Fills: The authors created MotionEdit, a targeted dataset for motion edits built from real video frames so the motion is natural and physically reasonable. Then, they built MotionNFT, a training method that uses optical flow to reward edits that move the right parts the right way.

Real Stakes: Why should anyone care? Motion editing powers animation, frame-controlled video generation, sports analysis, robotics visualization, and AR/VR content. If a game character’s arm doesn’t move as instructed, or a trainer app can’t show a proper squat change, users lose trust. In creative workflows, precise motion edits reduce reshoots, save time, and unlock new storytelling. Better motion edits also help safety—e.g., ensuring a medical simulation shows the correct procedure motion, not just a pretty picture.

🍞 Anchor: Imagine an animation tool where you say, “Turn the dancer’s head left and raise the right arm slightly,” and it does exactly that—no extra sparkles, no face glitches, just the right motion. That’s the promise of MotionEdit and MotionNFT.

02Core Idea

🍞 Hook: Picture teaching a robot to tidy a room. If you only praise it for making the room “look good,” it might just hide the mess under the rug. But if you reward it for moving each item to the right place, it learns real cleaning.

🥬 The Concept (The Aha!): What it is: Treat each motion edit like a tiny before–after video, use optical flow to measure the intended movement, and reward the model when its edit matches that movement (direction and magnitude). How it works: 1) Use pairs of images (original and correct target) from videos, 2) Compute ground-truth optical flow between them, 3) Compute optical flow between the original and the model’s edited image, 4) Score how well these flows match, 5) Train the model with negative-aware fine-tuning so good motion gets pulled closer and bad motion gets pushed away. Why it matters: Without motion-aware rewards, the model keeps doing safe appearance tweaks and avoids the real pose change you asked for.

🍞 Anchor: If you say, “Lower the lion’s head to look down,” the model gets a high score only if the pixels around the head move downward by the right amount and direction, while the rest stays consistent.

Three Analogies:

Map and Compass: The ground-truth flow is a compass showing where parts should go; the model’s flow is its guess; the closer they match, the better the navigation.
Dance Choreography: The target pose is a dance move; optical flow checks if each limb hits the beat and angle.
Connect-the-Dots: Each pixel dot should travel from A to B; the score favors edits that connect most dots correctly.

🍞 Anchor: For “Make two friends face each other,” the heads and torsos should rotate toward center; optical flow will show curved motions near shoulders and necks—if those curves match, the edit passes.

Before vs After:

Before: Models excelled at color/style tweaks but hesitated or fumbled on motion, sometimes distorting anatomy or changing the wrong limb.
After: MotionNFT guides training with a flow-based reward, nudging models to produce precise, believable motion while preserving identity and background.

🍞 Anchor: Previously, asking “Make the robot wave with its left arm” often raised the right arm or just added shine; after MotionNFT, the left arm lifts in the correct direction, with the robot still looking like itself.

Why It Works (Intuition):

Motion as Geometry: Actions are structured pixel movements. Optical flow directly captures geometry (direction + magnitude), so it’s a natural ruler for motion edits.
Balanced Signals: Pair the motion ruler with an MLLM semantic judge (for fidelity/preservation/coherence); one checks “Did it move right?” the other checks “Does it still look good and on-topic?”
Negative-Aware Learning: By pulling toward good examples and pushing away from bad ones, the model learns faster and avoids lazy, near-static edits.

🍞 Anchor: It’s like grading homework with two stamps: a “moved correctly” stamp and a “neat and accurate” stamp. The student improves at both form and function.

Building Blocks:

🍞 Hook: Imagine tracing arrows over two frames of a cartoon to show how each spot moved.
🥬 Optical Flow (what/how/why): What it is: pixel motion arrows. How: compute flow from original→edited and original→ground truth. Why: compares intended vs produced motion. Anchor: If a foot should step forward, its arrow should point forward by the right length.
🍞 Hook: Think of a cookbook of action recipes.
🥬 MotionEdit Dataset: What it is: video-derived pairs with natural motion and matching prompts. How: segment videos, filter with an MLLM for setting consistency and significant motion, rewrite clean instructions. Why: gives reliable supervision with real, coherent motion. Anchor: “Turn to face the camera” pairs show the same background but a head/torso rotation.
🍞 Hook: Like coaching that praises correct moves and flags mistakes.
🥬 MotionNFT: What it is: motion-guided negative-aware fine-tuning. How: compute a motion reward from flow magnitude match, direction match, and a penalty for under-moving; combine with MLLM reward; use NFT to pull toward high-reward velocity and push from low-reward. Why: prevents edits that don’t really move and fixes wrong directions. Anchor: The car falling off a cliff finally moves down and forward, not just getting recolored.

🍞 Anchor: Together, these parts turn motion editing from a color-change problem into a geometry-and-identity problem—measured, trained, and improved with the right rulers and coaches.

03Methodology

At a high level: Input (original image + motion instruction + ground-truth target) → [Compute ground-truth optical flow] → [Generate candidate edits and compute predicted flow] → [Motion reward + MLLM reward] → [Negative-aware fine-tuning update] → Output (a model that performs accurate motion edits).

Step 1: Build MotionEdit, a motion-focused dataset.

What happens: The authors collect videos from modern text-to-video sources because they are high-resolution and stable. They cut videos into short chunks and take the first and last frame to capture a meaningful motion change while keeping the scene stable. An MLLM judge filters pairs to ensure: (1) setting consistency (same background/view), (2) significant motion or interaction change, and (3) subject integrity (no artifacts). Then a rewrite step turns motion summaries into clean, human-style edit prompts (e.g., “Make the large white rabbit run forward dynamically”).
Why this step exists: Training needs trustworthy ‘before–after’ pairs that show real, physically plausible motion and keep identity consistent. Without this, models learn from bad examples and produce bad edits.
Example: Original = woman holding a cup; Target = woman sipping. Prompt = “Make the woman sip from her coffee cup, looking down.” The background stays the same; the head tilts and the cup meets the mouth.

🍞 Hook: Like tracing arrows between two frames of a flipbook. 🥬 Optical Flow Ground Truth: What it is: a pixel-wise motion map from original to target. How: run a strong flow estimator (e.g., UniMatch) on the original and ground-truth images, then normalize to compare across sizes. Why: this becomes the ‘answer key’ for how parts should move. 🍞 Anchor: If a hand rises by 20 pixels diagonally up-right, the flow arrow for those pixels points up-right with length ~20.

Step 2: Generate candidate edits and compute predicted flow.

What happens: The current model (e.g., FLUX.1 Kontext or Qwen-Image-Edit) produces multiple edited images for each prompt. For each candidate, compute optical flow between the original and the candidate.
Why this step exists: To compare the model’s motion against the true motion and assign a motion score.
Example: Instruction: “Turn the character to face away from the camera.” Some candidates may only change lighting or make tiny head shifts; better ones rotate head/torso more strongly. The predicted flow will reveal who actually moved correctly.

Step 3: Score motion with three terms.

Motion magnitude consistency: Compare how much things moved (length of arrows). If the target shows a big arm lift but the edit barely moves it, score goes down.
Motion direction consistency: Compare which way things moved (arrow angles), focusing more on regions that truly moved in the ground truth. If the edit rotates left when the target rotates right, score goes down.
Movement regularization: Gently penalize near-zero motion compared to the target, preventing lazy edits that change little.
Why these steps exist: Together they ensure the edit moves the right parts, in the right direction, by the right amount.
Example with data: Suppose ground truth average motion magnitude is 0.12 (normalized), but the model’s is 0.01—then the regularizer fires, pushing the model to move more next time.

🍞 Hook: Think of two teachers: one checks geometry (did it move right?), the other checks presentation (does it look good?). 🥬 MLLM Generative Reward: What it is: a semantic/quality judge measuring fidelity (follow the instruction), preservation (keep identity/background), and coherence (looks natural), plus an overall score. How: Ask a strong vision-language model to rate the edited image given the instruction and original. Why: motion must be correct and the image must still look good. 🍞 Anchor: If a penguin’s beak turns black during a pose change, preservation drops even if the motion is correct.

Step 4: Combine rewards and update with Negative-aware Fine-Tuning (NFT).

What happens: Mix the motion reward (flow-based) and the MLLM reward, typically 50/50. Use DiffusionNFT-style training: form an implicit positive direction that pulls the model toward high-reward behavior and an implicit negative direction that pushes away from low-reward behavior. Group-wise normalization stabilizes rewards across prompts.
Why this step exists: The model needs a clear signal about which outputs to imitate and which to avoid, balancing motion correctness and visual quality.
Example: If a candidate gets high motion alignment but low preservation (distorted face), the MLLM reward keeps it from being over-valued; if it looks pretty but didn’t move, the motion reward prevents complacency.

Step 5: Inference (using the improved model).

What happens: At test time, the model takes a new image and instruction and produces an edit. There is no extra cost added by MotionNFT; the training already taught the model to move correctly.
Why this step exists: To demonstrate practical benefit—edits should now execute the requested motion while preserving identity and scene.
Example: “Make the girl place her right hand on her hip.” The output shows a clear right-arm bend to the hip, with hairstyle, outfit, and background unchanged.

Secret Sauce: The clever part is converting motion edits into a measurable geometric task using optical flow, then blending that ruler with an MLLM quality judge inside a negative-aware fine-tuning loop. This avoids two extremes: beautiful but static edits, and correctly moving but messy edits. The training steers models to be both motion-faithful and visually reliable.

What breaks without each step:

No dataset from videos: Ground truths are noisy; models learn wrong motions.
No flow reward: Models revert to safe appearance tweaks instead of real motion.
No direction term: Arms might move but in the wrong direction.
No magnitude term: Edits under-move or over-stretch.
No regularizer: Edits become almost static.
No MLLM reward: Motion may be right, but identities and details drift.
No negative-aware update: Learning is slower and less stable, with more bad habits sticking around.

04Experiments & Results

The Test: The authors evaluate on MotionEdit-Bench (the held-out portion of MotionEdit), asking models to perform motion-centric edits and measuring results with both an MLLM-based generative judge (Fidelity, Preservation, Coherence, and Overall) and a deterministic Motion Alignment Score (MAS) from optical flow. They also compute pairwise win rates—how often one model’s output is preferred over another’s.

🍞 Hook: Imagine a talent show with two judges: one checks the choreography steps (optical flow), the other checks stage presence and costume (MLLM). You also see which dancer the audience prefers in head-to-head battles (win rate).

🥬 The Concept (Motion Alignment Score, MAS): What it is: A 0–100 score telling how closely the model’s edit motion matches the ground-truth motion, blending magnitude and direction, with a hard zero for near-static failures. How it works: 1) Compute flow alignment error, 2) Normalize, 3) Convert to a higher-is-better score. Why it matters: It directly measures motion faithfulness beyond just looking good.

🍞 Anchor: If the instruction says “move the car off the cliff angled downward,” MAS is high only when the car’s pixels shift down and outward by the right amount—not when the car just gets shinier.

The Competition: They compare common open-source editors (Instruct-P2P, MagicBrush, AnyEdit, UltraEdit, Step1X-Edit, BAGEL, UniWorld-V1) and two strong modern editors (FLUX.1 Kontext and Qwen-Image-Edit). Then they apply MotionNFT to FLUX.1 Kontext and Qwen-Image-Edit to see if it boosts performance.

The Scoreboard (with context):

FLUX.1 Kontext Overall goes from 3.84 to 4.25 with MotionNFT, a +10.68% relative gain—think moving from a solid B to an A−.
Its MAS rises from 53.73 to 55.45, showing more accurate motion alignment—not just prettier pictures.
Pairwise win rate jumps from about 58% to about 65%, meaning audiences pick the MotionNFT version far more often.
Qwen-Image-Edit also improves: Overall 4.65 → 4.72 and MAS 56.46 → 57.23, with the highest win rate among tested methods.
Compared to diffusion editors focused on appearance (e.g., MagicBrush, AnyEdit), MotionNFT-backed models do much better on motion tasks.

Surprising Findings:

MLLM-only rewards can improve look-and-feel but sometimes stall or even regress on motion alignment mid-training. Adding flow-based motion rewards with MotionNFT prevents this and steadily improves MAS.
MotionEdit data shows much larger motion changes (about 0.19 average normalized magnitude) compared to prior datasets (~0.03–0.07), proving it’s truly motion-centric and more challenging.
Despite specializing in motion, MotionNFT training does not harm general editing. On a broad benchmark (ImgEdit-Bench), it often slightly boosts scores, showing it’s a good ‘add-on skill’ rather than a trade-off.

Qualitative Highlights:

Editing inertia vanishes: models actually rotate heads, bend torsos, and reposition objects as instructed.
Motion misalignment drops: the correct limb moves in the correct direction (e.g., raising the left arm—not the right).
Identity is preserved better than many baselines: fewer face or hand distortions when poses change.

Human Alignment: A small human study shows good agreement among annotators and solid alignment between human preferences and the MLLM-based generative metric, supporting the validity of the evaluation protocol.

Bottom line: MotionNFT delivers consistent, measurable improvements in both motion faithfulness (MAS) and overall perceived quality and preference, particularly on a dataset designed to reveal motion weaknesses.

05Discussion & Limitations

Limitations:

Multi-subject coordination remains tough. When an edit targets one subject among several (e.g., “turn the standing crew member to face left while the seated member remains unchanged”), models still confuse who should move or how to layer depth correctly.
Identity preservation under big motions can slip in crowded or textured scenes—small parts like fingers or accessories sometimes warp.
Complex 3D spatial reasoning (occlusions, depth ordering, reorientation with foreshortening) is still challenging with only 2D supervision.
Flow estimation itself can be imperfect in low-texture or reflective regions; if ground-truth flow is ambiguous, rewards can be noisy.

Required Resources:

A capable image editing backbone (e.g., FLUX.1 Kontext or Qwen-Image-Edit).
GPUs for post-training with Negative-aware Fine-Tuning; the paper uses memory-saving tricks like FSDP and gradient checkpointing.
A lightweight flow model (e.g., UniMatch) on training nodes, plus an MLLM scorer (e.g., a served vision-language model) for semantic rewards.
The MotionEdit dataset for training/evaluation.

When NOT to Use:

Single-shot appearance-only edits where motion is irrelevant—flow-based rewards don’t add value and only consume extra compute during training.
Highly speculative physics or extreme out-of-distribution motions (e.g., “invert the person inside-out”); flow guidance can’t teach impossible geometry.
Real-time on-device training scenarios with no compute budget for reward models; MotionNFT is a training-time method, not an inference-time trick.

Open Questions:

How to extend from single-frame edits to multi-frame consistency so that edits are temporally stable across a sequence?
Can we add light 3D cues (depth, surface normals, skeleton priors) to make hard rotations and self-occlusions more reliable?
Could human-in-the-loop feedback refine motion rewards for edge cases where optical flow is uncertain?
How to disentangle subject identity from pose even more robustly, especially for hands, faces, and small props?
Can we scale MotionEdit further with automated validation to cover fine-grained categories like micro-expressions and finger articulations?

06Conclusion & Future Work

Three-Sentence Summary: This paper reframes image editing to focus on motion—changing actions, poses, and interactions while preserving identity and scene. It introduces MotionEdit, a video-derived dataset that provides natural, high-quality motion supervision, and MotionNFT, a training framework that uses optical flow rewards to teach models to move the right parts in the right way. Together, they raise both motion faithfulness and overall perceived quality on strong editing backbones.

Main Achievement: Turning motion edits into a measurable, trainable target via optical flow—and blending that geometric ruler with an MLLM quality judge inside negative-aware fine-tuning—so models learn to execute precise, believable motion edits instead of just repainting appearance.

Future Directions: Add temporal consistency across frames, integrate lightweight 3D priors, improve flow in tricky regions, and expand the dataset to harder multi-subject, high-occlusion scenes. Human-in-the-loop or physics-informed rewards may help with edge cases like complex contact, balance, and force cues.

Why Remember This: MotionEdit and MotionNFT show that motion can be scored and taught—not just described—transforming image editing from color-matching to action-understanding. This unlocks more faithful animation, controllable video generation, and realistic storyboarding where characters do exactly what you ask, and still look like themselves.

Practical Applications

•Animation cleanup: Quickly fix character poses (e.g., hand-on-hip, head turn) while keeping style and identity.
•Storyboard editing: Reposition actors to face each other or adjust body language for clearer storytelling.
•Frame-controlled video synthesis: Edit key frames to guide motion between frames more reliably.
•Sports replay tools: Adjust a player’s pose to demonstrate correct form while preserving their identity and uniform.
•AR try-on and photo apps: Change user poses naturally (e.g., turn, lift arm) without warping faces or outfits.
•Robotics training visuals: Show correct manipulation poses (grasping, reaching) in instructional images.
•Medical education: Depict precise hand and instrument movements for procedures without changing patient identity.
•Marketing visuals: Reposition products and hands to better showcase features while keeping the scene consistent.
•Game modding: Tweak NPC poses and interactions in stills used for cutscenes or promos without re-rendering everything.
•Safety manuals: Demonstrate correct body mechanics (lift, bend, reach) in a single image clearly and believably.

Version: 1