DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

Haozhe Xie; Beichen Wen; Jiarui Zheng; Zhaoxi Chen; Fangzhou Hong; Haiwen Diao; Ziwei Liu

DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

Intermediate

Haozhe Xie, Beichen Wen, Jiarui Zheng et al.1/29/2026

arXiv PDF

Key Summary

•DynamicVLA is a small and fast robot brain that sees, reads, and acts while things are moving.
•Its key trick is to think and do at the same time (Continuous Inference) so the robot never waits between action chunks.
•It also only executes the freshest plan (Latent-aware Action Streaming), throwing away actions that became stale during thinking.
•A compact 0.4B-parameter design with a convolutional vision encoder keeps vision tokens small and speeds up decisions.
•The new DOM benchmark supplies 200K simulated and 2K real episodes of moving-object tasks across 2.8K scenes and 206 objects.
•In simulation, DynamicVLA’s average success rate is 47.06%, with big wins in closed-loop reactivity (60.5%) and motion generalization (65.0%).
•Real-world tests on two different robot arms show much higher success than prior VLA baselines on dynamic tasks.
•Ablations show both Continuous Inference and LAAS are necessary and complementary; the 360M language backbone is the best latency–ability trade-off.
•The method runs at about 88 Hz on a single RTX A6000 and needs only 1.8 GB of GPU memory at inference.
•Limitations include short- to medium-horizon focus, rigid-body assumptions, and reduced robustness under extreme disturbances.

Why This Research Matters

Real life moves—bottles roll, toys slide, and tools bounce—so robots must respond in real time, not on last snapshot’s plan. DynamicVLA shows that a latency-aware loop—thinking while doing and executing only the freshest actions—can make robots far more reliable around motion. This unlocks safer home assistance, steadier handoffs in hospitals, and quicker adaptation in warehouses and factories. The DOM benchmark gives the community a common, motion-rich yardstick so everyone can build and compare better dynamic skills. By focusing on timing as much as on perception, the field gets a roadmap for practical, real-world robots. As robots share spaces with people and pets, reacting on time is just as important as reasoning well.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how catching a rolling ball is harder than picking up a ball that’s not moving? If you blink or hesitate, the ball has already rolled somewhere else.

🥬 Filling (The Actual Concept):

What it is: Dynamic Object Manipulation is when a robot must handle things that move while it is looking, thinking, and acting.
How it works: The robot continuously sees the scene, understands a language instruction (like “put the orange in the white tray”), plans a short sequence of arm moves, and executes them—all while the object keeps moving.
Why it matters: If the robot is slow or waits too long, its plan becomes outdated and it will miss, drop, or misplace the object.

🍞 Bottom Bread (Anchor): Imagine a robot asked to grab a rolling tangerine and place it in a tray. If it plans for where the tangerine was 0.2 seconds ago, it will pinch air.

🍞 Top Bread (Hook): Picture a chef looking at the ingredients while reading the recipe and moving their hands—all at once. If they had to stop cooking each time they read a line, dinner would be late.

🥬 Filling (The Actual Concept):

What it is: Vision-Language-Action (VLA) models are robot brains that connect seeing (vision), understanding instructions (language), and moving (action).
How it works: The camera images and the instruction become tokens the model processes to pick the next arm motions; it updates this loop as new images arrive.
Why it matters: VLAs make robots flexible: they can follow many different natural-language tasks without reprogramming.

🍞 Bottom Bread (Anchor): If you say, “Place the green apple in the red circle,” a VLA uses the pictures to find the apple and the red circle and then moves the arm to do it.

🍞 Top Bread (Hook): Think about sending a text that arrives a bit late—by the time your friend reads it, your plan may no longer fit.

🥬 Filling (The Actual Concept):

What it is: The Perception–Execution (P.E.) gap is the time difference between when the robot sees the world and when it starts executing the plan it made from that view.
How it works: While the model thinks (inference), the object keeps moving; when actions are ready, the world changed, so early actions are stale.
Why it matters: Stale actions cause misses, bumps, and failures in dynamic scenes.

🍞 Bottom Bread (Anchor): Planning to close the gripper where the ball used to be is like trying to catch a ball’s shadow.

The world before: Robots were mostly tested on still life scenes—pick a block, place a block. Big VLA models could be slow but still succeed because nothing moved while the model was thinking. Some dynamic attempts used special-purpose trackers and hand-coded rules that worked only on predictable conveyor belts or games with “big target margins” (like batting a ball anywhere on a table).

The problem: Real-life objects roll, slide, bounce, and change direction after collisions. Robots need fast perception, quick updates, and plans that stay in sync with reality. Traditional VLAs cause inter-chunk waiting: they finish executing a whole chunk of actions before thinking again. That delay widens the P.E. gap and makes the robot late to respond.

Failed attempts:

Handcrafted pipelines: brittle and tied to specific motions (e.g., straight conveyor lines).
Giant VLAs: strong reasoning but too slow for fast-moving targets.
Real-time demos with easy targets: impressive speed, but tasks didn’t require precise 6-DoF control or tight timing.

The gap this paper fills: It provides a VLA that is both capable and quick, and a way to overlap thinking with doing. It also enforces that the robot only executes the freshest actions so that perception and execution stay matched in time. Finally, it introduces a large dynamic dataset and benchmark so everyone can measure progress fairly.

Real stakes:

Homes: grabbing that rolling bottle before it falls; cleaning up toys that are still moving; safely handing items to people or pets.
Warehouses and factories: picking from moving lines, adapting to slips and bumps, and placing precisely into moving bins.
Hospitals: steady handoffs, catching sliding tools, and adapting quickly to unexpected motion for safety.
Everyday reliability: fewer misses, fewer drops, and more trust in robots that move among moving things.

02Core Idea

🍞 Top Bread (Hook): Imagine riding a bike: you look ahead, steer now, and keep adjusting as the road changes. You don’t stop the bike to think, then go; you think while moving.

🥬 Filling (The Actual Concept):

What it is: The key insight is to overlap thinking and doing, and to always execute the newest plan so actions match the latest world state.
How it works: The model continuously runs short inference cycles while current actions are being executed (Continuous Inference). When fresh actions arrive, it discards stale steps and keeps only the parts that align with the now-changed scene (Latent-aware Action Streaming).
Why it matters: This shrinks delay, keeps control tight, and lets the robot adapt quickly to motion changes.

🍞 Bottom Bread (Anchor): While grasping a rolling can, the robot keeps rethinking the next small moves and throws away any moves made for where the can used to be.

Three analogies:

Traffic navigation: your map app keeps updating as you drive; it doesn’t wait for you to finish an entire old route before recalculating.
Assembly line: the next part starts getting inspected before the last one leaves the station, and workers follow the latest checklist.
Live radio: the DJ speaks while watching breaking news; if news changes, the script updates mid-sentence.

Before vs After:

Before: VLAs planned a full chunk, executed all of it, then planned again—like holding your breath between sentences. Actions were often late.
After: DynamicVLA plans continuously and executes only the newest valid steps—like talking and breathing at the same time—so reactions stay fresh.

Why it works (intuition, not equations):

Small, fast brain: a compact 0.4B model with a convolutional vision encoder avoids token explosion across frames and reduces thinking time.
Overlap loop: by pipelining inference and execution, the robot never sits idle between chunks.
Freshness rule: by discarding outdated actions and prioritizing newer ones, the plan stays time-aligned with reality.

Building blocks (each as a Sandwich):

🍞 Top Bread (Hook): You know how a tiny scooter is zippy in traffic while a bus is slow to turn?

🥬 Filling (The Actual Concept):

What it is: Compact Parameterization (0.4B) means the model is small enough to be fast but smart enough for the job.
How it works: It uses a 360M-language backbone (first 16 layers) plus a slim action expert; fewer tokens and layers mean less compute per decision.
Why it matters: Speed is survival in dynamic manipulation; a slightly smaller brain that decides on time beats a giant brain that’s always late.

🍞 Bottom Bread (Anchor): A 0.4B model running at ~88 Hz can adjust to a can rolling at 0.5 m/s before it escapes.

🍞 Top Bread (Hook): Think of a camera that summarizes the important parts of a scene instead of sending every pixel.

🥬 Filling (The Actual Concept):

What it is: A Convolutional Vision Encoder (FastViT) compresses images efficiently while keeping spatial structure.
How it works: It turns multi-frame images into a small set of meaningful tokens without quadratic growth.
Why it matters: Fewer, better tokens speed up the model and reduce lag.

🍞 Bottom Bread (Anchor): Instead of drowning in pixels from three cameras, the robot quickly gets a compact “map” it can reason with every 1/88th of a second.

🍞 Top Bread (Hook): Imagine writing a rough sketch and then refining it step-by-step into a clean drawing.

🥬 Filling (The Actual Concept):

What it is: A Diffusion-style Action Expert (flow matching) generates a short action chunk that’s refined from noise into a precise motion.
How it works: Start with noisy actions and iteratively denoise them toward the target moves, guided by the latest visual–language features.
Why it matters: This produces smooth, flexible short-horizon action sequences that are easy to update often.

🍞 Bottom Bread (Anchor): To catch a rolling ball, the expert proposes 20 tiny precise steps that can be refreshed before all 20 are used.

🍞 Top Bread (Hook): Don’t wait to finish your entire homework to ask the next question—ask while you’re working so you never get stuck.

🥬 Filling (The Actual Concept):

What it is: Continuous Inference overlaps thinking with doing.
How it works: As soon as one inference finishes, the next one begins—even while earlier actions are still executing—so there’s no dead time between chunks.
Why it matters: No inter-chunk waiting means quicker reactions and smoother control.

🍞 Bottom Bread (Anchor): The robot keeps moving toward the can while it’s computing the next tiny adjustments.

🍞 Top Bread (Hook): If new directions arrive, follow those, not the old ones.

🥬 Filling (The Actual Concept):

What it is: Latent-aware Action Streaming (LAAS) executes only the newest valid actions and overwrites stale ones.
How it works: It discards any steps scheduled for times already passed and prefers the latest overlapping steps from the newest chunk.
Why it matters: This keeps actions time-aligned with the moving world and reduces misses.

🍞 Bottom Bread (Anchor): If a can bounces off a box and turns right, the robot immediately switches to the latest right-turn-grab plan and ignores the old straight-line plan.

03Methodology

At a high level: Inputs (recent images, instruction, robot state) → Fast vision–language encoding → Action Expert proposes a short action chunk → Continuous Inference keeps new chunks coming → LAAS executes only the freshest steps → Output: smooth, on-time motions.

Step-by-step (each with what/why/example):

Multiview, short-history perception

What happens: The robot reads a brief temporal window of images (e.g., frames at t−2 and t), a language instruction, and its own arm pose. FastViT compresses images into a small set of spatially faithful tokens; the instruction and robot state become tokens too.
Why this exists: Two spaced frames reveal motion (like speed/direction) with almost no latency cost; small token sets keep inference quick.
Example: Two snapshots of a rolling tangerine tell the model which way and how fast it’s moving.

Fusion and lightweight language reasoning

What happens: Visual, language, and state tokens are concatenated and processed by a compact language backbone (SmolLM2-360M using 16 layers). Linear layers align token dimensions so modules talk fluently.
Why this exists: A lean backbone is fast enough for dynamic control yet smart enough to ground the instruction in the current scene.
Example: The tokens encode “grasp the rolling tangerine and put it in the white tray,” linking the tangerine and the tray’s positions.

Diffusion-style Action Expert (flow matching)

What happens: Starting from noise, the expert denoises into a short action chunk (e.g., 20 small steps including gripper state) guided by the fused tokens.
Why this exists: Short chunks are easy to refresh often; diffusion-style denoising makes smooth, precise micro-motions.
Example: It outputs 20 tiny end-effector waypoints to intercept the rolling tangerine cleanly and then move toward the tray.

Continuous Inference (pipelined planning)

What happens: As soon as one chunk is computed, the next inference starts, even while some actions from the previous chunk are still executing. There’s no pause between chunks.
Why this exists: Eliminating inter-chunk waiting keeps the robot responsive when objects move unpredictably.
Example: While steps 1–5 of chunk A are being executed, the robot is already computing chunk B.

Latent-aware Action Streaming (freshness-first execution)

What happens: When chunk B arrives, any actions from chunk A that refer to time steps already passed are dropped as outdated. If A and B overlap at the same future step, B’s actions overwrite A.
Why this exists: Objects keep moving during inference; LAAS ensures the robot acts on up-to-date predictions.
Example: If the tangerine bounces and curves, chunk B’s curved intercept replaces chunk A’s straight-line intercept.

Training recipe (3 stages)

Pre-training: Align vision and language using 150M image–text pairs so the model can talk and see well together.
Mid-training: Train the full VLA on the DOM dataset (200K dynamic sim episodes) to learn motion-aware manipulation.
Post-training: Fine-tune on 2K real episodes for embodiment and sensing specifics (Franka and PiPER).
Why this exists: Start with general multimodal skills, then specialize for dynamic manipulation, then adapt to the real sensors/arms.
Example: After pre-training on generic images, the model learns grasp-and-place on moving fruits and bottles, then adapts to the real cameras.

The DOM benchmark and auto data pipelines

Simulation: 206 objects across 2.8K scenes (25 FPS, multi-view), physics gives ground-truth 6D pose and velocity; a shared 4-stage state machine controls approach, grasp & lift, approach target & place, reset.
Real world: Dual third-person RGB cameras + wrist camera; a real-time “simulator” estimates 6D pose/velocity via tracking and triangulation; the same state machine runs without teleoperation.
Why this exists: Large, standardized, motion-rich data is missing; the same interface in sim and real reduces sim-to-real friction.
Example: Rolling cans, bouncing balls, and curved trajectories are captured at scale without needing a human to teleop every attempt.

Secret sauce (what makes it clever):

Latency-aware end-to-end design: small, structured vision tokens; short but expressive action chunks; executing only the freshest steps.
Overlap and overwrite: planning never blocks execution, and execution never follows stale plans.
Practical data engine: the same state-machine and 6D state interface across sim and real enable massive dynamic data without handholding.

04Experiments & Results

The test: The DOM benchmark stresses three abilities under motion: Interaction (react quickly and stay coordinated), Perception (see/ground appearance, spatial relations, and motion cues), and Generalization (handle new objects, new scenes, and new motion patterns). Metrics include Success Rate (SR), Path Length, and Task Time.

The competition: Baselines include Diffusion Policy, OpenVLA-OFT, π and π 0.5, SmolVLA, GR00T-N1.5, VLA-Adapter-Pro, and VLASH. All are fine-tuned on DOM under the same protocol.

The scoreboard (contextualized):

Simulation average: DynamicVLA reaches 47.06% SR overall, with a Path Length of 2.50 m and Task Time of 8.53 s—like earning a solid B+ to A− while others score in the D to C range on truly moving targets.
Interaction: Closed-loop Reactivity 60.5%, Dynamic Adaptation 38.5%, Long-horizon Sequencing 40.5%. These are big jumps over the next-best models, showing faster and more stable responses when motion changes or continues for a while.
Perception: Visual Understanding 51.5%, Spatial Reasoning 48.0%, Motion Perception 33.5%. Performance drops as tasks demand tighter spatial-temporal grounding, but DynamicVLA still leads.
Generalization: Visual Generalization 59.5%, Motion Generalization 65.0%, Disturbance Robustness 26.5%. DynamicVLA transfers better to new appearances and motion regimes; extreme perturbations remain hard for all.

Real-world highlights:

On two different robot arms (Franka and PiPER), DynamicVLA consistently beats π 0.5, SmolVLA, and VLASH across interaction, perception, and generalization tasks. On perception-heavy tasks that require motion-aware grounding (e.g., “place the faster-rolling can inside the frisbee”), DynamicVLA’s advantage is especially clear.

Surprising findings and ablations:

Faster isn’t always bigger: A 360M backbone (first 16 layers) is the sweet spot—smaller (135M) lacks capacity; larger (1.7B) is too slow under real-time constraints.
Vision tokens matter: FastViT (convolutional encoder) outperforms transformer vision encoders by cutting token counts and latency while preserving structure.
Temporal gap choice: Using two frames spaced apart (t−2 and t) works better than adjacent frames; single frames lose motion cues.
CI and LAAS are complementary: Removing either hurts; removing both hurts a lot. Together they shift the system from late and jerky to timely and smooth.
Portable execution tricks: Adding CI+LAAS to other VLAs improves them too, but the benefit shrinks when the backbone itself is very slow (latency is the limiter).

What these numbers mean in plain words: The robot not only reacts faster, it stays in sync longer, and it handles new looks and new motion styles better than prior systems. When objects speed up, bounce, or curve, DynamicVLA more often still finishes the job.

05Discussion & Limitations

Limitations (be specific):

Short- to medium-horizon focus: The method shines on quick, reactive interaction but doesn’t yet plan many minutes ahead while things keep moving in complex sequences.
Rigid-body assumption: Training and estimation pipelines assume solid objects. Deformables (cloth, cables) and fluids introduce motion that’s hard to estimate and simulate at scale.
Tough perturbations: Very strong or chaotic disturbances (e.g., sudden large pushes, rough surfaces) still degrade performance.
Perception capacity trade-off: Making the model small helps latency but can limit deep multimodal reasoning in perception-heavy scenes.

Required resources:

Training: ~32 A100 GPUs for about two weeks across stages; large dynamic datasets (200K sim + 2K real episodes).
Inference: Modest—about 1.8 GB GPU memory, ~88 Hz on a single RTX A6000.

When NOT to use this:

Tasks dominated by complex long-horizon symbolic reasoning where motion is slow and latency is not the bottleneck.
Environments filled with deformable objects, liquids, or heavy occlusions that break the 6D rigid-state assumption.
Settings demanding ultra-precise force control without tactile feedback integration.

Open questions:

Can we keep the latency gains while scaling perception and language capacity—e.g., sparse attention, token dropping, or early exiting for vision-language?
How to extend CI+LAAS to long-horizon plans with memory and subgoal decomposition while keeping real-time guarantees?
Can we upgrade the real-world state interface beyond RGB triangulation (e.g., event cameras, radar, or learned motion fields) without adding heavy latency?
How to bring non-rigid dynamics into data and models without exploding complexity or lag?
Can we unify dynamic tasks across mobile bases, dual arms, and handovers under one continuous inference stream?

06Conclusion & Future Work

Three-sentence summary: DynamicVLA is a fast, compact VLA that overlaps thinking with doing and executes only the freshest actions, closing the perception–execution gap for moving objects. Its convolutional vision encoder and diffusion-style action expert deliver smooth, frequent micro-plans, while Continuous Inference and LAAS keep execution in sync with reality. The new DOM benchmark supplies large-scale dynamic data in sim and real, enabling broad, fair evaluation and strong results across robots.

Main achievement: Showing that latency-aware design—overlap plus freshness—transforms dynamic manipulation, delivering big gains in reactivity, perception-grounding under motion, and generalization.

Future directions: Grow perception capacity without growing latency (smarter tokenization, early exits), extend to longer horizons with memory and planning, and broaden beyond rigid bodies to deformables and fluids, all while keeping the CI+LAAS timing discipline. Improve robustness under strong disturbances and explore richer, low-latency sensing.

Why remember this: It reframes success in dynamic robotics as a timing problem first—design the loop so thinking and doing never drift apart, and your robot will finally catch the moving world instead of chasing its past.

Practical Applications

•Pick-and-place from moving conveyor belts with quick adaptation to slips and bumps.
•Catching and handing off rolling items (e.g., a nurse-bot receiving and placing cylindrical tools).
•Tidying homes by intercepting toys that are still moving and placing them into bins.
•Warehouse sorting into moving trays or areas marked on the floor while carts roll by.
•Robotic assistance in kitchens: grabbing sliding bottles or cans and placing them safely.
•Mobile manipulation (on a base) that adapts in real time to moving people and pets.
•Quality control: tracking and grasping parts that drift on surfaces before precise placement.
•Sports training robots that retrieve and return bouncing balls with accurate timing.
•Assembly lines where parts arrive with variable speeds and occasional collisions.
•Agricultural packing where produce rolls on surfaces and must be gently intercepted.

Version: 1