A Pragmatic VLA Foundation Model

Wei Wu; Fan Lu; Yunnan Wang; Shuai Yang; Shi Liu; Fangjing Wang; Qian Zhu; He Sun; Yong Wang; Shuailei Ma; Yiyu Ren; Kejia Zhang; Hui Yu; Jingmei Zhao; Shuai Zhou; Zhenqi Qiu; Houlong Xiong; Ziyu Wang; Zechen Wang; Ran Cheng; Yong-Lu Li; Yongtao Huang; Xing Zhu; Yujun Shen; Kecheng Zheng

A Pragmatic VLA Foundation Model

Intermediate

Wei Wu, Fan Lu, Yunnan Wang et al.1/26/2026

arXiv PDF

Key Summary

•LingBot-VLA is a robot brain that listens to language, looks at the world, and decides smooth actions to get tasks done.
•It was trained on about 20,000 hours of real robot practice from 9 different dual‑arm robot setups, which made it more general and reliable.
•The model’s special design (a Mixture‑of‑Transformers plus Flow Matching) keeps language/vision smarts strong while making actions smooth and precise.
•Adding depth (3D) understanding noticeably boosts success on tricky, spatial tasks like stacking, inserting, and aligning.
•In real‑robot tests across 100 tasks on 3 platforms, LingBot‑VLA with depth beat strong baselines in both Success Rate and Progress Score.
•The team built a very fast training system (up to 261 samples per second per GPU and near‑linear scaling), cutting costs and time.
•Performance kept improving as training data grew from 3,000 to 20,000 hours, with no sign of slowing down yet.
•With only 80 demos per task, LingBot‑VLA already surpassed a top baseline trained on 130 demos, showing high data efficiency.
•The project releases code, models, and benchmarks to help others build and test better real‑world robot skills.
•These advances matter for home help, warehouses, hospitals, and labs where robots must understand instructions and safely manipulate objects.

Why This Research Matters

Robots that understand language and act safely can help at home, in hospitals, in factories, and in labs. LingBot‑VLA shows that feeding real‑world experience into a well‑designed model keeps improving performance, making truly helpful robots more realistic. The model’s strong 3D sense and smooth actions are vital for precision tasks like stacking, inserting, or transferring delicate items. Faster training means more teams can try big ideas without huge costs, speeding up progress for everyone. Fair, large‑scale evaluation raises the standard for what “good” looks like in real‑world robotics. By open‑sourcing code, models, and data, the project invites the community to build safer, smarter robot assistants. This all points toward robots that are more reliable partners in everyday life.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how you can tell a friend, “Please put the red cup on the plate,” and they look, think, and do it? Robots want to do that too—see, understand, and act from your words.

🥬 The Concept — Vision‑Language‑Action (VLA) Foundation Model:

What it is: A VLA model is a robot brain that reads instructions, sees the world, and decides the actions to take.
How it works:
1. Look: Use cameras to see objects, colors, and places.
2. Read: Understand your instruction in natural language.
3. Plan: Connect what it sees with what you asked.
4. Act: Move arms and grippers to finish the task.
Why it matters: Without a VLA, robots either can’t follow language or can’t link what they see to how they should move. 🍞 Anchor: Say, “Toast two slices of bread and put them on the plate.” A VLA guides the robot to find bread, use the toaster, and place the toast carefully on the plate.

The world before: Robots were usually trained one task at a time (like learning only how to open a door). If you changed the task or the robot, you often had to train again. Many tests happened in simulations that look neat but miss real‑world messiness—glare, shadows, weird object positions, and tiny differences in hardware. So, a method could look great in sim and then stumble on a real table with random clutter. Also, code to train giant, multimodal robot models was not yet fast or smooth, making it hard to try bigger datasets.

The problem: People didn’t really know how robot performance scales when you feed the model lots more real‑robot data. Does it keep getting better? Does it slow down? And if you want to run huge experiments, can your training code keep up without gobbling tons of time and money?

Failed attempts:

Small, single‑robot datasets led to skills that didn’t transfer well.
Heavy reliance on simulation couldn’t capture the real world’s tiny bumps and surprises.
Some models mixed vision, language, and actions in ways that tangled signals, so the robot forgot either language smarts or action smoothness.
Training code often became the bottleneck—too slow, too memory‑hungry, or too hard to scale across many GPUs.

🍞 Hook: Imagine practicing piano only on a perfect digital keyboard. You sound great there, but on a real piano with stiff keys and room echoes, you might struggle.

🥬 The Concept — Real‑world Data Scaling:

What it is: Improving robot skill by training on more and more hours of real‑world robot experience—across many robots and places.
How it works:
1. Collect teleoperated demonstrations from 9 dual‑arm robots in many settings.
2. Clean and label the videos and instructions, clip into useful segments, and remove static frames.
3. Pretrain the VLA on 3,000 → 20,000 hours and measure how success changes.
Why it matters: Without lots of varied, real data, models overfit to a few scenes or motions and fail when tables, lighting, or object positions change. 🍞 Anchor: Like biking on grass, gravel, and pavement; practicing on many surfaces makes you steady everywhere. The model’s success kept rising with more hours—and didn’t flatten out by 20,000 hours.

The gap this paper fills: It shows, with careful measurements on real robots, that more diverse real‑world data keeps helping. It also brings an architecture that keeps vision‑language smarts strong while producing smooth actions, and a super‑efficient codebase so big training runs are actually doable. Finally, it uses a tough, diverse benchmark (GM‑100) to fairly compare methods across 3 platforms—no cherry‑picking.

Real stakes: This matters for home helpers (“Put the dishes in the rack”), hospitals (“Hand me the blue bandage roll”), warehouses (“Stack the boxes by size”), and labs (“Arrange test tubes by label”). To be trustworthy in busy, changing spaces, robots need language understanding, sharp 3D sense, smooth control, and tons of real practice—exactly what this work targets.

02Core Idea

The “Aha!” moment in one sentence: If you mix lots of real‑robot practice with a design that keeps language/vision brains clear and makes actions smooth—and train it efficiently—you get a single model that generalizes across many tasks and robots.

Explained three ways:

Sports team analogy: Use a superstar for seeing/reading the play (the vision‑language model), a specialist for moving (the action expert), and a smart walkie‑talkie link so they coordinate perfectly during every play.
Cooking analogy: One chef recognizes ingredients and reads recipes; another chef executes the fine knife work and timing; and a head chef keeps them in sync, so dishes come out tasty and on time.
Music analogy: One musician reads the sheet music (language) and describes the song (vision), another plays the instrument (actions), and a conductor keeps tempo so the performance is smooth.

🍞 Hook: Think of a two‑lane highway where drivers talk over radios so they never crash or get in each other’s way.

🥬 The Concept — Mixture‑of‑Transformers (MoT) Architecture:

What it is: Two transformer “roads”—one for vision‑language, one for actions—with a shared attention link so they learn together without stepping on each other’s toes.
How it works:
1. Encode multi‑view images and the instruction with a strong vision‑language model (VLM).
2. Feed robot state and action chunks to an “action expert.”
3. Use shared self‑attention so the action expert hears the VLM’s guidance at every layer.
4. Apply a causal mask so actions can’t peek into the future.
Why it matters: Without MoT, signals can get tangled—either the robot forgets language/vision smarts or its motions get jittery. 🍞 Anchor: Like a coach (VLM) and athlete (action expert) wearing synced earpieces. The coach points out the target; the athlete moves confidently.

🍞 Hook: Picture turning a messy scribble into a perfect line by nudging it a bit smoother at every step.

🥬 The Concept — Flow Matching for Action Generation:

What it is: A way to train the model to transform noisy, rough actions into the right, smooth actions over time.
How it works:
1. Start with a noisy action sequence (like a shaky drawing).
2. Blend it with the true action sequence.
3. Learn the “push” (a velocity) that moves the noisy version toward the true one.
4. Repeat so the model produces continuous, fluid motions (here, in chunks of 50 time steps).
Why it matters: Without this, the robot might move in jerky, stop‑start ways and miss precise targets. 🍞 Anchor: Pouring juice into a narrow glass smoothly, not in sputters—Flow Matching teaches that smoothness.

🍞 Hook: Try touching your finger to your nose with one eye closed—it’s harder to judge distance.

🥬 The Concept — Depth Perception Integration:

What it is: Teaching the model 3D sense by aligning its vision features with depth features distilled from a depth expert (LingBot‑Depth).
How it works:
1. Add learnable “queries” to the VLM that focus on spatial details in each camera view.
2. Align those queries with depth tokens from a depth model using a small projection and a matching loss.
3. Infuse the VLA with reliable distance/shape cues.
Why it matters: Without depth, tasks like inserting, stacking, or precise alignment often fail. 🍞 Anchor: It’s like putting your 3D glasses back on so you can line up a key with a lock.

🍞 Hook: Moving a big couch is faster with friends—and even faster if each friend knows exactly which corner to carry.

🥬 The Concept — Distributed Training Optimization:

What it is: Tricks to train huge models fast across many GPUs while saving memory and bandwidth.
How it works:
1. Use FSDP sharding to split parameters and optimizer states.
2. Make special shard groups for the action expert to cut communication costs.
3. Store/communicate in bfloat16, reduce in float32 for stability.
4. Speed up sparse attention with FlexAttention and fuse operators via compilation.
Why it matters: Without this, big VLA models take too long and cost too much to train. 🍞 Anchor: Like baking 1,000 cookies with 8 ovens and pre‑measured dough—way faster than one oven and mixing by hand.

Before vs. After:

Before: Separate, fragile skills; sim‑heavy training that didn’t always transfer; slow code making big experiments rare.
After: One pragmatic model that scales with real data, keeps language/vision smarts, produces smooth actions, and can be trained efficiently—ready for real tables, tools, and tasks.

Why it works (intuition): The VLM supplies rich semantics (“that’s the blue cup; ‘put in’ means place inside”), the action expert shapes continuous control, the shared attention keeps them in lockstep, depth fixes 3D geometry, and efficient training lets you actually scale—all combining into reliable generalization.

Building blocks:

Large, diverse real‑world data (20,000 hours, 9 robots)
MoT for clean cross‑modal coordination
Flow Matching for smooth, continuous actions
Depth distillation for 3D precision
High‑throughput training to make scaling practical

03Methodology

High‑level recipe: Input (multi‑view images + language instruction + robot state) → VLM encodes observations → Action Expert predicts the next chunk of actions (using Flow Matching) → Robot executes smoothly.

Step‑by‑step:

Data intake and preparation

What happens: Multi‑camera videos from 9 dual‑arm robots are cleaned, clipped into atomic actions, and paired with precise task and sub‑task instructions (using a strong annotator VLM plus human refinement). Static frames at clip edges are removed.
Why this exists: Without clean, labeled, diverse data, the model memorizes odd details and fails in new layouts.
Example: For “Toast two slices of bread, add lettuce and sauce,” they mark sub‑steps like “Take bread from toaster” and write short, clear instructions for each.

Observation encoding with a VLM

What happens: Three synchronized camera views and the language instruction are tokenized and sent through a pretrained vision‑language model (Qwen2.5‑VL). Robot proprioception (state) is also tokenized.
Why this exists: The VLM provides strong world and language understanding so the controller doesn’t start from scratch.
Example: The VLM identifies “toaster,” “bread,” and understands “take out” vs. “put in.”

Action Expert with Mixture‑of‑Transformers (MoT)

What happens: There are two transformer pathways—one mainly for observations (images+text) and one for actions—and they share a self‑attention mechanism layer by layer. A careful causal mask blocks future action tokens from leaking into the present.
Why this exists: To preserve semantic strength from the VLM while letting the action pathway specialize in control without interference.
Example: While the VLM focuses on “open the toaster door,” the action pathway plans a smooth reach‑grasp‑pull.

Smooth control via Flow Matching

What happens: The model learns to turn a noisy action sequence into the ground‑truth sequence by predicting the velocity that pushes noise toward truth, over a short horizon (chunks of 50 steps).
Why this exists: Continuous robot control needs smooth trajectories, not jumpy, one‑step guesses.
Example: Instead of jerking the toast out, the robot pulls gently and steadily.

Add 3D sense through depth distillation

What happens: Learnable queries inside the VLM align with depth tokens from a depth model (LingBot‑Depth) via a light projection and a matching loss. This injects geometry into the observation stream.
Why this exists: Many tabletop tasks demand millimeter‑level alignment that 2D cues alone can’t provide.
Example: When inserting a straw into a narrow bottle opening, depth helps center the tip.

Training efficiency tricks

What happens: Use FSDP sharding with special shard groups for action modules, mixed precision (bfloat16 storage/comm; float32 reductions), FlexAttention for sparse attention, and operator fusion via compilation. The pipeline reaches about 261 samples/sec/GPU on 8 GPUs and scales nearly linearly to large clusters.
Why this exists: Without speedups, large‑scale pretraining would be too slow/expensive to explore data scaling.
Example: Training runs that used to take weeks can now fit into days, making ablations and scaling studies feasible.

Post‑training (fine‑tuning) on target tasks

What happens: On the GM‑100 benchmark, each of 100 tasks gets 130 high‑quality demonstrations per platform. All models are fine‑tuned with the same settings (e.g., batch size 256, 20 epochs) for fair comparisons.
Why this exists: To adapt the general model to the exact objects, layouts, and protocols of the benchmark while keeping evaluation fair.
Example: “Stack Bowls” demonstrations cover different bowl sizes and starting positions.

Inference and deployment

What happens: At test time, the robot receives the instruction and live camera feeds, then autoregressively predicts action chunks and executes them. Safety criteria stop runs with repeated failures or risky contacts.
Why this exists: Chunked, smooth actions are safer and more stable on real hardware.
Example: If grasping fails three times in a row, the trial ends to avoid collisions.

The secret sauce:

Keep language/vision smarts intact (strong VLM with shared attention),
Make actions fluid (Flow Matching on chunks),
See in 3D (depth distillation), and
Train fast enough to scale (FSDP, FlexAttention, compilation). Together, these make one model that learns broadly and executes precisely, without blowing the compute budget.

04Experiments & Results

🍞 Hook: A fair race needs the same track, the same rules, and a clear scoreboard.

🥬 The Concept — Robot Manipulation Benchmarking:

What it is: Careful, apples‑to‑apples testing of different robot brains on the same tasks, robots, and conditions.
How it works:
1. Use GM‑100: 100 diverse tabletop tasks (e.g., stack bowls, fold towels, sieve particles, put scarf on doll).
2. Three platforms (AgileX, Agibot G1, Galaxea R1Pro), each with wrist and head cameras.
3. For each task, collect 150 demos, keep the best 130, and fine‑tune all models identically.
4. Run 15 test trials per task‑robot pair with randomized object poses; record everything.
Why it matters: Without strict rules, results can be cherry‑picked or unfair. 🍞 Anchor: It’s like a school sports day: everyone runs the same 100 events, on the same field, with the same timers.

Metrics that matter:

Success Rate (SR): Finished all steps in time (like getting a full score on the event).
Progress Score (PS): Partial credit based on how many sub‑steps you completed (like points for clearing earlier hurdles, even if you miss the last one).

Who’s in the race:

Baselines: WALL‑OSS, GR00T N1.6, and π0.5 (a strong open VLA).
Ours: LingBot‑VLA without depth and with depth.

Scoreboard highlights (real robots):

Average over 3 platforms, 100 tasks:
- π0.5: 13.02% SR, 27.65% PS (a solid baseline).
- LingBot‑VLA (w/o depth): 15.74% SR, 33.69% PS (clearly better).
- LingBot‑VLA (with depth): 17.30% SR, 35.41% PS (best overall).
Context: Going from 13.02% to 17.30% SR is like moving from a C+ to a solid B on a very tough exam, while also lifting partial credit notably (27.65% → 35.41% PS).
Platform‑wise samples:
- AgileX: Ours with depth hit 40.36% SR, beating others by a wide margin.
- Agibot G1: Ours with depth reached 30.47% PS best‑in‑class; SR close to our w/o‑depth variant.
- Galaxea R1Pro: Ours with depth led with 20.98% SR and 35.40% PS.

Simulation check (RoboTwin 2.0, 50 tasks):

Clean scenes: π0.5 at 82.74% SR; ours w/o depth 86.50%; ours with depth 88.56%.
Randomized scenes: π0.5 at 76.76% SR; ours w/o depth 85.34%; ours with depth 86.68%.
Context: In tough, randomized sims, a ~10‑point lift is like turning many near‑misses into confident successes.

Scaling law study:

Pretraining hours 3,000 → 20,000: both SR and PS climbed steadily with no sign of flattening at 20k hours. Each platform’s curve matched the overall trend, suggesting a robust, general rule: more diverse real robot data keeps paying off.

Data efficiency:

On 8 representative tasks (Agibot G1), with only 80 demos per task, LingBot‑VLA already beat π0.5 trained on the full 130 demos—then widened the gap as more demos were added. That means you can get better performance with fewer new examples.

Training throughput and scaling:

The new codebase hits about 261 samples/sec/GPU on 8 GPUs and stays fast as you add more GPUs (near‑linear scaling).
Compared to strong open codebases (StarVLA, Dexbotic, OpenPI), LingBot‑VLA’s code trains 1.5–2.8× faster depending on the base VLM, turning long waits into practical timelines.

Surprises and insights:

Depth made a big difference on spatially delicate tasks (insertion, stacking), confirming that 3D cues are essential.
A baseline (GR00T N1.6) did best on a platform heavily represented in its pretraining, reminding us that pretraining distribution matters a lot—coverage pays.
The test set’s atomic actions were very diverse—about half of the top test actions didn’t appear among the 100 most common training actions—yet the model still generalized, pointing to strong transfer.

05Discussion & Limitations

Limitations:

The gains depend on real‑world data quality and diversity; rare tools or exotic objects not seen in the 20k hours may still trip the model.
Results were on dual‑arm, tabletop robots; single‑arm mobile manipulation and outdoor settings aren’t yet covered.
Even with depth, extreme precision tasks (e.g., threading a needle) may need extra sensing or specialized policies.
The model still makes mistakes under hard lighting, heavy clutter, or occlusions, especially if cameras are miscalibrated.

Required resources:

Multi‑GPU training (benefits grow with 8+ GPUs), fast storage for data I/O, and synchronized multi‑view cameras.
Access to teleoperation or expert demonstrations for post‑training on new task suites.
Safety‑aware robot setups (collision checks, stop criteria) for real‑world evaluation.

When not to use:

If you can solve a very narrow, repetitive task with a tiny, classical controller, that may be cheaper and simpler.
If your environment is far from the training distribution (e.g., underwater, outdoors in rain, or with highly deformable materials), expect degraded performance without adaptation.
If you cannot provide multi‑view perception or depth cues in geometry‑heavy tasks, consider adding sensors first.

Open questions:

How far do scaling laws go—do we still see gains at 50k or 100k hours, and what kinds of data diversity matter most?
What is the best way to mix single‑arm, mobile, and bimanual data in one foundation model?
Can self‑supervised or on‑policy data collection reduce reliance on teleoperation while staying safe?
How to make depth and other 3D signals even tighter partners with language and vision (e.g., unified 3D tokens)?
Can we design evaluation suites that better capture safety, reliability under distribution shift, and task compositionality?

06Conclusion & Future Work

Three‑sentence summary:

LingBot‑VLA is a pragmatic, real‑world Vision‑Language‑Action foundation model trained on 20,000 hours from 9 dual‑arm robots, designed to generalize across many tasks and platforms.
A Mixture‑of‑Transformers keeps language/vision understanding strong while an action expert with Flow Matching produces smooth, precise control, and depth distillation adds 3D smarts.
With a highly optimized training stack, the model outperforms strong baselines on a 100‑task real‑robot benchmark and scales efficiently, with performance still rising as data grows.

Main achievement:

Proving, with careful large‑scale real‑robot evidence, that performance keeps improving with more diverse real data, while delivering an architecture and codebase that make this scaling truly practical.

Future directions:

Broaden to single‑arm and mobile manipulators, expand environments beyond tabletops, and explore even larger, more varied datasets.
Deepen 3D grounding with richer depth/geometry, and tighten the bridge between high‑level language reasoning and low‑level control.
Increase data efficiency via self‑improvement, active learning, and safer on‑robot exploration.

Why remember this:

It’s a concrete step toward helpful, reliable robots that understand what we say and handle the physical world with care. By mixing big real data, the right model design, and fast training, the paper shows how to turn lab smarts into practical, everyday robot skills.

Practical Applications

•Voice‑guided kitchen helpers that safely prepare ingredients and organize cookware.
•Hospital supply runners that fetch, sort, and deliver labeled items on request.
•Warehouse pick‑and‑place robots that adapt to new boxes and layouts without full retraining.
•Lab assistants that arrange tubes, cap/uncap containers, and set up equipment from natural language steps.
•Home tidying robots that fold towels, sort toys, and load bins while avoiding clutter.
•Retail stockers that place products on shelves by size, color, or barcode instructions.
•Assembly‑line bots that insert, align, and fasten parts with depth‑aware precision.
•Elder‑care support that brings specific objects (“the blue sweater”) or helps with simple tasks.
•Education kits where students teach robots new tasks using demonstrations and simple instructions.
•Field technicians’ helpers that hold, hand over, and organize tools during repairs.

Version: 1