A Pragmatic VLA Foundation Model
Key Summary
- â˘LingBot-VLA is a robot brain that listens to language, looks at the world, and decides smooth actions to get tasks done.
- â˘It was trained on about 20,000 hours of real robot practice from 9 different dualâarm robot setups, which made it more general and reliable.
- â˘The modelâs special design (a MixtureâofâTransformers plus Flow Matching) keeps language/vision smarts strong while making actions smooth and precise.
- â˘Adding depth (3D) understanding noticeably boosts success on tricky, spatial tasks like stacking, inserting, and aligning.
- â˘In realârobot tests across 100 tasks on 3 platforms, LingBotâVLA with depth beat strong baselines in both Success Rate and Progress Score.
- â˘The team built a very fast training system (up to 261 samples per second per GPU and nearâlinear scaling), cutting costs and time.
- â˘Performance kept improving as training data grew from 3,000 to 20,000 hours, with no sign of slowing down yet.
- â˘With only 80 demos per task, LingBotâVLA already surpassed a top baseline trained on 130 demos, showing high data efficiency.
- â˘The project releases code, models, and benchmarks to help others build and test better realâworld robot skills.
- â˘These advances matter for home help, warehouses, hospitals, and labs where robots must understand instructions and safely manipulate objects.
Why This Research Matters
Robots that understand language and act safely can help at home, in hospitals, in factories, and in labs. LingBotâVLA shows that feeding realâworld experience into a wellâdesigned model keeps improving performance, making truly helpful robots more realistic. The modelâs strong 3D sense and smooth actions are vital for precision tasks like stacking, inserting, or transferring delicate items. Faster training means more teams can try big ideas without huge costs, speeding up progress for everyone. Fair, largeâscale evaluation raises the standard for what âgoodâ looks like in realâworld robotics. By openâsourcing code, models, and data, the project invites the community to build safer, smarter robot assistants. This all points toward robots that are more reliable partners in everyday life.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: You know how you can tell a friend, âPlease put the red cup on the plate,â and they look, think, and do it? Robots want to do that tooâsee, understand, and act from your words.
𼏠The Concept â VisionâLanguageâAction (VLA) Foundation Model:
- What it is: A VLA model is a robot brain that reads instructions, sees the world, and decides the actions to take.
- How it works:
- Look: Use cameras to see objects, colors, and places.
- Read: Understand your instruction in natural language.
- Plan: Connect what it sees with what you asked.
- Act: Move arms and grippers to finish the task.
- Why it matters: Without a VLA, robots either canât follow language or canât link what they see to how they should move. đ Anchor: Say, âToast two slices of bread and put them on the plate.â A VLA guides the robot to find bread, use the toaster, and place the toast carefully on the plate.
The world before: Robots were usually trained one task at a time (like learning only how to open a door). If you changed the task or the robot, you often had to train again. Many tests happened in simulations that look neat but miss realâworld messinessâglare, shadows, weird object positions, and tiny differences in hardware. So, a method could look great in sim and then stumble on a real table with random clutter. Also, code to train giant, multimodal robot models was not yet fast or smooth, making it hard to try bigger datasets.
The problem: People didnât really know how robot performance scales when you feed the model lots more realârobot data. Does it keep getting better? Does it slow down? And if you want to run huge experiments, can your training code keep up without gobbling tons of time and money?
Failed attempts:
- Small, singleârobot datasets led to skills that didnât transfer well.
- Heavy reliance on simulation couldnât capture the real worldâs tiny bumps and surprises.
- Some models mixed vision, language, and actions in ways that tangled signals, so the robot forgot either language smarts or action smoothness.
- Training code often became the bottleneckâtoo slow, too memoryâhungry, or too hard to scale across many GPUs.
đ Hook: Imagine practicing piano only on a perfect digital keyboard. You sound great there, but on a real piano with stiff keys and room echoes, you might struggle.
𼏠The Concept â Realâworld Data Scaling:
- What it is: Improving robot skill by training on more and more hours of realâworld robot experienceâacross many robots and places.
- How it works:
- Collect teleoperated demonstrations from 9 dualâarm robots in many settings.
- Clean and label the videos and instructions, clip into useful segments, and remove static frames.
- Pretrain the VLA on 3,000 â 20,000 hours and measure how success changes.
- Why it matters: Without lots of varied, real data, models overfit to a few scenes or motions and fail when tables, lighting, or object positions change. đ Anchor: Like biking on grass, gravel, and pavement; practicing on many surfaces makes you steady everywhere. The modelâs success kept rising with more hoursâand didnât flatten out by 20,000 hours.
The gap this paper fills: It shows, with careful measurements on real robots, that more diverse realâworld data keeps helping. It also brings an architecture that keeps visionâlanguage smarts strong while producing smooth actions, and a superâefficient codebase so big training runs are actually doable. Finally, it uses a tough, diverse benchmark (GMâ100) to fairly compare methods across 3 platformsâno cherryâpicking.
Real stakes: This matters for home helpers (âPut the dishes in the rackâ), hospitals (âHand me the blue bandage rollâ), warehouses (âStack the boxes by sizeâ), and labs (âArrange test tubes by labelâ). To be trustworthy in busy, changing spaces, robots need language understanding, sharp 3D sense, smooth control, and tons of real practiceâexactly what this work targets.
02Core Idea
The âAha!â moment in one sentence: If you mix lots of realârobot practice with a design that keeps language/vision brains clear and makes actions smoothâand train it efficientlyâyou get a single model that generalizes across many tasks and robots.
Explained three ways:
- Sports team analogy: Use a superstar for seeing/reading the play (the visionâlanguage model), a specialist for moving (the action expert), and a smart walkieâtalkie link so they coordinate perfectly during every play.
- Cooking analogy: One chef recognizes ingredients and reads recipes; another chef executes the fine knife work and timing; and a head chef keeps them in sync, so dishes come out tasty and on time.
- Music analogy: One musician reads the sheet music (language) and describes the song (vision), another plays the instrument (actions), and a conductor keeps tempo so the performance is smooth.
đ Hook: Think of a twoâlane highway where drivers talk over radios so they never crash or get in each otherâs way.
𼏠The Concept â MixtureâofâTransformers (MoT) Architecture:
- What it is: Two transformer âroadsââone for visionâlanguage, one for actionsâwith a shared attention link so they learn together without stepping on each otherâs toes.
- How it works:
- Encode multiâview images and the instruction with a strong visionâlanguage model (VLM).
- Feed robot state and action chunks to an âaction expert.â
- Use shared selfâattention so the action expert hears the VLMâs guidance at every layer.
- Apply a causal mask so actions canât peek into the future.
- Why it matters: Without MoT, signals can get tangledâeither the robot forgets language/vision smarts or its motions get jittery. đ Anchor: Like a coach (VLM) and athlete (action expert) wearing synced earpieces. The coach points out the target; the athlete moves confidently.
đ Hook: Picture turning a messy scribble into a perfect line by nudging it a bit smoother at every step.
𼏠The Concept â Flow Matching for Action Generation:
- What it is: A way to train the model to transform noisy, rough actions into the right, smooth actions over time.
- How it works:
- Start with a noisy action sequence (like a shaky drawing).
- Blend it with the true action sequence.
- Learn the âpushâ (a velocity) that moves the noisy version toward the true one.
- Repeat so the model produces continuous, fluid motions (here, in chunks of 50 time steps).
- Why it matters: Without this, the robot might move in jerky, stopâstart ways and miss precise targets. đ Anchor: Pouring juice into a narrow glass smoothly, not in sputtersâFlow Matching teaches that smoothness.
đ Hook: Try touching your finger to your nose with one eye closedâitâs harder to judge distance.
𼏠The Concept â Depth Perception Integration:
- What it is: Teaching the model 3D sense by aligning its vision features with depth features distilled from a depth expert (LingBotâDepth).
- How it works:
- Add learnable âqueriesâ to the VLM that focus on spatial details in each camera view.
- Align those queries with depth tokens from a depth model using a small projection and a matching loss.
- Infuse the VLA with reliable distance/shape cues.
- Why it matters: Without depth, tasks like inserting, stacking, or precise alignment often fail. đ Anchor: Itâs like putting your 3D glasses back on so you can line up a key with a lock.
đ Hook: Moving a big couch is faster with friendsâand even faster if each friend knows exactly which corner to carry.
𼏠The Concept â Distributed Training Optimization:
- What it is: Tricks to train huge models fast across many GPUs while saving memory and bandwidth.
- How it works:
- Use FSDP sharding to split parameters and optimizer states.
- Make special shard groups for the action expert to cut communication costs.
- Store/communicate in bfloat16, reduce in float32 for stability.
- Speed up sparse attention with FlexAttention and fuse operators via compilation.
- Why it matters: Without this, big VLA models take too long and cost too much to train. đ Anchor: Like baking 1,000 cookies with 8 ovens and preâmeasured doughâway faster than one oven and mixing by hand.
Before vs. After:
- Before: Separate, fragile skills; simâheavy training that didnât always transfer; slow code making big experiments rare.
- After: One pragmatic model that scales with real data, keeps language/vision smarts, produces smooth actions, and can be trained efficientlyâready for real tables, tools, and tasks.
Why it works (intuition): The VLM supplies rich semantics (âthatâs the blue cup; âput inâ means place insideâ), the action expert shapes continuous control, the shared attention keeps them in lockstep, depth fixes 3D geometry, and efficient training lets you actually scaleâall combining into reliable generalization.
Building blocks:
- Large, diverse realâworld data (20,000 hours, 9 robots)
- MoT for clean crossâmodal coordination
- Flow Matching for smooth, continuous actions
- Depth distillation for 3D precision
- Highâthroughput training to make scaling practical
03Methodology
Highâlevel recipe: Input (multiâview images + language instruction + robot state) â VLM encodes observations â Action Expert predicts the next chunk of actions (using Flow Matching) â Robot executes smoothly.
Stepâbyâstep:
- Data intake and preparation
- What happens: Multiâcamera videos from 9 dualâarm robots are cleaned, clipped into atomic actions, and paired with precise task and subâtask instructions (using a strong annotator VLM plus human refinement). Static frames at clip edges are removed.
- Why this exists: Without clean, labeled, diverse data, the model memorizes odd details and fails in new layouts.
- Example: For âToast two slices of bread, add lettuce and sauce,â they mark subâsteps like âTake bread from toasterâ and write short, clear instructions for each.
- Observation encoding with a VLM
- What happens: Three synchronized camera views and the language instruction are tokenized and sent through a pretrained visionâlanguage model (Qwen2.5âVL). Robot proprioception (state) is also tokenized.
- Why this exists: The VLM provides strong world and language understanding so the controller doesnât start from scratch.
- Example: The VLM identifies âtoaster,â âbread,â and understands âtake outâ vs. âput in.â
- Action Expert with MixtureâofâTransformers (MoT)
- What happens: There are two transformer pathwaysâone mainly for observations (images+text) and one for actionsâand they share a selfâattention mechanism layer by layer. A careful causal mask blocks future action tokens from leaking into the present.
- Why this exists: To preserve semantic strength from the VLM while letting the action pathway specialize in control without interference.
- Example: While the VLM focuses on âopen the toaster door,â the action pathway plans a smooth reachâgraspâpull.
- Smooth control via Flow Matching
- What happens: The model learns to turn a noisy action sequence into the groundâtruth sequence by predicting the velocity that pushes noise toward truth, over a short horizon (chunks of 50 steps).
- Why this exists: Continuous robot control needs smooth trajectories, not jumpy, oneâstep guesses.
- Example: Instead of jerking the toast out, the robot pulls gently and steadily.
- Add 3D sense through depth distillation
- What happens: Learnable queries inside the VLM align with depth tokens from a depth model (LingBotâDepth) via a light projection and a matching loss. This injects geometry into the observation stream.
- Why this exists: Many tabletop tasks demand millimeterâlevel alignment that 2D cues alone canât provide.
- Example: When inserting a straw into a narrow bottle opening, depth helps center the tip.
- Training efficiency tricks
- What happens: Use FSDP sharding with special shard groups for action modules, mixed precision (bfloat16 storage/comm; float32 reductions), FlexAttention for sparse attention, and operator fusion via compilation. The pipeline reaches about 261 samples/sec/GPU on 8 GPUs and scales nearly linearly to large clusters.
- Why this exists: Without speedups, largeâscale pretraining would be too slow/expensive to explore data scaling.
- Example: Training runs that used to take weeks can now fit into days, making ablations and scaling studies feasible.
- Postâtraining (fineâtuning) on target tasks
- What happens: On the GMâ100 benchmark, each of 100 tasks gets 130 highâquality demonstrations per platform. All models are fineâtuned with the same settings (e.g., batch size 256, 20 epochs) for fair comparisons.
- Why this exists: To adapt the general model to the exact objects, layouts, and protocols of the benchmark while keeping evaluation fair.
- Example: âStack Bowlsâ demonstrations cover different bowl sizes and starting positions.
- Inference and deployment
- What happens: At test time, the robot receives the instruction and live camera feeds, then autoregressively predicts action chunks and executes them. Safety criteria stop runs with repeated failures or risky contacts.
- Why this exists: Chunked, smooth actions are safer and more stable on real hardware.
- Example: If grasping fails three times in a row, the trial ends to avoid collisions.
The secret sauce:
- Keep language/vision smarts intact (strong VLM with shared attention),
- Make actions fluid (Flow Matching on chunks),
- See in 3D (depth distillation), and
- Train fast enough to scale (FSDP, FlexAttention, compilation). Together, these make one model that learns broadly and executes precisely, without blowing the compute budget.
04Experiments & Results
đ Hook: A fair race needs the same track, the same rules, and a clear scoreboard.
𼏠The Concept â Robot Manipulation Benchmarking:
- What it is: Careful, applesâtoâapples testing of different robot brains on the same tasks, robots, and conditions.
- How it works:
- Use GMâ100: 100 diverse tabletop tasks (e.g., stack bowls, fold towels, sieve particles, put scarf on doll).
- Three platforms (AgileX, Agibot G1, Galaxea R1Pro), each with wrist and head cameras.
- For each task, collect 150 demos, keep the best 130, and fineâtune all models identically.
- Run 15 test trials per taskârobot pair with randomized object poses; record everything.
- Why it matters: Without strict rules, results can be cherryâpicked or unfair. đ Anchor: Itâs like a school sports day: everyone runs the same 100 events, on the same field, with the same timers.
Metrics that matter:
- Success Rate (SR): Finished all steps in time (like getting a full score on the event).
- Progress Score (PS): Partial credit based on how many subâsteps you completed (like points for clearing earlier hurdles, even if you miss the last one).
Whoâs in the race:
- Baselines: WALLâOSS, GR00T N1.6, and Ď0.5 (a strong open VLA).
- Ours: LingBotâVLA without depth and with depth.
Scoreboard highlights (real robots):
- Average over 3 platforms, 100 tasks:
- Ď0.5: 13.02% SR, 27.65% PS (a solid baseline).
- LingBotâVLA (w/o depth): 15.74% SR, 33.69% PS (clearly better).
- LingBotâVLA (with depth): 17.30% SR, 35.41% PS (best overall).
- Context: Going from 13.02% to 17.30% SR is like moving from a C+ to a solid B on a very tough exam, while also lifting partial credit notably (27.65% â 35.41% PS).
- Platformâwise samples:
- AgileX: Ours with depth hit 40.36% SR, beating others by a wide margin.
- Agibot G1: Ours with depth reached 30.47% PS bestâinâclass; SR close to our w/oâdepth variant.
- Galaxea R1Pro: Ours with depth led with 20.98% SR and 35.40% PS.
Simulation check (RoboTwin 2.0, 50 tasks):
- Clean scenes: Ď0.5 at 82.74% SR; ours w/o depth 86.50%; ours with depth 88.56%.
- Randomized scenes: Ď0.5 at 76.76% SR; ours w/o depth 85.34%; ours with depth 86.68%.
- Context: In tough, randomized sims, a ~10âpoint lift is like turning many nearâmisses into confident successes.
Scaling law study:
- Pretraining hours 3,000 â 20,000: both SR and PS climbed steadily with no sign of flattening at 20k hours. Each platformâs curve matched the overall trend, suggesting a robust, general rule: more diverse real robot data keeps paying off.
Data efficiency:
- On 8 representative tasks (Agibot G1), with only 80 demos per task, LingBotâVLA already beat Ď0.5 trained on the full 130 demosâthen widened the gap as more demos were added. That means you can get better performance with fewer new examples.
Training throughput and scaling:
- The new codebase hits about 261 samples/sec/GPU on 8 GPUs and stays fast as you add more GPUs (nearâlinear scaling).
- Compared to strong open codebases (StarVLA, Dexbotic, OpenPI), LingBotâVLAâs code trains 1.5â2.8Ă faster depending on the base VLM, turning long waits into practical timelines.
Surprises and insights:
- Depth made a big difference on spatially delicate tasks (insertion, stacking), confirming that 3D cues are essential.
- A baseline (GR00T N1.6) did best on a platform heavily represented in its pretraining, reminding us that pretraining distribution matters a lotâcoverage pays.
- The test setâs atomic actions were very diverseâabout half of the top test actions didnât appear among the 100 most common training actionsâyet the model still generalized, pointing to strong transfer.
05Discussion & Limitations
Limitations:
- The gains depend on realâworld data quality and diversity; rare tools or exotic objects not seen in the 20k hours may still trip the model.
- Results were on dualâarm, tabletop robots; singleâarm mobile manipulation and outdoor settings arenât yet covered.
- Even with depth, extreme precision tasks (e.g., threading a needle) may need extra sensing or specialized policies.
- The model still makes mistakes under hard lighting, heavy clutter, or occlusions, especially if cameras are miscalibrated.
Required resources:
- MultiâGPU training (benefits grow with 8+ GPUs), fast storage for data I/O, and synchronized multiâview cameras.
- Access to teleoperation or expert demonstrations for postâtraining on new task suites.
- Safetyâaware robot setups (collision checks, stop criteria) for realâworld evaluation.
When not to use:
- If you can solve a very narrow, repetitive task with a tiny, classical controller, that may be cheaper and simpler.
- If your environment is far from the training distribution (e.g., underwater, outdoors in rain, or with highly deformable materials), expect degraded performance without adaptation.
- If you cannot provide multiâview perception or depth cues in geometryâheavy tasks, consider adding sensors first.
Open questions:
- How far do scaling laws goâdo we still see gains at 50k or 100k hours, and what kinds of data diversity matter most?
- What is the best way to mix singleâarm, mobile, and bimanual data in one foundation model?
- Can selfâsupervised or onâpolicy data collection reduce reliance on teleoperation while staying safe?
- How to make depth and other 3D signals even tighter partners with language and vision (e.g., unified 3D tokens)?
- Can we design evaluation suites that better capture safety, reliability under distribution shift, and task compositionality?
06Conclusion & Future Work
Threeâsentence summary:
- LingBotâVLA is a pragmatic, realâworld VisionâLanguageâAction foundation model trained on 20,000 hours from 9 dualâarm robots, designed to generalize across many tasks and platforms.
- A MixtureâofâTransformers keeps language/vision understanding strong while an action expert with Flow Matching produces smooth, precise control, and depth distillation adds 3D smarts.
- With a highly optimized training stack, the model outperforms strong baselines on a 100âtask realârobot benchmark and scales efficiently, with performance still rising as data grows.
Main achievement:
- Proving, with careful largeâscale realârobot evidence, that performance keeps improving with more diverse real data, while delivering an architecture and codebase that make this scaling truly practical.
Future directions:
- Broaden to singleâarm and mobile manipulators, expand environments beyond tabletops, and explore even larger, more varied datasets.
- Deepen 3D grounding with richer depth/geometry, and tighten the bridge between highâlevel language reasoning and lowâlevel control.
- Increase data efficiency via selfâimprovement, active learning, and safer onârobot exploration.
Why remember this:
- Itâs a concrete step toward helpful, reliable robots that understand what we say and handle the physical world with care. By mixing big real data, the right model design, and fast training, the paper shows how to turn lab smarts into practical, everyday robot skills.
Practical Applications
- â˘Voiceâguided kitchen helpers that safely prepare ingredients and organize cookware.
- â˘Hospital supply runners that fetch, sort, and deliver labeled items on request.
- â˘Warehouse pickâandâplace robots that adapt to new boxes and layouts without full retraining.
- â˘Lab assistants that arrange tubes, cap/uncap containers, and set up equipment from natural language steps.
- â˘Home tidying robots that fold towels, sort toys, and load bins while avoiding clutter.
- â˘Retail stockers that place products on shelves by size, color, or barcode instructions.
- â˘Assemblyâline bots that insert, align, and fasten parts with depthâaware precision.
- â˘Elderâcare support that brings specific objects (âthe blue sweaterâ) or helps with simple tasks.
- â˘Education kits where students teach robots new tasks using demonstrations and simple instructions.
- â˘Field techniciansâ helpers that hold, hand over, and organize tools during repairs.