Action100M: A Large-scale Video Action Dataset
Key Summary
- •Action100M is a gigantic video dataset with about 100 million labeled action moments built automatically from 1.2 million instructional videos.
- •The pipeline first slices each video into small, meaningful moments using video embeddings, then writes multi-level captions, and finally has a reasoning model clean them up into structured action labels.
- •This creates a Tree-of-Captions for every video segment, which reduces mistakes by combining close-up and big-picture views.
- •A Self-Refine process makes the labels more accurate by reviewing and fixing them in multiple rounds.
- •Models trained on Action100M, like VL-JEPA, get better as you give them more data and beat strong baselines in zero-shot action recognition and text-to-video retrieval.
- •The dataset covers everyday physical tasks (like cooking or fixing things) and includes brief and detailed action descriptions plus captions.
- •Semantic resampling helps balance common and rare actions so the model learns more fairly from the long tail.
- •The approach turns heavy video understanding into mostly text processing, making it cheaper than running huge video models end-to-end.
- •This work sets a new foundation for open-vocabulary action understanding, helpful for robots, AR assistants, search, and planning.
- •All annotations are produced automatically at massive scale, showing a practical path to grow action-centric data without manual labeling.
Why This Research Matters
Action100M teaches AI to understand everyday actions the way people do—step by step—so assistants and robots can actually help with real tasks. It enables safer, smarter guidance in kitchens, garages, and classrooms by recognizing what a person is doing and what comes next. Search and recommendation get better because queries like “replace a faucet washer” can find precise, relevant moments, not just whole videos. Accessibility tools can explain procedures more clearly, aiding learners and users with diverse needs. The dataset’s open-vocabulary nature means it adapts to new tools, recipes, and skills as they appear in the world. By scaling action labels without manual annotation, it lowers the barrier to building practical, action-aware AI. Ultimately, this pushes AI from naming objects to understanding and supporting how we do things.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Top Bread (Hook): Imagine trying to learn every move in every sport, every cooking step in every recipe, and every fix for every gadget—just by watching videos. You’d need a lot of examples and clear explanations to truly get good.
🥬 Filling (The Actual Concept):
- What it is: This paper builds Action100M, a huge video action dataset that teaches AI to recognize what people are doing in videos, across many topics, without sticking to a fixed list of actions.
- How it works: The team collects millions of how-to videos, splits them into small moments, writes detailed captions at different levels, and uses a smart model to combine those captions into clean action labels. Then they train a powerful model (VL-JEPA) on these labels and test it on many benchmarks.
- Why it matters: Before this, action datasets were too small, too narrow, or too noisy. Without good, broad, and clean data, AIs struggle to notice and name actions—especially rare or very specific ones.
🍞 Bottom Bread (Anchor): Think of learning to cook by pausing YouTube every few seconds: “slice tomato,” “heat pan,” “add oil,” “flip omelet.” Now imagine those steps are neatly written down and checked for you. That’s what Action100M does for AI.
— New Concept 1 — 🍞 Hook: You know how you can understand a new playground game by watching, even if nobody tells you the rules? You pick up the idea from what people do. 🥬 The Concept: Open-vocabulary action recognition.
- What it is: Teaching AI to recognize actions described by any words—not just a short, fixed list.
- How it works: The model learns a shared “language” between video and text so when you say any action (like “whisk batter” or “tighten screw”), it can match that phrase to the right video moment.
- Why it matters: With a fixed list, the AI can’t handle new or rare actions. Open-vocabulary lets it generalize to the real world, where people do countless different things. 🍞 Anchor: If you ask for “peel a mango” but the dataset only knows “cut fruit,” the AI needs to still find the “peel mango” moments. Open-vocabulary makes that possible.
The World Before:
- AI got good at spotting objects (like “cat” or “cup”) from huge image-text datasets. But understanding actions (like “stir,” “fold,” “assemble”) lagged because videos are longer, messier, and expensive to label.
- Popular action datasets focused on specific domains (cooking, assembly) and were relatively small (thousands to under a million action clips), often hand-labeled—great quality but limited diversity and scale.
- Big video–text datasets existed, but their captions were usually pulled from speech or metadata, which talk about topics rather than step-by-step actions, so they didn’t teach fine-grained motion well.
The Problem:
- We needed a dataset that’s (1) huge, (2) richly action-centered, (3) open-vocabulary, (4) temporally precise (knows when an action starts/ends), and (5) affordable to build.
Failed Attempts:
- Manual labels: Accurate but too slow and costly to scale across millions of videos.
- ASR-only captions: Scalable but often off-topic and not aligned with the exact action on screen.
- Single-caption per segment: Misses the bigger story or the tiny details; hard to connect a close-up frame with the overall goal.
The Gap:
- No dataset combined internet scale with action-accurate, multi-level, and temporally localized annotations—without a massive human labeling budget.
— New Concept 2 — 🍞 Hook: When you watch a magic trick, you pay attention to tiny finger moves and also the whole routine. Both views matter. 🥬 The Concept: Hierarchical temporal segmentation.
- What it is: Splitting a video into a tree of moments—from small motions up to whole steps.
- How it works: First turn frames into features; then group neighboring frames that look similar over time; keep merging into bigger segments while keeping them contiguous and consistent.
- Why it matters: Without a multi-scale breakdown, you either miss tiny actions (like “press button”) or lose the big picture (like “make coffee”). 🍞 Anchor: A cooking video becomes a tree: little leaves like “crack egg,” branches like “make batter,” and the trunk “bake a cake.”
— New Concept 3 — 🍞 Hook: If you’re summarizing a movie, you might write notes for each scene and also a summary for the whole film. 🥬 The Concept: Tree-of-Captions.
- What it is: For every video segment in the tree, write both frame-level and segment-level captions and store them in a hierarchy.
- How it works: Use a vision model to caption key frames (close-up details) and another to caption longer segments (what happens over time), then organize these captions from leaves to root.
- Why it matters: Single captions miss either the fine details or the overall context; multiple levels let the AI cross-check and be specific. 🍞 Anchor: For “making almond butter,” a leaf caption might say “almonds spinning in blender,” and a higher-level caption says “blend almonds into butter,” and the root says “roast, blend, and jar almond butter.”
— New Concept 4 — 🍞 Hook: When you do math homework, checking your steps helps catch mistakes. 🥬 The Concept: Self-Refine mechanism.
- What it is: A model reviews its own draft labels multiple times and fixes errors.
- How it works: Round 1: write a best-guess summary. Round 2+: revisit the text plus all captions, correct inconsistencies, remove unsupported claims, improve clarity.
- Why it matters: First drafts can hallucinate; iterative review makes labels more faithful to the video. 🍞 Anchor: The model first writes “she adds sugar,” but later notices no sugar in any caption evidence and corrects it to “she adds almonds only.”
Real Stakes:
- Better action understanding helps assistants guide people in kitchens, workshops, and classrooms; helps robots perform multi-step tasks; improves video search and retrieval; and supports planning models that predict what happens next. It makes technology more practically helpful in everyday life.
02Core Idea
🍞 Top Bread (Hook): Imagine turning a long how-to video into a neat outline: little steps, bigger steps, and an overall summary—each one double-checked—so anyone can learn the moves quickly.
🥬 Filling (The Actual Concept):
- What it is: The key insight is to convert videos into a text-first, multi-level representation (Tree-of-Captions), then use a reasoning model with Self-Refine to turn that evidence into clean, structured action labels—at massive scale.
- How it works: 1) Break videos into a hierarchy of segments; 2) Caption frames and segments; 3) Aggregate all the text with an LLM that reasons across levels and refines itself; 4) Train an action model (VL-JEPA) on the resulting open-vocabulary labels.
- Why it matters: This pipeline scales like web data but keeps action labels accurate and detailed enough to teach real-world motions.
🍞 Bottom Bread (Anchor): A single recipe video becomes dozens of labeled moments like “preheat oven,” “whisk eggs,” and “pour batter,” each with short and detailed descriptions, all consistent with the big picture.
Three Analogies:
- Librarian: Instead of reading every book cover-to-cover, the librarian uses chapter summaries and page notes to build a reliable catalog of what’s inside each book.
- Puzzle: Close-up pieces (frame captions) join with bigger chunks (segment captions); a careful solver (LLM) checks every fit so the final picture (action labels) is correct.
- Cooking: Prep (segmentation), ingredients (captions), recipe review (Self-Refine), and plating (structured labels). The result is a meal (dataset) you can serve to many models.
Before vs After:
- Before: Datasets were either small and precise (manual labels) or big and noisy (ASR captions). Models often missed fine actions or overfit to common ones.
- After: Action100M delivers huge, hierarchical, and action-centered supervision, enabling models to recognize a far wider range of actions zero-shot and to align video with text more reliably.
Why It Works (Intuition):
- Hierarchies match how actions actually happen: tiny motions roll up into steps, which roll up into tasks. This structure helps both precision (catch small moves) and coherence (see the plan).
- Multi-level captions provide cross-checks; disagreement at one level can be resolved by majority evidence across levels, reducing hallucinations.
- Text-first aggregation moves heavy reasoning from pixels (expensive) to words (cheaper), so you can scale without exploding compute.
- Open-vocabulary labels let the model map many phrases to visual patterns, enabling generalization to unseen actions.
Building Blocks (with Sandwich intros to each new concept):
— New Concept 5 — 🍞 Hook: Think of a sports coach who watches practices and learns patterns of motion without needing play-by-play commentary. 🥬 The Concept: V-JEPA 2 (video encoder used for segmentation features).
- What it is: A self-supervised video model that turns frames into rich visual features capturing motion and appearance.
- How it works: It processes short windows of frames and outputs tokens representing what’s happening; these features are averaged and stitched across the video.
- Why it matters: Good features make it easier to cluster similar moments together into meaningful segments. 🍞 Anchor: V-JEPA 2 notices “hand moves toward switch” looks like “turning on light” across many clips, helping the pipeline group such frames.
— New Concept 6 — 🍞 Hook: You know how a giant library card catalog helps you find any book fast? 🥬 The Concept: Action100M (the dataset itself).
- What it is: A massive, open-vocabulary, action-centric video dataset with around 100M labeled segments from 1.2M how-to videos.
- How it works: Videos are segmented hierarchically, captioned at multiple levels, and then distilled into structured fields (brief/detailed action, actor, brief/detailed caption) via LLM aggregation.
- Why it matters: Size plus structure gives models the variety and clarity needed to learn actions broadly and accurately. 🍞 Anchor: From a single “almond butter” video, you get many labeled segments like “roast almonds” or “press almonds with tamper,” all linked in a tree.
— New Concept 7 — 🍞 Hook: When you write an essay, revising it a few times makes it clearer and more correct. 🥬 The Concept: Self-Refine (used during LLM aggregation).
- What it is: Multi-round self-review that improves the structured labels.
- How it works: Draft → analyze → fix inconsistencies → finalize; it uses evidence across the Tree-of-Captions and metadata.
- Why it matters: Reduces errors from any single caption model and keeps labels faithful to what’s visible. 🍞 Anchor: If one caption says “sugar” but five others don’t, the LLM removes “sugar” in the final action description.
— New Concept 8 — 🍞 Hook: If a class has too many examples of “add water” but only a few of “tighten bolt,” the class won’t learn bolts well. 🥬 The Concept: Semantic resampling.
- What it is: A way to balance training data by grouping similar action texts and sampling them more evenly.
- How it works: Embed action phrases, deduplicate exact repeats, cluster semantically, then sample uniformly across clusters.
- Why it matters: Prevents the model from overfitting to common actions and helps it learn rare but important ones. 🍞 Anchor: Instead of seeing “stir soup” a million times and “splice wire” hardly ever, the model sees a healthier mix.
— New Concept 9 — 🍞 Hook: Imagine a translator who speaks both “video” and “text” so that any phrase can match the right clip. 🥬 The Concept: VL-JEPA (the model trained on Action100M).
- What it is: A vision–language model that aligns video embeddings with text embeddings so you can do open-vocabulary recognition and retrieval.
- How it works: It uses a video encoder (from V-JEPA 2), a text encoder, and a contrastive objective (InfoNCE) to pull matching video–text pairs together.
- Why it matters: Once aligned, you can type any action and find matching clips, even for actions the model never saw labeled by humans. 🍞 Anchor: Type “assemble a shelf” and VL-JEPA ranks videos where a person fits boards and screws them together at the top.
The result is a practical, scalable recipe: hierarchical segments + multi-level captions + careful text aggregation + balanced sampling = stronger, broader, open-vocabulary action understanding.
03Methodology
High-level Recipe: Input video → Stage 1: Hierarchical temporal segmentation → Stage 2: Multi-level captioning (frames + segments) → Stage 3: LLM aggregation with Self-Refine → Output: structured, open-vocabulary action annotations → Train VL-JEPA.
Stage 1: Temporal Segmentation (What/Why/Example)
- What happens: The pipeline samples video frames (1 out of every 4), processes overlapping windows of 64 frames with V-JEPA 2 (ViT-g-384), averages spatial tokens into per-frame features, and stitches overlapping windows into a consistent feature timeline. Then it runs hierarchical agglomerative clustering with a temporal-neighbor constraint and Ward linkage to merge only adjacent segments while minimizing within-segment variance. Segments under 0.5 seconds are discarded.
- Why this step exists: It discovers action moments at multiple time scales—tiny motions (e.g., “press button”) and longer steps (e.g., “blend until smooth”). Without it, we would either miss fine-grained motions or lose big-step structure.
- Example: In the almond butter video, the timeline separates “spread almonds on tray,” “roast,” “cool,” “blend,” and “pour into jar,” with smaller leaves like “insert tamper,” all connected in a tree.
— Sandwich Recap: Hierarchical Temporal Segmentation — 🍞 Hook: Slicing a big cake into bite-size pieces so everyone gets the right portion. 🥬 Concept: Break a video into a tree of contiguous segments from tiny to large by clustering frame features over time.
- How: Features → cluster neighbors → merge to reduce variance → produce a hierarchy.
- Why: Multi-scale clarity of actions. 🍞 Anchor: “Crack egg” (leaf) sits inside “make batter” (branch) inside “bake cake” (trunk).
Stage 2: Caption Generation (What/Why/Example)
- What happens: For each leaf node, extract its midpoint frame and caption it with Llama-3.2-Vision-11B (“Describe this image in detail.”, up to 1024 tokens). For higher-level nodes, sample 32 frames across the segment at 320px and caption with Perception-LM-3B (“Describe this video in detail.”, up to 1024 tokens). Store all captions in a Tree-of-Captions aligning leaf-to-root.
- Why this step exists: Frame captions give precise object/pose details; segment captions describe temporal evolution. Together they provide evidence at multiple scales for later reasoning.
- Example: Frame caption: “Roasted almonds on a parchment-lined tray.” Segment caption: “The presenter roasts almonds, lets them cool, and prepares them for blending.”
— Sandwich Recap: Tree-of-Captions — 🍞 Hook: Scene notes plus chapter summaries make a better book report. 🥬 Concept: Multi-level captions for each node in the segmentation tree.
- How: Leaf frames → image captions; longer segments → video captions; organize hierarchically.
- Why: Cross-level evidence reduces mistakes and preserves context. 🍞 Anchor: Leaves mention “tamper” and “Vitamix,” while the parent summarizes “blend almonds into butter.”
Stage 3: LLM Aggregation with Self-Refine (What/Why/Example)
- What happens: For each segment (≥4s), the pipeline provides the node’s captions, its children’s captions, root-level context, and available metadata (title, description, ASR) to a reasoning LLM (GPT-OSS-120B). The LLM outputs five fields: brief action, detailed action, actor, brief caption, detailed caption. It performs three rounds of Self-Refine: initial draft → analysis and correction → final clean JSON.
- Why this step exists: Different captioners can disagree or hallucinate. Aggregation with iterative self-checking makes the final labels accurate and consistent with visible evidence.
- Example: Final labels for a mid-level segment might be: brief action “blend almonds,” detailed action “Blend roasted almonds on high speed while pressing down with a tamper until creamy,” actor “a woman home cook,” brief caption “She blends cooled roasted almonds,” detailed caption “After cooling the roasted almonds, she transfers them to a high-speed blender, uses a tamper to push them down, and keeps blending until a thick butter forms.”
— Sandwich Recap: Self-Refine — 🍞 Hook: Proofreading your own essay to catch mistakes. 🥬 Concept: Multi-round self-review to fix errors and enforce evidence.
- How: Draft → compare with multi-level captions and metadata → correct → finalize.
- Why: Reduces hallucinations and improves faithfulness. 🍞 Anchor: Removing “adds sugar” because no evidence supports it in any caption or frame.
Secret Sauce: Why this pipeline is clever
- Text-first reasoning: Pushes heavy logic from pixels to words, which is cheaper and easier to scale.
- Cross-level evidence: Leaf and parent captions cross-check each other to reduce hallucinations.
- Hierarchical structure: Mirrors real action structure, improving both precision and coherence.
- Open-vocabulary labels: Enables generalization to new actions.
— Sandwich Recap: Semantic Resampling — 🍞 Hook: If you always practice easy songs, you won’t learn the tricky parts. 🥬 Concept: Balance training data by grouping similar action texts and sampling evenly across groups.
- How: Embed action phrases, deduplicate, cluster (k-means), then sample uniformly across clusters.
- Why: Prevents overfitting to frequent actions and boosts learning on rare ones. 🍞 Anchor: See “wire splicing” more often even though it’s rarer than “stir soup.”
Training VL-JEPA (What/Why/Example)
- What happens: Three stages. Stage 1: image-only pretraining (single frame) with large image–text data (DataComp-1B, YFCC-100M with recaptions). Stage 2: video pretraining with 8 frames on Action100M (mixing brief/detailed actions and captions), continuing from Stage 1. Stage 3: longer 32-frame inputs and unfreezing the V-JEPA 2 encoder (lower LR, gradient accumulation) to better capture motion.
- Why this step exists: Stage 1 builds strong visual–text alignment; Stage 2 injects action and motion knowledge; Stage 3 deepens temporal understanding.
- Example: The model learns that “pour batter” looks different from “stir batter,” and learns to map both phrases to the right moments.
— Sandwich Recap: VL-JEPA — 🍞 Hook: A bilingual friend who understands both videos and words. 🥬 Concept: A video–text model that aligns visual and textual embeddings for open-vocabulary tasks.
- How: Video encoder (from V-JEPA 2) + text encoder + contrastive learning (InfoNCE) to pull matching pairs together.
- Why: Lets you type any action phrase and find relevant video segments or classify actions zero-shot. 🍞 Anchor: Query “tighten a bolt with a wrench” and retrieve clips showing that motion even if the exact wording wasn’t in training labels.
04Experiments & Results
The Test: What they measured and why
- Zero-shot action recognition (Top-1 accuracy) on eight diverse benchmarks: Something-Something v2, EPIC-KITCHENS-100, EgoExo4D Keysteps, Kinetics-400, COIN (both step recognition and whole-task recognition), and CrossTask (both step and task). This checks if the model can recognize actions it wasn’t explicitly trained to choose from.
- Zero-shot text-to-video retrieval (Recall@1) on eight datasets: MSR-VTT, ActivityNet, DiDeMo, MSVD, YouCook2, PVD-Bench, Dream-1K, VDC-1K. This checks if a text query can find the right video quickly.
The Competition: Strong baselines
- CLIP (various sizes), SigLIP2, and Perception Encoder—large, powerful vision–language encoders trained on tens of billions of image–text pairs.
Scoreboard with Context
- Despite seeing far fewer total samples (around 3B vs 13–86B for some baselines), VL-JEPA trained on Action100M shows higher average zero-shot action recognition across the eight datasets, with especially strong gains on motion-focused sets like Something-Something v2, EPIC-KITCHENS-100, EgoExo4D Keysteps, and step recognition for COIN/CrossTask. Think of this as getting an A when others with more study hours get a B.
- For text-to-video retrieval, VL-JEPA achieves higher average Recall@1 than CLIP, SigLIP2, and Perception Encoder across the eight benchmarks, reflecting strong video–text alignment learned from the detailed captions in Action100M. That’s like being first to buzz in with the right answer more often than the top quiz teams.
Scaling Behavior
- More Action100M training consistently improves zero-shot action recognition. Curves rise smoothly as effective data increases, showing steady returns rather than plateauing early.
- Stage transitions matter: a big jump from Stage 1 (image-only) to Stage 2 (8-frame video) shows that motion data is essential for action understanding; Stage 3 (32 frames, unfrozen encoder) adds another boost by capturing longer temporal context.
Ablations: What parts help most?
- LLM-aggregated brief actions outperform direct pseudo-labeling from a single captioning model (PLM-3B), validating the Tree-of-Captions + Self-Refine approach.
- Detailed captions from Action100M generally beat alternative LLM-caption datasets (e.g., PLM-Video-Auto) on most benchmarks, and Action100M is far larger.
- Adding egocentric atomic-action data especially boosts performance on egocentric datasets (EK-100, EgoExo4D), highlighting the benefit of domain-specific supplements.
Surprising/Notable Findings
- Even with lower input resolution and fewer total training samples than some baselines, VL-JEPA wins on average, suggesting that action-focused, hierarchical, and refined supervision provides quality that quantity alone can’t match.
- Semantic resampling improves average zero-shot accuracy when using fewer clusters (coarser groupings), indicating that balancing broad action families beats over-fragmentation when curating smaller training subsets.
Takeaway
- The pipeline’s multi-level, refined, and massive-scale labels translate into meaningfully better zero-shot performance in both recognition and retrieval, especially for actions that hinge on motion and procedure.
05Discussion & Limitations
Limitations
- Domain skew: Source videos are instructional, so social, abstract, or non-procedural actions are underrepresented.
- Long-tail still long: Even with semantic resampling, real-world action frequencies remain imbalanced; very rare actions may still be hard.
- Automated annotations: Quality depends on the upstream captioners and the reasoning LLM; some hallucinations or vague labels remain.
- Temporal granularity: Segments under 4 seconds are dropped at aggregation; micro-actions shorter than that may be missed in final labels.
- Bias and copyright: Internet-mined data can encode cultural or regional biases; video availability and ASR coverage vary.
Required Resources
- Compute: Building the dataset used ~1.6M GPU hours across V100/H100/H200; training VL-JEPA also requires substantial GPUs and memory, especially at 32-frame inputs.
- Storage: The annotations plus structure take ~205 GB; storing source videos and features requires far more.
- Tooling: Access to V-JEPA 2, Llama-3.2-Vision, Perception-LM-3B, a strong reasoning LLM (GPT-OSS-120B), and clustering infrastructure.
When NOT to Use
- Emotion-only or social-interaction analyses without physical manipulation, since the data is action/procedure-centric.
- Ultra-long conversational or narrative videos where speech content, not visible action, carries meaning.
- Privacy-sensitive settings without appropriate filtering, as internet videos may include sensitive contexts.
Open Questions
- Multimodal fusion: How much do audio, force, or 3D cues boost action understanding when added to this pipeline?
- Better balancing: Can we design principled sampling schedules that adapt during training to close long-tail gaps faster?
- Quality metrics: How to automatically score label faithfulness at scale without human audits?
- Multilingual expansion: Can we align actions across languages to broaden global coverage?
- Planning and anticipation: How well do models trained on Action100M predict the next action or plan multi-step tasks in robots and AR assistants?
06Conclusion & Future Work
Three-Sentence Summary
- Action100M converts millions of how-to videos into a massive, hierarchical, text-first dataset of about 100 million action segments with structured labels.
- A pipeline of temporal segmentation, multi-level captioning, and LLM aggregation with Self-Refine produces open-vocabulary action annotations at scale.
- Training VL-JEPA on Action100M yields consistent scaling gains and strong zero-shot performance in action recognition and text-to-video retrieval, especially for motion- and step-centric tasks.
Main Achievement
- Proving that a hierarchical, text-first, self-refined pipeline can produce high-quality, open-vocabulary action supervision at internet scale—and that such supervision beats larger-but-noisier alternatives.
Future Directions
- Extend to action anticipation and long-horizon planning; integrate audio/force/3D; improve semantic resampling; expand to multilingual captions; explore robot learning from the same labels.
Why Remember This
- It shows a practical path to grow action understanding: structure the time, write the story at multiple levels, let a careful reader reconcile it, and then teach a model to speak both video and language. This recipe makes action learning broader, cheaper, and more faithful to how people actually do things.
Practical Applications
- •AR step-by-step helpers that watch your hands and guide you through repairs, cooking, or crafts.
- •Home robots that recognize tasks like tidying, loading dishwashers, or sorting laundry and execute plans safely.
- •Smarter video search that jumps to the exact step you need (e.g., “remove bike wheel” at minute 3:12).
- •Coaching tools for sports or music practice that spot specific motions and suggest corrections.
- •Industrial training systems that track procedural compliance on assembly lines and flag missed steps.
- •Education platforms that auto-generate lesson snippets aligned with curriculum verbs (e.g., “measure,” “mix,” “test”).
- •Kitchen assistants that recognize stages of a recipe and adjust timers or prompts accordingly.
- •Quality control in manufacturing via action verification (e.g., “tighten to torque” vs. “hand-tighten only”).
- •Video editing tools that auto-tag and assemble how-to highlights from long recordings.
- •Robotics research pipelines that learn multi-step policies from open-vocabulary action labels.