Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

Yuhao Dong; Shulin Tian; Shuai Liu; Shuangrui Ding; Yuhang Zang; Xiaoyi Dong; Yuhang Cao; Jiaqi Wang; Ziwei Liu

Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

Beginner

Yuhao Dong, Shulin Tian, Shuai Liu et al.2/9/2026

arXiv

Key Summary

•This paper teaches AI to learn how-to steps from demonstrations in the moment, the way people do.
•It introduces Demo-driven Video In-Context Learning, where a model watches a demo (text or video) and answers next-step questions about a new, related video.
•A new benchmark, Demo-ICL-Bench, has 1,200 carefully built questions from instructional YouTube videos to test this skill.
•The model, Demo-ICL, uses two training stages: video-supervised fine-tuning to see details in videos, then information-assisted preference learning to get better at using demos.
•The method works much better than many strong baselines, especially when demos are provided.
•Text demonstrations help more than video ones right now, showing that learning from moving pictures is still hard for AI.
•Picking the right demonstration from a pool is tough, but it is key for real-life use where perfect demos are not handed to you.
•Adding helpful hints like timestamps and short text guidance during training makes preference learning cleaner and improves answers.
•Results suggest a path toward robots and assistants that can quickly pick up new procedures by watching short demos.

Why This Research Matters

Real people learn new tasks fast by watching short demos; building AI that can do the same unlocks helpful assistants at home, school, and work. With demo-driven learning, a model can adapt to your specific way of doing things, not just a generic method it memorized long ago. That means better safety and precision for tasks like cooking, DIY, and lab procedures. In the long run, robots could learn new household or factory tasks in minutes, just from a few examples. Teachers and tutors could provide a quick demonstration and have AI guide students through the next steps, personalized to each class. Customer support and maintenance could improve, as AI watches your setup and suggests the correct next action for your exact device. This research is a stepping stone toward flexible, human-like learning from the world around us.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you are learning to make pancakes. If you have a short video showing each step, you can follow along and cook even if you never did it before. That is how people learn fast in the real world.

🥬 Filling (The Actual Concept): Before this work, most video AIs answered questions with either facts stored in their heads or what they could see in a single video clip. They were not good at watching a few examples and then adapting to a new but related video on the fly. How it works in the field so far:

Big multimodal models connect text and visuals and can answer many questions.
Video benchmarks often ask about objects, scenes, or basic actions in one target video.
Some methods retrieve info or show chain-of-thought, but they mostly treat context like a reference book instead of a lesson to learn from. Why it matters: Without the ability to learn from demonstrations, assistants and robots cannot pick up new tasks quickly. They would need to be retrained for everything or rely only on memorized knowledge.

🍞 Bottom Bread (Anchor): Think of a robot in a kitchen. If you show it a short demo of your unique way to make Mexican rice, it should learn your version and finish the next steps in your new video, not just guess from generic rice knowledge.

— 🍞 Top Bread (Hook): You know how a helpful friend can look at pictures and read instructions at the same time when building LEGO? That is like an AI that understands many kinds of inputs.

🥬 Multimodal Large Language Models (MLLMs): What it is: MLLMs are AI systems that understand and connect text, images, audio, and video. How it works:

A visual encoder turns frames into features.
A language model reasons over those features and text.
The model produces an answer in natural language. Why it matters: If an AI cannot connect visuals with text, it misses the meaning of steps shown in videos.

🍞 Bottom Bread (Anchor): When asked what happens after pouring batter, an MLLM can look at the pancake video frames and the text instructions to reply: cook one side until golden, then flip.

— 🍞 Top Bread (Hook): Imagine a coach pauses game footage to teach players where to move next.

🥬 Video-supervised fine-tuning: What it is: A training step that uses lots of labeled or curated video examples to sharpen a model’s sense of time and detail. How it works:

Feed many short and long videos with questions and answers.
Train the model to align what it sees with the right words.
Emphasize temporal clues so it knows before, now, and next. Why it matters: Without this, the model cannot reliably spot fine-grained steps needed for procedures.

🍞 Bottom Bread (Anchor): After fine-tuning, the model can tell the difference between heating oil and frying onions, which look similar at a glance.

— The world before: Video AIs were strong at spotting objects or recalling fixed facts. But they struggled to learn a brand-new recipe style or tool usage just by watching a quick demo. The problem: We need AI that can learn and adapt from a few, in-context demonstrations and then apply that knowledge to a new target video. Failed attempts: Retrieval-augmented methods and chain-of-thought help with grounding and reasoning, but they do not truly learn procedures from the given demo. Zero-shot models often ignore the provided demo and rely on prior memory instead. The gap: No benchmark directly tested demo-driven learning from text and video, and training pipelines were not tailored to teach models to use demonstrations step by step. Real stakes: Faster onboarding for household robots, safer DIY help, personalized tutoring, and workers who can upskill quickly by watching short guides.

02Core Idea

🍞 Top Bread (Hook): Imagine you watch a short clip showing how to tie a special knot. Right after, someone hands you a different rope and asks you to continue. If you learned from the demo, you can do the next move.

🥬 The Aha Moment: What it is: Let the model learn procedures from an in-context demonstration (text or video) and then answer next-step questions about a new, related video. How it works (high level):

Provide a demonstration (text steps or a similar video).
Show a target video up to a certain point.
Ask what comes next or a related procedural question.
Train the model in two stages so it actually uses the demo. Why it matters: Without learning from the demo, the model falls back on generic memory and often gets steps wrong for the specific task version at hand.

🍞 Bottom Bread (Anchor): In Mexican rice, after heating oil, some versions add tomato purée next. The model must follow the in-context demo’s version, not a random recipe from memory.

— Multiple analogies:

Recipe analogy: The demo is a recipe card; the target video is your cooking in progress. The model uses the card to decide the next step.
Sports playbook: The demo is a practice drill; the game clip is live play. The model matches the drill to predict the next move.
LEGO guide: The demo is the page you just saw; the new model is partly built. The AI figures out which brick goes next.

Before vs After:

Before: Models answered with stored knowledge or surface clues, often ignoring the provided demonstration.
After: The model actively learns from the given demo and transfers that knowledge to a new but related video.

Why it works (intuition):

Stage 1 teaches careful seeing across time so the model recognizes fine-grained steps.
Stage 2 uses preference learning with helpful hints to reward answers that truly use the demonstration, not guesses.
Together, they nudge the model to align the target video’s current step with the demo’s plan and predict the correct next action.

Building blocks (each with a mini Sandwich):

🍞 Demo-driven Video In-Context Learning: You know how you copy a teacher’s example to solve a new problem?

What it is: A setting where the model learns a procedure from demos and then answers questions about a new video.
How it works: Give demo; show partial target video; ask what comes next; grade if it used the demo.
Why it matters: Real-life tasks vary, so learning from the provided context is crucial.
Anchor: After watching a demo of a specific lawn-laying method, the model advises joining grass pieces next in a related target video.

🍞 Demo-ICL-Bench: Imagine a fair obstacle course made to test how well you learn from examples.

What it is: A benchmark of 1,200 tasks from instructional videos that require demo-driven learning.
How it works: Curated demos (text or video), target videos, and questions that force using the demo.
Why it matters: Without a tough, clear test, we cannot tell if models truly learn from context.
Anchor: Human accuracy is far above models, and strong baselines still struggle, proving the task is hard and meaningful.

🍞 Information-assisted Direct Preference Optimization (DPO): Think of judges giving medals not just for a good jump, but for following the routine.

What it is: A preference-learning method where better answers (that use the demo) are preferred, and assistive hints make judging cleaner.
How it works: Generate multiple answers; add hints like timestamps or brief guidance; a reward model prefers the answer that aligns with the demo; train the model to produce preferred answers.
Why it matters: Without hints, the model and judge can be noisy, slowing learning.
Anchor: Adding timestamps to text-demo tasks improved how well the model matched steps and chose the correct next action.

🍞 Text-demo, Video-demo, and Demonstration Selection: Like learning from a recipe card, from a cooking video, or first finding the right video to learn from.

What it is: Three settings that test learning from text, from video, and from choosing the right video.
How it works: Provide the appropriate context; ask the next-step question; check both retrieval and reasoning.
Why it matters: Real life often starts with a search to find a good demo, then learning, then doing.
Anchor: Picking the right Mexican rice demo among rice, pasta, and noodle videos leads to a correct next-step answer in the target clip.

03Methodology

At a high level: Input (demonstrations plus a target video) → Understand the demo’s steps → Align the target video’s current point to the demo → Predict the next action → Train the model to prefer answers that follow the demo.

Step-by-step recipe with Sandwich explanations embedded:

Building the learning playground (Demo-ICL-Bench) 🍞 Hook: Imagine a science fair with carefully designed challenges that test a real skill, not just memorization. 🥬 What it is: A benchmark of 1,200 tasks from instructional YouTube videos that require using in-context demos. How it works:

Collect videos with subtitles and timestamps.
Summarize subtitles into clean step lists for text demos.
Pair similar videos for video demos and assemble distractors for selection tasks. Why it matters: The tasks are crafted so you must use the demonstration; guessing or generic memory fails often. 🍞 Anchor: From a pancake video’s steps, a question asks what happens after pouring batter; the correct next step is cooking for about 2 minutes before flipping.

Stage 1 training: Video-supervised fine-tuning 🍞 Hook: Like drills in sports, you practice core moves so you can execute precisely during the game. 🥬 What it is: Teach the model detailed video understanding and general in-context reasoning using large, diverse video and image-text data. How it works:

Train on mixed visual-language datasets and instructional video sets.
Explicitly include samples that mimic demo-driven tasks.
Emphasize temporal cues and align visual slices with step text. Why it matters: Without this foundation, the model cannot spot subtle step boundaries or maintain the order of actions. 🍞 Anchor: After Stage 1, the model more reliably distinguishes similar actions such as heating oil vs frying onion.

Stage 2 training: Information-assisted DPO (the secret sauce) 🍞 Hook: When judges in a contest also show you what they look for, you improve faster. 🥬 What it is: A preference-learning stage where the model learns to choose answers that actually use the demo, helped by small hints. How it works:

For text-demo tasks, provide timestamps to anchor steps.
For video-demo tasks, pair the demo with brief text guidance.
Generate multiple answers; a reward model prefers the one best aligned with the demo; train the model to reproduce preferred answers. Why it matters: Without assistive info, preference data is noisy, and the model often ignores the demo. 🍞 Anchor: With timestamps, the model correctly maps the target clip to Step 2 and predicts Step 3: add tomato purée.

The three task pipelines a. Text-demo In-Context Learning 🍞 Hook: Like following a neat recipe card while watching your own cooking. 🥬 What it is: Use summarized step-by-step text as the demonstration. How it works:

Read the text steps.
Watch the target video up to now.
Align the current step and output the next step. Why it matters: Teaches the model to use structured instructions from the context. 🍞 Anchor: After seeing heating oil in the target, and reading the recipe’s steps, the model says the next step is to add tomato purée.

b. Video-demo In-Context Learning 🍞 Hook: Like learning by watching someone else cook on video, then continuing your own dish. 🥬 What it is: Use a similar procedure video as the demonstration. How it works:

Watch the demo video and target partial video.
Match their step timelines.
Predict the next action in the target. Why it matters: Visual transfer is harder than text; success shows deeper understanding and alignment across videos. 🍞 Anchor: After seeing the demo’s Step 2 in action, the model concludes the target’s next action is frying onions and garlic.

c. Demonstration Selection 🍞 Hook: When you search online, you must first pick the right tutorial before you can learn from it. 🥬 What it is: Choose the best demo from a candidate pool, then use it to answer the question. How it works:

Compare candidate videos to the target task.
Select the most relevant demo.
Align and answer the next-step question. Why it matters: In real life, perfect demos are not handed to you; you must find them. 🍞 Anchor: From a pool with Mexican rice, fried rice, and pasta, the model picks Mexican rice and then answers the next-step question correctly.

What breaks without each step

Without Stage 1, the model misses subtle temporal cues and confuses similar actions.
Without assistive signals in Stage 2, preference learning is noisy and the model ignores the demo.
Without selection ability, real-world applicability drops because the model cannot find helpful demos from a pool.

Secret sauce

Carefully built benchmark forces true demo use.
Two-stage training locks in temporal perception first, then rewards demo-following behavior.
Assistive hints during DPO clean up supervision and boost transfer from demo to target.

04Experiments & Results

The test: Demo-ICL-Bench challenges models to learn from text or video demos and to select the right demo from a pool before answering next-step questions on a target video. This isolates whether a model truly uses the provided demonstration.

The competition: Strong proprietary models (Gemini 2.5 Pro, GPT-4o) and leading open-source video MLLMs (Qwen2.5-VL, LLaVA-Video, InternVL-3, Ola, VideoChat-R1, Video-R1) were evaluated. Human performance was also measured to show the headroom.

The scoreboard with context:

Text-demo ICL: Many models improve when text demos are given. Demo-ICL reaches about 43.4% accuracy. Think of this as a solid B when several classmates are still at C levels without demos. Notably, larger models can gain more from demos, showing scale helps with in-context learning.
Video-demo ICL: Much tougher. Several models barely improve or even get worse with video demos, showing how hard cross-video temporal alignment is. Demo-ICL still benefits, reaching about 32.0%, which is like moving from a struggling grade to a respectable pass on a very hard test.
Demonstration Selection: Models must pick the right demo from distractors and then answer. Selection accuracy and final QA accuracy are both challenging, and there is a large gap from human performance. Demo-ICL improves but this remains a frontier problem.

Surprising findings:

Text demos help more than video demos right now. Reading a clean recipe is easier for today’s models than learning from a moving example.
Adding more frames, giving exact reference clips, or using subtitles and captions as hints can boost video-demo performance, which suggests that models still need help capturing fine temporal structure.
Even when models pick the correct demo from the pool, they may still fail to focus on the key moments inside it, which explains why information-assisted DPO is so helpful.

Other benchmarks: On general video understanding suites such as VideoMME, MVBench, LongVideoBench, MLVU, and the knowledge acquisition set VideoMMMU, Demo-ICL is competitive with similarly sized models and even surpasses some larger ones on knowledge-focused tasks. This shows the training strategy adds in-context know-how without sacrificing general video skills.

05Discussion & Limitations

Limitations:

No special new architecture was introduced; improvements come from the training recipe. While simple and compatible, a purpose-built architecture might push results further.
Video-demo learning is still much harder than text-demo learning. Temporal alignment and cross-video step matching remain open challenges.
The method relies on reasonably structured instructional content. Messy, informal videos may reduce performance.
Demonstration selection from large pools is difficult; models can retrieve the right video yet still miss the crucial segments within it.

Required resources:

Substantial curated data from instructional sources.
GPUs for two-stage training and preference optimization; long-context handling for multiple videos and frames.
LLM or reward models to score candidate answers during preference learning.

When not to use:

Tasks that do not involve procedures or next-step reasoning.
Scenarios with extremely noisy, unlabeled footage where neither text nor visual demos can be aligned.
Real-time robotics with strict latency if long demos must be processed without efficient retrieval.

Open questions:

How to build architectures that naturally align and compare timelines across videos.
How to fuse multiple demo types at once, such as text plus video plus diagrams.
How to scale demonstration selection to huge libraries efficiently and reliably.
How to learn from sparse or imperfect demos the way people do, including handling variations, mistakes, and personal styles.

06Conclusion & Future Work

Three-sentence summary: This paper introduces Demo-driven Video In-Context Learning, where models learn procedures from demonstrations and apply them to new target videos. It builds a new benchmark, Demo-ICL-Bench, and proposes Demo-ICL, a two-stage trained model using video-supervised fine-tuning and information-assisted preference learning. Results show clear gains over strong baselines, especially when demonstrations are provided, with text demos currently helping more than video demos.

Main achievement: Turning demonstrations from passive context into active teachers, backed by a rigorous benchmark and a practical training pipeline that rewards demo-following behavior.

Future directions: Design architectures for robust cross-video temporal alignment, integrate multiple demo sources simultaneously, and scale demonstration selection to vast libraries. Explore better ways to summarize and highlight key moments in demos so models focus attention automatically.

Why remember this: It moves video AI closer to how people learn—by watching a few examples and then doing the next step correctly. That shift opens the door to assistants and robots that can pick up new procedures quickly and safely in everyday life.

Practical Applications

•Kitchen assistants that watch your preferred recipe demo and guide you through your exact next step.
•DIY and home repair helpers that learn from a short tutorial and adapt instructions to your tools and materials.
•Workplace training bots that watch a procedure video and coach new employees on the next action safely.
•Robotic manipulation that learns task variations from a handful of video demonstrations without retraining.
•Personalized tutoring that uses a solved example and then guides students through the next similar problem.
•Medical or lab workflow support that follows validated procedure demos to ensure correct step order.
•Customer support that selects the right tutorial video from a library and leads you through device-specific fixes.
•Sports practice analysis that learns from a drill demo and suggests your next move in real play.
•Manufacturing QA that compares a demo of the correct assembly with a live feed and flags the next correction.
•AR overlays that show the next step in context after watching a short how-to demo.

Version: 1