Streaming Video Instruction Tuning

Jiaer Xia; Peixian Chen; Mengdan Zhang; Xing Sun; Kaiyang Zhou

Streaming Video Instruction Tuning

Intermediate

Jiaer Xia, Peixian Chen, Mengdan Zhang et al.12/24/2025

arXiv PDF

Key Summary

•Streamo is a real-time video assistant that knows when to stay quiet, when to wait, and when to speak—while a video is still playing.
•It unifies decision-making and answering inside one model using three special states: Silence, Standby, and Response.
•A new 465K-sample dataset (Streamo-Instruct-465K) teaches the model many streaming tasks at once, like narration, event grounding, and time-sensitive Q&A.
•Training uses a clever weighted loss (focal loss + frequency balancing) so the model learns rare but important moments to reply.
•Compared to previous online systems, Streamo is much better at timing and accuracy on the OVO-Bench benchmark (up to +13.83% average gain).
•Even though it’s built for streaming, Streamo also stays strong on regular offline video benchmarks.
•A new test suite, Streamo-Bench, checks if models can follow diverse instructions in real time, not just answer multiple-choice questions.
•Streamo can be trained at 1 fps and still work even better at 2 fps without retraining, showing strong generalization.
•This work moves video AIs from ‘watch the whole clip and answer once’ to ‘respond at the right moment as life unfolds’.
•It opens doors for safer driving aids, better accessibility narrations, faster sports highlights, and smarter home assistants.

Why This Research Matters

Real life doesn’t pause, and Streamo is built for that: it speaks up at the right moment while the world is still moving. This can make roads safer with timely alerts, support people who are blind with live scene narration, and help parents or workers notice important changes without staring at screens. Sports fans can get instant highlights, and cooks can be told when a step is really done. Because Streamo also stays strong on offline tasks, teams don’t have to choose between live performance and general understanding. By uniting timing and content in one brain, this work nudges AI closer to truly helpful, real-time assistants for everyday life.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re watching a parade live. You don’t know what float is coming next, but you still want your friend to describe exciting moments right when they happen, not five minutes later.

🥬 Filling (The Actual Concept):

What it is: Continuous Video Streams are live, never-ending videos where frames arrive one after another, like a live parade.
How it works:
1. New frames arrive every moment.
2. The AI only sees the past and the now—it never sees the future.
3. The AI must decide, in real time, whether to talk now or wait for more.
Why it matters: Without handling continuous streams, an AI would miss key moments or speak too late, which breaks real-time help.

🍞 Bottom Bread (Anchor): A traffic camera sees a light turn green. A good streaming AI should announce “Green!” right then, not after the whole clip ends.

Before this paper, many video AIs were great at offline understanding—they watched the entire, pre-recorded video and then answered questions or wrote captions. That’s like reading the whole book before giving a summary. It works for movie summaries, but not for live events. In real life, you can’t pause the world. You need timely updates.

This leads to a bigger need:

🍞 Top Bread (Hook): You know how a good camp counselor can lead games, tell stories, and answer questions—all while activities are happening?

🥬 Filling (The Actual Concept):

What it is: Multi-task Instruction Following means the AI can understand different kinds of tasks (like narrate, find an event, answer a changing question) from instructions and do them in the right way, right now.
How it works:
1. Read the instruction (e.g., “Tell me when the dog starts running”).
2. Watch the video as it streams.
3. Match what you see to what was asked.
4. Choose the right moment and the right type of answer.
Why it matters: Without multi-task following, the AI might always give the same kind of answer (like a caption) even when the instruction needs timing, localization, or updates over time.

🍞 Bottom Bread (Anchor): If you say, “Update me when the cookie tray is full,” the AI should wait, watch cookies being added, and only respond when the tray is actually full.

The problem researchers faced: Offline models wait for the whole video, so they can’t decide the best moment to speak while the video is playing. Some teams tried adding a separate controller—a small extra model—to decide when to answer and then call the big model to generate the words. But this split-brain approach had issues:

The small controller often wasn’t smart enough to understand complex, time-based instructions.
A big controller slowed everything down.
Splitting decision-making from speaking meant the AI couldn’t smoothly adapt as the scene changed.

So what was missing? A single brain that could both decide when to speak and what to say, trained on the right kind of streaming data. Existing datasets mixed labels from many sources with different rules, which confused timing and task style. Models needed consistent examples that teach: when to be silent, when to wait, and when to respond—across many tasks.

Why should anyone care? Think of:

Safer roads: A dashcam helper that warns right when a pedestrian steps out.
Accessibility: Real-time narration for people who are blind, describing the world as it happens.
Sports: On-the-fly highlights, not a summary after the game.
Home help: An assistant that tells you the instant the toast pops.
Robotics: A robot that reacts at the right moment, not after re-watching a scene.

This paper builds that missing brain and the training library it needs, so video AIs can finally act like attentive narrators, grounded searchers, and quick responders—live.

02Core Idea

🍞 Top Bread (Hook): Imagine a lifeguard who not only watches the pool but also decides the exact moment to blow the whistle and then immediately explains why.

🥬 Filling (The Actual Concept):

What it is: Streamo is a real-time video model that puts the “when to speak” decision inside the same model that does the “what to say,” trained end-to-end on a large, unified streaming dataset.
How it works:
1. Stream frames and an instruction into the model.
2. At every moment, predict one of three states: <Silence>, <Standby>, or <Response>.
3. When <Response> triggers, immediately generate the answer or caption.
4. Keep going as new frames arrive.
Why it matters: Without unifying decision and response, you either speak too late, speak too often, or miss the moment entirely.

🍞 Bottom Bread (Anchor): “Tell me when the popcorn starts popping.” Streamo stays silent at first, says <Standby> when kernels swell, and replies exactly when popping starts.

Now let’s unpack the building blocks with three separate analogies and key concepts.

Multiple analogies for the same idea:

Sports announcer: Watches the play (Silence), sees a pass forming (Standby), calls the goal at the exact moment (Response), then explains.
Chef assistant: Preps quietly (Silence), notices the pan heating (Standby), says “Flip now!” precisely when the pancake is ready (Response).
Traffic guide: Waits at red (Silence), watches cars slow (Standby), announces “Green—go!” exactly when the light changes (Response).

🍞 Top Bread (Hook): You know how a single, well-coordinated team performs better than two teams that barely talk?

🥬 Filling (The Actual Concept):

What it is: An End-to-End Training Framework is a from-input-to-output process where every part is trained together so the whole system cooperates.
How it works:
1. Convert videos into time-tagged, multi-turn dialogues (e.g., <0s-1s>, <1s-2s>).
2. Interleave frames, instructions, and special state tokens in one sequence.
3. Train the model to predict state tokens and, when needed, the full answer.
4. Optimize everything jointly—no separate controller needed.
Why it matters: Without end-to-end training, the “when” and the “what” don’t learn to coordinate, causing late, early, or missing responses.

🍞 Bottom Bread (Anchor): Like rehearsing an orchestra together so the drummer and violinist sync perfectly on the big finish.

🍞 Top Bread (Hook): Imagine a library where every recipe shows not just the steps, but exactly when to check the oven.

🥬 Filling (The Actual Concept):

What it is: The Streamo-Instruct-465K Dataset is a large, unified training set with clear time boundaries and instructions across many streaming tasks.
How it works:
1. Re-annotate videos with consistent, time-stamped events and answers.
2. Include tasks like real-time narration, action/event captioning, temporal grounding, and time-sensitive QA.
3. Attach response timing labels so the model learns when to be silent, wait, or reply.
Why it matters: Without consistent, time-aware labels, the model can’t learn precise response timing or follow different tasks properly.

🍞 Bottom Bread (Anchor): The dataset tells the AI, “During 34s–50s, this event happens; answer when it ends,” just like a cooking timer.

🍞 Top Bread (Hook): Picture a teacher who pays extra attention to the trickiest homework problems.

🥬 Filling (The Actual Concept):

What it is: Focal Loss is a training trick that focuses learning on hard or rare cases—in this paper, the rare but crucial moments to say <Response>.
How it works:
1. Down-weight easy predictions (like constant <Silence>).
2. Up-weight tough or rare predictions (<Standby>/<Response>).
3. Combine with frequency balancing so all special states are learned well.
Why it matters: Without this, the model might stay silent too often and miss the moment to answer.

🍞 Bottom Bread (Anchor): When the traffic light finally turns green—a rare, brief change—the model must learn to notice and announce it.

Before vs After:

Before: Offline video AIs answered once after watching everything; streaming add-ons were clunky and slow.
After: A single, tightly-coupled model decides and answers on the fly, across many task types, with strong timing.

Why it works (intuition): Timing and content are two sides of the same coin. Training them together, on consistent time-stamped tasks, lets the model learn patterns like “this is relevant, keep watching” and “now it’s complete, answer.” That tight coupling, plus focused training on rare response moments, makes Streamo both attentive and timely.

03Methodology

At a high level: Input (live frames + instruction) → Multi-turn dialogue with time tags → Predict state (<Silence>/<Standby>/<Response>) each turn → If <Response>, generate answer → Continue streaming.

Step-by-step recipe:

Streamed input becomes a dialogue

What happens: A long video is split into 1-second chunks labeled like <0s-1s>, <1s-2s>, … These chunks are interleaved with the user’s instruction (e.g., “Tell me when the light turns green”).
Why it exists: It turns a live stream into learnable, time-stamped ‘turns,’ so the model can practice deciding at each moment.
Example: <3s-4s> shows the light still red → model should output <Silence>.

Predict the response state every second

What happens: The model predicts one token from three options: • <Silence>: keep watching; nothing to say yet. • <Standby>: it’s relevant and in progress; wait for completion. • <Response>: it’s complete or answerable; speak now.
Why it exists: Real-time systems must know not just what to say, but when.
Example: “Notify me when the toast pops.” While it’s heating: <Silence>. When it starts rising: <Standby>. When it fully pops: <Response> “The toast just popped.”

One-pass generation when ready

What happens: As soon as <Response> is predicted, the model immediately generates the answer, narration snippet, or grounded time window.
Why it exists: This avoids extra calls to another module—faster and better timed.
Example: “Localize: ‘Add vodka and then squeeze lemon.’” The model stays quiet through prep, switches to <Standby> during pouring or squeezing, and replies right after both actions finish with their time span.

Multi-task instruction tuning with unified labels

What happens: Training uses Streamo-Instruct-465K with standardized timing and task formats: • Real-time narration: second-by-second story of what changes now. • Action caption: highlight step-by-step actions. • Event caption: describe events at their endpoints. • Event grounding: given a caption, report the time window when it occurs. • Time-sensitive QA: one question whose answer changes over time; update when it changes.
Why it exists: Teaches the model not just to see, but to follow many instruction styles on a live stream.
Example data: For “What is the man holding now?”, the label shows exact seconds when he switches from glass to shaker to lemon—so the model updates its answer at the right moments.

Handling class imbalance with focused learning

What happens: Most seconds are <Silence>. The training uses focal loss and frequency balancing to emphasize rare <Response> and mid-rare <Standby> tokens.
Why it exists: Otherwise, the model would learn to stay silent almost all the time.
Example: In a 60-second clip with one short event at 48–52s, training increases attention on those few key seconds.

Keep the vision encoder fixed, tune the language brain

What happens: During training, the visual backbone is frozen; the connector and LLM layers learn the new streaming behavior.
Why it exists: This keeps training efficient and leverages strong existing vision features while aligning them to real-time decision-making.
Example: Two different base models (3B and 7B) can both adopt this streaming brain with the same recipe.

Evaluate across online and offline tasks

What happens: Test on streaming benchmarks (like OVO-Bench and Streamo-Bench) and on classic offline tasks (MVBench, TempCompass, VideoMME, LongVideoBench).
Why it exists: We want a real-time assistant that still understands videos well, even when used offline.
Example: The same model can narrate live cooking and also answer questions about a pre-recorded sports clip.

The Secret Sauce:

End-to-end unification: Decision (when) and generation (what) are trained together.
Three-state controller baked in: <Silence>, <Standby>, <Response> are predicted as normal tokens.
Time-tagged dialogues: The model naturally learns alignment between frames, instructions, and outputs.
Better loss for rare moments: Focal loss + frequency weights sharpen timing.
A big, consistent dataset: Streamo-Instruct-465K standardizes how timing and tasks are labeled, so the model learns cleanly and generalizes.

🍞 Top Bread (Hook): Think of a bicycle where the brakes and the pedals are tuned together so you stop exactly when you should.

🥬 Filling (The Actual Concept):

What it is: Frame-Level Response State Prediction means the model decides, at each time step, whether to be silent, wait, or answer.
How it works:
1. Read the current 1-second chunk and the instruction.
2. Predict one of three tokens (<Silence>/<Standby>/<Response>).
3. If <Response>, generate the output right away.
Why it matters: Without precise, per-frame decisions, you get awkward timing—either too noisy or too late.

🍞 Bottom Bread (Anchor): “Tell me the exact time the runner crosses the finish line.” The model outputs <Response> exactly at the crossing moment and gives the time.

🍞 Top Bread (Hook): Imagine a teacher who spends extra time helping you on the exact problems you miss most.

🥬 Filling (The Actual Concept):

What it is: Focal Loss focuses training on hard or rare cases so the model learns the tricky moments.
How it works:
1. Detect which predictions were too easy and reduce their weight.
2. Boost the weight of challenging predictions, like rare <Response> times.
3. Combine with frequency-based weights so all three states are balanced.
Why it matters: Without it, the model would under-practice the very moments that matter most in streaming.

🍞 Bottom Bread (Anchor): When only one second in a minute matters, focal loss makes sure the model pays attention during that second.

04Experiments & Results

The Test: The team evaluated two big things—how well Streamo behaves live, and whether it still understands videos in general.

Online/Streaming Benchmarks: OVO-Bench tests real-time perception, backward/forward tracing, and active responding across 12 subtasks.
Instruction-Following in Streaming: Streamo-Bench checks if the model can follow mixed instructions (grounding, narration, dense captions, time-sensitive QA) on the same videos.
Offline Benchmarks: MVBench and TempCompass (short videos), VideoMME and LongVideoBench (long videos) measure classic video reasoning.

The Competition: Streamo was compared to open-source offline giants (like Qwen2-VL, LLaVA-Video, InternVL) and online systems (like Flash-VStream, VideoLLM-Online, Dispider, StreamingVLM). It was also trained with different datasets to see how data choices matter (ET-Instruct-164K vs Streamo-Instruct-465K).

The Scoreboard (with context):

OVO-Bench: Streamo-7B beat the previous leading online model (Dispider-7B) by about +13.83% on average—like moving from a B to a solid A.
Frame rate surprise: A model trained at 1 fps performed even better at 2 fps without retraining—about +4.66% improvement—showing robust generalization to faster inputs.
Dataset advantage: Training with Streamo-Instruct-465K consistently outperformed ET-Instruct-164K, improving forward tasks by +7.1% and overall performance by +11.79% in key settings.
Offline strength retained: Despite being built for streaming, Streamo-7B also improved over its own offline base on standard offline benchmarks (e.g., around +3.3% average over Qwen2.5-VL-7B), and surpassed StreamingVLM across tests. That’s like becoming a better sprinter without losing your long-distance stamina.

Streamo-Bench findings:

Many online models struggled when tasks weren’t multiple-choice or when prompts asked for open-ended grounding and time-updating answers.
Streamo handled the mixed tasks much better, showing strong instruction comprehension and timing in one go.

Ablations (what made the difference):

Loss design: Using plain cross-entropy (no reweighting) led to too many <Silence> predictions (poor responsiveness).
Fixed class weights helped somewhat, but still missed token-level difficulty.
Focal loss + frequency balancing gave the best response-timing accuracy across backbones (e.g., big jumps in REC/SSR/CRR on OVO-Bench’s forward-active tasks).

Surprising findings:

Adding more offline-only supervision (like LLaVA-Video) to ET-Instruct sometimes improved perception but hurt streaming response behavior—highlighting a trade-off. In contrast, Streamo-Instruct-465K lifted both streaming and offline performance.
The simple, end-to-end pipeline worked across different base models (3B, 7B, and others), suggesting it’s a generally useful recipe, not a one-off trick.

05Discussion & Limitations

Limitations:

Unbounded streams are hard: As videos keep going, memory and latency grow. Without special long-sequence tricks, costs can become too high for very long sessions.
Latency on tiny devices: Although Streamo is efficient for what it does, extremely tight real-time constraints on low-power hardware may still be challenging.
Visibility of the future: The model can’t see future frames by design; tasks that require foresight (predicting what hasn’t happened yet) remain difficult.
Data coverage: Even at 465K samples, the world is huge. Rare activities or tricky camera angles may still trip the model.

Required Resources:

A capable base vision-language model (e.g., 3B–7B) and GPU(s) for training.
The Streamo-Instruct-465K dataset and the training pipeline that interleaves frames, states, and instructions.
Enough VRAM/CPU bandwidth to keep up with the desired frame rate (1–2 fps in the paper’s setup, with generalization to 2 fps at test time).

When NOT to Use:

Ultra-low-latency edge devices with strict power limits.
Tasks that depend on knowing the future (e.g., “Tell me what will happen in 10 seconds”).
Extremely noisy or unstable streams where basic perception consistently fails (e.g., almost all frames blurred or occluded).

Open Questions:

How to scale to much longer contexts without ballooning memory (e.g., better KV-cache management, sliding-window attention, adaptive frame compression)?
How to push beyond 2 fps evaluations while keeping or improving timing accuracy?
How to learn even finer-grained response timing (sub-second precision) without over-speaking?
How to broaden multi-task coverage (e.g., richer tool use, audio cues, or robotic actions) while preserving crisp response control?

06Conclusion & Future Work

Three-sentence summary:

This paper introduces Streamo, a real-time video assistant that unifies “when to speak” and “what to say” using three internal states trained end-to-end.
A large, consistent dataset (Streamo-Instruct-465K) teaches many streaming tasks with precise timing, while a focused loss design helps the model learn rare but critical response moments.
Streamo sets new bars on streaming benchmarks and keeps or improves performance on classic offline tasks, moving video AI closer to true live interaction.

Main achievement:

Seamlessly embedding frame-level decision-making (<Silence>/<Standby>/<Response>) into generation, powered by a unified streaming dataset and focal loss, to deliver accurate, timely, and multi-task responses in one pass.

Future directions:

Add long-context efficiency tools (KV-cache management, sliding-window attention, token pruning) to support truly unbounded streams.
Expand to richer modalities (audio cues), faster frame rates, and broader multi-step tool use.
Explore sub-second timing and adaptive frame sampling for even sharper responsiveness.

Why remember this:

Streamo shows that timing and content belong together in streaming AI. By training them as one, with the right data and losses, the model learns to notice, wait, and speak at just the right moment—bringing us closer to helpful, real-time assistants in everyday life.

Practical Applications

•Accessibility narration: Describe scenes in real time for visually impaired users.
•Driver assistance: Alert exactly when lights change, pedestrians cross, or hazards appear.
•Sports moments: Call goals, fouls, or key plays right when they happen.
•Smart kitchens: Announce precise moments to flip, stir, or remove food.
•Home monitoring: Notify when a package arrives or a door is left open.
•Industrial safety: Warn when a worker enters a restricted zone or a machine state changes.
•Education: Live lab or workshop narration that marks key steps as students perform them.
•Drone piloting: Call out targets or events at the right time during missions.
•Retail checkout: Detect when an item is scanned or a bagging step completes.
•Telepresence/robotics: Trigger timely actions when a visual condition becomes true.

Version: 1