Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

Weicai Yan; Yuhong Dai; Qi Ran; Haodong Li; Wang Lin; Hao Liao; Xing Xie; Tao Jin; Jianxun Lian

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

Beginner

Weicai Yan, Yuhong Dai, Qi Ran et al.3/3/2026

arXiv

Key Summary

•Proact-VL is a video-talking AI that knows not only what to say but also when to say it, like a great sports commentator.
•It watches video streams one tiny slice at a time and decides every second whether to speak or stay quiet using a special decision token called FLAG.
•When it does speak, it keeps the message short and clear so the conversation feels natural and low-latency.
•A new 561-hour Live Gaming Dataset and two benchmarks help train and fairly test this real-time behavior in solo commentary, co-commentary, and player guidance.
•Two training helpers—transition-aware classification and stability regularization—teach the model to switch between talking and silence smoothly and at human-like rates.
•A sliding-window cache with reverse-RoPE lets the system run for very long videos without forgetting the recent past.
•Across many games, Proact-VL beats prior methods on both text quality and timing, with higher F1 and lower TimeDiff (better alignment with human timing).
•It stays competitive with strong commercial models while running in real time, making it practical for live streaming use.
•Threshold tuning gives users control over how chatty the AI is, balancing coverage (F1) against concise consistency (CC).

Why This Research Matters

Real-time companions need more than smarts—they need timing and manners. Proact-VL shows how to build an AI that talks briefly and at the right moments, making live content more engaging and less distracting. This matters for streaming, where low latency and social coordination are essential, but it also extends to education, accessibility, and assistive tools. A teacher’s aide could nudge a learner with just-in-time hints, while an accessibility assistant could describe key on-screen events without overwhelming the listener. By separating the decision of when to talk from what to say, this approach opens the door to respectful, human-like AI behavior in fast, messy, real-world settings.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how your favorite game streamer talks at just the right moments—getting excited during big plays and staying quiet during boring parts? That’s what makes watching them feel so fun and human.

🥬 Filling (The Actual Concept): The world before this paper had Video AIs that could describe videos or answer questions, but most were better at 'offline' chores—like reading a whole clip then responding. They struggled with live streams where timing, brevity, and pacing matter. Many systems either spoke too much (annoying!) or stayed too quiet (lonely!).

How it worked before (step by step):

Break a long video into chunks and process them in order.
Either try to answer fast (real-time models) or be 'smart' about when to talk (proactive models).
Real-time models were quick but talked too often; proactive models chose good moments but then dumped long replies, causing delays.

Why it matters: A live AI companion must balance three things all at once: low latency (be quick), good timing (talk at the right moments), and good content (short, useful, human-like). Without that balance, the experience feels robotic or distracting.

🍞 Bottom Bread (Anchor): Imagine watching a live esports final. If the AI shouts play-by-play during loading screens or stays silent during the game-winning move, the vibe is ruined. We need an AI that senses moments, speaks briefly, and keeps up.

🍞 Top Bread (Hook): Think of chatting with a friend during a movie night. You both whisper during quiet scenes and cheer at big reveals—naturally coordinating without planning it.

🥬 Filling: Researchers faced three stubborn problems: (1) How to keep responses fast while the video never stops. (2) How to let the AI decide 'now is a good time to speak' without being told. (3) How to control message length and frequency so speech fits a live rhythm.

Failed attempts:

Always-on talking: low delay but too much chatter.
Smart trigger + long replies: better moments but too slow and chunky.
Ignoring social context (like co-commentators): caused interruptions and overlap.

The gap: We lacked a single system that could be fast, proactive, and controlled—like a considerate friend who times their comments, keeps them short, and fits into group conversations.

🍞 Bottom Bread (Anchor): Picture a duo shoutcast for League of Legends. One AI must not interrupt the other, speak only when helpful, and keep lines short enough to not miss the next play. That’s the bar.

🍞 Top Bread (Hook): Imagine sorting a never-ending photo slideshow, one picture every second, and deciding each time: Should I say something now?

🥬 Filling: This paper builds a framework—Proact-VL—that watches video one-second at a time, makes a tiny 'speak or stay silent' decision each second, and if needed, says a short, clear line. It also remembers recent history and can keep going for hours.

What was missing: A simple but reliable 'talk trigger' that works in real-time, plus training that teaches when to switch between talking and silence smoothly, and long-video memory that doesn’t break timing.

🍞 Bottom Bread (Anchor): In Minecraft, when a player nears lava, the AI chimes in with a quick tip ('Pour water to make obsidian'), then goes quiet while the player acts—no rambling, no lag.

02Core Idea

🍞 Top Bread (Hook): You know how a good coach claps, gives a short tip, then lets you try—without giving a whole speech each time? That rhythm keeps you in the zone.

🥬 Filling (The Actual Concept): The 'aha!' is: Decide first if it’s worth talking this second, then—only if yes—generate a short, real-time line. Proact-VL turns long, clunky replies into tiny, well-timed nudges.

Multiple analogies:

Traffic light: Every second, a tiny controller decides 'green to speak' or 'red to stay silent.' If green, the AI gives a one-sentence update; if red, it waits.
Sportscaster duo: One keeps eyes on the play; the other jumps in for a few words only when a moment pops, then backs off.
Texting during a live event: You send quick, relevant texts at key moments, not essays that arrive too late.

Before vs After:

Before: Real-time models talked a lot but lacked manners; proactive models chose moments but then delivered slow, paragraph-sized chunks.
After: Proact-VL speaks briefly at the right moments, keeping latency low and the flow natural—even across long streams and with multiple voices.

Why it works (intuition, not equations):

Make 'when to talk' a tiny, fast decision using a special decision token (FLAG) whose hidden state feeds a small gating head. This separates timing from language generation.
Teach switching between talk/silence with extra care. Since 'keep talking' or 'keep quiet' is common but 'switch states' is rare, the loss pays more attention to those switch moments (transition-aware weights), making timing crisp.
Keep the speaking rate steady and human-like (stability regularization) so the AI doesn’t jitter or babble.
Use a sliding memory (KV cache with reverse-RoPE) so the AI can stream for hours while still remembering recent context.

Building Blocks:

Chunk-wise input schema (video + optional user query + background/history) every second.
Lightweight response mechanism (FLAG + gating MLP + threshold τ) for talk-or-silent decisions.
Multi-tier training loss: language quality + response timing with transition emphasis + stability regularization to match human-like rates.
Infinite inference: two-cache sliding window plus reverse-RoPE to keep positions and memory consistent.

🍞 Bottom Bread (Anchor): Watching a Street Fighter 6 match, Proact-VL stays quiet during footsies but instantly says, 'Whiff punish! Big corner carry!' the moment it happens—then lets the action breathe.

🍞 Top Bread (Hook): Imagine laying out a sandwich assembly line: bread, filling, topping—repeat each second—so every bite is fresh.

🥬 Filling: Key concept sandwiches:

Proactive VideoLLM:
- What: A video-language model that initiates short, timely comments on its own.
- How: Read the current second’s video and context, score 'should I speak?', and if yes, emit a brief line.
- Why: Without proactivity, the AI waits for prompts and feels passive.
Chunk-wise Processing:
- What: Break the stream into one-second bites.
- How: At each tick, feed video+query+history, update memory, and produce at most one short utterance.
- Why: Without chunks, replies become long and laggy.
Lightweight Response Mechanism (FLAG):
- What: A special token whose hidden state drives a small 'speak or silence' gate.
- How: Compute a score, compare to threshold τ; if above, generate; else, output silence.
- Why: Without this, the AI can’t control timing cleanly.
Live Gaming Benchmark:
- What: A 561-hour dataset and tests across solo, co-commentary, and guidance.
- How: Curated multi-game videos with cleaned transcripts and timing labels.
- Why: Without good data and fair tests, you can’t train or trust a live AI companion.

🍞 Bottom Bread (Anchor): In co-commentary for League of Legends, Proact-VL waits for the pick to lock before confirming 'They lock Vi—strong engage with Wukong,' showing timing and social awareness.

03Methodology

At a high level: Video stream → (every 1s) [Gather video frame chunk + query + history] → Decide (speak or silence) → If speak: generate a short line; If silence: output '...' → Update memory → Repeat forever.

🍞 Top Bread (Hook): Think of a metronome ticking once per second. Each tick asks, 'Say something now, or wait?'

🥬 Filling (Steps, why each exists, and examples):

Chunk-Wise Input Schema

What happens: The video is sliced into 1-second chunks. At time t, the model gets three things: Vt (current visuals), Qt (optional user question), and Bt (background history like last-second commentary). Everything is serialized in a ChatML-style format.
Why it exists: Short, regular bites keep latency low and make it easy to keep pace with the stream.
Example: In Minecraft at t=12s, Vt shows a player near lava, Qt is 'How do I mine safely?', Bt recalls last tip ('place torch before mining').

Persona and System Prompting

What happens: The system prompt encodes role (e.g., 'LoL commentator'), persona (tone and vocabulary), and task (solo/co-commentary/guidance). This keeps style, tone, and goals consistent.
Why it exists: Without persona, style drifts or becomes generic; without task context, timing rules get fuzzy.
Example: 'You are a hype-driven commentator who keeps reactions short during fights.'

Proactive Response Mechanism (the FLAG gate)

What happens: After reading the user message, the model reaches a special <|FLAG|> token. Its hidden state goes through a tiny MLP + sigmoid to get a 'speak probability.' If the score ≥ τ, the assistant speaks briefly; otherwise, it outputs a silence placeholder.
Why it exists: Decoupling 'when' from 'what' makes timing controllable and stable, regardless of decoding temperature.
Example: At t=20s in a Street Fighter round, the score pops above τ right as a counter-hit lands—one short line fires; next second it falls below τ—silence.

Short, Clip-Level Generation

What happens: If triggered, the model generates a brief utterance aligned to that second (about one-second’s speech), then stops. If a thought needs more than one second, it continues naturally across consecutive triggers.
Why it exists: Long replies create lag and miss the next moment; short bursts keep pace with the action.
Example: 'Big crit!' then next second 'Boss staggered—go for the head!'

Multi-Tier Training Objective

What happens: Two complementary losses are used: a) Language Modeling Loss ( $L_m$ $L_{m}$ ain): Teaches what to say—clear, accurate, and coherent text. b) Response Loss ( $L_r$ $L_{r}$ esp): Teaches when to speak.
- Transition-Smoothed Classification: Pays extra attention when switching between talk↔silence (rare but crucial steps).
- Stability Regularization: Smooths the speak-probability over time and matches human-like speaking rates.
Why it exists: Without transition focus, the model misses key moments; without stability, it jitters, over-triggers, or goes mute.
Example: In co-commentary, the training highlights the moment the pick locks (switch from silence→speak) and discourages chatter during quiet lobby time.

Infinite Inference with Sliding KV Cache + Reverse-RoPE

What happens: The model keeps two caches: a persistent system cache and a dynamic streaming cache. When the context is full, it evicts the oldest part of the streaming cache but re-bases positions via reverse-RoPE to keep positional meaning consistent.
Why it exists: This lets the AI stream for hours, remember recent context, and avoid long-context drift.
Example: During a 2-hour esports stream, the AI stays stable and remembers the last exchanges without carrying the whole conversation.

Structured Input Formatting (ChatML)

What happens: Each tick has a consistent order: history → video embeddings → optional query → FLAG. This ordering helps the model reason over what just happened, what it sees now, and what the user wants—before making the speak decision.
Why it exists: Consistent structure supports stable 'decide then generate' behavior.
Example: 'History: co-caster just explained jungle matchup; Video: champion lock-in; Query: none; Now decide.'

The Secret Sauce:

The tiny but decisive FLAG gate: a minimal, fast head that makes clear talk/silence choices, independent from text generation.
Transition-aware weighting + stability regularization: makes talk/silence switching crisp and human-like while avoiding jitter.
Reverse-RoPE re-basing: keeps very long streams coherent without re-encoding everything.

🍞 Bottom Bread (Anchor): In a live Baldur’s Gate 3 stream, Proact-VL stays quiet as the player navigates menus, then gives a tight tip when a rare item appears, and stays stable and helpful through an hour-long session without drifting or lagging.

04Experiments & Results

🍞 Top Bread (Hook): Imagine a big game day where all commentators race to be the most helpful, most on-time voice—short, smart, and steady.

🥬 Filling (The Test): The team built the Live Gaming Dataset (561 hours; 12 games) and two test suites:

Live Gaming Benchmark (clip-level): Solo commentary, co-commentary, and guidance.
Live Gaming Benchmark-Streaming (long videos): Tests hour-scale stability. They evaluated two sides of performance:

Text Quality: CC (win rate vs a strong model), LiveU (second-by-second usefulness), FinalQ (quality after concatenating outputs).
Timing Quality: TimeDiff (how close to the right moment), PAUC (coverage quality over time), and F1 (balance of triggering vs over-silence on the full timeline).

The Competition:

Commercial: GPT-4o, Gemini 2.5 Pro.
Proactive baselines: VideoLLM-online, MMDuet, Livestar.
Real-time baselines: LiveCC-7B-Base, LiveCC-7B-Instruct, StreamingVLM.

The Scoreboard (with context):

Solo Commentary (text): Proact-VL CC 53.62 and FinalQ 5.48—like getting an A when others hover around B levels; it stays competitive even with strong commercial systems.
Co-Commentary (text): Overall LiveU ~5.15 and FinalQ ~3.59, showing robust multi-speaker coordination; again, top-tier among open approaches.
Overall Timing: F1 64.87 with TimeDiff ~1.71 and PAUC ~18.10—this is like arriving just as the play begins, not too early, not late, and doing it consistently across the whole match.
Common and General Commentary (Ego4D and Black Myth: Wukong): Proact-VL tops text quality and shows strong timing, indicating good generalization beyond the training games.
Long-Form Stability: Over 30–120 minutes, text quality stays steady; timing metrics dip slightly then stabilize—no runaway chatter or long silences.

Surprising Findings:

Threshold tuning (τ) gives a clear dial: raising τ reduces F1 (fewer triggers) but often boosts CC (more precise, consistent lines). Mid-range τ (0.3–0.5) balanced coverage vs concision best.
Both response losses matter: removing stability regularization or the transition-aware term hurts F1 and raises TimeDiff sharply, proving the switching-and-stability design is essential.
Efficiency: With predictable per-token times and short generation windows (~0.3s budget), the system can handle 10–15 FPS feeds in practice on the tested setup.

🍞 Bottom Bread (Anchor): In a real stream, Proact-VL behaves like a seasoned caster: it hits the big moments, keeps comments bite-sized, cooperates with co-commentators, and maintains that rhythm from the opening to the final scoreboard.

05Discussion & Limitations

🍞 Top Bread (Hook): Think of a smart walkie-talkie buddy: great most of the time, but sometimes it mishears tiny details on a noisy channel.

🥬 Filling (Honest Assessment): Limitations:

Fine-grained grounding: On cluttered UIs or tiny HUD numbers, Proact-VL can misread details (e.g., gold leads), leading to believable but incorrect lines.
Temporal fidelity: Current setup often samples sparse frames (e.g., 2 FPS), which can miss blink-and-you-miss-it events in fast games.
Entity drift: Games change often (new characters/items). Without retrieval or robust visual ID, the model may rely on outdated memory.
Style overfit risk: Personas help consistency but can over-constrain tone if not tuned.

Required Resources:

A capable VideoLLM backbone (e.g., LiveCC/Qwen family), GPU memory for streaming KV caches, and curated training data with second-level labels.
Some engineering for low-latency video ingest and the cache/eviction mechanism.

When NOT to Use:

Ultra-fast, high-FPS esports where every frame counts and you lack the compute to sample more densely.
High-stakes analytics needing pixel-accurate OCR and precise math on tiny HUDs (e.g., formal refereeing).
Domains with rapidly changing vocab where you cannot provide updated knowledge.

Open Questions:

How to couple robust, lightweight OCR and numeric reasoning into the loop without breaking latency.
How to scale to 60–120 FPS and 1080p+ while keeping the 1-second cadence and low latency.
How to incorporate retrieval (live wikis, patches, item databases) safely and stably in real time.
How to better coordinate in multi-agent settings (e.g., interrupt handling, turn-taking learning).

🍞 Bottom Bread (Anchor): Picture the model calling 'Huge lead!' when the HUD actually shows a tiny edge—this flags the need for sharper on-screen reading before speaking.

06Conclusion & Future Work

🍞 Top Bread (Hook): Imagine a co-pilot who whispers the right tip at the right second, then lets you fly—calm, brief, and on beat.

🥬 Filling (Takeaway): 3-Sentence Summary: Proact-VL is a proactive, real-time VideoLLM that decides when to speak every second and delivers short, timely comments. It pairs a lightweight FLAG-based trigger with training that focuses on smooth switching and stable speaking rates, plus a streaming cache that supports hours-long sessions. Across diverse games and settings, it improves both text quality and timing alignment versus prior methods.

Main Achievement: Turning 'when to talk' into a tiny, reliable, and controllable decision—separate from 'what to say'—so the AI feels human-like in live settings.

Future Directions: Add stronger on-screen reading (OCR) and tiny-number reasoning; scale to higher FPS and resolution under tight latency; plug in fresh knowledge via retrieval; and learn richer multi-agent turn-taking.

Why Remember This: It shows how small, well-placed decisions (per-second triggers) plus stability-minded training can transform a chatty model into a considerate live companion—useful far beyond games, from classrooms to assistive tech.

Practical Applications

•Esports co-commentary that times short reactions to big plays without stepping on co-casters.
•Streamer sidekick that adds quick facts or jokes at opportune moments without constant chatter.
•In-game guidance for players (e.g., Minecraft safety tips near lava) with one-sentence, real-time coaching.
•Classroom video aides that offer short hints during science demos right when key steps happen.
•Accessible descriptions of important on-screen events for viewers with low vision, timed to the action.
•Customer support during live product demos: brief, relevant pointers instead of long scripts.
•Sports training feedback (e.g., form cues in workout videos) delivered in quick, timely bursts.
•Remote teamwork assistants that summarize changes during a live dashboard review without flooding chat.
•Live news or weather explainers that chime in only when a significant update appears on-screen.
•Safety monitors that provide concise alerts during factory or lab video feeds when thresholds are crossed.

Version: 1