Rethinking Chain-of-Thought Reasoning for Videos

Yiwu Zhong; Zi-Yuan Hu; Yin Li; Liwei Wang

Rethinking Chain-of-Thought Reasoning for Videos

Intermediate

Yiwu Zhong, Zi-Yuan Hu, Yin Li et al.12/10/2025

arXiv PDF

Key Summary

•The paper shows that video AIs do not need long, human-like chains of thought to reason well.
•Short, clear reasoning plus fewer, smarter visual tokens can match or beat long Chain-of-Thought (CoT) approaches.
•They skip costly supervised training on human CoT notes and instead use reinforcement learning (GRPO) to align the model to be concise and correct.
•Token compression removes duplicate or unhelpful visual tokens so the model sees what matters and runs faster.
•With GRPO, concise reasoning gets much better and becomes more stable even when tokens are compressed.
•The method speeds up inference a lot by shrinking both the input tokens (prefill) and the output text (decode).
•Across many video benchmarks, it gains 1–11 percentage points while using less compute than CoT models.
•The results suggest we should rethink the idea that long, step-by-step explanations are always needed for video reasoning.
•This approach lowers cost, reduces carbon footprint, and makes video AI more practical on devices like robots or phones.

Why This Research Matters

Video AI powers tools for learning, safety, and assistance, but it is often too slow and expensive because videos are big and long. This paper shows a way to be fast and accurate without needing mountains of human-written thoughts. That lowers costs for schools, startups, and researchers, and reduces the carbon footprint of running large models. With token compression, models can look at more frames and catch key moments in long videos. Concise reasoning still explains the answer but keeps things speedy, which is crucial for robots or phones. Overall, it makes advanced video reasoning far more practical in the real world.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how some people solve a math problem by writing a whole page of steps, while others do a few neat lines and still get it right? The long version looks impressive, but it isn’t always better.

🥬 The Concept (Chain-of-Thought Reasoning): Chain-of-Thought (CoT) is when an AI writes long, step-by-step thoughts before giving an answer. How it works: 1) read the question, 2) describe many small steps in words, 3) finally give the answer. Why it matters: these steps can help—but in video tasks they can become very long and very slow, and sometimes add fluff like “Hmm, let’s think,” which doesn’t help.

🍞 Anchor: Imagine watching a 2-minute video and writing 2 pages of notes before answering a simple question like “What color is the ball?” It’s overkill.

The World Before: Text-based AIs learned that writing out thinking steps could help them solve tricky problems. This idea spread to models that handle pictures and videos, called multimodal large language models (MLLMs). For videos, researchers tried to make models talk through their thoughts like humans. At the same time, videos feed in thousands of visual tokens—tiny chunks of visual information—so models see a lot but also have to process a lot.

🍞 Hook: Imagine trying to find a friend at a huge party. If you look at every single person and explain out loud why each one isn’t your friend, you’ll take forever.

🥬 The Concept (Visual Tokens): Visual tokens are little pieces of the video that the model reads, like puzzle pieces. How it works: the video is split into frames, frames into patches, patches into tokens; the model processes them in layers. Why it matters: too many tokens make the model slow and costly.

🍞 Anchor: It’s like reading a comic book panel by panel, but someone printed each panel into 100 puzzle pieces first.

The Problem: Two big slow-downs happen: (1) Prefill cost—the model processes all the input tokens to set up its memory; (2) Decode cost—the model then generates lots of words for its reasoning and answer. In videos, both sides explode: loads of tokens in, loads of words out.

Failed Attempts: • Using CoT everywhere often gave only small accuracy gains while being very slow and expensive to train (it needs human-written CoT steps and long reinforcement learning). • Asking the model to be concise (short explanations) without special training actually made it worse than just answering directly. • Trying token compression “as a plugin” at test time helped speed but hurt accuracy more when the model was asked to reason out loud concisely.

🍞 Hook: Imagine shortening your speech for class but no one taught you how to keep the important parts—it might sound choppy or leave out key points.

🥬 The Concept (Concise Reasoning): Concise reasoning is short, clear thinking: say just what helps answer the question. How it works: 1) state what the video shows that matters, 2) connect it to the question, 3) give the answer. Why it matters: it speeds up decoding and can still be correct—but only if the model is trained to do it well.

🍞 Anchor: “The woman tapes the switch so the lights stay on; answer B.” Clear and quick.

The Gap: We needed a way to (1) train models to be good at concise reasoning, and (2) make token compression play nicely with that reasoning, so the model can run fast and still be accurate, without collecting expensive human CoT data.

Real Stakes: Faster, lighter video reasoning matters for real life: • Home robots reacting quickly • Classroom tools that run on laptops • Video customer support that scales • Lower bills on cloud servers • Greener AI with lower carbon footprint. If we can do more with fewer tokens and fewer words, we help more people use smart video tools anywhere.

02Core Idea

🍞 Hook: Picture a detective who scans a room, notes only the clues that matter, and states a tight conclusion. No rambling monologue—just the essentials.

🥬 The Concept (Key Insight): The big idea is: Short, well-trained reasoning plus fewer, smarter visual tokens is enough for strong video reasoning—no long, human-like chains needed.

How it works (recipe): 1) Compress the video tokens so the model sees the important parts without the clutter. 2) Train the model with reinforcement learning to produce brief, useful reasoning, then the answer. 3) At test time, run with compressed tokens and concise reasoning for speed and accuracy. Why it matters: This cuts both sides of the cost—fewer input tokens and fewer output words—while keeping or improving accuracy.

🍞 Anchor: Like packing a carry-on bag with only essentials and a packing list. You get to your trip faster and still have everything you need.

Three Analogies:

Librarian: Instead of reading every book, the librarian checks the index and key pages (token compression), then writes a short summary (concise reasoning).
Basketball: Don’t dribble forever. Make a few smart passes (compressed tokens) and take the open shot (concise reasoning + answer).
Cooking: Mise en place removes extra steps (compression), then the chef follows a short recipe (concise reasoning) to plate the dish (final answer).

Before vs After:

Before: Long CoT, many tokens, expensive training with human-labeled CoT, slow inference.
After: No CoT annotations, direct RL alignment, compressed tokens, brief reasoning, faster inference, competitive or better accuracy.

Why It Works (intuition):

Overthinking hurts: Extra words like “Hmm” or “Let’s think” don’t add facts; they add time and sometimes push thinking off track.
Seeing smarter, not more: Many video tokens repeat similar visual facts; merging/pruning keeps the meaning while shrinking cost.
Aligning behavior: Reinforcement learning (GRPO) rewards good form and correct answers, teaching the model to be short and right.
Robustness: Training with compression makes the model steady even when input tokens change.

Building Blocks (introduced with mini-sandwiches):

🍞 Hook: You know how a teacher gives stars when answers are correct and neat? 🥬 The Concept (Reinforcement Learning): The model tries answers and gets rewards for correct and well-formatted responses. Step-by-step: sample several answers, score them, prefer the better ones, and keep the model close to its original knowledge. Why it matters: It shapes behavior (be concise and correct) without collecting long human thoughts. 🍞 Anchor: The model learns that “short + right” earns a gold star.

🍞 Hook: Imagine a science fair where teams compare projects to pick the best. 🥬 The Concept (GRPO): GRPO compares a group of sampled responses to one another and boosts the better ones. How it works: produce multiple candidates, normalize their scores, and update the model toward the best within the group while staying near the reference model. Why it matters: It’s simpler and cheaper than training a separate critic model. 🍞 Anchor: Think of a mini-tournament where the winning answers teach the team how to do better next time.

🍞 Hook: Ever condense a long video into a highlight reel? 🥬 The Concept (Visual Token Compression): Combine similar visual patches and drop unhelpful ones so the model sees the highlights. How it works: token merging (group lookalikes together) and token pruning (remove low-importance ones), kept compatible with fast attention. Why it matters: Less prefill time, more frames can fit, better coverage of long videos. 🍞 Anchor: Instead of every second of a game, you watch the key plays and still know who won and why.

🍞 Hook: When answering a trivia question, you don’t read a whole essay. 🥬 The Concept (Concise Reasoning Mode): The model writes a short think section that names the key video evidence and the logic link, then gives the answer. Why it matters: It is explainable enough while staying fast. 🍞 Anchor: “She tapes the switch to keep lights on → B.”

03Methodology

At a high level: Video + Question → (A) Compress visual tokens (prefill saver) → (B) Reinforcement Learning with GRPO to train concise reasoning → (C) Inference with concise think-then-answer → Output.

Step A: Compress Visual Tokens (Prefill Efficiency)

What happens: The video is split into frames and patches (tokens). We merge tokens that are visually similar and prune tokens that look unimportant, physically removing them in select layers to save memory and time. We keep FlashAttention for most layers and only disable it in a few pruning layers to remain fast overall.
Why this step exists: Without compression, prefill is too slow and memory-hungry, especially for long videos. Compression lets us feed more frames so the model doesn’t miss key moments.
Example: Suppose a 10-second clip has many nearly identical hallway frames. Merging them still shows “a hallway with a light switch,” while dropping tiny irrelevant patches (like plain wall) keeps the meaning.

🍞 Hook: If you have too many similar photos from a trip, you keep the best ones. 🥬 The Concept (Token Merging): Group and merge tokens that carry similar visual info. How it works: compute similarity, merge groups, pass fewer tokens forward. Why it matters: The big picture remains, but compute shrinks. 🍞 Anchor: Ten photos of the same beach at noon become one great photo.

🍞 Hook: When cleaning your desk, you recycle scraps you don’t need. 🥬 The Concept (Token Pruning): Drop tokens that add little value, and do it physically in the model to really save memory and time. How it works: a few layers pick tokens to keep; others run as usual to stay fast. Why it matters: Real wall-clock speedups appear when tokens are truly removed. 🍞 Anchor: Instead of just hiding clutter under a paper, you actually throw it out.

Step B: Train Concise Reasoning with GRPO (Behavior Alignment)

What happens: We skip supervised fine-tuning on human-written CoT. Instead, we directly run reinforcement learning on the pre-trained model. For each video-question, the model samples several answers (each with a short think section and an answer). A reward checks two things: (1) format (did it produce a short think then the answer tags?), and (2) accuracy (was the answer correct?). GRPO then prefers better samples within each group and keeps the model close to its original knowledge via a small regularization.
Why this step exists: Simply prompting for short thinking made the model worse. It needed training signals to learn how to be short and right. GRPO provides that signal without needing human CoT notes.
Example with data: On MLVU (a long video benchmark), concise reasoning before training lagged behind direct answers. After GRPO, concise reasoning jumped by over 10 points, beating previous baselines while staying fast.

🍞 Hook: A coach gives points for hitting the target and for clean form. 🥬 The Concept (Reinforcement Learning Rewards): The model gets a higher score for correct answers and neat, concise formatting. How it works: sample answers, score each, boost the better ones. Why it matters: The model learns that being brief and correct is best. 🍞 Anchor: You learn to shoot baskets straight and quickly because that wins the drill.

🍞 Hook: How do you pick the best idea among several sketches? 🥬 The Concept (Group Comparison in GRPO): The model compares a small set of its own answers and learns from the top ones. How it works: normalize scores within the group so relative differences are clear; update toward the winners; limit drift from the base model. Why it matters: Simple, stable, and cheaper than training a separate critic. 🍞 Anchor: Like choosing the best of five drafts and using it to guide your next drafts.

Step C: Inference in Concise Mode (Decode Efficiency)

What happens: At test time, we keep compression on. The model writes a quick think section summing up critical evidence, then outputs the final answer. The think is short by design (no “Hmm” or “Wait”), cutting decode time a lot.
Why this step exists: Long CoT decoding is the biggest speed sink. Short, useful think text saves time and still explains the answer.
Example: “She tapes the switch so lights stay on → B.” That short think captures cause and effect without rambling.

Putting It All Together (What makes it clever):

Dual savings: We cut input size (prefill) and output length (decode).
No CoT labels: We remove a costly training step and still gain accuracy.
Robust with compression: Training with compression makes concise reasoning steady when input tokens change.
More frames, better coverage: With fewer tokens per frame, we can include more frames, which is crucial for long videos.

End-to-End Recipe:

Input video and question.
Extract visual tokens per frame.
Merge similar tokens and prune low-value ones in select layers (keep FlashAttention elsewhere).
During training: sample multiple concise think+answer outputs; reward correctness and formatting; use GRPO to move the model toward the better samples, staying close to the pre-trained model.
During inference: apply the same compression; generate a brief think then the answer.
Output: short explanation and final answer, fast and accurate.

04Experiments & Results

The Test: The authors measured two things: (1) How accurate the models are on many video benchmarks covering general, long, and complex videos; (2) How fast they are, especially the time spent in prefill (reading tokens) and decode (writing text). They also checked how token compression affects performance and how reinforcement learning (GRPO) changes concise reasoning.

The Competition:

Strong pre-trained base (Qwen2.5-VL) using either direct answers or concise reasoning.
A CoT-trained model (Video-R1) that uses long, human-like thoughts and heavy post-training (CoT annotations + supervised fine-tuning + RL).
The proposed approach: skip CoT annotations and supervised fine-tuning; train only with GRPO; use token compression; decode with concise reasoning.

The Scoreboard (with context):

CoT’s cost vs gain: CoT training took over 30 hours on multiple high-end GPUs and produced very long decoding (around ten times slower than direct answers). Yet accuracy gains over the pre-trained base were modest and sometimes negative on some benchmarks. That’s like studying all night, writing a very long test essay, and only scoring a tiny bit higher—or sometimes lower—than a student who wrote a short, clear answer.
Concise reasoning without training: Prompting the base model to be concise made it slower than direct answer mode for accuracy (even if the decoding was short). In other words, the idea was good, but the model hadn’t been taught how to do it correctly.
GRPO fixes concise reasoning: After GRPO training, concise reasoning improved by roughly 1–11 percentage points across benchmarks like VideoMME, MVBench, and MLVU, closing the gap and often beating the CoT model. That’s like turning short answers from B- quality into A quality by practicing with a good scoring system.
Token compression with concise reasoning: Applied naively at test time, compression hurt concise reasoning more than other modes. But after GRPO training, this drop shrank a lot, and because tokens were smaller, the model could include many more frames. This improved long-video performance substantially.
Overall win: The final model—compressed tokens, concise reasoning, GRPO training—achieved the best or competitive accuracy on all tested benchmarks while being much faster at inference and avoiding expensive CoT annotations and supervised fine-tuning.

Surprising Findings:

Long, human-like thoughts didn’t help much and sometimes hurt, even though they felt smart and careful.
Concise reasoning worked poorly without special training, but worked very well after GRPO taught the model what to keep and what to drop.
Token compression plus GRPO is a power combo: models can see more frames with less compute, which helps long or complex video questions that hinge on catching key moments.

Concrete Examples:

In a question about a woman taping a switch, the CoT model wrote a very long monologue and guessed a fanciful reason. The concise model stated the key observation and picked the practical answer (keep lights on).
For an image mentioning the global marketing industry’s size, the CoT model reasoned vaguely and chose the largest number; the concise model read the exact value and answered directly.
For comparing old vs new consoles, the concise model recognized competition as the most realistic relationship, while the CoT model over-explained and picked a less fitting answer.

Bottom line: With the right training and compression, short-and-smart beats long-and-chatty in video reasoning.

05Discussion & Limitations

Limitations:

Benchmark scope: The claims are backed by current video benchmarks. Tasks demanding full mathematical proofs, safety-case justifications, or legal reasoning may still benefit from longer, auditable reasoning.
Not a new algorithmic family: The work is an empirical recipe rather than a brand-new theory. It shows that an alternative path (short reasoning + compression + GRPO) works surprisingly well, but doesn’t prove it must always win.
Compression choices: Token pruning/merging settings and which layers to prune can affect results. Poorly tuned compression may drop rare but important details.
Explainability depth: Concise reasoning is explainable, but less detailed than full CoT. In safety-critical cases, stakeholders may require more verbose audit trails.

Required Resources:

A solid pre-trained video-capable MLLM.
GPU resources for RL fine-tuning (fewer than full CoT pipelines but still non-trivial).
Implementation of token compression compatible with most layers using fast attention, and a few pruning layers without it.

When Not to Use:

Tasks where step-by-step traceability is legally or ethically required (e.g., medical diagnoses in regulated contexts).
Problems where rare visual clues are extremely subtle and compression might hide them (e.g., microscopic details).
Settings where lengthy reasoning is itself the product (e.g., tutoring that demands elaborate explanations for learning).

Open Questions:

Can we auto-tune token compression per video so the model keeps rare but critical frames better?
How short is too short? What is the sweet spot between brevity and reliability?
Can this approach generalize to other modalities (audio, sensors) or tasks like video planning and control?
How do we design rewards that balance correctness, brevity, and fairness, especially in multi-choice versus open-ended answers?
Can concise reasoning be paired with selective deep dives—think-on-demand only when confidence is low?

06Conclusion & Future Work

Three-Sentence Summary: This paper argues that long, human-like Chain-of-Thought is not necessary for strong video reasoning. Instead, concise reasoning combined with visual token compression and GRPO training yields faster, cheaper, and often more accurate results. It removes the need for human CoT labels and heavy supervised fine-tuning while improving performance across many benchmarks.

Main Achievement: A practical framework that shrinks both input (with token compression) and output (with concise reasoning), aligned by GRPO, to deliver efficient and competitive video reasoning without costly CoT annotations.

Future Directions: Explore adaptive compression per video, better reward designs that balance brevity and robustness, think-on-demand systems that escalate from concise to detailed only when needed, and generalize the approach to other long-context modalities like audio and sensor streams.

Why Remember This: It flips a common assumption—more words aren’t always better. In video AI, small, sharp, and well-trained can beat long and chatty, saving time, money, and energy while keeping or boosting accuracy.

Practical Applications

•On-device video assistants that summarize or answer questions without cloud compute.
•Robotics that needs quick, reliable video understanding to act safely in real time.
•Education tools that analyze classroom videos and give short, clear explanations.
•Customer support that reviews product videos and provides fast, accurate answers.
•Sports analytics that highlights key plays and explains decisions briefly.
•Home security that scans long footage efficiently and flags important events.
•Healthcare triage tools that quickly summarize non-diagnostic video (e.g., rehab form checks).
•Content moderation that reviews more frames with less compute while keeping accuracy.
•Video search engines that find moments faster by processing compressed tokens.
•AR/VR systems that understand scenes on the fly with concise reasoning outputs.

Version: 1