Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling

Andong Chen; Wenxin Zhu; Qiuyu Ding; Yuchen Song; Muyun Yang; Tiejun Zhao

Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling

Intermediate

Andong Chen, Wenxin Zhu, Qiuyu Ding et al.2/2/2026

arXiv PDF

Key Summary

•This paper shows that comics (multi-panel pictures with words) can help AI think through problems step by step, just like a student explains their work.
•Comics keep the order of events (time), show cause-and-effect, and include helpful text bubbles, while being much cheaper than making videos.
•The authors propose two ways to use comics: generate a comic that already contains the solution, or use the comic as extra context for a vision-language model to solve the problem.
•On tough benchmarks like MathVista, the comic method outperforms video-based reasoning while staying efficient.
•For long documents (DocVQA), comics help highlight and organize key information, reaching near-perfect accuracy.
•Detective-style comic prompts guide models to reason better than plain documentary-style pictures, showing narrative style matters.
•Accuracy improves as you add panels up to around 4–6, then levels off—meaning a few panels capture most of the useful reasoning.
•Scrambling or deleting panels hurts performance, proving that the panel order (time structure) carries important logic.
•Adding speech bubbles and narration removes confusion and boosts accuracy, because pictures plus short text are clearer than pictures alone.
•Compared to videos, comics reduce media generation cost by about 86.6% for typical tasks while keeping temporal logic.

Why This Research Matters

Comics give AI a practical way to “show its work” visually, so people can understand how answers were formed. They keep the order of steps without the heavy cost of full videos, making them efficient for real applications like education, document reading, and help desks. For long documents, comics act like a guided map that points to the most important spots. For cultural questions, short narratives with pictures and tiny text bubbles clarify context and reduce misunderstandings. Because comics transfer well across different AI models, teams can reuse the same visual reasoning assets. The approach encourages safer, clearer AI by making chains of thought easier to inspect. It opens the door to more accessible tools that mix pictures and words in a student-friendly way.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a friend explains a math problem by writing out each step, so you can follow their thinking? That’s easier than just showing the final answer.

🥬 The Concept (Chain-of-Thought, CoT): CoT is a way for AI to show step-by-step thinking before giving an answer. How it works:

Break the big question into small steps,
Solve each step in order,
Use the steps to reach the final answer. Why it matters: Without CoT, AI can jump to wrong answers because it skips the in-between thinking. 🍞 Anchor: When asked “How long to read 120 pages if 8 pages take 20 minutes?”, CoT helps the AI write: 120/8=15 chunks; 15×20=300 minutes; 300/60=5 hours.

🍞 Hook: Imagine using a map instead of just directions—you see the whole path at once. 🥬 The Concept (Thinking with Images, TWI): TWI lets AI use pictures to assist reasoning, not just words. How it works:

Make or read an image related to the problem,
Point out useful parts (like arrows or highlights),
Use those visual clues to think better. Why it matters: Without images, the AI misses visual facts (like shapes or layouts). 🍞 Anchor: For “Which triangle is right-angled?”, a diagram with a marked square corner helps the AI pick the right one.

🍞 Hook: Watching a short video of dominos falling shows cause-and-effect better than a single photo. 🥬 The Concept (Thinking with Video, TWV): TWV uses motion and time to show how things change step by step. How it works:

Generate or read a short video,
Track changes over time (before, during, after),
Use the sequence to reason about actions and results. Why it matters: Without time, it’s hard to explain processes, like how one step leads to another. 🍞 Anchor: To show “pour water then measure”, a video makes the order obvious.

🍞 Hook: Stories have a beginning, middle, and end. If you mix the pages, the story gets confusing. 🥬 The Concept (Temporal Structure): Temporal structure is the order of events in time. How it works:

Mark step 1, then step 2, then step 3,
Keep cause before effect,
Connect each step to the next. Why it matters: Without order, you can’t tell what caused what. 🍞 Anchor: In a recipe, you must crack eggs before cooking them—shuffling steps ruins breakfast.

🍞 Hook: Picture books are easier to understand when pictures and words go together. 🥬 The Concept (Visual Storytelling): Visual storytelling uses pictures (and sometimes short text) to tell a clear, connected story. How it works:

Choose key scenes,
Draw them as panels,
Use captions or speech bubbles for crucial facts. Why it matters: Without a story, pictures can feel random and confusing. 🍞 Anchor: A comic about planting a seed (plant → water → sprout) teaches the process better than one photo.

🍞 Hook: Sometimes we need to use both eyes and ears—like watching a science video while the teacher explains. 🥬 The Concept (Multimodal Reasoning): Multimodal reasoning mixes words, pictures, and sometimes sound to think better. How it works:

Read the question (text),
Look at pictures/panels (vision),
Combine them for a final answer. Why it matters: If you only use one sense, you may miss important clues. 🍞 Anchor: Solving a geometry word problem with a labeled diagram plus notes is easier than text alone.

🍞 Hook: Comics are like tiny movies made of still pictures, each panel telling ‘what happens next.’ 🥬 The Concept (Thinking with Comics, TwC): TwC uses comic panels (with or without short text) as the AI’s step-by-step reasoning space. How it works:

Turn the problem into a short comic of 4–6 panels,
Each panel captures a key step or cause-and-effect link,
The last panel shows or leads to the answer. Why it matters: Unlike one image, comics keep time; unlike videos, comics skip redundant frames and save cost. 🍞 Anchor: To solve a word problem about speed, panels can show distance, time, calculation, then the final result.

The world before: LLMs got much better at reasoning with Chain-of-Thought (text steps). Then multimodal models added images and even videos for visual logic. But single images miss temporal structure; videos add time but are expensive and repetitive. The problem: We need a format that preserves time and cause-and-effect without video’s heavy cost. Failed attempts: Single pictures with arrows or multiple separate images still felt disjointed, and videos often carried redundant frames and high compute costs. The gap: A middle ground that balances temporal logic, clarity, and efficiency—keeping both visual cues and minimal redundancy. Real stakes: From reading long documents to solving math with diagrams to understanding cultural scenes, we want AI that can explain its steps clearly, cheaply, and in order—just like a teacher’s whiteboard comic.

02Core Idea

🍞 Hook: Imagine solving a puzzle by laying out a comic strip where each panel shows the next clue—it’s tidy, ordered, and quick to scan.

🥬 The Concept (TwC’s Key Insight): The big idea is to use comics—short, ordered panels with optional speech bubbles—as the AI’s step-by-step visual notebook. How it works:

Generate a multi-panel comic that mirrors the reasoning path,
Either read the answer from the final panel (Path I) or feed the comic to a vision-language model to reason jointly (Path II),
Use narrative style and panel count to match task difficulty. Why it matters: It captures time and causality like a video but with the low redundancy and clarity of images, improving accuracy and efficiency. 🍞 Anchor: A four-panel strip can show “interpret the question → set up numbers → compute → present result,” reducing confusion.

The “Aha!” in one sentence: Treat a comic strip as a compact timeline of reasoning steps—enough to preserve logic over time, without the bulk of video.

Three analogies:

Flipbook vs. movie: A flipbook shows the key frames you need; a full movie repeats lots of similar frames. Comics are like a well-chosen flipbook.
Teacher’s chalkboard: Each box on the board is a step, in order, with short notes—exactly how panels and bubbles work.
Recipe cards: Each card (panel) is one stage: gather ingredients, mix, cook, serve—clear and compact.

Before vs. after:

Before: Single images lacked time; videos had time but were heavy and redundant.
After: Comics deliver time and cause-and-effect in 4–6 informative panels, guiding the model to reason in order.

Why it works (intuition, no equations):

Selecting key moments: Comics pick the most informative snapshots, dropping near-duplicate frames.
Anchoring with text: Short bubbles remove ambiguity a picture alone might have.
Order matters: Panels lock in sequence, so causes come before effects.
Style as a prompt: A detective-style comic nudges the model to follow clues methodically, improving logic.

Building blocks (mini concepts with sandwiches):

🍞 Hook: Ever notice how a magnifying glass makes small print readable? 🥬 The Concept (Textual Anchoring): Tiny bits of text (speech bubbles, captions) clarify what a picture means. How it works: 1) Add short labels to key parts; 2) Tie numbers and units to visuals; 3) Mark conclusions in the final panel. Why it matters: Pictures can be vague; words make them precise. 🍞 Anchor: Labeling “radius = 3 cm” on a drawn circle stops unit confusion.
🍞 Hook: When a story is too short, it’s unclear; too long, it drags. 🥬 The Concept (Panel Scaling): Choose the right number of panels (often 4–6) to balance detail vs. brevity. How it works: 1) Start with 4 panels; 2) Add more if steps feel cramped; 3) Stop when extra panels repeat info. Why it matters: Too few misses steps; too many waste compute. 🍞 Anchor: A geometry proof may need 6 panels, but a quick arithmetic word problem might need only 2–4.
🍞 Hook: Different coaches motivate in different ways; a detective coach asks for clues. 🥬 The Concept (Role-Playing Narrative): The comic’s style (detective, slice-of-life) acts like a “visual system prompt.” How it works: 1) Pick a narrative style that fits the task; 2) Keep consistent characters; 3) Let the style steer the reasoning. Why it matters: Style primes the model’s thinking path. 🍞 Anchor: A detective theme boosts step-by-step logic on math puzzles.

The result: A structured visual story that consistently improves reasoning across tasks while staying efficient.

03Methodology

High-level pipeline: Input question → Generate a comic (multi-panel visual reasoning) → Either extract the answer (Path I) or feed comic + question to a VLM (Path II) → Output final answer.

Path I: End-to-End Visualized Reasoning (the comic is the reasoning)

What happens:
1. Take the question (text or visual-text).
2. Use an image generator to produce a multi-panel comic. Each panel is one reasoning step.
3. The last panel explicitly shows the answer (a number, word, or highlighted choice).
4. Read the answer from that final panel using a simple answer reader.
Why this step exists: • It externalizes the chain of thought as a visual timeline. Without it, the model’s reasoning stays hidden and can be inconsistent.
Example with data: • GSM8K: “Joy reads 8 pages in 20 minutes; how long for 120 pages?” Panel 1: Reads 8 pages/20 min; Panel 2: 15 chunks; Panel 3: 300 minutes; Panel 4: 5 hours (final bubble: 5 hours!).
What breaks without it: • If the comic doesn’t end with an explicit answer panel, extraction becomes unreliable.

Path II: Comic as Conditioning Context for a Vision-Language Model (the comic guides the solver)

What happens:
1. Generate the comic from the question, just like in Path I.
2. Feed both the original question and the comic to a VLM (e.g., Gemini 3 Pro).
3. The VLM reads panels, bubbles, and order, then outputs a textual answer with optional steps.
Why this step exists: • Some tasks benefit from both the comic’s structure and the VLM’s powerful text reasoning. Without the comic, the VLM may miss spatial/temporal cues.
Example with data: • DocVQA: “What time is the ‘coffee break’?” Panels highlight the schedule table cell; the VLM reads and answers precisely.
What breaks without it: • Long documents get overwhelming; the comic acts like a visual outline, reducing search and confusion.

Important sub-steps and safeguards:

Structured prompt design for generation: • Ask for exactly N panels (often 4–6), with consistent characters and labeled math/regions. • Encourage detective or slice-of-life style depending on the task (logic vs. culture).
Answer extraction protocol: • Path I: Use an answer reader that only returns the final value; verify on a human-checked subset. • Path II: Parse the VLM’s final text; use exact-match rules when appropriate.

Concrete mini-features (with sandwiches):

🍞 Hook: A sticky note on a page helps you find the right spot fast. 🥬 The Concept (Region Highlighting): Panels can zoom in or mark the exact area to read (tables, boxes, angles). How it works: 1) Draw arrows; 2) Circle key cells; 3) Add a short label. Why it matters: Without highlights, the model wastes time searching. 🍞 Anchor: In DocVQA, circling ‘Coffee Break: 10:30–10:45’ makes extraction trivial.
🍞 Hook: A good storyboard shows the key scenes, not every frame. 🥬 The Concept (Keyframe Selection): Comics choose the most informative steps instead of every tiny change. How it works: 1) Keep big state changes; 2) Drop near-duplicates; 3) Ensure cause precedes effect. Why it matters: Without selection, you get video-like redundancy and high cost. 🍞 Anchor: For a pouring-then-measuring task, show ‘empty cup → pouring → meniscus at 200 ml → answer.’
🍞 Hook: A librarian files books in order so you can find them later. 🥬 The Concept (Temporal Ordering): Keep panels in the correct sequence. How it works: 1) Number panels; 2) Ensure each panel references the previous; 3) Make the last panel a clear conclusion. Why it matters: If order is shuffled, logic breaks and accuracy drops. 🍞 Anchor: For multi-step algebra, panel 2 must use results from panel 1—shuffling ruins the derivation.

Secret sauce—what’s clever here:

Use narrative style as a visual system prompt: detective style pushes careful, clue-by-clue reasoning.
Integrate tiny text with visuals: bubbles eliminate ambiguity while keeping context grounded.
Stop at the sweet spot of 4–6 panels: enough to show time and cause, not enough to become a video.
Treat the comic as a global plan: generating all panels together preserves cross-panel consistency better than incremental images.

04Experiments & Results

The test: The authors measured accuracy on math and logic (MATH500, GSM8K), visual math (MathVista), document QA (DocVQA), and cultural knowledge (CulturalBench). They compared:

Text-only models (e.g., GPT-5.2, Gemini 3 Pro, Claude Sonnet 4.5),
Reasoning LLMs with text CoT (DeepSeek-R1, Qwen3-235B-A22B),
Thinking with Images (prompted or trained),
Thinking with Video (Sora 2),
Thinking with Comics (both paths, with/without embedded text).

The competition: Strong baselines cover both proprietary and open models across modalities, ensuring a fair race.

The scoreboard (with context):

MathVista: TwC hits about 85.8%—that’s like scoring an A when video approaches are around the B range.
DocVQA: TwC reaches about 99.4%—almost perfect, like finding every important line in a long book with a good set of bookmarks.
CulturalBench: TwC is strong on both easy and hard sets, especially when text bubbles are allowed (textual anchoring).
Pure-text math (GSM8K/MATH500): TwC remains competitive with top models while delivering visual interpretability.

Surprising findings:

Narrative style matters a lot: Detective-style comics improved accuracy by roughly 28.5 points over a plain documentary look on average across two benchmarks—like switching from a general pep talk to a targeted coach.
Panel scaling shows a sweet spot: Accuracy climbs until about 4–6 panels, then plateaus; extra panels add little but cost more.
Order is real logic: Shuffling or deleting panels hurts performance noticeably, proving that the model relies on temporal sequence.
Text bubbles help: Adding small pieces of text inside panels boosts accuracy (e.g., +13.2 points on MathVista), reducing picture-only confusion.
Model-agnostic gains: Feeding the same comic to different VLMs yields robust improvements, suggesting comics are a portable reasoning aid.

Efficiency vs. video: With standard pricing, a 10-second video may cost around $1 to generate, while a comic image costs about$ 0.134—an ~86.6% reduction. The break-even is ~1.34 seconds; beyond that, comics stay cheaper while still preserving temporal logic.

Bottom line: Comics consistently improve multi-step and temporal reasoning with far less redundancy than video, and they structure long-context understanding into clear, scannable panels.

05Discussion & Limitations

Limitations (be specific):

Dependent on the image generator’s skill: If the generator can’t produce clean, multi-panel layouts with clear text, the reasoning trail weakens.
Style-task fit required: The best narrative style (e.g., detective vs. slice-of-life) varies by task; a mismatch can reduce gains.
Very fine-grained motion: Tasks needing frame-by-frame physics may still favor short videos over summarized panels.
Answer placement: Path I needs the final panel to show the answer cleanly; messy rendering can confuse extraction.

Required resources:

A capable image generator that supports multi-panel comics and embedded text.
A vision-language model (for Path II) that can read both pictures and text.
Simple answer-extraction or parsing scripts.

When NOT to use:

Tasks with extremely subtle temporal cues (e.g., tiny biomechanical changes) where compression into panels loses needed detail.
Purely textual logic where visuals add no value.
Ultra-low-resource settings where even image generation is too costly or slow.

Open questions:

Automatic panel planning: How to choose the optimal number and content of panels per task instance.
Faithfulness: Ensuring every panel truthfully reflects the original question without hallucinated details.
Fairness and culture: Narrative styles can carry cultural assumptions; how to make them inclusive and robust across regions.
Evaluation: Developing standardized metrics for visual chain-of-thought quality beyond final accuracy.
Editing loops: Can the model critique and fix its own comic for even better results?

06Conclusion & Future Work

Three-sentence summary: This paper proposes Thinking with Comics, where AI turns problems into short, ordered comic panels that capture time and logic efficiently. By using comics either as the full reasoning path (Path I) or as context for a VLM (Path II), the method boosts accuracy on multi-step and long-context tasks while cutting generation cost compared to videos. Experiments show gains from narrative style, panel count sweet spots, and text bubbles that anchor meaning.

Main achievement: Establishing comics as a compact, structured, and highly effective intermediate visual representation that preserves temporal structure with far less redundancy than video.

Future directions:

Smarter panel planning and automatic style selection per task,
Stronger faithfulness checks and cultural robustness,
Better evaluation tools for visual chains of thought,
Interactive editing to refine panels before final answers.

Why remember this: Comics are a sweet spot between images and videos—clear like a storyboard, ordered like a timeline, and cheap to generate—making them a practical new medium for multimodal reasoning.

Practical Applications

•Tutoring systems that turn math word problems into four-panel explanations students can follow.
•Document assistants that convert long PDFs into comics highlighting the exact lines with answers.
•Customer support that storyboard-troubleshoots device setup steps to reduce errors.
•Cultural learning apps that use slice-of-life comics to teach etiquette and daily practices.
•Medical intake kiosks (non-diagnostic) that visually guide patients through form completion and instructions.
•STEM classrooms where teachers generate panel-by-panel lab procedures or geometry proofs.
•Accessibility tools that pair pictures with short text bubbles to explain complex charts or schedules.
•Enterprise analytics summaries that convert dashboards into panel sequences explaining the trend → cause → action.
•Coding helpers that visualize algorithm steps (input → transform → output) in panel form.
•Compliance training that uses detective-style panels to show correct vs. incorrect procedures and why.

Version: 1