SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs

Jintao Tong; Shilin Yan; Hongwei Xue; Xiaojun Tang; Kunyu Shi; Guannan Zhang; Ruixuan Li; Yixiong Zou

SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs

Intermediate

Jintao Tong, Shilin Yan, Hongwei Xue et al.2/5/2026

arXiv PDF

Key Summary

•SwimBird is a multimodal AI that can switch how it thinks: only in text, only in vision (with hidden picture-like thoughts), or a mix of both.
•It uses a hybrid autoregressive design that predicts the next word for text thoughts and the next embedding for visual thoughts in the same timeline.
•Special tags teach the model when to enter or exit visual thinking, so it chooses the best mode based on the question instead of following a fixed script.
•A new dataset, SwimBird-SFT-92K, was curated to cover three modes (text-only, vision-only, interleaved), so the model learns when each mode helps.
•SwimBird dynamically adjusts how many visual-thought tokens to generate, giving hard visual tasks more thinking space and easy ones less.
•It sets new state-of-the-art results on fine-grained visual benchmarks like V* Bench (85.5), HR-Bench 4K (79.0), and HR-Bench 8K (74.9).
•On general VQA and multimodal reasoning, it also improves over strong baselines (e.g., 71.2 on MMStar and 73.1 on RealWorldQA).
•Ablations show that too many visual tokens or too-strong visual-loss weights can actually hurt, so balance matters.
•By matching the right thinking style to each question, SwimBird keeps strong logic for text problems while getting much better at tough visual ones.

Why This Research Matters

SwimBird brings the common-sense idea of “use the right kind of thinking” into AI, so it performs better on both reading and seeing tasks. This helps in real situations like reading tiny labels on packages, counting steps on a map, or verifying answers on diagram-based homework. It can make digital assistants more helpful for people who rely on screen readers or need clearer visual guidance. Robots and drones can make safer choices by looking carefully when it matters and reasoning in words when planning. Finally, by avoiding unnecessary steps, SwimBird can be faster and cheaper on easy tasks while still powering through tough, high-resolution problems.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how sometimes you solve a riddle by talking it through, and other times you just need to stare at a picture to spot a tiny clue? Different problems need different kinds of thinking.

🥬 Filling (The Actual Concept): Before SwimBird, most Multimodal Large Language Models (MLLMs) tried to solve everything using mostly text-based Chain-of-Thought (CoT)—that’s like explaining every step with words, even when the real work is looking closely at the image.

What it is: Textual CoT is step-by-step reasoning written in words; it boosted logic-heavy tasks like math word problems and planning.
How it worked: 1) Read the question and image; 2) Explain the steps in sentences; 3) Give the final answer.
Why it mattered: It made models better at logical and numerical problems, but it struggled when tiny visual details or spatial layouts were crucial.

🍞 Bottom Bread (Anchor): Imagine reading a maze’s solution out loud without tracing it—your words won’t help you actually see the right path.

🍞 Top Bread (Hook): Picture a treasure hunt: sometimes you must keep a mental picture, not a paragraph, to track where the shiny coin is.

🥬 Filling: Researchers added latent visual reasoning: hidden, picture-like thoughts the model keeps inside instead of trying to describe every visual detail in text.

What it is: Continuous visual embeddings (hidden states) that act like mental images, updated step by step.
How it worked: 1) Encode the image; 2) Generate the next visual embedding; 3) Repeat to refine visual understanding; 4) Answer.
Why it mattered: It helps on vision-dense tasks (like tiny-object search, reading small text, or folding 3D nets), but many models always used these visual steps—even when not needed.

🍞 Bottom Bread (Anchor): It’s like putting on magnifying goggles for everything, including reading a big billboard—you see fine, but it’s slower and sometimes distracting.

The problem: Fixed patterns. Many systems pick one routine—only text, only visual, or an interleaved dance that always alternates in the same way. This leads to a mismatch: forcing visual thoughts for simple text questions can confuse logic; forcing text-only for visual puzzles misses critical details; and fixed interleaving can add extra, useless steps.

Failed attempts: 1) Text-only CoT—great logic, weak fine-grained vision. 2) Vision-only—better perception, but worse at symbolic reasoning. 3) Interleaved but fixed schedule—sometimes inserts unnecessary mode switches. 4) Fixed visual-thought length—too short for hard images, wasteful for easy ones.

The gap: We need a model that can choose how to think—text, vision, or both—based on the question, and can adjust how many visual-thought steps to take.

Real stakes: In daily life, this means reading tiny serial numbers, navigating cluttered shelves, grading diagram-based math, or helping a robot pick the right screw. Using the wrong kind of thinking can make the AI slow, brittle, or simply wrong. A switchable, balanced thinker matters whenever pictures and words collide.

02Core Idea

🍞 Top Bread (Hook): Imagine a Swiss Army knife brain that flips out the right tool—words or pictures—at the perfect moment.

🥬 Filling (The Actual Concept): The key idea in one sentence: Teach one model to predict words for text thoughts and embeddings for visual thoughts in the same sequence, and let it switch modes on its own.

How it works: 1) Unify two predictions—next word (text) and next embedding (visual); 2) Use special tags to enter/exit visual thinking; 3) Train on a dataset that contains text-only, vision-only, and interleaved examples; 4) Let the model choose and adapt the number of visual-thought steps at inference.
Why it matters: The model naturally matches its thinking style to the question, keeping strong logic on text problems and strong perception on vision-dense ones.

🍞 Bottom Bread (Anchor): It’s like solving a jigsaw: you silently scan and fit pieces (visual thoughts), then say, “Corner piece goes here” (text thought) when needed—switching as the puzzle demands.

🍞 Top Bread (Hook): You know how a storyteller can tell the next sentence, and an artist can sketch the next stroke? What if one person did both, choosing which to do next?

🥬 The Concept—Hybrid Autoregressive Model:

What it is: A single timeline where the model either predicts the next word (for language) or the next embedding (for visual thinking).
How it works: 1) Read the prompt and image; 2) Decide whether to write text or imagine a visual embedding; 3) Produce the next item; 4) Repeat, switching modes using tags when helpful; 5) Output the answer.
Why it matters: Without this shared timeline, switching modes would be clunky, and one mode might dominate unfairly.

🍞 Anchor: Like a comic book where some panels are pictures and some are captions—the story flows because both sit on the same page.

🍞 Top Bread (Hook): Sometimes you need quiet picture-thinking before you can put ideas into words.

🥬 The Concept—Continuous Hidden States as Visual Thoughts:

What it is: Internal, picture-like vectors the model updates step by step to hold visual evidence.
How it works: 1) Encode the scene; 2) Predict the next visual embedding that refines focus (e.g., zooming on digits); 3) Continue until enough detail is gathered; 4) Switch to text to decide and explain.
Why it matters: Without these, the model must describe every tiny visual clue in words, which is slow and lossy.

🍞 Anchor: Think of mentally zooming into a photo to read a street sign before you say the address out loud.

🍞 Top Bread (Hook): A good thinker knows when to talk and when to look.

🥬 The Concept—Mode Switching with Special Delimiters:

What it is: Tags like <|latent_start|> and <|latent_end|> that tell the model when to think visually or textually.
How it works: 1) The model generates a tag to enter visual mode; 2) Produces visual embeddings; 3) Emits a tag to exit; 4) Continues in text if needed.
Why it matters: Without clear switches, the model mixes modes randomly, causing confusion and waste.

🍞 Anchor: It’s like flipping a switch between a microscope (visual) and a notebook (text) while doing a science experiment.

🍞 Top Bread (Hook): Big pictures need more time; tiny doodles don’t.

🥬 The Concept—Dynamic Latent Token Budget:

What it is: The model varies how many visual-thought tokens it uses based on image resolution and difficulty.
How it works: 1) Allow a range for visual tokens; 2) For detailed images, generate more visual steps; 3) For simple ones, stop early; 4) Avoid fixed pooling that throws away detail.
Why it matters: Fixed budgets either miss details or waste compute; dynamic budgets fit the task.

🍞 Anchor: It’s like taking more camera shots of a complex scene but just one quick snap of something simple.

🍞 Top Bread (Hook): You cook different meals with different recipes—so your cookbook should label which recipe fits which ingredients.

🥬 The Concept—Reasoning-Mode Curation (Dataset Design):

What it is: A process to label training examples as text-only, vision-only, or interleaved so the model learns when to use each.
How it works: 1) Collect interleaved CoT data with intermediate images; 2) Filter out easy cases solvable without hints; 3) Use pass@8 checks with and without visual hints to decide if an item is vision-only or interleaved; 4) Add 50K text-only CoT samples for balance.
Why it matters: Without well-labeled examples, the model won’t learn mode selection.

🍞 Anchor: Like organizing tools into drawers—labels help you grab the right tool quickly.

Before vs After: Before, models stuck to one script; after, SwimBird flexibly switches. Why it works: Matching the thinking style to the information bottleneck (language or vision) reduces errors and saves effort. Building blocks: hybrid timeline, visual thoughts, delimiter tags, dynamic token budget, and curated multi-mode data.

03Methodology

High-level overview: Input (image + question) → Mode selection via tags → Autoregressive generation (text tokens or visual embeddings) with dynamic visual budget → Final answer.

Step A: Unified Timeline for Two Kinds of “Next”

What happens: The model continues a single sequence where the next item can be a word (text token) or a visual thought (embedding).
Why this step exists: Keeping both kinds of thoughts on one track makes switching smooth and lets the model plan its reasoning holistically.
Example: For reading a phone number on a boat, the model writes <|latent_start|>, produces several visual embeddings to zoom and stabilize the digits, then <|latent_end|> and writes “Option B.”

Step B: Mode Switching with Tags

What happens: The model decides to enter visual mode by emitting <|latent_start|>, and decides to exit by emitting <|latent_end|>. Between these tags, it produces continuous embeddings instead of words.
Why it exists: Clear on/off signals prevent muddling text and visual steps, and enable accurate training targets.
Example: Maze solving—no need to speak until after tracing; the model stays in visual mode, then exits to say “4 steps.”

Step C: Dynamic Latent Token Budget

What happens: The number of visual embeddings is not fixed. The vision encoder outputs variable tokens based on resolution bounds; at inference, the model keeps producing embeddings until it decides to stop.
Why it exists: Hard, high-resolution inputs deserve more visual thinking; easy ones shouldn’t pay that cost.
Example: A tiny label in a 4K image might need 24 visual steps; a big, clear digit might need just 4.

Step D: Training Objectives (Intuition Only)

What happens: Text spans learn via a standard likelihood loss (make the next word likely). Visual spans learn via a reconstruction-style loss (make the next visual embedding match a target embedding of an intermediate thinking image).
Why it exists: Each mode gets the right kind of supervision—text for language accuracy, embeddings for visual faithfulness.
Example: Teaching spelling by reading (text) and teaching drawing by comparing sketches (visual), each with its own scorecard.

Step E: Reasoning-Mode Curation (SwimBird-SFT-92K)

What happens: Build a balanced training set that teaches when to use each mode.
Why it exists: Models copy patterns they see. If training only shows visual steps, the model will overuse them; if only text steps, it will ignore vision.
Example with real data:
- Stage 1: Gather interleaved CoT with intermediate images from ThinkMorph, Zebra-CoT, MathCanvas; filter out items solvable without hints.
- Stage 2: Use pass@8 checks: if visual hints raise accuracy a lot (≥75%), label as vision-only; if hints help but not enough, label as interleaved.
- Stage 3: Add 50K text-only CoT from OpenMMReasoner. Total ≈92K examples across three modes.

Step F: Prompting for Mode Control

What happens: A system message explains the mode tags and tells the model to choose text-only, vision-only, or interleaved as needed.
Why it exists: Clear instructions reduce confusion and nudge the model to use modes purposefully.
Example: “Use <reason>...</reason> for logic; use <|latent_start|>...</|latent_end|> for visual thinking; put the final answer in <answer>...</answer>.”

Step G: Inference Flow (Putting it All Together)

What happens: 1) Read the question and image; 2) Start in text mode by default; 3) If text cues suggest small details or spatial search, emit <|latent_start|> to enter visual mode; 4) Produce as many visual embeddings as needed; 5) Emit <|latent_end|>; 6) Conclude in text and output <answer>.
Why it exists: This recipe ensures the model can self-direct attention and compute where it matters most.
Example: Net folding puzzle—visual-only mode to simulate folding; arithmetic word problem—text-only mode for symbolic steps; photo OCR with options—interleaved: visual to read, text to compare.

The Secret Sauce:

One timeline that blends words and visual thoughts without friction.
Clear mode tags that the model learns to place on its own.
Dynamic visual budgets that scale with image difficulty.
A carefully curated dataset that teaches not just how to think, but when to pick which way to think.

Sandwich Spotlights (Concise Recaps):

🍞/🥬/🍞 Hybrid Autoregressive Model: One sequence, two prediction types; switch as needed; otherwise, modes collide or one dominates. Anchor: a graphic novel mixing pictures and captions in one story flow.
🍞/🥬/🍞 Visual Thoughts: Internal image-like vectors updated stepwise; without them, tiny clues get lost in text. Anchor: mentally zooming to read a street sign.
🍞/🥬/🍞 Dynamic Latent Budget: Variable visual steps; fixed budgets either miss details or waste time. Anchor: taking more photos of complex scenes than simple ones.
🍞/🥬/🍞 Mode Tags: Start/stop signals for visual thinking; without tags, switching is messy. Anchor: flipping between microscope and notebook.

04Experiments & Results

The Test: The authors measure how well SwimBird handles fine-grained, high-resolution visual understanding and general multimodal reasoning. These include benchmarks where tiny details matter (V* Bench, HR-Bench 4K/8K, MME-RealWorld) and where logic and reading matter (MMStar, RealWorldQA, WeMath, DynaMath, MathVerse_MINI).

The Competition: SwimBird is compared to strong text-first models (like Qwen2.5/3-VL, LLaVA-OneVision), latent-visual models (Monet, LVR, SkiLa), and agentic tool-using systems (Pixel Reasoner, DeepEyes, Thyme). Some closed models like GPT-4o are also referenced.

The Scoreboard (with context):

Fine-grained perception:
- V* Bench: 85.5 (like scoring higher than previous A students in a class of tough visual quizzes).
- HR-Bench 4K: 79.0 and HR-Bench 8K: 74.9, beating Qwen3-VL-8B-Instruct (76.5/71.3) and surpassing many agentic systems. This is like reading small print on billboards from farther away than your peers.
General VQA and reasoning:
- MMStar: 71.2 (better than Qwen3-VL-8B-Instruct at 64.7).
- RealWorldQA: 73.1 (on par or better than strong baselines).
- Multimodal reasoning: WeMath 49.5, DynaMath 67.2, MathVerse_MINI 65.8—showing the model didn’t sacrifice logic to gain vision.

Surprising/Notable Findings:

Mode Distribution Matches Task Needs: On math/logic-heavy sets (DynaMath, MathVerse_MINI), SwimBird mostly stays in text-only mode, avoiding pointless visual steps. On vision-dense sets (V* Bench, HR-Bench), it frequently uses vision-only or interleaved reasoning. That’s evidence of true mode adaptivity.
Dynamic Budget Matters: Setting the maximum visual-thought tokens too low (e.g., 16) limits perception; raising to 32 helps a lot; pushing higher (64, 128) can hurt—like overthinking with too many mental snapshots.
Balance of Losses: A moderate weight for the visual-thought reconstruction loss (about 0.2) yields the best overall balance. Too little and visual thinking is weak; too much and the model leans too hard on reconstruction, hurting general reasoning.

What Changed vs Fixed-Pattern Baselines:

Text-first baselines: Great at logic, weaker on tiny details. SwimBird narrows or flips this gap while keeping logic strong by not adding visual steps when they’re unnecessary.
Visual-latent baselines: Better perception but sometimes stumble on text logic because they always invoke visual thoughts. SwimBird avoids this by choosing text-only when that’s enough.
Agentic baselines: Use explicit tools (cropping, search). SwimBird achieves similar or better perception without external tool pipelines, thanks to its learned visual-thought space and switching behavior.

Takeaway: Matching the thinking style to the bottleneck (language vs. vision) consistently improves accuracy across very different benchmarks, which is rare and valuable.

05Discussion & Limitations

Limitations:

Mode Choice Errors: The model can sometimes pick the wrong mode (e.g., doing visual steps for a purely textual task), adding latency or noise.
Training Complexity: Two supervision types (text and embeddings) with balancing weights make tuning trickier.
Data Dependence: The quality of reasoning-mode curation (e.g., pass@8 labeling) affects performance; biases in source datasets can skew when modes are used.
Compute and Memory: While dynamic tokens save compute on easy cases, vision-dense problems can still be expensive.
Frozen Vision Encoder: Keeping the encoder frozen simplifies training but may cap potential on niche visual domains.

Required Resources:

A strong base MLLM (Qwen3-VL-8B here), high-memory GPUs (e.g., A100 80G), and curated multi-mode SFT data (≈92K samples). Stable training benefits from careful learning-rate schedules and loss weighting.

When NOT to Use:

Pure Text Tasks with No Images: A text-only LLM is simpler, cheaper, and fast.
Real-Time Edge Scenarios with Very Tight Latency: If dynamic visual steps risk timeouts, a fixed, tiny visual budget or a specialized model may be better.
Domains Requiring Specialized Tools (e.g., CAD, medical imaging with domain tools): A tool-augmented system might outperform a pure latent approach.

Open Questions:

Automatic Budget Control: Can the model learn an even smarter stopping rule for visual thoughts that’s provably optimal for latency/accuracy trade-offs?
Better Visual Targets: Are there richer supervision signals (e.g., contrastive or perceptual losses) that produce more semantically aligned visual thoughts?
Safety and Reliability: How to detect and recover when the model chooses the wrong mode midstream?
Continual Learning: Can mode preferences adapt over time as new tasks and domains appear without catastrophic forgetting?
Tool Integration: How to blend switchable internal thoughts with external tools (crop/zoom/OCR) for the best of both worlds?

06Conclusion & Future Work

Three-Sentence Summary: SwimBird introduces a switchable reasoning MLLM that unifies next-word (text) and next-embedding (visual) predictions in one sequence, then learns to choose text-only, vision-only, or interleaved modes based on each question. A curated 92K-sample dataset and dynamic visual-token budgeting teach the model not just how to think, but when and how much to think visually. The result is state-of-the-art fine-grained perception with strong text logic preserved.

Main Achievement: Establishing a practical, hybrid autoregressive framework plus data curation that reliably elicits mode switching and adaptive visual computation, turning rigid pipelines into flexible, query-matched reasoning.

Future Directions: Combine switchable thoughts with lightweight tools (zoom/crop/OCR), refine visual-thought supervision with perceptual or contrastive objectives, learn smarter stopping rules for visual spans, and extend to videos and 3D. Investigate safety signals for detecting wrong-mode choices and enable continual learning of mode preferences.

Why Remember This: SwimBird shows that picking the right kind of thinking at the right time—words, pictures, or both—can lift accuracy across very different tasks. It’s a simple, powerful idea: don’t force one reasoning style on every question. Instead, let the model switch gears and size its visual thinking to match the challenge.

Practical Applications

•Retail shelf auditing: Read tiny price tags and match them to product names with interleaved visual-text reasoning.
•Document QA: Verify values on invoices or forms by zooming in visually, then cross-checking with text logic.
•STEM tutoring: Solve diagram-based math problems by inspecting figures visually and explaining steps in words.
•Quality control: Spot surface defects or misprinted serial numbers using vision-only reasoning when needed.
•Navigation and maps: Trace shortest paths or count steps on grids/mazes using visual thoughts, then state the result.
•Accessibility tools: Read small on-image text aloud and summarize content for low-vision users.
•Robotics: Pick-and-place tasks that require precise visual grounding before planning motions in text-like steps.
•Drone search & rescue: Identify targets in high-resolution images, switching to text to coordinate actions.
•AR assistants: Highlight relevant regions in a scene (visual) and provide instructions (text) for repairs or setup.
•Scientific analysis: Examine charts/plots visually to extract values, then reason about trends and hypotheses in text.

Version: 1