🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
đŸ›€ïžPaths📚Topics💡Concepts🎮Shorts
🎯Practice
đŸ§©Problems🎯Prompts🧠Review
Search
Kimi K2.5: Visual Agentic Intelligence | How I Study AI

Kimi K2.5: Visual Agentic Intelligence

Beginner
Kimi Team, Tongtong Bai, Yifan Bai et al.2/2/2026
arXivPDF

Key Summary

  • ‱Kimi K2.5 is a new open-source AI that can read both text and visuals (images and videos) and act like a team of helpers to finish big tasks faster.
  • ‱It trains text and vision together from the start, so the two skills boost each other instead of fighting for attention.
  • ‱A clever step called zero‑vision SFT uses only text training to wake up visual skills that were formed during joint pretraining.
  • ‱Reinforcement learning is run on both text and vision tasks, and surprisingly, vision practice also makes text answers better.
  • ‱Agent Swarm lets one main AI split a hard job into many smaller jobs and run them at the same time using sub‑agents.
  • ‱This parallel teamwork cuts waiting time by up to about 4.5× compared to doing steps one after another.
  • ‱A special vision encoder, MoonViT‑3D, handles images and videos in one shared way and compresses time so the model can watch much longer videos.
  • ‱On tough tests in math, coding, vision, video, and web‑browsing agents, Kimi K2.5 is state‑of‑the‑art or highly competitive with top proprietary systems.
  • ‱The model checkpoint is released so researchers and developers can build real apps with general agentic intelligence.
  • ‱The big idea: train language and vision together, then teach the model to organize a helpful swarm of parallel agents.

Why This Research Matters

Kimi K2.5 shows a practical path to truly helpful digital assistants that can read, see, and act. By training text and vision together and learning to split work across parallel agents, it finishes big, messy tasks faster and more accurately. This helps students, researchers, engineers, and analysts who juggle long documents, screenshots, charts, and videos. It also lowers the barrier to building reliable browsing and coding agents that cite sources and verify results. In businesses, faster multimodal agents mean quicker insights, better decision‑making, and reduced costs. Because K2.5 is open‑source after post‑training, the community can adapt it to real‑world needs and keep improving the ecosystem.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how a school project gets easier when you can read the instructions, look at pictures, and then split the work with friends? Early AI models were good readers (text) but not great at seeing (images and video), and they mostly worked alone. Before models like Kimi K2.5, many systems treated vision as a last‑minute add‑on—like taping a picture to an essay at the end. That made them strong at language or strong at vision, but rarely both, and certainly not great at planning and acting across tools.

The world before: Large language models could answer questions, write code, and explain things. Visual models could label images or read charts. But mixing them was bumpy. A common recipe was to train the language brain first, then bolt on vision late with a big chunk of image tokens. That often caused a clash: when vision arrived late and loud, text quality dipped, then slowly recovered—if it did at all. And even when models could reason step by step, they usually called tools in a straight line: step 1, then step 2, then step 3. As tasks grew bigger (say, researching 50 websites, reading dozens of PDFs, writing code, and verifying results), time grew linearly and context got messy.

The problem: 1) How do we make text and vision help each other instead of competing? 2) How do we make agentic work (planning, tool use, browsing, coding) faster and more scalable when tasks are broad and deep? 3) How do we make learning from visual tasks improve text abilities (and vice versa) instead of causing trade‑offs?

Failed attempts: Late vision fusion with high vision ratios tried to cram in visual knowledge at the end. It often hurt text, learned brittle visual skills, and needed lots of hand‑made visual training paths (like scripted tool chains) that didn’t generalize. Sequential agents tried to be smarter by thinking more, but more steps just meant longer waits, and context overflow forced reactive tricks like trimming history, which sometimes cut away needed clues.

The gap: We needed a native multimodal foundation—where text and vision grow up together from the beginning—and an agent framework that learns when and how to split work and run in parallel. We also needed a training path that could turn text‑only practice into visual competence, so we wouldn’t depend on scarce, hand‑curated vision data.

Real stakes: This matters for daily life and work. Picture a homework helper that reads your textbook, understands the diagram, and explains the steps. Imagine a coding assistant that reads screenshots, inspects logs, and writes patches. Consider a research agent that searches many sources at once, reads long documents, and summarizes the truth with citations—all much faster. For video, think of analyzing long lectures or security footage without missing key moments. In business, faster and better multimodal agents mean lower costs, better decisions, and tools that feel genuinely helpful.

02Core Idea

Aha! Train language and vision together from the start, then teach the model to organize a swarm of parallel agents so big, mixed tasks finish faster and more accurately.

Three analogies:

  1. Team sport: Instead of first training only the striker (text) and later adding the goalie (vision), you train the whole team together so they pass smoothly. Then, during a match, you coordinate multiple players to cover the field at once.
  2. Cooking: You simmer pasta and sauce together so flavors blend (joint text‑vision training). When it’s dinner time, you have several cooks each handling a part of the meal at the same time (Agent Swarm), so food arrives hot and fast.
  3. Detective work: You study letters and photos together so clues connect naturally (joint training). Then you send different detectives in parallel—one to the library, one to the archives, one to the crime scene—and gather their reports for the solution (Agent Swarm + orchestration).

Before vs. after:

  • Before: Vision added late; text dips when vision arrives; agents execute tools in a slow line; visual learning sometimes hurts text.
  • After: Vision fused early; text and vision strengthen each other; agents split tasks and run in parallel; training on vision can even boost text benchmarks.

Why it works (intuition, not equations):

  • Early, balanced exposure lets the model build one shared space where words and pixels line up cleanly. No late shocks, fewer conflicts.
  • Text‑only fine‑tuning can still wake up visual skills because the model already linked text and visual features during pretraining; you teach it procedures and tool habits in text, and it generalizes them to images.
  • Outcome‑based visual RL rewards correct visual behaviors. Those calibrations (like careful counting, structured extraction) carry over to similar text tasks.
  • Parallel agents reduce waiting on the slowest, longest chain. The orchestrator learns to spawn the right helpers and only gathers the useful results, keeping global context tidy.

Building blocks (explained with the Sandwich pattern):

🍞 Hook: You know how your brain uses eyes and words together when studying a diagram in science class? đŸ„Ź The Concept: Multimodal Learning (Multimodal ML) means teaching AI to understand more than one kind of input—like text, images, and video—at the same time.

  • How it works: 1) Feed mixed data (words + pixels), 2) Build shared representations so the same idea connects across text and visuals, 3) Practice tasks that need both.
  • Why it matters: Without it, the model can read or see—but struggles to connect the two. 🍞 Anchor: Reading a question about a chart and then looking at the chart to answer it.

🍞 Hook: Imagine learning by trying things and getting points when you do well. đŸ„Ź The Concept: Reinforcement Learning (RL) lets AI learn by taking actions and getting rewards for good outcomes.

  • How it works: 1) Try an action, 2) Get feedback (reward), 3) Adjust to do better next time.
  • Why it matters: Without RL, the model may talk nicely but won’t improve at tool use and multi‑step tasks. 🍞 Anchor: A browsing agent that gets points for correct, well‑cited answers.

🍞 Hook: Your brain is a web of connections that light up when you learn. đŸ„Ź The Concept: Neural Networks are computing layers that learn patterns by tuning their connections.

  • How it works: 1) Pass data through layers, 2) Compare output to the goal, 3) Nudge weights to reduce mistakes.
  • Why it matters: Without neural nets, modern AI wouldn’t learn rich patterns. 🍞 Anchor: Recognizing the digit “8” in different handwritings.

🍞 Hook: Think of a teacher grading your homework with answer keys. đŸ„Ź The Concept: Supervised Learning is training on examples with the right answers.

  • How it works: 1) Input + correct output, 2) Model guesses, 3) Compare and correct.
  • Why it matters: Without it, the model doesn’t get clear guidance early on. 🍞 Anchor: Learning to caption an image from many image–caption pairs.

🍞 Hook: Big projects get easier when classmates each take a part. đŸ„Ź The Concept: Multi‑agent Systems are groups of AIs that work together.

  • How it works: 1) Split a task, 2) Assign roles, 3) Share results, 4) Combine into a final answer.
  • Why it matters: Without teamwork, long tasks become too slow or messy. 🍞 Anchor: One agent searches papers, another analyzes data, a third writes the summary.

🍞 Hook: Adjusting the brightness on a photo app. đŸ„Ź The Concept: Image Processing changes and measures pixels to understand pictures.

  • How it works: 1) Filter or segment pixels, 2) Detect shapes/lines, 3) Count or locate objects.
  • Why it matters: Without it, models can miss fine visual details. 🍞 Anchor: Counting blue slices in a pie chart by selecting blue pixels.

🍞 Hook: Reading a picture in small tiles like a mosaic. đŸ„Ź The Concept: Vision Transformers split images into patches and learn relationships between them.

  • How it works: 1) Turn image into patches, 2) Embed and attend to relationships, 3) Predict labels or text.
  • Why it matters: Without patch attention, models struggle with flexible resolutions and complex scenes. 🍞 Anchor: Finding where the cat is by relating nearby patches.

🍞 Hook: Studying math and science in the same semester helps you connect ideas. đŸ„Ź The Concept: Multimodal Pre‑training means training on text and images/videos together early on.

  • How it works: 1) Mix data types at a steady ratio, 2) Learn shared features, 3) Keep going long enough for balance.
  • Why it matters: Late add‑ons cause clashes and dips in quality. 🍞 Anchor: Mixing 10% vision and 90% text from the start for smoother learning.

🍞 Hook: Stirring sauce into pasta while cooking, not after plating. đŸ„Ź The Concept: Joint Optimization of Text and Vision improves both at the same time.

  • How it works: 1) Early fusion, lower vision ratio, 2) Train steadily, 3) Avoid big late shifts.
  • Why it matters: Without it, one skill steals focus and the other suffers. 🍞 Anchor: Better results than dumping 50% vision late in training.

🍞 Hook: Learning lab safety steps in a booklet before entering the lab. đŸ„Ź The Concept: Zero‑Vision SFT uses only text fine‑tuning to activate visual tool use learned during pretraining.

  • How it works: 1) Teach procedures in text (like coding steps), 2) Because text and vision are aligned, the skills transfer, 3) Avoid brittle hand‑made visual scripts.
  • Why it matters: Without it, you need lots of costly visual SFT and still risk worse generalization. 🍞 Anchor: Text‑only practice leads to good performance on OCR and counting once images appear.

🍞 Hook: Solving a mystery by reading notes and also inspecting photos. đŸ„Ź The Concept: Joint Text‑Vision RL improves decisions using both text and images.

  • How it works: 1) Give tasks with verifiable outcomes, 2) Reward correct multimodal reasoning, 3) Share gains across modalities.
  • Why it matters: Without it, improvements in one mode may not help the other. 🍞 Anchor: Visual RL improved MMLU‑Pro and GPQA text scores.

🍞 Hook: Watching a movie frame by frame and summarizing each scene. đŸ„Ź The Concept: MoonViT‑3D is a vision encoder that handles images and videos in a shared way with light temporal compression.

  • How it works: 1) Pack patches from up to 4 frames, 2) Share weights between image and video, 3) Pool over time to go 4× longer.
  • Why it matters: Without shared handling, video would need separate bulky modules. 🍞 Anchor: Handling over 2,000 frames for long‑video understanding.

🍞 Hook: A team captain who knows when to call in more players. đŸ„Ź The Concept: Agent Swarm is a trained orchestrator that creates specialized sub‑agents and runs them in parallel.

  • How it works: 1) Decide when to split tasks, 2) Spawn sub‑agents, 3) Run them concurrently, 4) Collect just the useful outputs.
  • Why it matters: Without it, long tasks take too long and overflow context. 🍞 Anchor: Parallel web research that finishes in a fraction of the time.

🍞 Hook: A teacher grades the group project, not each tiny step, while the group learns how to divide work. đŸ„Ź The Concept: Parallel Agent Reinforcement Learning (PARL) trains only the orchestrator while sub‑agents are frozen.

  • How it works: 1) Freeze sub‑agents as tools, 2) Reward good parallel plans, 3) Prevent reward‑hacks like spawning useless agents.
  • Why it matters: Without it, training is unstable and it’s unclear who deserves credit. 🍞 Anchor: Smoother learning of when/what to parallelize and big latency wins.

03Methodology

At a high level: Mixed text+images/videos → Early joint pretraining (MoonViT‑3D + language model) → Zero‑vision SFT (text‑only procedures) → Outcome‑based vision RL → Joint text‑vision RL → Agent Swarm orchestration (PARL) → Fast, accurate multimodal agents.

Step A: Native multimodal pretraining with early fusion

  • What happens: From near‑end Kimi K2, the model trains on about 15T mixed tokens with a steady, moderate vision ratio (e.g., 10–20% vision, 80–90% text) instead of dumping lots of vision late. The MoonViT‑3D encoder packs image/video patches and shares weights for images and videos. Temporal pooling compresses 4 frames into one chunk so the model can handle videos 4× longer in the same context.
  • Why it exists: Late heavy vision shakes the language space and causes a dip; early moderate fusion builds one stable shared space and reduces conflicts.
  • Example: With a constant 10%:90% vision:text ratio from the start, the model maintains steadier text scores while steadily climbing in vision tasks.

Step B: Zero‑vision SFT (text‑only fine‑tuning)

  • What happens: The model is fine‑tuned with text‑only instruction data that teaches general agent behaviors (like how to plan, cite sources, or write Python to analyze data). Because text and vision were aligned during pretraining, these skills generalize to images when they appear.
  • Why it exists: High‑quality text data is abundant; curated vision SFT data is scarce and can overfit. Text‑only SFT activates visual tool use more robustly than small hand‑crafted visual scripts.
  • Example: The model learns to write Python code to binarize, count, and segment—then uses the same code on an actual image to count apples or read a pie chart.

Step C: Outcome‑based visual RL

  • What happens: The model practices vision‑needed tasks where answers can be checked—like grounding (point/box/polygon), counting, OCR, charts, and vision‑critical STEM. Correct outputs get rewards. Good traces can be reused for further fine‑tuning.
  • Why it exists: After text‑only SFT, the model sometimes ignores images. Outcome‑based RL forces it to pay attention when visuals matter.
  • Example: On OCRBench, rewards reflect edit distance; for segmentation, rewards depend on overlap (IoU) between predicted and true shapes.

Step D: Joint multimodal RL across abilities

  • What happens: RL isn’t split by modality but by abilities (knowledge, reasoning, coding, agentic). Both pure‑text and multimodal tasks train the same policy; a Generative Reward Model (GRM) provides nuanced feedback (helpfulness, relevance, instruction following) where exact answers aren’t verifiable.
  • Why it exists: Training by abilities lets wins transfer across modalities (e.g., structured extraction learned from visuals improves similar text tasks).
  • Example: After vision RL, text scores on MMLU‑Pro and GPQA go up, showing cross‑modal generalization.

Step E: Agent Swarm with PARL

  • What happens: One orchestrator model learns to decide: Should I parallelize? How many sub‑agents? What tasks do they get? Sub‑agents are frozen policies initialized from earlier checkpoints, treated as tools. Rewards include: (1) final task success, (2) a bonus that prevents collapsing back to single‑agent mode, and (3) a finish‑rate term that prevents spawning useless agents. Over training, the orchestrator learns efficient parallel plans.
  • Why it exists: Sequential agents are slow and run out of context. Parallel orchestration reduces wall‑clock time and keeps local reasoning separate in sub‑agent memories.
  • Example: On WideSearch, Agent Swarm cuts time by about 3×–4.5× to hit the same accuracy target.

Step F: Critical steps as a speedometer

  • What happens: Instead of counting total steps, the system measures the sum over stages of main‑agent step + the longest sub‑agent branch. This mirrors real latency when things run in parallel.
  • Why it exists: If you spawn many sub‑tasks but one super long branch dominates, you didn’t really speed up. This metric nudges the orchestrator to balance workloads.
  • Example: Splitting 20 sources among 5 sub‑agents (4 each) is better than one sub‑agent doing all 20 while others idle.

Step G: Proactive context management by design

  • What happens: Sub‑agents keep their own small contexts; only final, relevant outputs are returned to the orchestrator. That means less clutter in the global conversation and fewer token overflows.
  • Why it exists: Truncation methods like “discard all” lose structure. Swarm keeps structure by sharding context among sub‑agents and reassembling summaries.
  • Example: In BrowseComp, swarm outperforms discard‑all in both speed and accuracy.

Secret sauce

  • Early, lower‑ratio vision fusion avoids late‑stage shocks and yields better overall learning under fixed budgets.
  • Zero‑vision SFT turns plentiful text data into general multimodal procedures.
  • Outcome‑based visual RL calibrates attention to images, and those calibrations help text tasks too.
  • PARL trains the orchestrator only, solving credit assignment and stabilizing learning for parallel plans.
  • MoonViT‑3D’s shared image/video space plus temporal pooling unlocks very long video understanding without special video branches.

04Experiments & Results

The tests: The team measured Kimi K2.5 on many fronts: hard math and science reasoning (AIME 2025, HMMT, IMO‑AnswerBench, GPQA‑Diamond, MMLU‑Pro), long‑context reading (LongBench v2), coding and software engineering (SWE‑Bench series, LiveCodeBench), multimodal image/video understanding (MMMU‑Pro, OCRBench, MathVision, LongVideoBench, LVBench), agentic web research (BrowseComp, WideSearch, DeepSearchQA, FinSearchComp, Seal‑0), and computer use (OSWorld‑Verified, WebArena).

The competition: K2.5 was compared with top proprietary systems (GPT‑5.2 with extra reasoning, Claude Opus 4.5 with extended thinking, Gemini 3 Pro) and strong open‑source baselines (DeepSeek‑V3.2 for text, Qwen3‑VL‑235B‑A22B for vision).

Scoreboard with context:

  • Math and reasoning: On AIME 2025, K2.5 scored 96.1%—like getting an A+ next to classmates with A or A−. It was also outstanding on HMMT 2025 (95.4%) and IMO‑AnswerBench (81.8%). On GPQA‑Diamond and MMLU‑Pro, it reached 87.6% and 87.1% respectively—top‑tier scientific and general knowledge.
  • Long‑context: 61.0% on LongBench v2, competitive with leading models.
  • Coding: 76.8% on SWE‑Bench Verified and 85.0% on LiveCodeBench v6, showing robust, up‑to‑date coding skill. It also performed strongly across multilingual SWE‑Bench, TerminalBench 2.0, SciCode, and more.
  • Image understanding: 78.5% on MMMU‑Pro and 92.3% on OCRBench—strong at visual reasoning and reading text in images. It also excelled at MathVision (84.2%) and MathVista (mini) (90.1%).
  • Video understanding: State of the art on long video tests—75.9% on LVBench and 79.8% on LongVideoBench—demonstrating it can handle thousands of frames. It also achieved 86.6% on VideoMMMU and 80.4% on MMVU.
  • Agentic research: On BrowseComp, baseline K2.5 got 60.6% and rose to 74.9% with a simple context trick. With Agent Swarm, it jumped to 78.4%, topping even GPT‑5.2 Pro in the reported setting. On WideSearch, it improved from 72.7% to 79.0% with swarm, beating Claude Opus 4.5 (76.2%).
  • Computer use: 63.3% on OSWorld‑Verified using only GUI actions, ahead of many open approaches and close to the best proprietary system reported.

Surprising findings:

  • Vision RL helped text tasks. After outcome‑based visual RL, text‑only benchmarks improved (e.g., MMLU‑Pro and GPQA‑Diamond both rose to about 86–87%). This suggests better calibration and structured extraction skills learned from vision applied back to text.
  • Parallelism reduced time a lot. In WideSearch, Agent Swarm cut wall‑clock time by about 3×–4.5× to reach the same Item‑F1, and the time stayed flatter as the target score got higher—exactly what you want from real parallelism. That’s like a study group finishing homework hours earlier by splitting chapters.
  • Early fusion with moderate vision ratio beat late heavy vision. Given the same total token budget, sprinkling in vision early worked better than dumping a big chunk late. Text curves didn’t “dip and recover” badly; they stayed healthier.

What this means practically: K2.5 doesn’t just score well; it behaves like a capable, coordinated team. It reads, sees, plans, and acts—faster. For end users, that’s smoother browsing agents, quicker research, more reliable OCR and chart reading, stronger coding help, and long‑video understanding that doesn’t choke on length.

05Discussion & Limitations

Limitations:

  • Compute and memory: Training and running a trillion‑parameter MoE with multimodal encoders and long contexts needs serious hardware. Agent swarms multiply concurrent inference, which can raise costs without good scheduling.
  • Data dependence: While zero‑vision SFT reduces the need for curated visual scripts, overall performance still depends on high‑quality, diverse multimodal pretraining data and careful filtering.
  • Orchestration complexity: The orchestrator must learn good parallel plans; poor plans can spawn too many agents or unbalanced branches that don’t truly speed things up.
  • Black‑box tools and web variability: Agent benchmarks can be noisy because search results change and sites differ. Careful averaging helps, but variance remains.

Required resources:

  • GPUs with strong interconnects, fast storage, and an efficient training stack (to handle long contexts and MoE routing).
  • A tool sandbox for safe code execution, browsing, and search—plus logging for RL rewards and rollouts.
  • Monitoring to prevent reward hacking (e.g., spawning agents that do nothing but collect a parallelism bonus).

When not to use:

  • Tiny, one‑shot Q&A that fits in short context and doesn’t need tools—simpler models are cheaper and fast enough.
  • Real‑time edge devices with strict latency and memory limits—full multimodal swarms may be too heavy.
  • Highly sensitive domains if tools or browsing can access untrusted content—sandboxing and guardrails are essential first.

Open questions:

  • How far does cross‑modal transfer go? We saw vision RL help text; can text RL help even more advanced visual tasks, like precise 3D reasoning?
  • What is the best curriculum for parallelism? Which tasks and rewards teach the orchestrator the most efficient decomposition strategies fastest?
  • Can we automatically size sub‑agents per task and hardware budget, like elastic scaling in the cloud?
  • How do we guarantee faithfulness and reduce hallucinations as swarms grow, especially when aggregating many sub‑agent outputs?
  • Can the same principles extend to audio, 3D, and sensor data while keeping training stable and efficient?

06Conclusion & Future Work

In three sentences: Kimi K2.5 shows that the best way to build a helpful, general agent is to grow language and vision together from the beginning, then teach the model to organize a swarm of parallel helpers. Zero‑vision SFT plus outcome‑based visual RL unlocks visual skills without brittle hand‑made scripts, and joint RL shares gains across modalities—including boosts to text tasks. Agent Swarm turns linear tool use into parallel orchestration, cutting latency by up to about 4.5× while improving accuracy on broad, real‑world research tasks.

Main achievement: A unified, open multimodal agent that jointly optimizes text and vision and learns to coordinate parallel sub‑agents, delivering state‑of‑the‑art performance across coding, vision, video, and agentic benchmarks.

Future directions: Expand to more modalities (audio, 3D), refine parallel curricula, make orchestrators elastic to hardware budgets, strengthen faithfulness checks with better reward models and verifiers, and keep improving long‑video and long‑document understanding. Exploring richer context‑sharding and aggregation could further scale effective context length without heavy truncation.

Why remember this: The paper flips two old assumptions—vision shouldn’t be bolted on late, and agents shouldn’t be stuck in a single line of actions. Training text and vision together plus teaching parallel orchestration produces an AI that reads, sees, plans, and acts like a fast, well‑coached team.

Practical Applications

  • ‱Homework helper that reads textbooks, understands diagrams, and explains steps with citations.
  • ‱Coding assistant that inspects logs/screenshots and writes, tests, and patches code across a repository.
  • ‱Research agent that searches many sources in parallel, filters for trustworthy information, and summarizes with references.
  • ‱Document and chart analyzer that performs OCR, extracts tables, and explains trends from complex reports.
  • ‱Video analyst that finds key moments in long lectures, tutorials, or security footage and produces concise summaries.
  • ‱Customer support triage that reads screenshots of errors and proposes step‑by‑step fixes.
  • ‱Financial or scientific review that cross‑checks facts across multiple PDFs and datasets simultaneously.
  • ‱Product intelligence that scans web pages, manuals, and images to build feature comparisons quickly.
  • ‱Legal/contract assistant that parses scanned documents, highlights clauses, and compares versions.
  • ‱Education content creator that turns mixed notes, images, and videos into structured lessons or study guides.
#multimodal learning#vision-language models#joint optimization#reinforcement learning#agent swarm#parallel agents#MoonViT-3D#zero-vision SFT#video understanding#OCR#tool use#orchestration#long-context#Generative Reward Model#parallelization
Version: 1