Kimi K2.5: Visual Agentic Intelligence
Key Summary
- âąKimi K2.5 is a new open-source AI that can read both text and visuals (images and videos) and act like a team of helpers to finish big tasks faster.
- âąIt trains text and vision together from the start, so the two skills boost each other instead of fighting for attention.
- âąA clever step called zeroâvision SFT uses only text training to wake up visual skills that were formed during joint pretraining.
- âąReinforcement learning is run on both text and vision tasks, and surprisingly, vision practice also makes text answers better.
- âąAgent Swarm lets one main AI split a hard job into many smaller jobs and run them at the same time using subâagents.
- âąThis parallel teamwork cuts waiting time by up to about 4.5Ă compared to doing steps one after another.
- âąA special vision encoder, MoonViTâ3D, handles images and videos in one shared way and compresses time so the model can watch much longer videos.
- âąOn tough tests in math, coding, vision, video, and webâbrowsing agents, Kimi K2.5 is stateâofâtheâart or highly competitive with top proprietary systems.
- âąThe model checkpoint is released so researchers and developers can build real apps with general agentic intelligence.
- âąThe big idea: train language and vision together, then teach the model to organize a helpful swarm of parallel agents.
Why This Research Matters
Kimi K2.5 shows a practical path to truly helpful digital assistants that can read, see, and act. By training text and vision together and learning to split work across parallel agents, it finishes big, messy tasks faster and more accurately. This helps students, researchers, engineers, and analysts who juggle long documents, screenshots, charts, and videos. It also lowers the barrier to building reliable browsing and coding agents that cite sources and verify results. In businesses, faster multimodal agents mean quicker insights, better decisionâmaking, and reduced costs. Because K2.5 is openâsource after postâtraining, the community can adapt it to realâworld needs and keep improving the ecosystem.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how a school project gets easier when you can read the instructions, look at pictures, and then split the work with friends? Early AI models were good readers (text) but not great at seeing (images and video), and they mostly worked alone. Before models like Kimi K2.5, many systems treated vision as a lastâminute addâonâlike taping a picture to an essay at the end. That made them strong at language or strong at vision, but rarely both, and certainly not great at planning and acting across tools.
The world before: Large language models could answer questions, write code, and explain things. Visual models could label images or read charts. But mixing them was bumpy. A common recipe was to train the language brain first, then bolt on vision late with a big chunk of image tokens. That often caused a clash: when vision arrived late and loud, text quality dipped, then slowly recoveredâif it did at all. And even when models could reason step by step, they usually called tools in a straight line: step 1, then step 2, then step 3. As tasks grew bigger (say, researching 50 websites, reading dozens of PDFs, writing code, and verifying results), time grew linearly and context got messy.
The problem: 1) How do we make text and vision help each other instead of competing? 2) How do we make agentic work (planning, tool use, browsing, coding) faster and more scalable when tasks are broad and deep? 3) How do we make learning from visual tasks improve text abilities (and vice versa) instead of causing tradeâoffs?
Failed attempts: Late vision fusion with high vision ratios tried to cram in visual knowledge at the end. It often hurt text, learned brittle visual skills, and needed lots of handâmade visual training paths (like scripted tool chains) that didnât generalize. Sequential agents tried to be smarter by thinking more, but more steps just meant longer waits, and context overflow forced reactive tricks like trimming history, which sometimes cut away needed clues.
The gap: We needed a native multimodal foundationâwhere text and vision grow up together from the beginningâand an agent framework that learns when and how to split work and run in parallel. We also needed a training path that could turn textâonly practice into visual competence, so we wouldnât depend on scarce, handâcurated vision data.
Real stakes: This matters for daily life and work. Picture a homework helper that reads your textbook, understands the diagram, and explains the steps. Imagine a coding assistant that reads screenshots, inspects logs, and writes patches. Consider a research agent that searches many sources at once, reads long documents, and summarizes the truth with citationsâall much faster. For video, think of analyzing long lectures or security footage without missing key moments. In business, faster and better multimodal agents mean lower costs, better decisions, and tools that feel genuinely helpful.
02Core Idea
Aha! Train language and vision together from the start, then teach the model to organize a swarm of parallel agents so big, mixed tasks finish faster and more accurately.
Three analogies:
- Team sport: Instead of first training only the striker (text) and later adding the goalie (vision), you train the whole team together so they pass smoothly. Then, during a match, you coordinate multiple players to cover the field at once.
- Cooking: You simmer pasta and sauce together so flavors blend (joint textâvision training). When itâs dinner time, you have several cooks each handling a part of the meal at the same time (Agent Swarm), so food arrives hot and fast.
- Detective work: You study letters and photos together so clues connect naturally (joint training). Then you send different detectives in parallelâone to the library, one to the archives, one to the crime sceneâand gather their reports for the solution (Agent Swarm + orchestration).
Before vs. after:
- Before: Vision added late; text dips when vision arrives; agents execute tools in a slow line; visual learning sometimes hurts text.
- After: Vision fused early; text and vision strengthen each other; agents split tasks and run in parallel; training on vision can even boost text benchmarks.
Why it works (intuition, not equations):
- Early, balanced exposure lets the model build one shared space where words and pixels line up cleanly. No late shocks, fewer conflicts.
- Textâonly fineâtuning can still wake up visual skills because the model already linked text and visual features during pretraining; you teach it procedures and tool habits in text, and it generalizes them to images.
- Outcomeâbased visual RL rewards correct visual behaviors. Those calibrations (like careful counting, structured extraction) carry over to similar text tasks.
- Parallel agents reduce waiting on the slowest, longest chain. The orchestrator learns to spawn the right helpers and only gathers the useful results, keeping global context tidy.
Building blocks (explained with the Sandwich pattern):
đ Hook: You know how your brain uses eyes and words together when studying a diagram in science class? đ„Ź The Concept: Multimodal Learning (Multimodal ML) means teaching AI to understand more than one kind of inputâlike text, images, and videoâat the same time.
- How it works: 1) Feed mixed data (words + pixels), 2) Build shared representations so the same idea connects across text and visuals, 3) Practice tasks that need both.
- Why it matters: Without it, the model can read or seeâbut struggles to connect the two. đ Anchor: Reading a question about a chart and then looking at the chart to answer it.
đ Hook: Imagine learning by trying things and getting points when you do well. đ„Ź The Concept: Reinforcement Learning (RL) lets AI learn by taking actions and getting rewards for good outcomes.
- How it works: 1) Try an action, 2) Get feedback (reward), 3) Adjust to do better next time.
- Why it matters: Without RL, the model may talk nicely but wonât improve at tool use and multiâstep tasks. đ Anchor: A browsing agent that gets points for correct, wellâcited answers.
đ Hook: Your brain is a web of connections that light up when you learn. đ„Ź The Concept: Neural Networks are computing layers that learn patterns by tuning their connections.
- How it works: 1) Pass data through layers, 2) Compare output to the goal, 3) Nudge weights to reduce mistakes.
- Why it matters: Without neural nets, modern AI wouldnât learn rich patterns. đ Anchor: Recognizing the digit â8â in different handwritings.
đ Hook: Think of a teacher grading your homework with answer keys. đ„Ź The Concept: Supervised Learning is training on examples with the right answers.
- How it works: 1) Input + correct output, 2) Model guesses, 3) Compare and correct.
- Why it matters: Without it, the model doesnât get clear guidance early on. đ Anchor: Learning to caption an image from many imageâcaption pairs.
đ Hook: Big projects get easier when classmates each take a part. đ„Ź The Concept: Multiâagent Systems are groups of AIs that work together.
- How it works: 1) Split a task, 2) Assign roles, 3) Share results, 4) Combine into a final answer.
- Why it matters: Without teamwork, long tasks become too slow or messy. đ Anchor: One agent searches papers, another analyzes data, a third writes the summary.
đ Hook: Adjusting the brightness on a photo app. đ„Ź The Concept: Image Processing changes and measures pixels to understand pictures.
- How it works: 1) Filter or segment pixels, 2) Detect shapes/lines, 3) Count or locate objects.
- Why it matters: Without it, models can miss fine visual details. đ Anchor: Counting blue slices in a pie chart by selecting blue pixels.
đ Hook: Reading a picture in small tiles like a mosaic. đ„Ź The Concept: Vision Transformers split images into patches and learn relationships between them.
- How it works: 1) Turn image into patches, 2) Embed and attend to relationships, 3) Predict labels or text.
- Why it matters: Without patch attention, models struggle with flexible resolutions and complex scenes. đ Anchor: Finding where the cat is by relating nearby patches.
đ Hook: Studying math and science in the same semester helps you connect ideas. đ„Ź The Concept: Multimodal Preâtraining means training on text and images/videos together early on.
- How it works: 1) Mix data types at a steady ratio, 2) Learn shared features, 3) Keep going long enough for balance.
- Why it matters: Late addâons cause clashes and dips in quality. đ Anchor: Mixing 10% vision and 90% text from the start for smoother learning.
đ Hook: Stirring sauce into pasta while cooking, not after plating. đ„Ź The Concept: Joint Optimization of Text and Vision improves both at the same time.
- How it works: 1) Early fusion, lower vision ratio, 2) Train steadily, 3) Avoid big late shifts.
- Why it matters: Without it, one skill steals focus and the other suffers. đ Anchor: Better results than dumping 50% vision late in training.
đ Hook: Learning lab safety steps in a booklet before entering the lab. đ„Ź The Concept: ZeroâVision SFT uses only text fineâtuning to activate visual tool use learned during pretraining.
- How it works: 1) Teach procedures in text (like coding steps), 2) Because text and vision are aligned, the skills transfer, 3) Avoid brittle handâmade visual scripts.
- Why it matters: Without it, you need lots of costly visual SFT and still risk worse generalization. đ Anchor: Textâonly practice leads to good performance on OCR and counting once images appear.
đ Hook: Solving a mystery by reading notes and also inspecting photos. đ„Ź The Concept: Joint TextâVision RL improves decisions using both text and images.
- How it works: 1) Give tasks with verifiable outcomes, 2) Reward correct multimodal reasoning, 3) Share gains across modalities.
- Why it matters: Without it, improvements in one mode may not help the other. đ Anchor: Visual RL improved MMLUâPro and GPQA text scores.
đ Hook: Watching a movie frame by frame and summarizing each scene. đ„Ź The Concept: MoonViTâ3D is a vision encoder that handles images and videos in a shared way with light temporal compression.
- How it works: 1) Pack patches from up to 4 frames, 2) Share weights between image and video, 3) Pool over time to go 4Ă longer.
- Why it matters: Without shared handling, video would need separate bulky modules. đ Anchor: Handling over 2,000 frames for longâvideo understanding.
đ Hook: A team captain who knows when to call in more players. đ„Ź The Concept: Agent Swarm is a trained orchestrator that creates specialized subâagents and runs them in parallel.
- How it works: 1) Decide when to split tasks, 2) Spawn subâagents, 3) Run them concurrently, 4) Collect just the useful outputs.
- Why it matters: Without it, long tasks take too long and overflow context. đ Anchor: Parallel web research that finishes in a fraction of the time.
đ Hook: A teacher grades the group project, not each tiny step, while the group learns how to divide work. đ„Ź The Concept: Parallel Agent Reinforcement Learning (PARL) trains only the orchestrator while subâagents are frozen.
- How it works: 1) Freeze subâagents as tools, 2) Reward good parallel plans, 3) Prevent rewardâhacks like spawning useless agents.
- Why it matters: Without it, training is unstable and itâs unclear who deserves credit. đ Anchor: Smoother learning of when/what to parallelize and big latency wins.
03Methodology
At a high level: Mixed text+images/videos â Early joint pretraining (MoonViTâ3D + language model) â Zeroâvision SFT (textâonly procedures) â Outcomeâbased vision RL â Joint textâvision RL â Agent Swarm orchestration (PARL) â Fast, accurate multimodal agents.
Step A: Native multimodal pretraining with early fusion
- What happens: From nearâend Kimi K2, the model trains on about 15T mixed tokens with a steady, moderate vision ratio (e.g., 10â20% vision, 80â90% text) instead of dumping lots of vision late. The MoonViTâ3D encoder packs image/video patches and shares weights for images and videos. Temporal pooling compresses 4 frames into one chunk so the model can handle videos 4Ă longer in the same context.
- Why it exists: Late heavy vision shakes the language space and causes a dip; early moderate fusion builds one stable shared space and reduces conflicts.
- Example: With a constant 10%:90% vision:text ratio from the start, the model maintains steadier text scores while steadily climbing in vision tasks.
Step B: Zeroâvision SFT (textâonly fineâtuning)
- What happens: The model is fineâtuned with textâonly instruction data that teaches general agent behaviors (like how to plan, cite sources, or write Python to analyze data). Because text and vision were aligned during pretraining, these skills generalize to images when they appear.
- Why it exists: Highâquality text data is abundant; curated vision SFT data is scarce and can overfit. Textâonly SFT activates visual tool use more robustly than small handâcrafted visual scripts.
- Example: The model learns to write Python code to binarize, count, and segmentâthen uses the same code on an actual image to count apples or read a pie chart.
Step C: Outcomeâbased visual RL
- What happens: The model practices visionâneeded tasks where answers can be checkedâlike grounding (point/box/polygon), counting, OCR, charts, and visionâcritical STEM. Correct outputs get rewards. Good traces can be reused for further fineâtuning.
- Why it exists: After textâonly SFT, the model sometimes ignores images. Outcomeâbased RL forces it to pay attention when visuals matter.
- Example: On OCRBench, rewards reflect edit distance; for segmentation, rewards depend on overlap (IoU) between predicted and true shapes.
Step D: Joint multimodal RL across abilities
- What happens: RL isnât split by modality but by abilities (knowledge, reasoning, coding, agentic). Both pureâtext and multimodal tasks train the same policy; a Generative Reward Model (GRM) provides nuanced feedback (helpfulness, relevance, instruction following) where exact answers arenât verifiable.
- Why it exists: Training by abilities lets wins transfer across modalities (e.g., structured extraction learned from visuals improves similar text tasks).
- Example: After vision RL, text scores on MMLUâPro and GPQA go up, showing crossâmodal generalization.
Step E: Agent Swarm with PARL
- What happens: One orchestrator model learns to decide: Should I parallelize? How many subâagents? What tasks do they get? Subâagents are frozen policies initialized from earlier checkpoints, treated as tools. Rewards include: (1) final task success, (2) a bonus that prevents collapsing back to singleâagent mode, and (3) a finishârate term that prevents spawning useless agents. Over training, the orchestrator learns efficient parallel plans.
- Why it exists: Sequential agents are slow and run out of context. Parallel orchestration reduces wallâclock time and keeps local reasoning separate in subâagent memories.
- Example: On WideSearch, Agent Swarm cuts time by about 3Ăâ4.5Ă to hit the same accuracy target.
Step F: Critical steps as a speedometer
- What happens: Instead of counting total steps, the system measures the sum over stages of mainâagent step + the longest subâagent branch. This mirrors real latency when things run in parallel.
- Why it exists: If you spawn many subâtasks but one super long branch dominates, you didnât really speed up. This metric nudges the orchestrator to balance workloads.
- Example: Splitting 20 sources among 5 subâagents (4 each) is better than one subâagent doing all 20 while others idle.
Step G: Proactive context management by design
- What happens: Subâagents keep their own small contexts; only final, relevant outputs are returned to the orchestrator. That means less clutter in the global conversation and fewer token overflows.
- Why it exists: Truncation methods like âdiscard allâ lose structure. Swarm keeps structure by sharding context among subâagents and reassembling summaries.
- Example: In BrowseComp, swarm outperforms discardâall in both speed and accuracy.
Secret sauce
- Early, lowerâratio vision fusion avoids lateâstage shocks and yields better overall learning under fixed budgets.
- Zeroâvision SFT turns plentiful text data into general multimodal procedures.
- Outcomeâbased visual RL calibrates attention to images, and those calibrations help text tasks too.
- PARL trains the orchestrator only, solving credit assignment and stabilizing learning for parallel plans.
- MoonViTâ3Dâs shared image/video space plus temporal pooling unlocks very long video understanding without special video branches.
04Experiments & Results
The tests: The team measured Kimi K2.5 on many fronts: hard math and science reasoning (AIME 2025, HMMT, IMOâAnswerBench, GPQAâDiamond, MMLUâPro), longâcontext reading (LongBench v2), coding and software engineering (SWEâBench series, LiveCodeBench), multimodal image/video understanding (MMMUâPro, OCRBench, MathVision, LongVideoBench, LVBench), agentic web research (BrowseComp, WideSearch, DeepSearchQA, FinSearchComp, Sealâ0), and computer use (OSWorldâVerified, WebArena).
The competition: K2.5 was compared with top proprietary systems (GPTâ5.2 with extra reasoning, Claude Opus 4.5 with extended thinking, Gemini 3 Pro) and strong openâsource baselines (DeepSeekâV3.2 for text, Qwen3âVLâ235BâA22B for vision).
Scoreboard with context:
- Math and reasoning: On AIME 2025, K2.5 scored 96.1%âlike getting an A+ next to classmates with A or Aâ. It was also outstanding on HMMT 2025 (95.4%) and IMOâAnswerBench (81.8%). On GPQAâDiamond and MMLUâPro, it reached 87.6% and 87.1% respectivelyâtopâtier scientific and general knowledge.
- Longâcontext: 61.0% on LongBench v2, competitive with leading models.
- Coding: 76.8% on SWEâBench Verified and 85.0% on LiveCodeBench v6, showing robust, upâtoâdate coding skill. It also performed strongly across multilingual SWEâBench, TerminalBench 2.0, SciCode, and more.
- Image understanding: 78.5% on MMMUâPro and 92.3% on OCRBenchâstrong at visual reasoning and reading text in images. It also excelled at MathVision (84.2%) and MathVista (mini) (90.1%).
- Video understanding: State of the art on long video testsâ75.9% on LVBench and 79.8% on LongVideoBenchâdemonstrating it can handle thousands of frames. It also achieved 86.6% on VideoMMMU and 80.4% on MMVU.
- Agentic research: On BrowseComp, baseline K2.5 got 60.6% and rose to 74.9% with a simple context trick. With Agent Swarm, it jumped to 78.4%, topping even GPTâ5.2 Pro in the reported setting. On WideSearch, it improved from 72.7% to 79.0% with swarm, beating Claude Opus 4.5 (76.2%).
- Computer use: 63.3% on OSWorldâVerified using only GUI actions, ahead of many open approaches and close to the best proprietary system reported.
Surprising findings:
- Vision RL helped text tasks. After outcomeâbased visual RL, textâonly benchmarks improved (e.g., MMLUâPro and GPQAâDiamond both rose to about 86â87%). This suggests better calibration and structured extraction skills learned from vision applied back to text.
- Parallelism reduced time a lot. In WideSearch, Agent Swarm cut wallâclock time by about 3Ăâ4.5Ă to reach the same ItemâF1, and the time stayed flatter as the target score got higherâexactly what you want from real parallelism. Thatâs like a study group finishing homework hours earlier by splitting chapters.
- Early fusion with moderate vision ratio beat late heavy vision. Given the same total token budget, sprinkling in vision early worked better than dumping a big chunk late. Text curves didnât âdip and recoverâ badly; they stayed healthier.
What this means practically: K2.5 doesnât just score well; it behaves like a capable, coordinated team. It reads, sees, plans, and actsâfaster. For end users, thatâs smoother browsing agents, quicker research, more reliable OCR and chart reading, stronger coding help, and longâvideo understanding that doesnât choke on length.
05Discussion & Limitations
Limitations:
- Compute and memory: Training and running a trillionâparameter MoE with multimodal encoders and long contexts needs serious hardware. Agent swarms multiply concurrent inference, which can raise costs without good scheduling.
- Data dependence: While zeroâvision SFT reduces the need for curated visual scripts, overall performance still depends on highâquality, diverse multimodal pretraining data and careful filtering.
- Orchestration complexity: The orchestrator must learn good parallel plans; poor plans can spawn too many agents or unbalanced branches that donât truly speed things up.
- Blackâbox tools and web variability: Agent benchmarks can be noisy because search results change and sites differ. Careful averaging helps, but variance remains.
Required resources:
- GPUs with strong interconnects, fast storage, and an efficient training stack (to handle long contexts and MoE routing).
- A tool sandbox for safe code execution, browsing, and searchâplus logging for RL rewards and rollouts.
- Monitoring to prevent reward hacking (e.g., spawning agents that do nothing but collect a parallelism bonus).
When not to use:
- Tiny, oneâshot Q&A that fits in short context and doesnât need toolsâsimpler models are cheaper and fast enough.
- Realâtime edge devices with strict latency and memory limitsâfull multimodal swarms may be too heavy.
- Highly sensitive domains if tools or browsing can access untrusted contentâsandboxing and guardrails are essential first.
Open questions:
- How far does crossâmodal transfer go? We saw vision RL help text; can text RL help even more advanced visual tasks, like precise 3D reasoning?
- What is the best curriculum for parallelism? Which tasks and rewards teach the orchestrator the most efficient decomposition strategies fastest?
- Can we automatically size subâagents per task and hardware budget, like elastic scaling in the cloud?
- How do we guarantee faithfulness and reduce hallucinations as swarms grow, especially when aggregating many subâagent outputs?
- Can the same principles extend to audio, 3D, and sensor data while keeping training stable and efficient?
06Conclusion & Future Work
In three sentences: Kimi K2.5 shows that the best way to build a helpful, general agent is to grow language and vision together from the beginning, then teach the model to organize a swarm of parallel helpers. Zeroâvision SFT plus outcomeâbased visual RL unlocks visual skills without brittle handâmade scripts, and joint RL shares gains across modalitiesâincluding boosts to text tasks. Agent Swarm turns linear tool use into parallel orchestration, cutting latency by up to about 4.5Ă while improving accuracy on broad, realâworld research tasks.
Main achievement: A unified, open multimodal agent that jointly optimizes text and vision and learns to coordinate parallel subâagents, delivering stateâofâtheâart performance across coding, vision, video, and agentic benchmarks.
Future directions: Expand to more modalities (audio, 3D), refine parallel curricula, make orchestrators elastic to hardware budgets, strengthen faithfulness checks with better reward models and verifiers, and keep improving longâvideo and longâdocument understanding. Exploring richer contextâsharding and aggregation could further scale effective context length without heavy truncation.
Why remember this: The paper flips two old assumptionsâvision shouldnât be bolted on late, and agents shouldnât be stuck in a single line of actions. Training text and vision together plus teaching parallel orchestration produces an AI that reads, sees, plans, and acts like a fast, wellâcoached team.
Practical Applications
- âąHomework helper that reads textbooks, understands diagrams, and explains steps with citations.
- âąCoding assistant that inspects logs/screenshots and writes, tests, and patches code across a repository.
- âąResearch agent that searches many sources in parallel, filters for trustworthy information, and summarizes with references.
- âąDocument and chart analyzer that performs OCR, extracts tables, and explains trends from complex reports.
- âąVideo analyst that finds key moments in long lectures, tutorials, or security footage and produces concise summaries.
- âąCustomer support triage that reads screenshots of errors and proposes stepâbyâstep fixes.
- âąFinancial or scientific review that crossâchecks facts across multiple PDFs and datasets simultaneously.
- âąProduct intelligence that scans web pages, manuals, and images to build feature comparisons quickly.
- âąLegal/contract assistant that parses scanned documents, highlights clauses, and compares versions.
- âąEducation content creator that turns mixed notes, images, and videos into structured lessons or study guides.