COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence
Key Summary
- ā¢COOPER is a single AI model that both ālooks betterā (perceives depth and object boundaries) and āthinks smarterā (reasons step by step) to answer spatial questions about images.
- ā¢It teaches itself to create extra helpful picturesālike depth maps (how far things are) and segmentation maps (which pixels belong to which object)āand then uses them while reasoning.
- ā¢The model decides on the fly when to draw these helper pictures and when to keep reasoning with text; this flexible back-and-forth is called interleaved multimodal chain-of-thought.
- ā¢To make non-RGB helpers fit its image generator, COOPER cleverly converts depth and segmentation labels into RGB-like pseudo-images and learns them with the same training recipe.
- ā¢After supervised fine-tuning on examples with step-by-step, image-and-text reasoning, COOPER is further polished with reinforcement learning using a special CPR reward.
- ā¢Across spatial benchmarks, COOPER improves by an average of 6.91% over its base model (BAGEL) while also gaining 4.47% on general multimodal tests.
- ā¢Even a version trained only to generate helper maps (no reasoning training) gets 7.92% better at distance and size estimation, showing that making these helpers internalizes 3D know-how.
- ā¢Compared with doing just perception help or just text reasoning, COOPERās interleaved reasoning reaches a higher ceiling on both spatial and general tasks.
- ā¢COOPER adapts its tools: it tends to use depth for relative-distance tasks, segmentation for counting and locating, and sticks to text-only steps when visuals add little.
- ā¢Limits remain: it focuses on single images, relies on the BAGEL backbone, and currently optimizes mainly text in RL (not the generated images), leaving video and richer 3D inputs as future work.
Why This Research Matters
Spatial intelligence underpins safety and usefulness in everyday AI: a home robot needs to know which cup is closer, AR apps must place virtual objects at the right depth, and cars must judge distances precisely. COOPER shows that one model can both create the visual aids it needs and reason with them, rather than depending on separate, brittle toolchains. This tight coupling makes decisions more robust and explanations more transparent, since the model shows its work with helper images. The approach also boosts general skills, not just spatial ones, suggesting a broader recipe for multimodal reasoning. As systems scale to video and 3D inputs, this cooperative see-and-think loop could become a standard blueprint for reliable AI in the physical world.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Top Bread (Hook): Imagine youāre playing hide-and-seek in a photo. To find whoās closest, you donāt just look at colorsāyou also judge how far things are and which shapes belong to which person. That mix of seeing and thinking is how we really understand space.
š„¬ Filling (The Actual Concept): Before this work, most Multimodal Large Language Models (MLLMs) were great talkers about pictures but not great 3D thinkers. They learned mostly from 2D imageātext pairs, so they often missed true depth, shapes, and object boundaries that matter for spatial questions like āWhich object is closer?ā or āHow many players are on the left?ā
- What it is: Visual spatial reasoning is an AIās ability to notice object properties (size, distance, boundaries) and their spatial relationships, then use that to answer questions.
- How it worked before: People improved either the āseeingā part (by adding helper signals like depth and segmentation from separate tools) or the āthinkingā part (by training models to reason in text), but rarely both together.
- Why it matters: Without strong perception, reasoning guesses can be wrong. Without strong reasoning, even perfect perception canāt solve complex tasks.
š Bottom Bread (Anchor): If you ask, āIs the train closer than the building?ā, a model that just reads 2D pixels might get confused by perspective. A model that also creates and uses a depth map can see closeness more reliably.
ā New Concept 1 ā š Hook: You know how a Swiss Army knife has many tools in one handle? š„¬ MLLMs: What it is: A Multimodal Large Language Model (MLLM) is a single model that can understand and generate both images and text.
- How it works:
- Turn images into tokens (like words for pictures).
- Mix text and image tokens in one brain (the transformer).
- Predict the next token for text; generate image pixels with an image head.
- Answer questions by combining what it sees and what it reads.
- Why it matters: It lets the model look and talk about the same thing, enabling richer answers than text-only. š Anchor: When you ask, āWhat color is the ball next to the dog?ā, the model looks (image tokens) and explains (text tokens) in one go.
The problem: Even strong MLLMs still struggled with 3D awareness. Two separate fix-it paths emerged:
- Perception enhancement: Feed extra modalities (depth maps, segmentation masks) from external tools so the modelās vision gets crisper.
- Reasoning enhancement: Teach better step-by-step thinking with textual chain-of-thought or reinforcement learning on spatial Q&A. But doing only one of these made the other a bottleneck.
ā New Concept 2 ā š Hook: Imagine cooking with spices you can make yourself whenever you need. š„¬ Auxiliary Modality Generation: What it is: The model learns to generate its own helpful add-on pictures (like depth and segmentation) instead of relying on external tools.
- How it works:
- Convert depth and segmentation labels into RGB-like pseudo-images.
- Train the modelās image generator to produce these helpers on command.
- During a question, the model can choose to create a helper image and use it while reasoning.
- Why it matters: If the model can make its own helpers, it can flexibly see better right when reasoning needs it. š Anchor: When asked, āWhich bottle is closer?ā, the model can create a depth map and then answer with confidence.
ā New Concept 3 ā š Hook: Think of a student who decides when to use a calculator and when to do mental math. š„¬ Adaptive Reasoning: What it is: The model learns to decide, step by step, whether to generate a helper image or continue reasoning in text.
- How it works:
- Start reasoning in text.
- If needed, generate a depth/segmentation helper.
- Read that helper and continue reasoning.
- Stop when the answer is clear.
- Why it matters: Fixed pipelines are rigid; adaptive steps match the taskās needs. š Anchor: For counting players on the left, it generates a segmentation map; for comparing distances, it generates depth.
The gap this paper fills: It unifies powerful perception and careful reasoning inside one model that learns when to use which tool. That changes the game from ātwo separate improvementsā to āa single, cooperative system.ā
Real stakes: Better spatial intelligence helps robots put items in the right spot, assists AR apps in placing virtual objects realistically, and makes self-driving cars and home assistants safer and smarter.
ā New Concept 4 ā š Hook: You know how our eyes judge how far a soccer ball is before we kick? š„¬ Depth Estimation: What it is: Predicting how far each pixel is from the camera.
- How it works:
- Look at the image.
- Infer a depth value per pixel (near vs. far).
- Output a depth map where colors represent distance.
- Why it matters: Without depth, ācloser or fartherā questions are guessy. š Anchor: To answer āWhich is nearer, the mug or the vase?ā, a depth map makes the nearer one pop out.
ā New Concept 5 ā š Hook: Like cutting a cake into slices to serve people. š„¬ Segmentation: What it is: Labeling which pixels belong to which object.
- How it works:
- Find object boundaries.
- Assign each object a distinct color/label.
- Output a map showing clear object regions.
- Why it matters: Counting, locating, and comparing sizes is easier with clean boundaries. š Anchor: To answer āHow many players are left of #4?ā, segmentation separates players so counting is reliable.
02Core Idea
š Top Bread (Hook): Imagine a detective who can draw their own helpful sketches (like outlines and distance lines) while thinking through a caseāand decides exactly when each sketch would help.
š„¬ Filling (The Actual Concept): The paperās key insight in one sentence: Teach one unified model to both generate perception helpers (depth/segmentation) and to adaptively weave those helpers into its step-by-step reasoning.
Three analogies:
- Chef with spices-on-demand: The chef (model) cooks (answers) better when it grinds its own spices (depth/segmentation) right when the recipe needs flavor.
- Student with tools: The student decides when to use a calculator (depth) or a highlighter (segmentation) versus just thinking, making problem-solving smoother.
- Builderās toolkit: The builder picks a tape measure (depth) or a stencil (segmentation) mid-job, instead of carrying pre-cut parts that might not fit.
Before vs. After:
- Before: Models either saw better using fixed, external helpers or thought better with pure text stepsābut not both together in a flexible way.
- After: COOPER internally creates helpers and interleaves them with reasoning, choosing the right tool at the right moment, boosting both accuracy and reliability.
Why it works (intuition without equations):
- Internal practice builds intuition: When the model learns to generate depth/segmentation, it rehearses 3D geometry and object boundaries, embedding that knowledge into its representations.
- Ask-when-useful: Adaptive reasoning prevents overusing visuals (which can waste time) and underusing them (which can cause mistakes). The model weighs the benefit, then acts.
- Feedback that teaches balance: A special CPR reward praises correct answers, clean reasoning format, and sensible tool useānudging the model to couple seeing and thinking cooperatively.
Building blocks (each explained with a sandwich): ā New Concept 6 ā š Hook: Like one backpack that holds both a camera and a notepad. š„¬ Unified MLLM Backbone (BAGEL): What it is: A base model that understands and generates images and text in one system, with dedicated parts for each.
- How it works:
- A visual encoder turns images into tokens for understanding.
- A language core mixes image and text tokens.
- An image generator produces images from a learned visual latent space.
- Why it matters: Itās the stage where both seeing and drawing can happen together. š Anchor: When prompted, the model can write an explanation or generate an image (like a depth map-lookalike) without leaving its own toolbox.
ā New Concept 7 ā š Hook: Think of tracing paper that lets you convert different drawings into the same format. š„¬ RGB Pseudo-Images for Helpers: What it is: Turning depth and segmentation labels into RGB-like images so the existing image generator can learn them.
- How it works:
- Segmentation: assign distinctive colors to instances to make an RGB label image.
- Depth: rescale depth values into an RGB range the generator expects.
- Train with the normal image-generation loss.
- Why it matters: No new machinery neededāthe model learns helpers in its native image space. š Anchor: The model can be asked, ā<depth-estimation>...</depth-estimation>,ā and it will generate a depth-looking image it understands.
ā New Concept 8 ā š Hook: Like learning from worked examples before practicing freely. š„¬ Supervised Fine-Tuning (SFT) for Interleaved Steps: What it is: Show the model examples where text thinking and helper images alternate.
- How it works:
- Curate questions and build step-by-step solutions that include when to generate helpers.
- Train the model to produce those text steps and final answers.
- Do not force the exact pixels of helpers during SFT to avoid noise.
- Why it matters: Gives the model a starting habit of when and how to interleave. š Anchor: For a sports image, the example shows: think ā generate segmentation ā think ā answer.
ā New Concept 9 ā š Hook: Like getting points in a game for both solving puzzles and using tools wisely. š„¬ Reinforcement Learning (RL) with CPR Reward: What it is: Practice-time feedback that scores answers, clean format, and sensible helper use.
- How it works:
- Sample multiple responses per question.
- Score each: correct answer (yes/no), proper interleaved format (yes/no), and tool-use balance (guided by past data and a threshold).
- Nudge the model towards higher-scoring behaviors.
- Why it matters: Reinforces the cooperative dance between seeing and thinking. š Anchor: The model learns that for distance questions, using depth helps; for pure geometry text puzzles, extra visuals arenāt needed.
The result: A single model that flexibly ādraws to thinkāāmaking depth/segmentation when useful, reading them, and finishing with a confident answer.
03Methodology
At a high level: Input image and question ā (Stage 1) Learn to generate helper modalities (depth and segmentation) as RGB pseudo-images ā (Stage 2) Learn to adaptively interleave text reasoning and helper generation with SFT ā Refine with RL (CPR reward) ā Output final answer (and optional helper images).
Stage 1: Auxiliary Modality Generation (make the model its own helper-tool) ā New Concept 10 ā š Hook: You know how you can repaint any sketch into the same color palette so one printer can print them all? š„¬ RGB Mapping for Helpers: What it is: A method to express depth and segmentation as RGB-like images so the generator can learn them with its usual training.
- How it works step by step:
- Segmentation to RGB: Assign an RGB color to each instance/class ā get a colorful label image.
- Depth to RGB: Replicate the depth channel into three channels and rescale values into the generatorās expected range (so itās like an image).
- Train with the same image-generation loss the model already uses (Rectified Flow in latent space with a VAE).
- Why it matters: No extra decoders or new losses; the modelās existing image generator now learns to ādrawā depth/segmentation on demand. š Anchor: Prompt: ā<segmentation> Segment the objects...ā ā the model outputs a segmentation-like RGB image it can also read back into its reasoning.
Training details (friendly intuition): The modelās image head doesnāt know itās making a āspecialā image; it just learns how these helper images should look when asked with tags like <depth-estimation>...</depth-estimation> or <segmentation>...</segmentation>. Across datasets (e.g., Hypersim/Virtual KITTI for depth; ADE20K for segmentation), it practices producing accurate helper images.
What breaks without this step: If the model cannot generate helpers itself, it must rely on external tools, losing the tight loop between drawing and thinking inside one brain.
Example with real data: Given a living room photo, the model can generate (1) a depth map where nearer furniture is āwarmer,ā and (2) a segmentation map where the couch, table, and lamp are separate colors.
ā New Concept 11 ā š Hook: Think of drawing-and-thinking as a conversation with yourself. š„¬ Interleaved Multimodal Chain-of-Thought: What it is: Reasoning that goes text ā (optional helper image) ā text ā (optional helper) ā answer.
- How it works:
- The model starts with a <think> step.
- If visuals can help, it issues <depth-estimation> or <segmentation>.
- It reads the helper, continues thinking, and finally answers.
- Why it matters: Some questions need a picture to think clearly; some donāt. Interleaving adapts to the question. š Anchor: For āWhich player is left of #4?ā, it generates segmentation to distinguish players cleanly before counting.
Stage 2a: Supervised Fine-Tuning (SFT) on curated interleaved data
- Data curation: Start from spatial QA sets (e.g., SAT VQA) and general QA (TACO). For each question, test the base model twice: with and without helpers, sampling multiple attempts. Keep samples where helpers sometimes change accuracy (not trivially always right or always wrong). Build balanced sets: positive-gain (helpers help), negative-gain (helpers hurt), and boundary (no big change). Use GPT-4o to draft clean interleaved solutions, while using the Stage-1 model to actually generate helper images so the steps match the modelās abilities. Keep only examples with correct final answers.
- SFT recipe: Train the model to reproduce the text reasoning and final answers (donāt force exact pixels of helpers at this stage to avoid noise).
- Why SFT first: It teaches the basic rhythm of interleaving and when helpers tend to be useful.
- What breaks without SFT: RL alone may wander; SFT provides stable habits and format.
Example: Question: āWhich is closer to the red phone: the cabinet (green) or the monitor (blue)?ā SFT teaches: think ā depth ā think using the depth colors ā answer.
Stage 2b: Reinforcement Learning with GRPO and CPR reward ā New Concept 12 ā š Hook: Like a game where you try many moves, score each, and learn the smartest strategy relative to your past tries. š„¬ GRPO (Group Relative Policy Optimization): What it is: An RL method that samples several answers per question, scores them, and improves the policy based on which answers are better than the group average.
- How it works:
- For each question, sample N responses.
- Score them with a reward function.
- Update the model to make above-average answers more likely, controlling step size so learning stays stable.
- Why it matters: Itās robust and efficient for on-policy learning. š Anchor: If, among 8 tries, the ones using depth led to correct answers and neat formatting, the model is nudged toward that pattern.
ā New Concept 13 ā š Hook: Think of a scorecard that rewards winning, neat work, and smart tool useānot just the final result. š„¬ CPR Reward (Cooperative PerceptionāReasoning): What it is: A three-part reward that balances correctness, clean interleaving format, and sensible visual-tool usage.
- How it works:
- Answer reward: 1 if correct, else 0.
- Format reward: 1 if the output follows the interleaved think/generate/answer pattern, else 0.
- Exploration-guided tool-use reward: Based on precomputed āvisual-gainā labels and a threshold, lightly reward or penalize using helpers too much or too little.
- Why it matters: It prevents overusing helpers (slowing or distracting) and underusing them (missing needed cues), guiding a practical balance. š Anchor: On tasks where helpers historically help (positive-gain), using them modestly gains a bonus; on tasks where they hurt (negative-gain), overusing them gets a small penalty.
Secret sauce: The combinationāinternal helper generation + interleaved SFT habits + CPR-guided RL practiceālets the model self-choose when to ādraw to think,ā making spatial reasoning both stronger and more general.
04Experiments & Results
The Test: Researchers measured how well COOPER understands space (distance, size, relationships) and how it holds up on general multimodal benchmarks. They chose three spatial benchmarks (SIBench single-image subset, Q-SpatialBench for distance/size estimation, and MMVP) plus two general ones (MMBench v1.1 and MM-Vet).
The Competition: COOPER was compared to strong baselines, including unified models (Janus-Pro, Liquid, and the base BAGEL) and leading MLLMs (InternVL3.5, Qwen3VL, GPT-4o, GPT-5). Two special ablations show the value of the two halves: BAGEL-PE (only perception enhancement: always generate helpers) and BAGEL-RE (only reasoning enhancement: text-only RL on spatial Q&A).
The Scoreboard (with context):
- Spatial reasoning overall: COOPER improves by an average of 6.91% over the base BAGEL. Think of it as moving from a solid B to a clear A- on challenging spatial tests.
- Q-SpatialBench (distance/size): COOPER matches or beats much larger open-source models and approaches proprietary models; even the āStage 1 onlyā variant (helper generation but no reasoning RL) gains 7.92%. Thatās like acing the āhow-far/how-bigā quizzes after practicing drawing depth maps.
- General multimodal: Despite focusing on spatial skills, COOPER still averages a 4.47% boost on general tests, equivalent to raising your overall GPA, not just your math grade.
Surprising Findings:
- Helper generation alone already teaches useful 3D sense: Training the model to produce depth/segmentation internalizes geometric and boundary knowledge, improving distance/size estimation even without extra reasoning training.
- Interleaved beats either extreme: COOPER outperforms BAGEL-PE (which overuses helpers) on spatial tasks and BAGEL-RE (which avoids helpers) on general tasks, showing that flexible switching is more powerful than one-size-fits-all.
- Smart tool choice by task: Quantitatively, COOPER tends to use depth for relative-distance tasks and segmentation for situational counting and locating; it often stays text-only for pure geometric puzzles where visuals add little.
Qualitative examples:
- Relative Distance: COOPER generates a depth map, reads the warm/cool colors to judge near/far, and picks the closer object correctly.
- Situational QA: It segments the players so it can count how many are on the left of a specific jersey number.
- Failure case: Sometimes it picks the right tool (depth) but still misreads distances, reminding us that helper quality and interpretation both matter.
Helper-quality checks:
- Segmentation (qualitative): COOPER often shows crisp boundaries and clear colors compared to ground-truth labels.
- Depth (NYUv2, out-of-domain): COOPERās depth maps are visually sharp; quantitative metrics are on par with a specialized depth model (Marigold), showing that unified training didnāt ruin its visual precision.
Bottom line: The numbers and pictures agreeāletting the model ādraw to thinkā adaptively delivers consistent gains in spatial intelligence while keeping or even improving general skills.
05Discussion & Limitations
Limitations:
- Single-image focus: COOPER is trained and tested on single images; long videos with moving cameras and objects are still out of scope.
- Backbone constraints: Built on BAGEL, COOPER inherits its architectural and inference-speed limits (e.g., compatibility with fast-serving stacks).
- Narrow helper set: Only depth and segmentation are used; other helpful modalities (point clouds, surface normals, optical flow) arenāt yet integrated.
- RL reward mainly text-side: The CPR reward shapes text reasoning and tool scheduling, but doesnāt directly optimize the pixels of generated helper images during RL.
Required Resources:
- Data: Depth (e.g., Hypersim, Virtual KITTI) and segmentation (ADE20K) for Stage 1; curated interleaved CoT data for Stage 2.
- Compute: Multi-GPU training (e.g., 8ĆH800) for SFT and RL with multiple candidate samples per question.
When NOT to Use:
- Long-horizon video tasks needing real-time performance; COOPER isnāt tuned for speed or temporal memory across many frames.
- Pixel-perfect downstream tasks where specialized perception models with task-specific losses are mandatory (e.g., surgical-grade segmentation).
- Scenarios where helper generation consistently harms performance (negative-gain tasks) and strict latency budgets forbid exploration.
Open Questions:
- Joint text+image RL: Can we extend CPR to also reward the visual quality/utility of helper images directly (beyond text correctness and format)?
- Richer modalities: Whatās the best way to add 3D point clouds, normals, or multi-view cues into the interleaved chain-of-thought?
- Efficient unification: Can we compress or redesign the backbone for faster inference while keeping strong interleaved reasoning?
- Robust scheduling: How can the model better detect when helpers will mislead and abstain earlier, especially in adversarial or out-of-domain images?
- Video scale-up: How to maintain coherent helper use over time (e.g., consistent depth across frames) and reason about motion and occlusion?
06Conclusion & Future Work
Three-sentence summary: COOPER is a single model that learns to generate its own helper visuals (depth/segmentation) and to interleave them with text reasoning, deciding when to see more and when to think more. This cooperative interplay lifts spatial intelligence by 6.91% on average over its base while also improving general multimodal ability. Even training only the helper generation internalizes 3D knowledge, boosting distance/size tasks by 7.92%.
Main achievement: Showing that unifying perception and reasoning inside one modelāand letting the model adaptively ādraw to thinkāābeats improving either side alone, raising both accuracy and flexibility.
Future directions: Add joint text-and-image RL rewards so helper images are optimized directly; broaden helper modalities (e.g., point clouds, normals); and design faster, video-ready unified backbones for long-horizon spatial reasoning.
Why remember this: COOPER turns a model into its own lab partnerāable to make the visual notes it needs, exactly when it needs themāand uses those notes to think better about space. That simple ideaāgenerate what helps, then reason with itācan guide the next wave of robust, trustworthy multimodal AI.
Practical Applications
- ā¢Home robots placing items accurately on shelves by generating depth maps before planning a grasp.
- ā¢AR apps that auto-generate depth to anchor virtual furniture realistically in a room.
- ā¢Warehouse drones estimating distances and sizes to avoid collisions and organize packages.
- ā¢Assistive tools for the visually impaired that segment obstacles and announce relative positions.
- ā¢Sports analytics that segment players to count formations and estimate spacing in real time.
- ā¢E-commerce visualization that estimates object dimensions from photos for better fit previews.
- ā¢Education tools that show step-by-step visual reasoning (depth/segments) when teaching geometry from images.
- ā¢Construction site monitoring that segments machinery and estimates distances for safety checks.
- ā¢Museum guides that localize and compare artworksā positions for spatial tours.
- ā¢Traffic monitoring systems that judge vehicle gaps using depth-like cues for safety analysis.