COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence

Zefeng Zhang; Xiangzhao Hao; Hengzhu Tang; Zhenyu Zhang; Jiawei Sheng; Xiaodong Li; Zhenyang Li; Li Gao; Daiting Shi; Dawei Yin; Tingwen Liu

COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence

Beginner

Zefeng Zhang, Xiangzhao Hao, Hengzhu Tang et al.12/4/2025

arXiv PDF

Key Summary

•COOPER is a single AI model that both “looks better” (perceives depth and object boundaries) and “thinks smarter” (reasons step by step) to answer spatial questions about images.
•It teaches itself to create extra helpful pictures—like depth maps (how far things are) and segmentation maps (which pixels belong to which object)—and then uses them while reasoning.
•The model decides on the fly when to draw these helper pictures and when to keep reasoning with text; this flexible back-and-forth is called interleaved multimodal chain-of-thought.
•To make non-RGB helpers fit its image generator, COOPER cleverly converts depth and segmentation labels into RGB-like pseudo-images and learns them with the same training recipe.
•After supervised fine-tuning on examples with step-by-step, image-and-text reasoning, COOPER is further polished with reinforcement learning using a special CPR reward.
•Across spatial benchmarks, COOPER improves by an average of 6.91% over its base model (BAGEL) while also gaining 4.47% on general multimodal tests.
•Even a version trained only to generate helper maps (no reasoning training) gets 7.92% better at distance and size estimation, showing that making these helpers internalizes 3D know-how.
•Compared with doing just perception help or just text reasoning, COOPER’s interleaved reasoning reaches a higher ceiling on both spatial and general tasks.
•COOPER adapts its tools: it tends to use depth for relative-distance tasks, segmentation for counting and locating, and sticks to text-only steps when visuals add little.
•Limits remain: it focuses on single images, relies on the BAGEL backbone, and currently optimizes mainly text in RL (not the generated images), leaving video and richer 3D inputs as future work.

Why This Research Matters

Spatial intelligence underpins safety and usefulness in everyday AI: a home robot needs to know which cup is closer, AR apps must place virtual objects at the right depth, and cars must judge distances precisely. COOPER shows that one model can both create the visual aids it needs and reason with them, rather than depending on separate, brittle toolchains. This tight coupling makes decisions more robust and explanations more transparent, since the model shows its work with helper images. The approach also boosts general skills, not just spatial ones, suggesting a broader recipe for multimodal reasoning. As systems scale to video and 3D inputs, this cooperative see-and-think loop could become a standard blueprint for reliable AI in the physical world.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re playing hide-and-seek in a photo. To find who’s closest, you don’t just look at colors—you also judge how far things are and which shapes belong to which person. That mix of seeing and thinking is how we really understand space.

🥬 Filling (The Actual Concept): Before this work, most Multimodal Large Language Models (MLLMs) were great talkers about pictures but not great 3D thinkers. They learned mostly from 2D image–text pairs, so they often missed true depth, shapes, and object boundaries that matter for spatial questions like “Which object is closer?” or “How many players are on the left?”

What it is: Visual spatial reasoning is an AI’s ability to notice object properties (size, distance, boundaries) and their spatial relationships, then use that to answer questions.
How it worked before: People improved either the ‘seeing’ part (by adding helper signals like depth and segmentation from separate tools) or the ‘thinking’ part (by training models to reason in text), but rarely both together.
Why it matters: Without strong perception, reasoning guesses can be wrong. Without strong reasoning, even perfect perception can’t solve complex tasks.

🍞 Bottom Bread (Anchor): If you ask, “Is the train closer than the building?”, a model that just reads 2D pixels might get confused by perspective. A model that also creates and uses a depth map can see closeness more reliably.

— New Concept 1 — 🍞 Hook: You know how a Swiss Army knife has many tools in one handle? 🥬 MLLMs: What it is: A Multimodal Large Language Model (MLLM) is a single model that can understand and generate both images and text.

How it works:
1. Turn images into tokens (like words for pictures).
2. Mix text and image tokens in one brain (the transformer).
3. Predict the next token for text; generate image pixels with an image head.
4. Answer questions by combining what it sees and what it reads.
Why it matters: It lets the model look and talk about the same thing, enabling richer answers than text-only. 🍞 Anchor: When you ask, “What color is the ball next to the dog?”, the model looks (image tokens) and explains (text tokens) in one go.

The problem: Even strong MLLMs still struggled with 3D awareness. Two separate fix-it paths emerged:

Perception enhancement: Feed extra modalities (depth maps, segmentation masks) from external tools so the model’s vision gets crisper.
Reasoning enhancement: Teach better step-by-step thinking with textual chain-of-thought or reinforcement learning on spatial Q&A. But doing only one of these made the other a bottleneck.

— New Concept 2 — 🍞 Hook: Imagine cooking with spices you can make yourself whenever you need. 🥬 Auxiliary Modality Generation: What it is: The model learns to generate its own helpful add-on pictures (like depth and segmentation) instead of relying on external tools.

How it works:
1. Convert depth and segmentation labels into RGB-like pseudo-images.
2. Train the model’s image generator to produce these helpers on command.
3. During a question, the model can choose to create a helper image and use it while reasoning.
Why it matters: If the model can make its own helpers, it can flexibly see better right when reasoning needs it. 🍞 Anchor: When asked, “Which bottle is closer?”, the model can create a depth map and then answer with confidence.

— New Concept 3 — 🍞 Hook: Think of a student who decides when to use a calculator and when to do mental math. 🥬 Adaptive Reasoning: What it is: The model learns to decide, step by step, whether to generate a helper image or continue reasoning in text.

How it works:
1. Start reasoning in text.
2. If needed, generate a depth/segmentation helper.
3. Read that helper and continue reasoning.
4. Stop when the answer is clear.
Why it matters: Fixed pipelines are rigid; adaptive steps match the task’s needs. 🍞 Anchor: For counting players on the left, it generates a segmentation map; for comparing distances, it generates depth.

The gap this paper fills: It unifies powerful perception and careful reasoning inside one model that learns when to use which tool. That changes the game from “two separate improvements” to “a single, cooperative system.”

Real stakes: Better spatial intelligence helps robots put items in the right spot, assists AR apps in placing virtual objects realistically, and makes self-driving cars and home assistants safer and smarter.

— New Concept 4 — 🍞 Hook: You know how our eyes judge how far a soccer ball is before we kick? 🥬 Depth Estimation: What it is: Predicting how far each pixel is from the camera.

How it works:
1. Look at the image.
2. Infer a depth value per pixel (near vs. far).
3. Output a depth map where colors represent distance.
Why it matters: Without depth, “closer or farther” questions are guessy. 🍞 Anchor: To answer “Which is nearer, the mug or the vase?”, a depth map makes the nearer one pop out.

— New Concept 5 — 🍞 Hook: Like cutting a cake into slices to serve people. 🥬 Segmentation: What it is: Labeling which pixels belong to which object.

How it works:
1. Find object boundaries.
2. Assign each object a distinct color/label.
3. Output a map showing clear object regions.
Why it matters: Counting, locating, and comparing sizes is easier with clean boundaries. 🍞 Anchor: To answer “How many players are left of #4?”, segmentation separates players so counting is reliable.

02Core Idea

🍞 Top Bread (Hook): Imagine a detective who can draw their own helpful sketches (like outlines and distance lines) while thinking through a case—and decides exactly when each sketch would help.

🥬 Filling (The Actual Concept): The paper’s key insight in one sentence: Teach one unified model to both generate perception helpers (depth/segmentation) and to adaptively weave those helpers into its step-by-step reasoning.

Three analogies:

Chef with spices-on-demand: The chef (model) cooks (answers) better when it grinds its own spices (depth/segmentation) right when the recipe needs flavor.
Student with tools: The student decides when to use a calculator (depth) or a highlighter (segmentation) versus just thinking, making problem-solving smoother.
Builder’s toolkit: The builder picks a tape measure (depth) or a stencil (segmentation) mid-job, instead of carrying pre-cut parts that might not fit.

Before vs. After:

Before: Models either saw better using fixed, external helpers or thought better with pure text steps—but not both together in a flexible way.
After: COOPER internally creates helpers and interleaves them with reasoning, choosing the right tool at the right moment, boosting both accuracy and reliability.

Why it works (intuition without equations):

Internal practice builds intuition: When the model learns to generate depth/segmentation, it rehearses 3D geometry and object boundaries, embedding that knowledge into its representations.
Ask-when-useful: Adaptive reasoning prevents overusing visuals (which can waste time) and underusing them (which can cause mistakes). The model weighs the benefit, then acts.
Feedback that teaches balance: A special CPR reward praises correct answers, clean reasoning format, and sensible tool use—nudging the model to couple seeing and thinking cooperatively.

Building blocks (each explained with a sandwich): — New Concept 6 — 🍞 Hook: Like one backpack that holds both a camera and a notepad. 🥬 Unified MLLM Backbone (BAGEL): What it is: A base model that understands and generates images and text in one system, with dedicated parts for each.

How it works:
1. A visual encoder turns images into tokens for understanding.
2. A language core mixes image and text tokens.
3. An image generator produces images from a learned visual latent space.
Why it matters: It’s the stage where both seeing and drawing can happen together. 🍞 Anchor: When prompted, the model can write an explanation or generate an image (like a depth map-lookalike) without leaving its own toolbox.

— New Concept 7 — 🍞 Hook: Think of tracing paper that lets you convert different drawings into the same format. 🥬 RGB Pseudo-Images for Helpers: What it is: Turning depth and segmentation labels into RGB-like images so the existing image generator can learn them.

How it works:
1. Segmentation: assign distinctive colors to instances to make an RGB label image.
2. Depth: rescale depth values into an RGB range the generator expects.
3. Train with the normal image-generation loss.
Why it matters: No new machinery needed—the model learns helpers in its native image space. 🍞 Anchor: The model can be asked, “<depth-estimation>...</depth-estimation>,” and it will generate a depth-looking image it understands.

— New Concept 8 — 🍞 Hook: Like learning from worked examples before practicing freely. 🥬 Supervised Fine-Tuning (SFT) for Interleaved Steps: What it is: Show the model examples where text thinking and helper images alternate.

How it works:
1. Curate questions and build step-by-step solutions that include when to generate helpers.
2. Train the model to produce those text steps and final answers.
3. Do not force the exact pixels of helpers during SFT to avoid noise.
Why it matters: Gives the model a starting habit of when and how to interleave. 🍞 Anchor: For a sports image, the example shows: think → generate segmentation → think → answer.

— New Concept 9 — 🍞 Hook: Like getting points in a game for both solving puzzles and using tools wisely. 🥬 Reinforcement Learning (RL) with CPR Reward: What it is: Practice-time feedback that scores answers, clean format, and sensible helper use.

How it works:
1. Sample multiple responses per question.
2. Score each: correct answer (yes/no), proper interleaved format (yes/no), and tool-use balance (guided by past data and a threshold).
3. Nudge the model towards higher-scoring behaviors.
Why it matters: Reinforces the cooperative dance between seeing and thinking. 🍞 Anchor: The model learns that for distance questions, using depth helps; for pure geometry text puzzles, extra visuals aren’t needed.

The result: A single model that flexibly ‘draws to think’—making depth/segmentation when useful, reading them, and finishing with a confident answer.

03Methodology

At a high level: Input image and question → (Stage 1) Learn to generate helper modalities (depth and segmentation) as RGB pseudo-images → (Stage 2) Learn to adaptively interleave text reasoning and helper generation with SFT → Refine with RL (CPR reward) → Output final answer (and optional helper images).

Stage 1: Auxiliary Modality Generation (make the model its own helper-tool) — New Concept 10 — 🍞 Hook: You know how you can repaint any sketch into the same color palette so one printer can print them all? 🥬 RGB Mapping for Helpers: What it is: A method to express depth and segmentation as RGB-like images so the generator can learn them with its usual training.

How it works step by step:
1. Segmentation to RGB: Assign an RGB color to each instance/class → get a colorful label image.
2. Depth to RGB: Replicate the depth channel into three channels and rescale values into the generator’s expected range (so it’s like an image).
3. Train with the same image-generation loss the model already uses (Rectified Flow in latent space with a VAE).
Why it matters: No extra decoders or new losses; the model’s existing image generator now learns to ‘draw’ depth/segmentation on demand. 🍞 Anchor: Prompt: “<segmentation> Segment the objects...” → the model outputs a segmentation-like RGB image it can also read back into its reasoning.

Training details (friendly intuition): The model’s image head doesn’t know it’s making a ‘special’ image; it just learns how these helper images should look when asked with tags like <depth-estimation>...</depth-estimation> or <segmentation>...</segmentation>. Across datasets (e.g., Hypersim/Virtual KITTI for depth; ADE20K for segmentation), it practices producing accurate helper images.

What breaks without this step: If the model cannot generate helpers itself, it must rely on external tools, losing the tight loop between drawing and thinking inside one brain.

Example with real data: Given a living room photo, the model can generate (1) a depth map where nearer furniture is ‘warmer,’ and (2) a segmentation map where the couch, table, and lamp are separate colors.

— New Concept 11 — 🍞 Hook: Think of drawing-and-thinking as a conversation with yourself. 🥬 Interleaved Multimodal Chain-of-Thought: What it is: Reasoning that goes text → (optional helper image) → text → (optional helper) → answer.

How it works:
1. The model starts with a <think> step.
2. If visuals can help, it issues <depth-estimation> or <segmentation>.
3. It reads the helper, continues thinking, and finally answers.
Why it matters: Some questions need a picture to think clearly; some don’t. Interleaving adapts to the question. 🍞 Anchor: For “Which player is left of #4?”, it generates segmentation to distinguish players cleanly before counting.

Stage 2a: Supervised Fine-Tuning (SFT) on curated interleaved data

Data curation: Start from spatial QA sets (e.g., SAT VQA) and general QA (TACO). For each question, test the base model twice: with and without helpers, sampling multiple attempts. Keep samples where helpers sometimes change accuracy (not trivially always right or always wrong). Build balanced sets: positive-gain (helpers help), negative-gain (helpers hurt), and boundary (no big change). Use GPT-4o to draft clean interleaved solutions, while using the Stage-1 model to actually generate helper images so the steps match the model’s abilities. Keep only examples with correct final answers.
SFT recipe: Train the model to reproduce the text reasoning and final answers (don’t force exact pixels of helpers at this stage to avoid noise).
Why SFT first: It teaches the basic rhythm of interleaving and when helpers tend to be useful.
What breaks without SFT: RL alone may wander; SFT provides stable habits and format.

Example: Question: “Which is closer to the red phone: the cabinet (green) or the monitor (blue)?” SFT teaches: think → depth → think using the depth colors → answer.

Stage 2b: Reinforcement Learning with GRPO and CPR reward — New Concept 12 — 🍞 Hook: Like a game where you try many moves, score each, and learn the smartest strategy relative to your past tries. 🥬 GRPO (Group Relative Policy Optimization): What it is: An RL method that samples several answers per question, scores them, and improves the policy based on which answers are better than the group average.

How it works:
1. For each question, sample N responses.
2. Score them with a reward function.
3. Update the model to make above-average answers more likely, controlling step size so learning stays stable.
Why it matters: It’s robust and efficient for on-policy learning. 🍞 Anchor: If, among 8 tries, the ones using depth led to correct answers and neat formatting, the model is nudged toward that pattern.

— New Concept 13 — 🍞 Hook: Think of a scorecard that rewards winning, neat work, and smart tool use—not just the final result. 🥬 CPR Reward (Cooperative Perception–Reasoning): What it is: A three-part reward that balances correctness, clean interleaving format, and sensible visual-tool usage.

How it works:
1. Answer reward: 1 if correct, else 0.
2. Format reward: 1 if the output follows the interleaved think/generate/answer pattern, else 0.
3. Exploration-guided tool-use reward: Based on precomputed ‘visual-gain’ labels and a threshold, lightly reward or penalize using helpers too much or too little.
Why it matters: It prevents overusing helpers (slowing or distracting) and underusing them (missing needed cues), guiding a practical balance. 🍞 Anchor: On tasks where helpers historically help (positive-gain), using them modestly gains a bonus; on tasks where they hurt (negative-gain), overusing them gets a small penalty.

Secret sauce: The combination—internal helper generation + interleaved SFT habits + CPR-guided RL practice—lets the model self-choose when to ‘draw to think,’ making spatial reasoning both stronger and more general.

04Experiments & Results

The Test: Researchers measured how well COOPER understands space (distance, size, relationships) and how it holds up on general multimodal benchmarks. They chose three spatial benchmarks (SIBench single-image subset, Q-SpatialBench for distance/size estimation, and MMVP) plus two general ones (MMBench v1.1 and MM-Vet).

The Competition: COOPER was compared to strong baselines, including unified models (Janus-Pro, Liquid, and the base BAGEL) and leading MLLMs (InternVL3.5, Qwen3VL, GPT-4o, GPT-5). Two special ablations show the value of the two halves: BAGEL-PE (only perception enhancement: always generate helpers) and BAGEL-RE (only reasoning enhancement: text-only RL on spatial Q&A).

The Scoreboard (with context):

Spatial reasoning overall: COOPER improves by an average of 6.91% over the base BAGEL. Think of it as moving from a solid B to a clear A- on challenging spatial tests.
Q-SpatialBench (distance/size): COOPER matches or beats much larger open-source models and approaches proprietary models; even the ‘Stage 1 only’ variant (helper generation but no reasoning RL) gains 7.92%. That’s like acing the ‘how-far/how-big’ quizzes after practicing drawing depth maps.
General multimodal: Despite focusing on spatial skills, COOPER still averages a 4.47% boost on general tests, equivalent to raising your overall GPA, not just your math grade.

Surprising Findings:

Helper generation alone already teaches useful 3D sense: Training the model to produce depth/segmentation internalizes geometric and boundary knowledge, improving distance/size estimation even without extra reasoning training.
Interleaved beats either extreme: COOPER outperforms BAGEL-PE (which overuses helpers) on spatial tasks and BAGEL-RE (which avoids helpers) on general tasks, showing that flexible switching is more powerful than one-size-fits-all.
Smart tool choice by task: Quantitatively, COOPER tends to use depth for relative-distance tasks and segmentation for situational counting and locating; it often stays text-only for pure geometric puzzles where visuals add little.

Qualitative examples:

Relative Distance: COOPER generates a depth map, reads the warm/cool colors to judge near/far, and picks the closer object correctly.
Situational QA: It segments the players so it can count how many are on the left of a specific jersey number.
Failure case: Sometimes it picks the right tool (depth) but still misreads distances, reminding us that helper quality and interpretation both matter.

Helper-quality checks:

Segmentation (qualitative): COOPER often shows crisp boundaries and clear colors compared to ground-truth labels.
Depth (NYUv2, out-of-domain): COOPER’s depth maps are visually sharp; quantitative metrics are on par with a specialized depth model (Marigold), showing that unified training didn’t ruin its visual precision.

Bottom line: The numbers and pictures agree—letting the model ‘draw to think’ adaptively delivers consistent gains in spatial intelligence while keeping or even improving general skills.

05Discussion & Limitations

Limitations:

Single-image focus: COOPER is trained and tested on single images; long videos with moving cameras and objects are still out of scope.
Backbone constraints: Built on BAGEL, COOPER inherits its architectural and inference-speed limits (e.g., compatibility with fast-serving stacks).
Narrow helper set: Only depth and segmentation are used; other helpful modalities (point clouds, surface normals, optical flow) aren’t yet integrated.
RL reward mainly text-side: The CPR reward shapes text reasoning and tool scheduling, but doesn’t directly optimize the pixels of generated helper images during RL.

Required Resources:

Data: Depth (e.g., Hypersim, Virtual KITTI) and segmentation (ADE20K) for Stage 1; curated interleaved CoT data for Stage 2.
Compute: Multi-GPU training (e.g., 8×H800) for SFT and RL with multiple candidate samples per question.

When NOT to Use:

Long-horizon video tasks needing real-time performance; COOPER isn’t tuned for speed or temporal memory across many frames.
Pixel-perfect downstream tasks where specialized perception models with task-specific losses are mandatory (e.g., surgical-grade segmentation).
Scenarios where helper generation consistently harms performance (negative-gain tasks) and strict latency budgets forbid exploration.

Open Questions:

Joint text+image RL: Can we extend CPR to also reward the visual quality/utility of helper images directly (beyond text correctness and format)?
Richer modalities: What’s the best way to add 3D point clouds, normals, or multi-view cues into the interleaved chain-of-thought?
Efficient unification: Can we compress or redesign the backbone for faster inference while keeping strong interleaved reasoning?
Robust scheduling: How can the model better detect when helpers will mislead and abstain earlier, especially in adversarial or out-of-domain images?
Video scale-up: How to maintain coherent helper use over time (e.g., consistent depth across frames) and reason about motion and occlusion?

06Conclusion & Future Work

Three-sentence summary: COOPER is a single model that learns to generate its own helper visuals (depth/segmentation) and to interleave them with text reasoning, deciding when to see more and when to think more. This cooperative interplay lifts spatial intelligence by 6.91% on average over its base while also improving general multimodal ability. Even training only the helper generation internalizes 3D knowledge, boosting distance/size tasks by 7.92%.

Main achievement: Showing that unifying perception and reasoning inside one model—and letting the model adaptively ‘draw to think’—beats improving either side alone, raising both accuracy and flexibility.

Future directions: Add joint text-and-image RL rewards so helper images are optimized directly; broaden helper modalities (e.g., point clouds, normals); and design faster, video-ready unified backbones for long-horizon spatial reasoning.

Why remember this: COOPER turns a model into its own lab partner—able to make the visual notes it needs, exactly when it needs them—and uses those notes to think better about space. That simple idea—generate what helps, then reason with it—can guide the next wave of robust, trustworthy multimodal AI.

Practical Applications

•Home robots placing items accurately on shelves by generating depth maps before planning a grasp.
•AR apps that auto-generate depth to anchor virtual furniture realistically in a room.
•Warehouse drones estimating distances and sizes to avoid collisions and organize packages.
•Assistive tools for the visually impaired that segment obstacles and announce relative positions.
•Sports analytics that segment players to count formations and estimate spacing in real time.
•E-commerce visualization that estimates object dimensions from photos for better fit previews.
•Education tools that show step-by-step visual reasoning (depth/segments) when teaching geometry from images.
•Construction site monitoring that segments machinery and estimates distances for safety checks.
•Museum guides that localize and compare artworks’ positions for spatial tours.
•Traffic monitoring systems that judge vehicle gaps using depth-like cues for safety analysis.

Version: 1