🎓How I Study AIHISA
đź“–Read
📄Papers📰Blogs🎬Courses
đź’ˇLearn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning | How I Study AI

SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

Intermediate
Yong Xien Chng, Tao Hu, Wenwen Tong et al.12/30/2025
arXivPDF

Key Summary

  • •SenseNova-MARS is a vision-language model that can think step-by-step and use three tools—text search, image search, and image cropping—during its reasoning.
  • •It learns this behavior with reinforcement learning and a new training trick called BN-GSPO that keeps learning steady and fair across many different problems.
  • •Before training with RL, a small 'cold-start' teaching set (about 3,000 high-quality examples) shows the model how to follow the rules and call tools correctly.
  • •A new test set called HR-MMSearch uses 4K images and tough, knowledge-heavy questions to check if models can both zoom in on tiny details and look up facts on the web.
  • •SenseNova-MARS-32B sets new open-source records on search tasks like MMSearch (74.3) and HR-MMSearch (54.4), even beating big proprietary models in those tests.
  • •It’s also excellent at high-resolution visual understanding, scoring 94.2 on V* Bench and 90.2 on HR-Bench 4K.
  • •The model learns to pick the right tool at the right time—search when it needs facts, crop when it needs tiny visual details, or both for complex tasks.
  • •BN-GSPO normalizes reward signals within groups and across batches, making RL updates stable for long, multi-step tool-using trajectories.
  • •Ablation studies show BN-GSPO outperforms GRPO and GSPO, and mixed training data (search + perception) prevents overspecialization.
  • •Despite strong results, the model can still be confused by noisy search results or form weak search queries when visual clues aren’t grounded well.

Why This Research Matters

Many real-world questions need both sharp eyes and fresh facts—like identifying a new logo in a sports photo and checking its history online. SenseNova-MARS shows how to train one agent to handle both by thinking step-by-step and calling the right tool at the right time. This can make information assistants more reliable for journalists, researchers, and everyday users who deal with changing news and tiny visual clues. The benchmark with 4K images raises the standard, helping the whole field measure real-world readiness better. Stable RL training (BN-GSPO) makes it practical to scale these systems without chaotic learning. As more tools get added, this approach could power assistants that investigate, verify, and explain with strong evidence.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re solving a mystery with a giant picture and the internet. Sometimes you need to zoom in on a tiny logo in the picture, and other times you need to look up what that logo means online. Real problem-solving mixes both.

🥬 Filling (The Actual Concept): What it is: SenseNova-MARS is a vision-language agent that plans, thinks out loud, and calls tools (text search, image search, image crop) in the middle of its reasoning to answer tough questions about images. How it works: 1) It reads the question and sees the image. 2) It decides if it should crop to zoom into a tiny region, search with the image, or search with text. 3) It repeats: think → tool → read results, until ready to answer. 4) It learns this behavior from examples and from trial-and-reward training. Why it matters: Without this, models either guess from memory or look up too much irrelevant stuff, missing small visual clues or mixing facts.

🍞 Bottom Bread (Anchor): Example: A photo of a race car driver: the question asks the gap between a company’s founding year (from a tiny logo) and the driver’s birth year. The agent crops to read the logo (Castore), searches for the brand’s founding year (2015), image-searches to confirm the driver (Max Verstappen), searches his birth year (1997), and computes 18.

🍞 Top Bread (Hook): You know how a good helper doesn’t just answer instantly—they plan, use tools, and adjust as they learn more? That’s called being “agentic.”

🥬 Agentic Reasoning (Concept): What it is: Agentic reasoning means the model makes plans and decisions over multiple steps, choosing actions like a person would. How it works: 1) Think about the goal. 2) Choose one action (like a specific tool). 3) Observe the result. 4) Update the plan and repeat until done. Why it matters: Without agency, the model dumps one-shot guesses and can’t handle multi-hop, tool-heavy problems.

🍞 Anchor: When asked, “Which company is on the small badge, and when was it founded?”, an agentic model first zooms in, then searches, rather than guessing.

🍞 Top Bread (Hook): Imagine building a LEGO set with both the picture on the box (images) and the instruction booklet (text). You need both to finish it right.

🥬 Multimodal Integration (Concept): What it is: Combining different kinds of information—like images and text—so the model understands better. How it works: 1) Read the picture to spot details. 2) Read text or the web to get missing facts. 3) Fuse both streams while reasoning. Why it matters: Without it, you either miss tiny visual clues or lack real-world facts.

🍞 Anchor: To answer “What’s the color of the cap?” the model crops the image. To answer “When was this company founded?” it searches text.

The world before: Vision-language models could describe pictures and answer easy questions. But for realistic tasks—like recognizing a tiny team logo and then checking real-time facts online—older systems were stuck. They followed text-only chain-of-thought, called tools in a rigid way, or overused retrieval. They didn’t interleave tool use with ongoing thinking.

The problem: Real images are high-resolution with small, important bits (like stickers, badges, timestamps). Real questions are knowledge-intensive and time-sensitive (e.g., sports rosters, product releases). Models must zoom in precisely and also fetch fresh facts. Doing both smoothly was missing.

Failed attempts: • Retrieval-Augmented Generation (RAG) uses fixed workflows and often fetches too much or the wrong stuff. • Search-only agents can’t read tiny image regions. • Crop-only agents can’t fetch facts. • Pure RL without a warm start often fails to learn deep, multi-turn tool use.

The gap: A single agent that can plan and interleave three tools—text search, image search, and image crop—driven by a stable reinforcement-learning approach.

🍞 Top Bread (Hook): Think of a robot learning tricks by getting treats when it succeeds.

🥬 Reinforcement Learning (Concept): What it is: A way for a model to learn from trial and error using rewards. How it works: 1) Try a multi-step plan. 2) Get a reward if the final answer and format are correct. 3) Adjust the strategy to earn more rewards next time. Why it matters: Without RL, the model won’t reliably discover when and how to call tools over many steps.

🍞 Anchor: If the model’s final answer matches the ground truth and it followed the one-tool-per-turn protocol, it gets positive reward; if not, it adjusts.

Real stakes: • News photos, product launches, and sports events change often; answers must be grounded in current web info. • High-res images hide critical details; without cropping, answers go wrong. • In everyday life, from identifying products to checking safety signs, you need both sharp eyes and up-to-date facts.

🍞 Top Bread (Hook): Picture a referee making sure every player’s score is compared fairly, even if some games are longer or harder.

🥬 BN-GSPO (Concept): What it is: BN-GSPO is an RL training method that normalizes learning signals within groups and across the batch so updates are stable for long, varied tool-use sequences. How it works: 1) Sample several responses per question (a group). 2) Standardize rewards inside each group. 3) Normalize again across the minibatch. 4) Use a clipped objective with a small KL penalty to prevent wild updates. Why it matters: Without this, uneven rewards across different prompts and trajectory lengths can make training unstable and biased.

🍞 Anchor: Two questions—one needs 3 steps, another needs 9. BN-GSPO keeps their learning signals on comparable scales, so neither dominates training.

02Core Idea

🍞 Top Bread (Hook): Imagine teaching a scout to explore a huge map. Sometimes they zoom in with a magnifying glass; sometimes they ask a local expert. The trick is knowing when to do which—and to keep exploring until the destination is certain.

🥬 The Aha! (One sentence): Train a single multimodal agent, via stable reinforcement learning, to fluidly interleave image cropping with image and text search while reasoning step-by-step.

Multiple Analogies:

  1. Chef analogy: The model is a chef who tastes (crops to inspect details), reads cookbooks (text search), and looks at dish photos (image search) to perfect the recipe (final answer). 2) Detective analogy: The model zooms in on fingerprints (crop), checks mugshots (image search), and reads case files (text search) before naming the culprit. 3) Student analogy: The model looks closely at a diagram (crop), reverse-looks up the picture in a reference book (image search), and reads articles (text search) to ace a quiz.

Before vs. After:

  • Before: Models either stuck to text-only reasoning, used tools in isolation, or failed to coordinate small visual clues with live facts. They over-fetched, missed tiny details, or stopped too early.
  • After: SenseNova-MARS plans, thinks, and chooses among three tools at each step. It crops when details matter, searches when facts are missing, and combines both for complex, high-res problems.

Why it works (intuition, no equations):

  • Agentic planning breaks tasks into steps so the model can decide which tool reduces the most uncertainty right now. - RL rewards full trajectories (tool calls + final answer), teaching the model efficient sequences (not just correct endings). - BN-GSPO keeps the learning signal fair even when some questions are long, some short, and some noisy. - A small cold-start dataset provides a “how to behave” template so RL explores smartly, not randomly.

Building Blocks (explained with sandwiches when first introduced):

🍞 Top Bread (Hook): Imagine joining a team sport—first you learn the rules, then you practice in real games.

🥬 Cold-Start SFT (Concept): What it is: A short, supervised lesson that teaches the model tool etiquette and protocol before RL. How it works: 1) Curate ~3k high-quality, tool-using examples. 2) Train the model to follow the one-tool-per-turn format and produce valid JSON-like calls. 3) Ensure it knows when to output a final answer. Why it matters: Without SFT, RL wastes time bumbling through invalid tool calls and shallow plans.

🍞 Anchor: The model practices a few dozen examples of crop → search → answer so it knows the rhythm before trying its own ideas.

🍞 Top Bread (Hook): Think of three gadgets in a backpack: a magnifier, a photo-matcher, and a library card.

🥬 Tool Trio (Concept): What it is: The agent can call text search, image search, or image crop each turn—exactly one. How it works: 1) Text search queries the web and returns summarized pages. 2) Image search finds visually similar images and captions. 3) Image crop zooms into a chosen region for tiny details. Why it matters: Without all three, many questions stay unsolved: search-only can’t see tiny clues; crop-only can’t fetch facts.

🍞 Anchor: To answer “How many motors are in this car model?” the agent image-searches to identify the model, then text-searches for specs.

🍞 Top Bread (Hook): Picture a science fair with extra-hard challenges—tiny details and up-to-date facts.

🥬 HR-MMSearch (Concept): What it is: A new benchmark of 4K images and hard, search-driven questions to test fine detail + web facts together. How it works: 1) Collect recent (2025) high-res images to avoid training leakage. 2) Write questions that require cropping and searching. 3) Score pass@1 with an LLM judge for open answers. Why it matters: Without such a test, agents might look great on easy or low-res tasks but fail on real-world, mixed challenges.

🍞 Anchor: Questions like “What’s written on that small banner, and in what year was that sponsor founded?” require both zooming and searching.

In short, SenseNova-MARS’s core idea is a carefully staged training recipe—first learn the rules (SFT), then master multi-step tool use with stable RL (BN-GSPO)—paired with a realistic, challenging benchmark (HR-MMSearch) that forces true multimodal, tool-integrated reasoning.

03Methodology

High-level overview: Input (image + question) → Plan/Think → Choose exactly one action (text search, image search, or image crop) → Observe results (summaries, thumbnails, or crop) → Repeat until confident → Output final answer.

Step-by-step (like a recipe):

  1. Observation and Planning
  • What happens: The agent reads the full history: the original image/question, any previous thoughts, tool calls, and tool outputs. It then writes a short internal plan (<think>...</think>) about what to do next. - Why it exists: Planning avoids guesswork and helps choose the most useful next action. Without it, the agent wastes steps or misses crucial details. - Example: The agent notices a tiny logo on a jersey; it decides to crop first because the logo text is unreadable at full size.
  1. One Action per Turn
  • What happens: The agent must choose exactly one of: (a) text search (send a query), (b) image search (reverse lookup), or (c) image crop (zoom into [x1,y1,x2,y2] on an image index). - Why it exists: Enforcing one tool per turn keeps the protocol clean and prevents messy, overlapping calls. Without it, trajectories are hard to learn from and evaluate. - Example: After cropping reveals “Castore,” the next turn is a text search: “Castore founding year.”
  1. Structured Tool Responses
  • What happens: Each tool returns compact, consistent outputs. • Text search: top pages summarized by a separate summarizer model to control length/noise. • Image search: pre-fetched titles/thumbnails for speed and cost. • Crop: a new image index for the zoomed-in patch. - Why it exists: Consistent, short outputs prevent context overflow and keep the agent focused. Without summarization, the agent can drown in irrelevant text. - Example: The text search returns a 5-sentence summary stating that Castore was founded in 2015.
  1. Final Answer Turn
  • What happens: When the agent feels confident, it writes a concluding thought and outputs <answer>...</answer>. - Why it exists: Separating thinking from answering avoids half-baked conclusions. Without a clean final step, reward scoring and user trust suffer. - Example: “Founding year 2015; driver born 1997; difference 18.”

Training Pipeline (Two Stages): A) Cold-Start Supervised Fine-Tuning (SFT)

  • What happens: Use ~3,315 curated examples (from FVQA, a pixel-reasoning corpus, and expert cases) with verified tool-using trajectories. Train the model to: 1) follow the strict one-tool-per-turn format, 2) output valid JSON-like tool calls, and 3) end with a final answer tag. - Why it exists: RL can’t start from scratch for multi-step tool use—too many invalid actions. SFT teaches the rules so RL can explore productively. - Example data: A case that starts with a crop to read a tiny brand, then image-search to identify a stadium, then text-search for founding dates.

B) Reinforcement Learning with BN-GSPO

  • What happens: The model generates multiple full solutions per prompt (groups). Each solution is scored with: (1) an accuracy reward (LLM-as-judge checks semantic match), and (2) a format reward (correct tool protocol). Advantages are normalized twice: within the group and across the batch. A clipped objective with a small KL penalty keeps updates safe. - Why it exists: Different prompts yield variable lengths and reward scales; naive updates can be unstable. Double normalization makes learning steady for long, multi-turn rollouts. - Example: For a simple question, three-step trajectories shouldn’t overpower a complex nine-step one just because it’s shorter/louder.

The Secret Sauce (BN-GSPO):

  • Length-normalized importance ratio prevents long sequences from dominating. - Group standardization compares responses to peers answering the same prompt. - Batch normalization evens out differences across prompts in the minibatch. - Clipping and KL regularization prevent overly large, drifting updates.

Rewards (shaping good behavior):

  • Answer reward: 1.0 if the final answer semantically matches the ground truth (LLM-judge), else 0.0. - Format reward: 0.5 if protocol is followed (think + exactly one tool per non-final turn; think + answer at the end; tags and JSON schema correct), else 0.0. - Why it matters: Without format reward, models might get answers right with messy tool use; without accuracy reward, they might follow rules but be wrong.

Tool Implementations (practical details):

  • Text search: Uses a web API at inference; during RL training, local Wikipedia retrieval plus consistent summarization keeps costs low and formats matched. Top-5 results are summarized individually, then jointly, to keep context short and relevant. - Image search: Reverse image titles/thumbnails pre-cached for training to reduce latency. - Image crop: Takes normalized coordinates and an image index, producing a new cropped image reference for the next turn.

Limits and safety valves:

  • Turn limit T=10 encourages efficient plans. - Token limits per turn and per trajectory prevent runaways. - One-tool-per-turn schema keeps logs clean for analysis and reward checking.

End-to-end example (with actual data flow):

  • Input: A 4K photo of a stadium gate with a small red sponsor sign and the question about the years between sponsor founding and stadium construction. - Turn 1: <think>Need sponsor name</think> → crop ([x1,y1,x2,y2]) → returns zoom with “Rakuten.” - Turn 2: <think>Need founding year</think> → text search “Rakuten founding year” → summary includes 1997. - Turn 3: <think>Identify stadium and start year</think> → image search of the full photo → summaries suggest Japan National Stadium, construction began 2016. - Final: <think>Compute 2016-1997</think> → <answer>19</answer>.

Why the method is clever:

  • It teaches not just answers, but how to acquire the right evidence at the right time. - BN-GSPO makes multi-tool RL stable across mixed difficulty. - The cold-start set is tiny yet powerful: it unlocks deep RL exploration quickly. - Consistent tool outputs (summaries, thumbnails, crops) keep the agent’s context clean and focused.

04Experiments & Results

The Test (what and why):

  • Agentic search: Can the model fetch up-to-date facts and connect them to visual clues? Benchmarks include MMSearch, FVQA, InfoSeek, SimpleVQA, LiveVQA, MAT-Search, and the new HR-MMSearch with 4K images and tough, search-driven questions. - Fine-grained visual reasoning: Can the model spot tiny details and reason spatially in high-res scenes? Benchmarks include V* Bench, HR-Bench 4K, HR-Bench 8K, and MME-RealWorld.

The Competition (baselines):

  • Proprietary: GPT-4o/5/5.2, Gemini-2.5/3-Pro/Flash. - Open-source agentic: MMSearch-R1, DeepMMSearch-R1, DeepEyesV2, Visual-ARFT. - Open-source bases (direct answer and with tools): Qwen families and others.

Metrics (making numbers meaningful):

  • Agentic search uses Pass@1 via an LLM judge: how often is the first final answer correct? - Visual reasoning uses Exact Match (many are multiple-choice); some report Avg@8 to smooth variance.

Scoreboard (with context):

  • SenseNova-MARS-32B in agentic search achieves an average 69.74 across seven datasets—like scoring an A when many strong models get a B+. - MMSearch: 74.3 (tied for SOTA), showing excellent ability to connect visuals with fresh web facts. - HR-MMSearch: 54.4, beating Gemini-3-Pro and GPT-5.2 by around 6 points—impressive given 4K detail demands plus web lookup. - FVQA: 72.6 Pass@1 (agentic)—very strong multimodal factual Q&A. - Fine-grained vision: V* Bench 94.2 and HR-Bench 4K 90.2—top-tier, showing the model can truly “think with images.”

Smaller model strength:

  • SenseNova-MARS-8B averages 64.20 on search-oriented evaluations, outperforming similarly sized open-source agents and even surpassing some proprietary systems (e.g., Gemini-3-Flash) on average.

Surprising findings:

  • Hybrid data matters: Training only on fine-grained perception boosted V* Bench but hurt search tasks; mixing search + perception data produced balanced, superior results, especially on HR-MMSearch. - Tool-use adaptation: On MMSearch, the agent leaned on text/image search; on V* Bench, it used almost only crop; on HR-MMSearch, it used a balanced mix—evidence of real tool selection, not habits. - Efficiency gains with RL: Average tool calls dropped from ~4 to ~2 as RL progressed, indicating smarter, not just longer, trajectories.

Ablation highlights:

  • BN-GSPO vs GRPO/GSPO (pure RL, no SFT): BN-GSPO delivered the most stable, across-the-board gains. The extra batch normalization step reduced reward-scale variance and made training robust for mixed-length, multi-turn sequences. - Data distribution: The best performance came from the hybrid RL set (search + perception), confirming that broad practice teaches when to crop vs. when to search.

What to take from the numbers:

  • SenseNova-MARS isn’t just memorizing; it’s acquiring fresh info when needed and zooming when tiny details matter. - State-of-the-art scores across both search and vision benches show the method’s generality, not a niche trick.

05Discussion & Limitations

Limitations (be specific):

  • Retrieval noise: Web snippets can be misleading (e.g., mixing birthplace with company HQ). The agent occasionally follows a wrong thread to a plausible but incorrect answer. - Query grounding: Sometimes the agent fails to turn a tiny visual clue (like a specific office name) into a sharp search query, retrieving global stats instead of local facts. - Hard dependencies: Pages needing heavy JavaScript rendering weren’t used in training; this can miss some web content. - Judge dependency: Accuracy uses an LLM-as-judge; while practical, it may occasionally mis-score edge cases.

Required resources:

  • A capable VLM backbone (e.g., Qwen-VL family). - RL infrastructure with group sampling, batch-size ~128, and efficient caching for image search. - A summarizer model to condense web pages. - Budget for live search at inference time (or careful caching) and enough GPU memory for long-context rollouts.

When NOT to use:

  • Fully offline or no-internet settings where web search is prohibited. - Time-critical, millisecond-latency scenarios—multi-turn tool use may be too slow. - Domains needing exact legal/medical citations without human review; retrieval noise can still slip in.

Open questions:

  • How to make search queries perfectly grounded in visual text and entities every time? - Can we build a retrieval noise filter that flags contradictions before finalizing an answer? - How to extend to more tools (e.g., OCR, calculator, table parser) without destabilizing RL? - Can we reduce reliance on LLM judges with cheaper, robust reward models? - How to safely support JavaScript-heavy pages at training time without exploding costs?

06Conclusion & Future Work

Three-sentence summary: SenseNova-MARS is a multimodal agent that learns, via a stable RL method (BN-GSPO), to interleave image cropping with image and text search while reasoning step-by-step. A small cold-start dataset teaches correct tool etiquette before RL, and a new benchmark (HR-MMSearch) checks whether the agent can handle 4K details and live facts together. The model achieves state-of-the-art open-source results on both search and high-resolution vision tasks.

Main achievement: Unifying fine-grained visual analysis and web-grounded knowledge retrieval inside one agent, then making the RL training stable enough to scale and outperform strong baselines.

Future directions: Add more tools (OCR, table readers, calculators), strengthen visual-to-query grounding, handle JS-heavy pages, reduce reliance on LLM judges, and explore multi-agent collaboration. Also, push efficiency—fewer steps, lower latency—without losing accuracy.

Why remember this: It shows how to teach models not just to answer, but to investigate—zoom in when details matter, look up facts when memory isn’t enough, and stitch everything together with careful, stable learning.

Practical Applications

  • •News verification: Identify small visual cues in breaking photos and confirm facts with live web search.
  • •E-commerce: Zoom into product labels or serials and fetch authentic specs, release dates, or safety recalls.
  • •Education: Help students analyze diagrams by cropping and then reading supporting articles for context.
  • •Sports analytics: Read tiny sponsor logos or jersey patches and connect them to team or brand histories.
  • •Travel assistance: Recognize landmarks from photos and retrieve updated visiting rules or events.
  • •Compliance auditing: Inspect fine-print on packaging and cross-check regulations online.
  • •Customer support: Parse screenshots by zooming into error codes and searching official docs for fixes.
  • •Research curation: Extract figure details from papers and search for related studies or datasets.
  • •Cultural heritage: Read inscriptions on artifacts and link them to museum or archival records online.
#multimodal agent#vision-language model#reinforcement learning#BN-GSPO#tool use#image crop#text search#image search#HR-MMSearch#high-resolution perception#agentic reasoning#MMSearch#V* Bench#HR-Bench#LLM-as-a-judge
Version: 1