Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation

Jun He; Junyan Ye; Zilong Huang; Dongzhi Jiang; Chenjue Zhang; Leqi Zhu; Renrui Zhang; Xiang Zhang; Weijia Li

Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation

Intermediate

Jun He, Junyan Ye, Zilong Huang et al.2/2/2026

arXiv PDF

Key Summary

•Mind-Brush turns image generation from a one-step 'read the prompt and draw' into a multi-step 'think, research, and create' process.
•It first figures out what the model does not know, then searches the web for trusted text and image clues, and finally reasons step by step before drawing.
•This agentic workflow bridges hidden user intentions and real-world facts, reducing mistakes and hallucinations in the final picture.
•A new benchmark, Mind-Bench, tests whether models can handle fresh news, long-tail topics, and tricky reasoning, not just simple prompts.
•Mind-Brush lifts an open-source baseline (Qwen-Image) from 0.02 to 0.31 accuracy on Mind-Bench, a huge jump like going from almost zero to solid progress.
•It also beats or matches strong systems on other tests like WISE (world knowledge) and RISE (reasoning), showing broad gains.
•The key idea is to combine active search (to stay up to date) with logical reasoning (to solve hidden constraints) before drawing.
•A 'Concept Review' step condenses all the evidence into a clean Master Prompt and visual references that the generator can follow.
•Mind-Brush is training-free: it orchestrates existing models and tools, so it can improve many generators without retraining them.

Why This Research Matters

Mind-Brush changes how AI draws by acting more like a careful student: it plans, researches, and thinks before creating. This reduces wrong details in images about real events, science facts, or math problems, which is vital for education, media, and decision-making. Teachers and students can get visuals that both look good and tell the truth, even for brand-new topics. Journalists and designers can illustrate up-to-the-minute stories with fewer mistakes. Companies can produce brand-accurate, fact-checked visuals on demand. Overall, it’s a move from ‘pretty guesses’ to ‘beautiful and correct’—and that’s a big step for trustworthy AI.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you’re making a school poster, you don’t just start drawing? First you understand the topic, then you look up facts and pictures, and finally you plan what to put where. Classic image generators usually skip those middle steps—they jump straight from your words to pixels.

Before this work, most text-to-image models acted like super-speedy but very literal artists. If you asked for “a poster about the latest eclipse in Iceland,” they might draw a pretty eclipse, but miss that it happened in 2026, that it was visible in Reykjavik, or what the sky looked like that day. They mainly relied on what they memorized during training, which can be out of date or too general. That means they often failed when the prompt required fresh facts (like recent news), rare or long-tail details (like a niche object or new character), or multi-step reasoning (like interpreting a map or solving a geometry setup before drawing).

The problem researchers faced was this: prompts often hide extra steps. People may not write every detail, but they expect the picture to match both their intent and the world’s truth. Without checking facts or doing reasoning, models can produce confident-looking but wrong images. It’s like turning in a science project poster without checking your sources—nice colors, wrong content.

What did people try before? A first wave of agentic tools expanded or polished the prompt (“prompt optimization”), adding descriptive flourishes. That helps clarity, but if the missing piece is a fact that the model never learned—or a logic puzzle it needs to solve—fancier wording doesn’t fix it. Others tried simple image retrieval to show examples, but often treated those as loose hints, not something to verify and reason about. On the flip side, reasoning-only methods stayed inside the model’s head: with no external search, they couldn’t update themselves about the real world.

The missing ingredient was a unified, dynamic workflow—one that could (1) detect what’s missing, (2) fetch the right evidence, (3) reason through it, and then (4) generate a picture that respects both the user’s goal and the facts. In short: not just ‘draw from memory,’ but ‘think, research, and then create.’

Why this matters in daily life: Imagine a teacher asking for a poster about last week’s volcanic eruption, including the correct location, ash color, and safety signs. Or a student wanting a diagram that solves a geometry problem step by step, then visualizes the result. Or a fan requesting a brand-new character’s official costume details. A model that can’t search or reason will guess and often get it wrong; a model that can do both can produce something useful, trustworthy, and tailored.

Here are the key new ideas introduced in this paper, explained with the Sandwich pattern when each concept first appears:

🍞 Top Bread (Hook): Imagine you’re planning a science fair project. You don’t just doodle; you figure out what you need to know, research it, think it through, and then build your display. 🥬 The Concept: Mind-Brush Framework is a system that turns image generation into a ‘think–research–create’ pipeline. How it works:

It analyzes the user’s request to find missing information.
It searches for trusted text and image evidence.
It reasons step by step to resolve hidden constraints.
It condenses everything into a clear instruction and references for drawing. Why it matters: Without a plan that includes search and reasoning, the model can draw something pretty but wrong or irrelevant. 🍞 Bottom Bread (Anchor): Asking for “a poster of the 2026 Iceland eclipse in Reykjavik at sunset” leads Mind-Brush to check the date, place, sky color, and sample visuals before generating an accurate poster.

We’ll detail more building blocks in the next sections, but the story is simple: instead of a static text-to-pixels jump, Mind-Brush uses a smart, adaptive roadmap that mirrors how humans create with care.

02Core Idea

🍞 Top Bread (Hook): You know how detectives make a plan, gather clues, and then piece them together before solving the case? They don’t just guess from the first sentence they hear. 🥬 The Concept: The aha! moment is that image generation should actively search for facts and then reason through them—before drawing—so the final picture matches both intent and reality. How it works (at a glance):

Detect missing pieces in the prompt (cognitive gaps).
Retrieve up-to-date text and image evidence.
Do step-by-step reasoning to resolve hidden constraints.
Synthesize a Master Prompt and visual references. Why it matters: Without searching and reasoning, the model will ‘hallucinate’ or overlook crucial details. 🍞 Bottom Bread (Anchor): “Draw the champion of this year’s marathon holding their country’s flag at the finish line” requires checking who won this year and what their flag looks like—Mind-Brush does that before drawing.

Three analogies to lock it in:

Librarian analogy: Before writing a report, you check the newest books and articles. Mind-Brush does that for images.
Recipe analogy: Chefs gather the right ingredients and measure them before cooking. Mind-Brush collects facts and constraints before rendering.
Map analogy: Before hiking, you study the route and conditions. Mind-Brush studies the logical path to a correct image.

Before vs. After:

Before: Models took prompts literally and drew from memory. They were fast but often wrong on new events or logic-heavy tasks.
After: Mind-Brush inspects the prompt, looks up facts, reasons through puzzles, and then draws with confidence and correctness.

Why it works (intuition): Prompts often hide two challenges—missing knowledge and hidden reasoning. Search fills in the knowledge gaps; reasoning resolves the logic gaps. By chaining them, the generator receives a grounded, precise instruction rather than a vague guess.

Now the key building blocks, each with the Sandwich pattern and in dependency order:

🍞 Top Bread (Hook): Imagine your brain asking, “What do I still need to know before I can do this right?” 🥬 The Concept: Cognitive Gap Detection is the step where the system identifies exactly what information is missing to satisfy the request. How it works:

Break the prompt into 5W1H (What, When, Where, Who, Why, How).
Spot pieces that are unclear, out-of-date, or require calculation.
Turn each into a small, answerable question. Why it matters: Without finding the gaps, the system won’t know what to look up or reason about. 🍞 Bottom Bread (Anchor): “Make a poster of yesterday’s weather in Tokyo at noon.” The gap is: what was the weather at that time? The system flags that and searches.

🍞 Top Bread (Hook): Think of a student who goes to the library and image archives to collect trustworthy sources. 🥬 The Concept: Active Search is when the system queries the web to pull in recent text facts and matching reference images. How it works:

Turn gaps into smart search keywords.
Retrieve short text snippets and a few relevant images.
Calibrate the prompt and visual references using what was found. Why it matters: Without search, the model can’t update itself about real-world changes or rare details. 🍞 Bottom Bread (Anchor): For “the 2026 Reykjavik eclipse,” Active Search confirms date, location, sky color, and example photos before drawing.

🍞 Top Bread (Hook): When you solve a math word problem, you reason step by step instead of guessing. 🥬 The Concept: Logical Reasoning is the model’s step-by-step thinking that turns clues and rules into clear conclusions. How it works:

Read the prompt and any input images (like a map or diagram).
Combine them with searched evidence.
Deduce positions, counts, sizes, or math results. Why it matters: Without reasoning, the system can’t satisfy hidden constraints like geometry steps or spatial layouts. 🍞 Bottom Bread (Anchor): “Draw the triangle after rotating it 90° around point O.” The system reasons through the transformation before drawing.

🍞 Top Bread (Hook): When you show your work in math, you list the steps. 🥬 The Concept: Chain-of-Thought (CoT) is writing down the reasoning steps so the system doesn’t skip logic. How it works:

Break hard problems into bite-sized steps.
Check each step against the evidence.
Produce a clear conclusion that guides the drawing. Why it matters: Without CoT, mistakes hide inside the model’s head. 🍞 Bottom Bread (Anchor): For a map route, CoT lists “start here, turn left, cross the bridge,” then renders the scene accordingly.

🍞 Top Bread (Hook): After researching, you make a clean outline before making your poster. 🥬 The Concept: The Master Prompt is a cleaned-up, structured instruction that merges your goal with verified facts and references. How it works:

Filter noisy evidence.
Keep only verified facts and logic conclusions.
Write a precise prompt plus attach visual references. Why it matters: Without a clean summary, the generator might follow messy or conflicting inputs. 🍞 Bottom Bread (Anchor): “Poster: Reykjavik harbor at sunset during the 2026 eclipse; include correct skyline, lighting, and safe-viewing signs.”

🍞 Top Bread (Hook): Finally, an artist uses your outline and example pictures to paint the final image. 🥬 The Concept: The Unified Image Generation Agent is the component that actually draws, guided by the Master Prompt and visual cues. How it works:

Choose generation or editing mode.
Condition on the Master Prompt and reference images.
Render the final picture with high fidelity. Why it matters: Without a guided generator, all the smart planning wouldn’t turn into a correct image. 🍞 Bottom Bread (Anchor): Given the verified eclipse details and reference skyline, the agent renders an accurate, beautiful poster.

03Methodology

At a high level: Input (text and optional image) → Intent Analysis (find gaps) → Plan (search and/or reason) → Evidence Collection (texts and images) → Reasoning (step-by-step CoT) → Concept Review (Master Prompt) → Guided Generation (final image).

Step-by-step, like a recipe:

Intent Analysis with 5W1H

What happens: The system parses the user’s request into What, When, Where, Who, Why, and How to surface hidden requirements.
Why it exists: Without structuring the prompt, it’s hard to see what’s missing (e.g., dates, identities, spatial details).
Example: “Make a poster of yesterday’s Tokyo noon weather.” ‘When’=yesterday noon; ‘Where’=Tokyo; ‘What’=weather visuals; gap=actual conditions.

🍞 Top Bread (Hook): Like making a checklist before a trip so you don’t forget anything. 🥬 The Concept: 5W1H Parsing is turning a prompt into six clear questions. How it works:

Split the prompt into 5W1H.
Mark parts that need verification.
Turn them into search/logic tasks. Why it matters: Without a checklist, important details slip through. 🍞 Bottom Bread (Anchor): The system flags “yesterday noon weather” as something to verify.
Cognitive Gap Detection and Planning

What happens: The system identifies whether gaps need search (facts) or reasoning (logic), then creates a plan to run one or both.
Why it exists: Some tasks need only facts; others need only logic; many need both. Planning saves time and improves accuracy.
Example: “Draw the winner of this year’s marathon crossing the line.” Plan = search for winner identity and flag image; then generate.

Active Search for External Knowledge

What happens: Convert gaps into query keywords, retrieve short text snippets and a handful of image references, then update the prompt and visual plan.
Why it exists: Prompts often depend on recent or rare knowledge not stored inside the model.
Example: “Special event poster: 2026 eclipse in Reykjavik at sunset.” Search confirms date, place, skyline, and typical sky colors at that time.

🍞 Top Bread (Hook): Think of checking a weather app and photo gallery before painting a scene. 🥬 The Concept: Cognition Search Agent performs targeted text and image retrieval. How it works:

Generate smart text/image queries.
Pull concise articles and reference photos.
Calibrate the prompt and references with verified facts. Why it matters: Without it, the model guesses about the world and makes factual errors. 🍞 Bottom Bread (Anchor): It finds Reykjavik harbor reference shots and real eclipse timing, then updates the plan.
Logical Reasoning with Chain-of-Thought

What happens: The system reasons through math, spatial layouts, or rules derived from the prompt and evidence, outputting clear conclusions.
Why it exists: Many visual tasks involve implied logic (e.g., rotations, counts, relative positions).
Example: “Render the polygon after reflecting across line AB.” The agent computes the transformation steps before drawing.

🍞 Top Bread (Hook): Like showing your steps on a math quiz so you don’t skip crucial logic. 🥬 The Concept: Knowledge Reasoning Agent executes step-by-step CoT to reach firm conclusions. How it works:

Take in prompt, inputs, and searched evidence.
Break the task into small reasoning steps.
Output explicit constraints (e.g., positions, counts) for generation. Why it matters: Without explicit logic, the final image may look nice but be logically wrong. 🍞 Bottom Bread (Anchor): It calculates the rotated triangle vertices, then feeds those constraints to the generator.
Concept Review and Master Prompt Synthesis

What happens: A review module filters noisy evidence, keeps only verified facts and logic, and writes a clean Master Prompt with references.
Why it exists: Raw search results can be messy; the generator needs a crisp, conflict-free instruction.
Example: It merges: “Reykjavik harbor,” “sunset lighting,” “eclipse timing,” and the user’s style request into a single, tidy prompt.

🍞 Top Bread (Hook): Before you draw your poster, you write a neat outline from your notes. 🥬 The Concept: Concept Review Agent consolidates everything into a Master Prompt plus references. How it works:

Remove duplicates and noise.
Keep verified facts and CoT conclusions.
Produce a structured prompt with visual cues. Why it matters: Without consolidation, the generator could follow the wrong detail. 🍞 Bottom Bread (Anchor): It outputs: “Poster with Reykjavik skyline, eclipse at verified time, warm sunset tones, safety sign included,” plus reference photos.
Guided Image Generation

What happens: The Unified Image Generation Agent uses the Master Prompt and references to either create from scratch or edit an input image.
Why it exists: The final stage translates all the thinking and research into pixels with high fidelity.
Example: Starting from a blank canvas (T2I) or refining an existing photo (I2I edit) to match constraints.

🍞 Top Bread (Hook): An artist uses your outline and example pictures to paint exactly what you asked for. 🥬 The Concept: Unified Image Generation Agent is the draw-and-edit engine guided by facts and logic. How it works:

Choose generation or editing mode.
Condition on Master Prompt and references.
Render the final, faithful image. Why it matters: Without grounding, the final result can drift from the plan. 🍞 Bottom Bread (Anchor): It produces the Reykjavik eclipse poster with the correct skyline and time-of-day lighting.

The secret sauce: Adaptive routing. The system doesn’t always do every step; it chooses search, reasoning, or both based on the detected gaps. That saves time and focuses effort where it’s needed most.

04Experiments & Results

What did they test? The authors built Mind-Bench, a 500-sample benchmark that stresses two skills: Knowledge-Driven tasks (like breaking news, weather at a time/place, specific characters, and world facts) and Reasoning-Driven tasks (like daily-life logic, map/geo understanding, math, science/logic, and poem imagery). Instead of vague scores, they used a strict checklist: a picture only passes if it satisfies every required detail.

🍞 Top Bread (Hook): Think of a teacher grading a project with a checklist—every box must be ticked to get full credit. 🥬 The Concept: Checklist-based Strict Accuracy (CSA) is a pass/fail scoring method that requires meeting all listed facts or constraints. How it works:

Each test item has a verified sub-checklist.
A judge model compares the image against each item.
Only if all items pass does the image count as correct. Why it matters: Without strict checks, models can get partial credit for pretty but inaccurate images. 🍞 Bottom Bread (Anchor): If the checklist says “the skyline must be Reykjavik, lighting is sunset, and an eclipse is visible,” missing any one fails the sample.

Competition and context: They compared Mind-Brush (using open-source generators and a strong reasoning backbone) against both open-source and proprietary systems, including big names like GPT-Image, Nano Banana Pro, and FLUX variants. This shows not just raw drawing power, but the benefit of ‘think–research–create.’

Scoreboard highlights (with context):

On Mind-Bench, Mind-Brush lifts an open-source baseline (Qwen-Image) from 0.02 to 0.31 accuracy. That’s like jumping from almost-zero to a solid base—proof that search+reasoning unlocks big gains.
On WISE (world knowledge), Mind-Brush reaches an overall WiScore of 0.78 and outperforms strong open-source baselines; it narrows the gap with top proprietary models.
On RISE (reasoning), Mind-Brush achieves strong Instruction Reasoning scores (e.g., 61.5), surpassing several competitive systems and approaching proprietary results.

Surprising findings:

Synergy matters: Search alone helps with fresh facts, reasoning alone helps with logic, but doing both provides the largest and most stable improvements.
Old models get new life: Wrapping a weaker generator in the Mind-Brush workflow can double performance on strict tasks. The agentic ‘brain’ lifts the ‘hand that draws.’
Adaptive planning pays off: Not every prompt needs every tool. Detecting what’s missing and routing to the right tools saves compute and reduces noise.

In short, strict tests that demand both truth and logic show Mind-Brush’s core promise: better alignment with intent, knowledge, and reasoning—before the first pixel is drawn.

05Discussion & Limitations

Limitations:

Dependency on external evidence: If search results are wrong, incomplete, or biased, the final image can inherit those issues.
Latency and cost: Searching, reading, and reasoning add time and tokens compared to one-shot generation.
Tool availability: Internet or API outages reduce capability; some domains may lack quality sources.
Judge and metric sensitivity: Automatic evaluators (like MLLMs) can still mis-score tricky edge cases.
Visual grounding gaps: Even with perfect facts, a generator might struggle with exact small details (e.g., fine logos) without higher-resolution control.

Required resources:

A capable multimodal reasoning model to run the agents.
Access to web search (text and images) with modest result caps (e.g., a few text snippets, a handful of images).
A solid image generator or editor (open-source or proprietary) to execute the final render.
GPU resources for inference if running locally.

When not to use:

Purely imaginative prompts with no factual constraints (e.g., surreal fantasy) may not benefit from search overhead.
Offline or privacy-restricted settings where external search is disallowed.
Ultra-low-latency applications where the added planning time is unacceptable.

Open questions:

How to guarantee source reliability and reduce bias when grounding images in real-world data?
Can we speed up search/reason loops without losing accuracy (e.g., better caching, query planning)?
How to extend beyond images to video or 3D while keeping reasoning precise?
Can we design better automatic judges for strict, multi-constraint image evaluation?
How can the framework learn which tools to use from end-to-end feedback without retraining the whole stack?

06Conclusion & Future Work

In three sentences: Mind-Brush turns image generation into a ‘think–research–create’ process by detecting what’s missing, searching for evidence, and reasoning step by step before drawing. This unified agentic workflow grounds images in up-to-date facts and clear logic, drastically improving accuracy on challenging tasks. A new benchmark, Mind-Bench, shows large gains over baselines and competitive performance with strong systems on knowledge and reasoning.

Main achievement: Unifying active search and explicit reasoning within an adaptable, training-free pipeline that boosts many generators without retraining.

Future directions: Faster, more reliable search; stronger reasoning backbones; richer visual control (e.g., fine-grained editing); broader modalities (video/3D); and improved evaluators that can check detailed visual claims. Also, integrating safety filters and provenance tracking to make evidence-aware images trustworthy at scale.

Why remember this: It marks a shift from ‘paint from memory’ to ‘plan with facts and logic,’ bringing AI artwork closer to how careful humans create—by understanding, verifying, and then crafting.

Practical Applications

•Generate accurate classroom posters about recent scientific events (e.g., eclipses, launches) with verified facts and visuals.
•Create math diagrams that first solve the problem step by step, then render the final geometric construction.
•Design news illustrations that reflect the latest information (winners, locations, flags, dates) with trusted references.
•Produce travel flyers that match real weather, landmarks, and time-of-day lighting for specific dates and places.
•Build geography visuals from maps (routes, landmarks, relative positions) after reasoning through spatial constraints.
•Make character or product visuals that match official styles and artifacts by retrieving up-to-date references.
•Assist teachers in crafting science lab diagrams that respect physical states and sequences (e.g., phase changes).
•Auto-generate company slides with brand-correct colors, logos, and recent product details verified by search.
•Create storybook scenes that follow cultural or historical facts, preventing common visual mistakes.
•Produce safety infographics that combine current conditions (e.g., weather alerts) with correct symbols and layouts.

Version: 1