Figure It Out: Improve the Frontier of Reasoning with Executable Visual States

Meiqi Chen; Fandong Meng; Jie Zhou

Figure It Out: Improve the Frontier of Reasoning with Executable Visual States

Intermediate

Meiqi Chen, Fandong Meng, Jie Zhou12/30/2025

arXiv PDF

Key Summary

•FIGR is a new way for AI to ‘think by drawing,’ using code to build clean, editable diagrams while it reasons.
•Instead of only writing text steps, the model can decide to generate small programs that draw figures, check them, and revise them to stay geometrically consistent.
•This drawing-inside-thinking is trained end-to-end with reinforcement learning, so the AI learns when drawing helps and when it’s unnecessary.
•An adaptive reward system nudges the model to draw only when it improves accuracy, cutting down on useless pictures.
•Across seven tough math benchmarks, FIGR beats strong text-only baselines and the original vision-language model, with big gains on AIME 2025 (+13.12 points) and BeyondAIME (+11.00 points).
•Executable visual states act like a trusty sketchpad: precise, controllable, and re-runnable, unlike noisy, unconstrained generated images.
•FIGR doesn’t need a special supervised pretraining stage for diagrams; it learns the behavior purely through reinforcement learning.
•Ablations show that removing visual feedback or the adaptive reward weakens performance and stability, proving those pieces matter.

Why This Research Matters

Many real-world tasks depend on structures that must fit together exactly, not just sound good in words. FIGR gives AI a reliable sketchpad—precise, editable diagrams—that it can use while thinking, so mistakes show up as visible mismatches instead of hidden algebra slips. This means better math tutoring systems that can draw and check student diagrams, more trustworthy design assistants that validate geometry constraints, and smarter scientific reasoning tools that can see global patterns. The adaptive reward keeps the model practical, reserving drawing for when it truly helps. Over time, this approach could spread to planning, robotics, and software design, where clean intermediate representations are the difference between ‘seems fine’ and ‘is correct.’

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how students draw triangles, circles, and arrows when a math problem gets tricky? Those little sketches help keep all the angles and lengths straight in your head. Computers face the same struggle: when a problem hides important spatial rules (like where lines meet or which shape touches which), words alone can be slippery.

Before this work, most advanced AIs solved problems by writing out steps in text, called chain-of-thought. That was good for arithmetic or logic, but not great for geometry or anything where the whole picture must fit together. If a subtle geometric rule got missed early on, later algebra could go off the rails, and the mistake might not get caught. People tried two big ideas to fix this:

Unconstrained image generation during reasoning: models would imagine pictures as they thought. Cool idea, but the images were often fuzzy or a bit wrong, and tiny spatial mistakes could snowball into wrong answers.
Tool-augmented visual steps: models used fixed tools (like zooming or cropping a given image). This had control, but it was stuck with the images it was given and a tiny toolbox—no freedom to draw exactly the diagram the problem needed.

That left a gap. What if the model could build exact, rule-following drawings as it thought—draw, check, fix, and try again—just like a student adjusting a sketch until all the lines meet where they should? And what if those drawings were executable, meaning they come from code that can be re-run, modified, and verified?

Here’s the real-world stake: getting structure right matters far beyond geometry class. Planning a robot’s path, checking a bridge design, or understanding scientific diagrams all require shapes and relations that truly fit together. Words might describe them, but a clean, precise, revisable drawing can reveal mistakes instantly. Without that, we risk brittle reasoning that looks smart in text but breaks under real constraints.

So the problem researchers faced was: how do we let an AI think with drawings in a way that is precise, fixable, and trustworthy—and teach it to know when drawing helps at all? Prior attempts either drew without control (pictures looked plausible but weren’t guaranteed to be correct) or used helpful but limited tools that couldn’t create the exact diagrams needed.

This paper fills the gap with FIGR: a system that, during its step-by-step thinking, can choose to write small programs that construct exact figures. Those programs run in a sandbox, produce a clean diagram and textual feedback, and then the AI uses that feedback to continue reasoning. An adaptive reward gently teaches the AI to draw only when it actually helps finish the problem correctly, instead of drawing just for show.

Why should you care? Because whether you’re a student solving a tough geometry question, a scientist sketching a complex system, or an engineer validating a layout, the difference between ‘looks okay’ and ‘is actually correct’ is huge. FIGR’s big idea—making the model’s sketchpad executable and inside the reasoning loop—pushes AI toward answers that hold together globally, not just line by line.

02Core Idea

🍞 Top Bread (Hook): Imagine building a LEGO set by checking the manual after every few pieces, making sure each step matches the picture. If something looks off, you fix it right away instead of waiting until the end.

🥬 The Concept (One-sentence insight): FIGR lets an AI ‘think by drawing’ with code, so it can build precise, revisable diagrams inside its reasoning process and use them to keep the whole solution consistent.

How it works (short):

The AI reasons in turns. 2) At any turn, it can choose to write code that draws a figure. 3) The code runs, producing a clean diagram and textual feedback. 4) The AI reviews the figure and feedback, then continues reasoning. 5) A special reward system encourages drawing only when it actually helps get the right answer.

Why it matters: Without this, the AI may miss global structure (like a circle not truly tangent to a line) and get an algebraically neat but geometrically wrong result.

🍞 Bottom Bread (Anchor): Solving a geometry puzzle, the AI draws a circle that must touch a line and pass through a point. If the circle is even a tiny bit off, the figure exposes it immediately, and the AI corrects its steps.

Multiple analogies (same idea, three ways):

Sketchpad analogy: Like a student sketching, erasing, and redrawing to confirm angles and tangencies before committing to a final answer.
Map-and-route analogy: A GPS that recalculates the route when you miss a turn; the executable figure is the map, and each reasoning step adjusts the route.
Recipe-and-oven analogy: You don’t just write the recipe; you actually bake small test batches (run code to draw/measure), taste (read feedback), and then tweak the recipe (reason further).

Before vs After:

Before: The AI kept everything in its head (text only) or drew pretty but unreliable pictures it couldn’t truly check or edit precisely.
After: The AI makes exact, controllable figures from code, checks them, and improves its plan—so global constraints hold together.

Why it works (intuition): If you can turn “I think the circle touches the line” into “I drew a circle using exact math operations and verified the touch,” you replace guesswork with testable structure. Executable drawings act like measured blueprints rather than sketches from memory.

Building Blocks (with sandwich explanations):

🍞 Top Bread (Hook): You know how coaches reward good plays so a team learns what to repeat?
🥬 Reinforcement Learning (what): A way for the AI to learn by trying actions and getting rewards based on how well things turned out.
How it works: (1) Try a strategy. (2) Get a score (reward). (3) Prefer strategies that scored higher next time.
Why it matters: Without RL, the AI wouldn’t reliably learn when drawing helps or hurts.
🍞 Bottom Bread (Anchor): If drawing a figure led to the right geometry answer, that path gets a higher reward and is chosen more often later.
🍞 Top Bread (Hook): Imagine your math notebook where every drawing is neat, labeled, and easy to change.
🥬 Executable Visual States (what): Diagrams generated from code that can be precisely recreated and edited.
How it works: (1) The AI writes code to place points/lines/circles. (2) A sandbox runs the code and renders the figure. (3) The AI reads back measurements or errors.
Why it matters: Without executability, drawings can be fuzzy or inconsistent, making it hard to trust them.
🍞 Bottom Bread (Anchor): The AI codes “draw a circle through A and E tangent to CD,” runs it, and checks if tangency truly holds.
🍞 Top Bread (Hook): Teachers don’t grade every breath you take—only the final project—but you still learn which study habits paid off.
🥬 End-to-End Reinforcement Learning (what): Training the whole reasoning-and-drawing loop by rewarding complete solution attempts.
How it works: (1) Generate a full solution path (text + code). (2) Execute code and finish with an answer. (3) Give one overall reward. (4) Prefer paths with better rewards.
Why it matters: Without end-to-end feedback, the AI can’t connect “I chose to draw here” with “that led to the correct final answer.”
🍞 Bottom Bread (Anchor): The model gets a high score only if the answer is right and the drawing executed successfully when it helped.
🍞 Top Bread (Hook): Sometimes a sketch helps a lot; other times it’s a distraction.
🥬 Adaptive Reward Mechanism (what): A system that boosts rewards for using drawings when they help, and dials them down when they don’t.
How it works: (1) Predict if the problem benefits from a diagram. (2) If the AI used drawings and got the answer right, add a bigger bonus when diagrams were suitable; a smaller bonus otherwise. (3) No bonus if the answer is wrong or the code fails.
Why it matters: Without this, the AI might draw too much (wasting time) or too little (missing structure).
🍞 Bottom Bread (Anchor): On a geometry puzzle, correct use of a clean, executed diagram earns extra points; on simple algebra, drawing gives little extra.
🍞 Top Bread (Hook): Solving a mystery takes several clues gathered over time, not just one guess.
🥬 Multi-turn Reasoning (what): The AI thinks in rounds, using the last round’s figure and feedback to guide the next step.
How it works: (1) Read the question. (2) Reason a bit or draw. (3) Observe feedback. (4) Refine and repeat until done.
Why it matters: Without turns, the AI can’t correct itself mid-solution.
🍞 Bottom Bread (Anchor): The model draws, notices the circle misses tangency, adjusts the center, redraws, and proceeds.
🍞 Top Bread (Hook): A good puzzle solution makes all pieces click together.
🥬 Geometric Consistency (what): Keeping all shapes and relations correct together across the whole figure.
How it works: (1) Encode constraints (like lengths/angles/tangency) in code. (2) Render and check. (3) Revise until all constraints hold.
Why it matters: Without global consistency, a solution that looks fine locally can be wrong overall.
🍞 Bottom Bread (Anchor): Even if an equation looked okay, the rendered diagram shows two lines aren’t parallel; the AI fixes it.
🍞 Top Bread (Hook): Detectives form temporary guesses and update them with new clues.
🥬 Intermediate Hypotheses (what): Provisional ideas the AI writes down and tests through drawings and feedback.
How it works: (1) Propose a structure. (2) Execute to visualize. (3) Keep or revise based on what the figure shows.
Why it matters: Without testing guesses, small mistakes can hide until the end.
🍞 Bottom Bread (Anchor): “Maybe the center is here.” Draw, check distances, then adjust until it fits all conditions.

03Methodology

At a high level: Input (a math question) → Multi-turn policy chooses text or code → If code, run it to render a figure and get feedback → Update context → Repeat → Output final answer.

Step-by-step (like a recipe):

Start the conversation state. What happens: The model reads the problem and any earlier notes. Why it exists: Without a shared state, the AI can’t remember what it tried. Example: “In parallelogram ABCD… find the circle radius.”
Choose what to do next (text or code). What happens: The policy decides to reason in words or generate code to draw/measure. Why it exists: Not every step needs a figure; sometimes text is enough. Example: The AI decides to draw: “Let’s place points and construct the circle.”
If code is chosen, execute it in a sandbox. What happens: The code draws a precise diagram and returns visual plus textual feedback (like coordinates, distances, success/failure). Why it exists: Execution turns guesses into checkable facts. Example: The interpreter returns “figure rendered; tangency test failed.”
Update the context with feedback. What happens: The model appends the figure and messages to its running notes. Why it exists: Without updating, later steps won’t benefit from what was learned. Example: The AI now knows the circle missed tangency by a small gap.
Continue multi-turn reasoning. What happens: The model writes new text or code based on what changed. Why it exists: Iteration is how errors shrink and consistency grows. Example: “Shift the center right by 0.5 units and redraw.”
Stop when done and present the answer. What happens: The model emits an end token and formats the final result. Why it exists: To deliver a clean, checkable answer. Example: “The radius is 10 − 4√3.”

The training loop (how FIGR learns the recipe):

Rollout groups (Group Relative Policy Optimization, GRPO). The model samples several full solution attempts per question. Each attempt includes all text and code actions and their feedback. After all attempts finish, each gets a single reward score.
Rewards combine three parts: correctness (R_acc), good output format (R_fmt), and visual-usefulness (R_vis). The visual bonus applies only if (a) the answer is correct, (b) the diagram code executed, and (c) the problem was predicted as suitable for diagrams (bigger bonus) or not (smaller bonus). This ties drawing to real success, not just activity.
Relative rewards. Instead of absolute scores alone, the model compares attempts to one another for the same question and learns to favor the better ones. This stabilizes learning when feedback comes only at the end.

What breaks without each piece:

No executable diagrams: Visuals can be sloppy, so little trust; errors hide.
No multi-turn: The AI can’t fix mid-course; early slips persist.
No execution feedback: You can’t validate global structure; wrong-but-plausible paths survive.
No adaptive reward: The model might draw too often (slow, noisy) or too rarely (missed structure).
No trajectory-level RL: The AI won’t connect “I drew here” with “that led to the correct answer,” so it won’t learn the right pattern.

Concrete walkthrough (using the squares example from the paper):

Input: Two equal-area squares arranged so each center is a vertex of the other; find union/intersection area ratio.
Step A: Assume T=1, place one square axis-aligned at (0,0) and the other rotated 45° at (1/2,1/2).
Step B: Generate code to construct both polygons and compute exact intersection and union areas.
Step C: Read back: intersection = T/4, union = 7T/4, ratio = 7.
Step D: Try another T (e.g., T=4) to confirm constancy.
Output: Box the final answer 7.

The secret sauce:

Executable visual states close the loop between “I think” and “I can verify this exact structure.”
Adaptive rewards teach the model judgment: draw when it helps, skip when it doesn’t.
GRPO focuses on full-solution quality, letting the model discover stable drawing habits that boost accuracy.

Extra sandwich for GRPO: 🍞 Top Bread (Hook): Think of a study group where everyone tries a solution; you learn by seeing which attempt worked best.
🥬 GRPO (what): A way to train by comparing several complete solution attempts for the same question and preferring the better ones.
How it works: (1) Generate multiple attempts. (2) Score each. (3) Increase the chance of attempts that score above the group average.
Why it matters: Without comparison, learning from delayed rewards can be noisy and unstable.
🍞 Bottom Bread (Anchor): If the attempt that used a precise diagram got the right answer while others didn’t, the model shifts toward that behavior next time.

04Experiments & Results

The test: The team measured how often the model answered correctly (pass@1) across seven tough math benchmarks. Why this matters: these sets (like AIME and BeyondAIME) include many problems where global structure and geometry bite—perfect for testing whether drawing inside the loop really helps.

The competition: FIGR was compared against (a) strong text-only models (with and without ‘thinking’ steps), (b) large vision-language models that don’t use executable drawings, (c) unified multimodal models that generate images during reasoning (but without strict constraints), (d) tool-augmented systems that operate on given images, and (e) a text-only RL baseline trained the same way but without diagrams.

The scoreboard (with context):

Average across seven benchmarks: FIGR hits 74.11% pass@1. That’s like raising your class average from a solid C+ to a strong B across many exams.
Compared to its own base model (Qwen3-VL-32B-Instruct), FIGR gains +6.77 points on average.
Big wins show up where structure matters most: AIME 2025 improves by +13.12 points, and BeyondAIME by +11.00 points—like jumping from a B- to an A on especially hard tests.
FIGR also edges past strong text-only ‘Thinking’ models and even a larger LVLM that lacks executable visual construction. That suggests the secret isn’t just size—it’s the quality of the intermediate states.

Surprising findings and ablations (what the team poked at):

Prompt engineering alone (teaching a style without training) helped a bit in places but was unstable; without learning, the model didn’t reliably decide when to draw.
Supervised fine-tuning on the math set actually hurt some scores, likely overfitting to seen patterns instead of learning robust strategies.
Text-only RL improved performance steadily, but adding executable visuals pushed it further, especially on geometry-heavy sets.
Feeding in passively generated images (from Bagel or Qwen-Image) didn’t help much. Pictures that aren’t executable and revisable don’t anchor global constraints, so errors sneak through.
Removing the Adaptive Reward Mechanism caused a ‘boom-and-bust’ pattern: early overuse of code then a collapse when the model learned drawing wasn’t consistently rewarded. With adaptive rewards on, code usage stayed healthy and purposeful.
Removing visual feedback (keeping only textual execution messages) also degraded performance: you need the actual figure to see global shape and catch structural slips.

Case study flavor: On a function-analysis problem, the text-only approaches fussed with boundary cases and could miss the big picture. FIGR plotted clean curves, saw where zeros live, and locked onto the right interval—showing that visual state can spotlight global truths that text alone might juggle less reliably.

Takeaway: Executable, controllable drawings inside the reasoning loop are not just a nice extra; they are a robust source of feedback that lifts accuracy, especially when the problem’s heart is spatial or structural.

05Discussion & Limitations

Limitations:

Extra compute: Drawing figures and running code in multiple turns takes time and resources. For simple problems, that overhead may not pay off.
Best on structure-heavy tasks: When a problem is easy algebra or pure symbolic manipulation, diagrams add little and might distract.
Training complexity: End-to-end RL with delayed rewards is trickier to tune than straightforward supervised training.

Required resources:

A capable vision-language base model and a safe, sandboxed code runner.
Enough compute for multi-turn rollouts (several attempts per question) and for rendering figures.
A dataset with verifiable answers (so rewards are meaningful).

When NOT to use it:

Quick arithmetic or direct factual recall.
Problems where a diagram can’t represent the key constraints any better than equations already do.
Latency-sensitive settings where every millisecond counts and diagrams won’t change the outcome.

Open questions:

Smarter ‘when to draw’ prediction: Can we learn the suitability signal jointly instead of relying on an auxiliary classifier?
Faster loops: Can we compress diagrams or cache partial constructions to cut cost without losing precision?
Beyond math: How well do executable visual states transfer to robotics planning, scientific simulations, or programming tasks with dependency graphs?
Richer execution feedback: Besides pictures and text, can we integrate constraint solvers or formal checkers to guarantee properties (like exact tangency) even more robustly?

06Conclusion & Future Work

Three-sentence summary: FIGR teaches AI to think by drawing, generating executable, editable diagrams inside its reasoning steps. An adaptive reward helps the model learn when drawing truly improves the final answer, leading to big gains on structure-heavy benchmarks. By closing the loop between symbolic thoughts and precise visuals, FIGR boosts both accuracy and reliability.

Main achievement: Turning diagrams into executable, controllable intermediate states—then training end-to-end so the model learns to use them purposefully.

Future directions: Make the reasoning–drawing loop faster and lighter, learn the ‘draw or not’ decision jointly, and extend the idea to fields like robotics planning, scientific modeling, and software design diagrams. Adding stronger constraint checking (e.g., geometric solvers) could further lock in global correctness.

Why remember this: FIGR shows that the path to better reasoning isn’t just longer text chains or bigger models—it’s better intermediate states. When the model’s sketchpad is precise and revisable, global truths become visible, and mistakes get caught early.

Practical Applications

•Interactive math tutoring that draws exact figures, checks constraints (like tangency), and guides students to correct, consistent constructions.
•Geometry problem solvers for competitions (AIME/AMC) that validate global structure before finalizing answers.
•STEM education tools that teach ‘draw to think’ by showing how precise diagrams reveal hidden mistakes.
•Design and CAD helpers that construct and verify layout constraints (parallelism, contact, clearances) during ideation.
•Robotics/path-planning assistants that render simplified, executable maps to check reachability and collisions while planning.
•Scientific analysis notebooks that interleave reasoning with executable plots/diagrams to confirm global patterns.
•Data storytelling dashboards that auto-generate explainable, consistent visuals as part of analytic reasoning.
•Code review aids for algorithms that convert abstract steps into executable flow diagrams to catch logic gaps.
•Physics problem assistants that render force diagrams or trajectories to ensure all constraints are simultaneously satisfied.
•Curriculum tools that diagnose when drawing will help a student and nudge them to sketch (or not) accordingly.

Version: 1