Thinking with Drafting: Optical Decompression via Logical Reconstruction

Jingxuan Wei; Honghao He; Caijun Jia; Siyuan Li; Zheng Sun; Yuhang Xu; Yuanyuan Lin; Linzhuang Sun; Yuchen Wu; Bihui Yu; Xiangxiang Zhang; Cheng Tan

Thinking with Drafting: Optical Decompression via Logical Reconstruction

Beginner

Jingxuan Wei, Honghao He, Caijun Jia et al.2/12/2026

arXiv

Key Summary

•The paper fixes a common problem in AI: models can read pictures and text well, but they often mess up the logic behind them.
•It treats reasoning like "optical decompression": turning a messy picture or word problem back into clean, step-by-step logic.
•The key idea, called Thinking with Drafting (TwD), makes the AI write a tiny, special code (a DSL) that draws an exact diagram it can check.
•Instead of guessing answers, the model drafts its thoughts into code, renders a precise diagram, and uses that to verify and correct itself.
•A new benchmark, VisAlg, tests whether models can rebuild the hidden logical structure of visual algebra problems.
•TwD uses simple building blocks (lines, aligners, braces) plus a virtual grid to separate logic from pixels, so drawings become mathematically exact.
•On the VisAlg test, an 8B model trained with TwD beats strong proprietary systems, showing structure-first reasoning really helps.
•This creates a closed loop: parse → draft → render → verify → answer, where the drawing is not art but a proof you can check.
•The method reduces hallucinations, prevents answer leakage, and keeps relations like equality, parts, and transfers consistent.
•It matters for everyday tasks like school math, forms, tables, and charts—anywhere clear structure wins over pretty pictures.

Why This Research Matters

We often rely on diagrams, forms, and charts to make real decisions, so getting the structure right matters more than making a picture look nice. TwD forces AI systems to show their work in a small, checkable language, which makes mistakes easier to spot and fix. This helps students learn math with trustworthy diagrams, helps offices process documents without mixing up totals and parts, and helps scientists keep relationships exact in visualizations. By turning drawings into proofs, we reduce hallucinations and hidden errors. The approach is efficient, too: a compact model can outperform much larger ones when it drafts and verifies. Over time, this could become the standard way AI handles any problem where structure and precision beat surface appearance.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how you can read a page from a book and copy every word perfectly, but still not understand the story? That’s a bit like today’s AI. Modern systems can read the words and numbers from images (like worksheets or receipts) very well, and they can even draw helpful pictures while thinking. But when the problem needs exact logic—like precise equalities, ratios, or what belongs to whom—they often slip.

🍞 Top Bread (Hook): Imagine copying every sticker on a puzzle box but not knowing how the pieces fit together. 🥬 The Concept (OCR): What it is—Optical Character Recognition (OCR) reads text from images very accurately. How it works—1) It looks at an image, 2) spots characters and numbers, 3) outputs the exact text. Why it matters—Without OCR, the AI wouldn’t even know the words or numbers to start with. 🍞 Bottom Bread (Anchor): Scanning a homework sheet and getting “Bunny has 3, Mom gives 6” exactly right.

But copying text isn’t enough. The number “12” might mean a total, a leftover, or a difference, depending on where it belongs. AI needs the map of relations, not just the labels on the map.

🍞 Top Bread (Hook): You know how cooking recipes list steps in order so you don’t mix things up? 🥬 The Concept (Chain-of-Thought, CoT): What it is—CoT is writing down reasoning steps in plain language. How it works—1) Break the problem into little steps, 2) write each step, 3) get the answer. Why it matters—Without steps, the AI jumps to guesses. But plain language can still be fuzzy about shapes, alignments, and exact equalities. 🍞 Bottom Bread (Anchor): “First add 6 to bunny, then compare with mom” sounds clear, but it doesn’t force exact lengths or positions.

Recently, some models began “thinking with images,” drawing sketches during reasoning. That’s helpful, but pixels can look right and still be wrong mathematically. A drawn line might look a bit longer, but not be exactly twice as long.

This creates a precision paradox: inputs are read with high fidelity, and outputs look plausible, yet the hidden logical structure (who equals whom, what’s 3 times what, what was transferred) is shaky.

What was missing? A strict middle layer that tells the model: “Don’t just talk about your thinking—build it in a tiny, checkable language that draws an exact diagram.” That way, every quantity and relation snaps to rules like Lego bricks.

🍞 Top Bread (Hook): Imagine building with Lego where each brick only fits in places that make sense, so your castle can’t collapse. 🥬 The Concept (DSL): What it is—A Domain-Specific Language is a small, special language for one job (here: visual algebra). How it works—1) Define objects as bars, 2) align shared points, 3) add braces for parts and totals, 4) render a precise diagram, 5) verify it. Why it matters—Without a DSL, drawings can be pretty but imprecise; with a DSL, they must follow the rules. 🍞 Bottom Bread (Anchor): “HL ‘Bunny’ 3 6” means bunny’s bar has a 3-length part plus a 6-length add-on, exactly.

The paper’s big move is to treat reasoning as optical decompression: if OCR compresses the page into tokens, reasoning should decompress those tokens back into an explicit, executable structure. The authors build a closed loop—parse → draft code → render → check → answer—so the diagram becomes a proof, not just a sketch. They also create VisAlg, a test that checks if models can truly recover the hidden structure behind word problems. The stakes are real: school math, forms, spreadsheets, and scientific charts all depend on exact relations, not just readable text or pretty pictures.

02Core Idea

Aha! Moment in one sentence: Don’t just say your reasoning—draft it in a tiny language that forces exact diagrams, so the picture itself can prove your logic.

🍞 Top Bread (Hook): You know how architects use blueprints, not just speeches, to make buildings safe? 🥬 The Concept (Thinking with Drafting, TwD): What it is—TwD makes the model write a small program (DSL) that draws a provable diagram before answering. How it works—1) Parse the problem, 2) draft DSL code capturing objects and relations, 3) render an exact diagram, 4) verify for conflicts, 5) refine or answer. Why it matters—Without drafting, models can hallucinate; with drafting, logic is anchored in checkable structure. 🍞 Bottom Bread (Anchor): For the bunny-and-mom carrots, TwD builds two bars, adds a +6 transfer, and enforces “mom after = 3× bunny after” with alignment lines—then solves.

Explain the same idea three ways:

Blueprint analogy: A building plan (DSL) must obey strict rules; if a beam is out of place, the inspector (verifier) catches it before construction (final answer).
Lego analogy: Pieces (entities, relations) click only at allowed studs (boundaries). If it doesn’t click, it doesn’t ship.
Recipe analogy: Ingredients (numbers) are measured into bowls (bars/segments). Braces and aligners are the measuring spoons. No spooning = no baking.

Before vs After:

Before: Models read well and talk well, but logical shapes float—ratios, equalities, and transfers might be off by a few pixels or a vague phrase.
After: Models must express every relation as DSL code, which renders to an exact diagram. If it doesn’t align, it’s wrong—and they fix it before answering.

Why it works (intuition, no equations):

Language is flexible (good for fluency) but too fuzzy for geometry. Pixels are pretty (good for showing) but too slippery for exact math. A tiny program is both expressive and strict. It forces the model to pick explicit pieces—where is the shared boundary, what’s the unit, what got moved out—and then locks them together so the renderer can check.

Building blocks (each is introduced with sandwich-style clarity):

🍞 Hook: Think of a bookshelf where each shelf is a person’s total and each book is a part. 🥬 DSL Entity Primitive (HL): What it is—HL draws a horizontal bar with subsegments for parts. How it works—1) Name the bar (e.g., “Bunny”), 2) list segment lengths (positive = solid, negative = dashed for removed or pretend bits), 3) place it in a row. Why it matters—Without parts, you can’t see totals or changes. 🍞 Anchor: HL "Bunny" 3 6 means Bunny has 3 now and gets +6.

🍞 Hook: You know how rulers help line up the same edge on two objects? 🥬 DSL Relational Primitive (VL): What it is—VL is a vertical alignment line marking a shared boundary across rows. How it works—1) Choose an x-position, 2) connect the rows that share it, 3) enforce equality or comparison anchors. Why it matters—Without VLs, “equal after” isn’t pinned to any exact place. 🍞 Anchor: Marking where Mom’s remaining equals 3× Bunny’s after amount.

🍞 Hook: Like curly braces in writing that group words together. 🥬 DSL Aggregation Primitives (HB/VB): What it is—HB (horizontal brace) groups parts within one bar; VB (vertical brace) groups across multiple bars (totals or comparisons). How it works—1) Pick span start and end, 2) add a label, 3) place above or below. Why it matters—Without braces, part–whole and multi-object sums stay hidden. 🍞 Anchor: HB “Total ?” over Bunny’s whole bar; VB “Peach + Pear total” across two rows.

🍞 Hook: Grids in graph paper keep drawings neat. 🥬 Virtual Grid System: What it is—A discrete logic grid that places objects by rows and boundaries by order, not by raw pixels. How it works—1) Assign each bar a row, 2) place boundaries by logical steps, 3) renderer turns it into exact coordinates. Why it matters—Without it, the model must guess pixel positions and drifts off. 🍞 Anchor: “Row 1: Bunny; Row 2: Mom; shared boundary at x=9.”

🍞 Hook: Printers repeat the same file into the same picture every time. 🥬 Deterministic Rendering: What it is—A strict engine that converts valid DSL into a canonical diagram with macros for common patterns. How it works—1) Check code, 2) draw exactly, 3) auto-add helpful marks for comparisons/ratios. Why it matters—Without determinism, verification can’t be trusted. 🍞 Anchor: The same DSL always renders to the same, aligned bar model.

🍞 Hook: Teachers say “show your work” so they can spot mistakes. 🥬 Optical Decompression: What it is—Rebuilding hidden logical structure from text and image into explicit code that can be drawn and checked. How it works—1) Parse entities and relations, 2) write DSL, 3) render and verify, 4) fix and answer. Why it matters—Without decompression, the logic stays trapped inside words and pixels. 🍞 Anchor: From “3 more than” in text to an offset segment and alignment line in the diagram.

Finally, TwD treats the draft as the thinking engine: by forcing structure early, errors show up visually and can be corrected before the answer is chosen.

03Methodology

At a high level: Input (image + question) → Parse to DSL (draft code) → Render (exact diagram) → Verify & Refine (closed loop) → Final answer.

Step 1: Optical Decompression via Logical Parsing

What happens: The model reads the problem image and text, identifies objects (people, boxes, sets), quantities (numbers, ratios), and relations (equal, more than, transfer), then drafts a first version of the DSL.
Why it exists: If the model jumps straight to an answer, it can hide logical mistakes. Drafting exposes its beliefs as code.
Example: “Bunny has 3; Mom gives 6; Mom after is 3× Bunny after” becomes HL "Bunny" 3 6, HL "Mom" 33 -6, plus a VL that anchors “3× after”.

🍞 Hook: Like coloring solid blocks for what you have, and dotted outlines for pretend or removed parts. 🥬 Status-Aware Segments: What it is—Positive numbers draw solid (existing) segments; negative numbers draw dashed (removed/imagined) segments. How it works—1) Use solid for current amounts, 2) dashed for what was taken or temporarily added, 3) keep them separate in order. Why it matters—Without this, before/after and transfers get confused. 🍞 Anchor: HL "Mom" 9 9 9 -6 means Mom had 27, then 6 got given away (dashed).

Step 2: Drafting and Rendering on a Virtual Grid

What happens: The DSL is rendered by a deterministic engine onto a virtual grid that uses rows and boundary order instead of raw pixel guesses. Common templates (like comparisons) automatically add clean braces and aligners.
Why it exists: It separates logic from drawing, so even a small model can place things perfectly without wrestling with continuous coordinates.
Example data: If a comparison says “A is 12 more than B,” a macro adds a dashed offset on A and a VL pinning the shared start.

Step 3: Verification Loop (Closed-Loop Reasoning)

What happens: The system (or an LLM-based judge) checks syntax validity, visual completeness, and logical consistency: Do braces end exactly at boundaries? Are all givens labeled in visible text? Do transfers subtract then add? Any answer leakage in labels? If a check fails, the model revises the DSL and re-renders.
Why it exists: This is the safety net. It turns the diagram into a proof that either passes or points to the exact place that’s wrong.
Example: If a VB (vertical brace) tries to group a single object, the checker fails it and forces a fix.

Step 4: DSL-Conditioned Answering

What happens: Once the diagram is verified, the model reads its own draft to compute values (totals, differences, reversals) and outputs the final answer.
Why it exists: Now the math happens on top of a guaranteed-correct structure, reducing hallucinations.
Example: With Bunny_now = 3+6 and Mom_now = 33-6 and “Mom_now = 3× Bunny_now”, the answer drops out cleanly.

Concrete, small data walk-through (change & revert):

Input: “Bunny pulled 3 carrots. Mom says: If I give you 6, I’ll have 3 times what you have. How many did Mom pull?”
Parse: HL "Bunny" 3 6; HL "Mom" 33 -6; VL pins “3× after”.
Render: Two aligned bars; a +6 on Bunny; a −6 (dashed) on Mom; braces marking totals.
Verify: All labels present (3, give 6, 3 times), no leakage of final counts, braces hit boundaries, transfer is paired −6/+6.
Answer: Compute after-states and solve. The diagram enforces the math.

The Secret Sauce:

The draft is the reasoning engine: By forcing code first, the model must commit to exact boundaries and relations, so weak, fuzzy language can’t slip through.
Virtual grid decoupling: The model never guesses pixels; it only picks logical order and groupings. The renderer guarantees neat geometry.
Deterministic macros: Common patterns (sum & split, more/less by t, ratios via equal units) render the same, canonical way, which stabilizes training and checking.
Visible-text-only policy: Givens and unknowns must appear as labels; numbers alone don’t count. This stops hidden info and answer leakage.
Pairing rules for transfers: Every “give t” is a −t/+t pair across rows. If you only see one side, the verifier flags it.

Putting it together like a recipe:

Input → Identify entities and relations → Write HL/VL/HB/VB code with status-aware segments → Render on the grid → Check alignment, completeness, compliance → Fix if needed → Read the verified diagram → Compute the answer. The loop makes the picture a self-checking proof.

04Experiments & Results

The Test: The authors build VisAlg, a benchmark focused on whether models can recover the exact logical topology behind bar-model word problems. Each item includes a problem image and a ground-truth DSL. The test checks both the code (does your DSL match?) and the image (does your rendering align?), plus a judge’s semantic checks.

🍞 Top Bread (Hook): Like grading both a student’s steps and their final diagram, not just the final number. 🥬 The Concept (VisAlg): What it is—A structured test for logic-aware visual algebra. How it works—1) Collect bar-model problems, 2) generate and refine DSL drafts, 3) filter with a strict verifier calibrated to human experts, 4) evaluate models on code similarity, image similarity, and semantic checks. Why it matters—Without a test that cares about structure, models can pass with pretty but wrong pictures. 🍞 Bottom Bread (Anchor): If your brace endpoints float between segments, VisAlg dings you—even if your final number is right.

What they measured and why:

Code similarity (BLEU/ROUGE-L/chrF): Does your DSL resemble the ground truth? chrF is the main code metric because it handles mixed symbols and numbers well.
Image similarity (LPIPS/SSIM/PSNR): Does your rendered diagram match structurally? SSIM is the main image metric since it’s sensitive to shapes and edges.
LLM-as-judge scores: Alignment, information coverage, numerical consistency, semantic compliance, and answer leakage (0–1 each; average is used). This catches deeper semantic mistakes.
Main composite score: Average of chrF, SSIM, and the LLM judge score.

The Competition:

Proprietary models: GPT-5.1, GPT-4o, Claude-4, Gemini-3-Pro, Gemini-2.5-Pro.
Open-weight baselines: InternVL3-8B, InternVL2.5-8B, Intern-S1-mini, Mimo-VL-7B-RL, Qwen3-VL-8B.
TwD model: Starts from Qwen3-VL-8B and is supervised on VisAlg.

The Scoreboard (with context):

TwD hits an overall 82.63. That’s like scoring an A when other strong students get high Bs: Gemini-3-Pro at 79.96 and Gemini-2.5-Pro at 74.12.
Open baselines mostly stay below 55, showing they struggle to produce valid, consistent DSL and clean diagrams without structure-first training.
Biggest TwD gains are in structural fidelity: better code alignment and diagram alignment, strong info coverage, and minimal answer leakage. The remaining headroom is in numerical consistency against top proprietary models.

Schema-wise performance:

Five types: proportional distribution, rate & percentage, change & revert, sum & split, difference analysis.
TwD’s advantages are most visible in structure-heavy cases (proportional distribution, difference analysis) where exact units and boundaries matter most.
Performance is steady across types, suggesting the DSL-and-verifier loop generalizes.

Surprising findings:

Human vs LLM-judge agreement is very high (r ≈ 0.96), so the automated verifier is a reliable stand-in for experts.
On advanced set problems with multiple overlaps (A∩B∩C, etc.), some frontier models compute totals correctly but draw illegal intersections (topological hallucinations). TwD preserves legal overlaps by decomposing intersections explicitly as geometric pieces, keeping semantics and geometry aligned.

Bottom line: Treating drafting as thinking—then verifying the draft—yields better, more trustworthy reasoning than text-only steps or pixel-only drawings.

05Discussion & Limitations

Limitations:

The DSL is tuned for bar-model visual algebra—linear bars, shared boundaries, braces. It doesn’t yet cover every scientific diagram (e.g., complex graphs, physics free-body diagrams, circuit schematics). Extending the grammar and macros will be needed.
The system relies on a strict visible-text-only policy and rule checks; domains with ambiguous labels or free-form sketches may require looser but still safe rules.
While TwD reduces hallucination, it can still draft a syntactically valid but semantically wrong structure if the parsing step misreads the problem.

Required resources:

A multimodal model that can parse images and text, trained with examples of DSL drafting.
The deterministic renderer and the verifier (LLM-based or rule-based) to enforce alignment, completeness, and compliance.
Moderate compute for fine-tuning and evaluating on VisAlg (the paper used an 8-GPU node; the method itself is parameter-efficient relative to frontier models).

When NOT to use it:

Tasks where geometry is not the right backbone (e.g., free-form creative art) or where relations cannot be cleanly represented by bars, braces, or aligners.
Real-time scenarios with ultra-low latency requirements where rendering and verification loops would be too slow.
Problems where the input has no stable, shared boundaries (e.g., vague stories with no numeric anchors) may not benefit from the DSL structure.

Open questions:

How to expand the DSL family to cover curves, angles, forces, or network flows while keeping the language small and checkable.
How to combine TwD with program-of-thought for algebraic solving and formal proof systems, bridging diagrams and equations more tightly.
How to train models to choose the right visual grammar automatically (bar model vs. Venn vs. timeline) based on the problem type.
How to teach partial-credit verification (diagnose what’s fixable automatically) to reduce human oversight even further.
How to incorporate uncertainty: can the draft carry confidence tags and trigger targeted re-checks when confidence is low?

06Conclusion & Future Work

In three sentences: This paper turns visual reasoning into optical decompression—rebuilding hidden logic from images and text into a tiny, executable diagram language. Its method, Thinking with Drafting (TwD), forces the model to “show its work” by drafting code, rendering a precise diagram, and verifying it before answering. On the VisAlg benchmark, this structure-first approach lets a compact 8B model outperform strong proprietary systems on logic-aware visual algebra.

Main achievement: Proving that a minimalist DSL plus a closed drafting–rendering–verification loop is a powerful cognitive scaffold that upgrades fuzzy reasoning into checkable structure.

Future directions: Broaden the DSL to cover more diagram types (sets, graphs, physics, circuits), connect drafts to symbolic solvers and formal verifiers, and build automatic “grammar selection” so the model picks the best visual language for each problem.

Why remember this: TwD shows that the path to trustworthy multimodal AI isn’t just better reading or prettier pictures—it’s making models commit to explicit, verifiable structures. When the drawing becomes a proof, answers become safer, clearer, and easier to teach and check.

Practical Applications

•Math tutoring that generates bar-model diagrams with exact alignments and no answer leakage.
•Automated grading of student diagrams, checking bracket endpoints, labels, and transfers for validity.
•Form and invoice parsing that reconstructs totals, parts, and discounts as verifiable structures before summing.
•Data dashboard generation where proportions and comparisons are enforced by alignment rules, not guesswork.
•Interactive problem solvers that let users edit the DSL draft and instantly re-verify the logic.
•Accessible learning tools that convert word problems into clean bar models, highlighting unknowns explicitly.
•Set and probability problems rendered as legal Venn-style partitions with verified overlaps.
•Compliance checks in reports to ensure no diagrams embed final answers in labels (preventing leakage).
•Curriculum design that teaches students to move from text to structured diagrams using a simple DSL.
•Agent systems that pick a visual grammar (bars, sets) and verify structure before calling a calculator.

Version: 1