Thinking with Drafting: Optical Decompression via Logical Reconstruction
Key Summary
- âąThe paper fixes a common problem in AI: models can read pictures and text well, but they often mess up the logic behind them.
- âąIt treats reasoning like "optical decompression": turning a messy picture or word problem back into clean, step-by-step logic.
- âąThe key idea, called Thinking with Drafting (TwD), makes the AI write a tiny, special code (a DSL) that draws an exact diagram it can check.
- âąInstead of guessing answers, the model drafts its thoughts into code, renders a precise diagram, and uses that to verify and correct itself.
- âąA new benchmark, VisAlg, tests whether models can rebuild the hidden logical structure of visual algebra problems.
- âąTwD uses simple building blocks (lines, aligners, braces) plus a virtual grid to separate logic from pixels, so drawings become mathematically exact.
- âąOn the VisAlg test, an 8B model trained with TwD beats strong proprietary systems, showing structure-first reasoning really helps.
- âąThis creates a closed loop: parse â draft â render â verify â answer, where the drawing is not art but a proof you can check.
- âąThe method reduces hallucinations, prevents answer leakage, and keeps relations like equality, parts, and transfers consistent.
- âąIt matters for everyday tasks like school math, forms, tables, and chartsâanywhere clear structure wins over pretty pictures.
Why This Research Matters
We often rely on diagrams, forms, and charts to make real decisions, so getting the structure right matters more than making a picture look nice. TwD forces AI systems to show their work in a small, checkable language, which makes mistakes easier to spot and fix. This helps students learn math with trustworthy diagrams, helps offices process documents without mixing up totals and parts, and helps scientists keep relationships exact in visualizations. By turning drawings into proofs, we reduce hallucinations and hidden errors. The approach is efficient, too: a compact model can outperform much larger ones when it drafts and verifies. Over time, this could become the standard way AI handles any problem where structure and precision beat surface appearance.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how you can read a page from a book and copy every word perfectly, but still not understand the story? Thatâs a bit like todayâs AI. Modern systems can read the words and numbers from images (like worksheets or receipts) very well, and they can even draw helpful pictures while thinking. But when the problem needs exact logicâlike precise equalities, ratios, or what belongs to whomâthey often slip.
đ Top Bread (Hook): Imagine copying every sticker on a puzzle box but not knowing how the pieces fit together. đ„Ź The Concept (OCR): What it isâOptical Character Recognition (OCR) reads text from images very accurately. How it worksâ1) It looks at an image, 2) spots characters and numbers, 3) outputs the exact text. Why it mattersâWithout OCR, the AI wouldnât even know the words or numbers to start with. đ Bottom Bread (Anchor): Scanning a homework sheet and getting âBunny has 3, Mom gives 6â exactly right.
But copying text isnât enough. The number â12â might mean a total, a leftover, or a difference, depending on where it belongs. AI needs the map of relations, not just the labels on the map.
đ Top Bread (Hook): You know how cooking recipes list steps in order so you donât mix things up? đ„Ź The Concept (Chain-of-Thought, CoT): What it isâCoT is writing down reasoning steps in plain language. How it worksâ1) Break the problem into little steps, 2) write each step, 3) get the answer. Why it mattersâWithout steps, the AI jumps to guesses. But plain language can still be fuzzy about shapes, alignments, and exact equalities. đ Bottom Bread (Anchor): âFirst add 6 to bunny, then compare with momâ sounds clear, but it doesnât force exact lengths or positions.
Recently, some models began âthinking with images,â drawing sketches during reasoning. Thatâs helpful, but pixels can look right and still be wrong mathematically. A drawn line might look a bit longer, but not be exactly twice as long.
This creates a precision paradox: inputs are read with high fidelity, and outputs look plausible, yet the hidden logical structure (who equals whom, whatâs 3 times what, what was transferred) is shaky.
What was missing? A strict middle layer that tells the model: âDonât just talk about your thinkingâbuild it in a tiny, checkable language that draws an exact diagram.â That way, every quantity and relation snaps to rules like Lego bricks.
đ Top Bread (Hook): Imagine building with Lego where each brick only fits in places that make sense, so your castle canât collapse. đ„Ź The Concept (DSL): What it isâA Domain-Specific Language is a small, special language for one job (here: visual algebra). How it worksâ1) Define objects as bars, 2) align shared points, 3) add braces for parts and totals, 4) render a precise diagram, 5) verify it. Why it mattersâWithout a DSL, drawings can be pretty but imprecise; with a DSL, they must follow the rules. đ Bottom Bread (Anchor): âHL âBunnyâ 3 6â means bunnyâs bar has a 3-length part plus a 6-length add-on, exactly.
The paperâs big move is to treat reasoning as optical decompression: if OCR compresses the page into tokens, reasoning should decompress those tokens back into an explicit, executable structure. The authors build a closed loopâparse â draft code â render â check â answerâso the diagram becomes a proof, not just a sketch. They also create VisAlg, a test that checks if models can truly recover the hidden structure behind word problems. The stakes are real: school math, forms, spreadsheets, and scientific charts all depend on exact relations, not just readable text or pretty pictures.
02Core Idea
Aha! Moment in one sentence: Donât just say your reasoningâdraft it in a tiny language that forces exact diagrams, so the picture itself can prove your logic.
đ Top Bread (Hook): You know how architects use blueprints, not just speeches, to make buildings safe? đ„Ź The Concept (Thinking with Drafting, TwD): What it isâTwD makes the model write a small program (DSL) that draws a provable diagram before answering. How it worksâ1) Parse the problem, 2) draft DSL code capturing objects and relations, 3) render an exact diagram, 4) verify for conflicts, 5) refine or answer. Why it mattersâWithout drafting, models can hallucinate; with drafting, logic is anchored in checkable structure. đ Bottom Bread (Anchor): For the bunny-and-mom carrots, TwD builds two bars, adds a +6 transfer, and enforces âmom after = 3Ă bunny afterâ with alignment linesâthen solves.
Explain the same idea three ways:
- Blueprint analogy: A building plan (DSL) must obey strict rules; if a beam is out of place, the inspector (verifier) catches it before construction (final answer).
- Lego analogy: Pieces (entities, relations) click only at allowed studs (boundaries). If it doesnât click, it doesnât ship.
- Recipe analogy: Ingredients (numbers) are measured into bowls (bars/segments). Braces and aligners are the measuring spoons. No spooning = no baking.
Before vs After:
- Before: Models read well and talk well, but logical shapes floatâratios, equalities, and transfers might be off by a few pixels or a vague phrase.
- After: Models must express every relation as DSL code, which renders to an exact diagram. If it doesnât align, itâs wrongâand they fix it before answering.
Why it works (intuition, no equations):
- Language is flexible (good for fluency) but too fuzzy for geometry. Pixels are pretty (good for showing) but too slippery for exact math. A tiny program is both expressive and strict. It forces the model to pick explicit piecesâwhere is the shared boundary, whatâs the unit, what got moved outâand then locks them together so the renderer can check.
Building blocks (each is introduced with sandwich-style clarity):
đ Hook: Think of a bookshelf where each shelf is a personâs total and each book is a part. đ„Ź DSL Entity Primitive (HL): What it isâHL draws a horizontal bar with subsegments for parts. How it worksâ1) Name the bar (e.g., âBunnyâ), 2) list segment lengths (positive = solid, negative = dashed for removed or pretend bits), 3) place it in a row. Why it mattersâWithout parts, you canât see totals or changes. đ Anchor: HL "Bunny" 3 6 means Bunny has 3 now and gets +6.
đ Hook: You know how rulers help line up the same edge on two objects? đ„Ź DSL Relational Primitive (VL): What it isâVL is a vertical alignment line marking a shared boundary across rows. How it worksâ1) Choose an x-position, 2) connect the rows that share it, 3) enforce equality or comparison anchors. Why it mattersâWithout VLs, âequal afterâ isnât pinned to any exact place. đ Anchor: Marking where Momâs remaining equals 3Ă Bunnyâs after amount.
đ Hook: Like curly braces in writing that group words together. đ„Ź DSL Aggregation Primitives (HB/VB): What it isâHB (horizontal brace) groups parts within one bar; VB (vertical brace) groups across multiple bars (totals or comparisons). How it worksâ1) Pick span start and end, 2) add a label, 3) place above or below. Why it mattersâWithout braces, partâwhole and multi-object sums stay hidden. đ Anchor: HB âTotal ?â over Bunnyâs whole bar; VB âPeach + Pear totalâ across two rows.
đ Hook: Grids in graph paper keep drawings neat. đ„Ź Virtual Grid System: What it isâA discrete logic grid that places objects by rows and boundaries by order, not by raw pixels. How it worksâ1) Assign each bar a row, 2) place boundaries by logical steps, 3) renderer turns it into exact coordinates. Why it mattersâWithout it, the model must guess pixel positions and drifts off. đ Anchor: âRow 1: Bunny; Row 2: Mom; shared boundary at x=9.â
đ Hook: Printers repeat the same file into the same picture every time. đ„Ź Deterministic Rendering: What it isâA strict engine that converts valid DSL into a canonical diagram with macros for common patterns. How it worksâ1) Check code, 2) draw exactly, 3) auto-add helpful marks for comparisons/ratios. Why it mattersâWithout determinism, verification canât be trusted. đ Anchor: The same DSL always renders to the same, aligned bar model.
đ Hook: Teachers say âshow your workâ so they can spot mistakes. đ„Ź Optical Decompression: What it isâRebuilding hidden logical structure from text and image into explicit code that can be drawn and checked. How it worksâ1) Parse entities and relations, 2) write DSL, 3) render and verify, 4) fix and answer. Why it mattersâWithout decompression, the logic stays trapped inside words and pixels. đ Anchor: From â3 more thanâ in text to an offset segment and alignment line in the diagram.
Finally, TwD treats the draft as the thinking engine: by forcing structure early, errors show up visually and can be corrected before the answer is chosen.
03Methodology
At a high level: Input (image + question) â Parse to DSL (draft code) â Render (exact diagram) â Verify & Refine (closed loop) â Final answer.
Step 1: Optical Decompression via Logical Parsing
- What happens: The model reads the problem image and text, identifies objects (people, boxes, sets), quantities (numbers, ratios), and relations (equal, more than, transfer), then drafts a first version of the DSL.
- Why it exists: If the model jumps straight to an answer, it can hide logical mistakes. Drafting exposes its beliefs as code.
- Example: âBunny has 3; Mom gives 6; Mom after is 3Ă Bunny afterâ becomes HL "Bunny" 3 6, HL "Mom" 33 -6, plus a VL that anchors â3Ă afterâ.
đ Hook: Like coloring solid blocks for what you have, and dotted outlines for pretend or removed parts. đ„Ź Status-Aware Segments: What it isâPositive numbers draw solid (existing) segments; negative numbers draw dashed (removed/imagined) segments. How it worksâ1) Use solid for current amounts, 2) dashed for what was taken or temporarily added, 3) keep them separate in order. Why it mattersâWithout this, before/after and transfers get confused. đ Anchor: HL "Mom" 9 9 9 -6 means Mom had 27, then 6 got given away (dashed).
Step 2: Drafting and Rendering on a Virtual Grid
- What happens: The DSL is rendered by a deterministic engine onto a virtual grid that uses rows and boundary order instead of raw pixel guesses. Common templates (like comparisons) automatically add clean braces and aligners.
- Why it exists: It separates logic from drawing, so even a small model can place things perfectly without wrestling with continuous coordinates.
- Example data: If a comparison says âA is 12 more than B,â a macro adds a dashed offset on A and a VL pinning the shared start.
Step 3: Verification Loop (Closed-Loop Reasoning)
- What happens: The system (or an LLM-based judge) checks syntax validity, visual completeness, and logical consistency: Do braces end exactly at boundaries? Are all givens labeled in visible text? Do transfers subtract then add? Any answer leakage in labels? If a check fails, the model revises the DSL and re-renders.
- Why it exists: This is the safety net. It turns the diagram into a proof that either passes or points to the exact place thatâs wrong.
- Example: If a VB (vertical brace) tries to group a single object, the checker fails it and forces a fix.
Step 4: DSL-Conditioned Answering
- What happens: Once the diagram is verified, the model reads its own draft to compute values (totals, differences, reversals) and outputs the final answer.
- Why it exists: Now the math happens on top of a guaranteed-correct structure, reducing hallucinations.
- Example: With Bunny_now = 3+6 and Mom_now = 33-6 and âMom_now = 3Ă Bunny_nowâ, the answer drops out cleanly.
Concrete, small data walk-through (change & revert):
- Input: âBunny pulled 3 carrots. Mom says: If I give you 6, Iâll have 3 times what you have. How many did Mom pull?â
- Parse: HL "Bunny" 3 6; HL "Mom" 33 -6; VL pins â3Ă afterâ.
- Render: Two aligned bars; a +6 on Bunny; a â6 (dashed) on Mom; braces marking totals.
- Verify: All labels present (3, give 6, 3 times), no leakage of final counts, braces hit boundaries, transfer is paired â6/+6.
- Answer: Compute after-states and solve. The diagram enforces the math.
The Secret Sauce:
- The draft is the reasoning engine: By forcing code first, the model must commit to exact boundaries and relations, so weak, fuzzy language canât slip through.
- Virtual grid decoupling: The model never guesses pixels; it only picks logical order and groupings. The renderer guarantees neat geometry.
- Deterministic macros: Common patterns (sum & split, more/less by t, ratios via equal units) render the same, canonical way, which stabilizes training and checking.
- Visible-text-only policy: Givens and unknowns must appear as labels; numbers alone donât count. This stops hidden info and answer leakage.
- Pairing rules for transfers: Every âgive tâ is a ât/+t pair across rows. If you only see one side, the verifier flags it.
Putting it together like a recipe:
- Input â Identify entities and relations â Write HL/VL/HB/VB code with status-aware segments â Render on the grid â Check alignment, completeness, compliance â Fix if needed â Read the verified diagram â Compute the answer. The loop makes the picture a self-checking proof.
04Experiments & Results
The Test: The authors build VisAlg, a benchmark focused on whether models can recover the exact logical topology behind bar-model word problems. Each item includes a problem image and a ground-truth DSL. The test checks both the code (does your DSL match?) and the image (does your rendering align?), plus a judgeâs semantic checks.
đ Top Bread (Hook): Like grading both a studentâs steps and their final diagram, not just the final number. đ„Ź The Concept (VisAlg): What it isâA structured test for logic-aware visual algebra. How it worksâ1) Collect bar-model problems, 2) generate and refine DSL drafts, 3) filter with a strict verifier calibrated to human experts, 4) evaluate models on code similarity, image similarity, and semantic checks. Why it mattersâWithout a test that cares about structure, models can pass with pretty but wrong pictures. đ Bottom Bread (Anchor): If your brace endpoints float between segments, VisAlg dings youâeven if your final number is right.
What they measured and why:
- Code similarity (BLEU/ROUGE-L/chrF): Does your DSL resemble the ground truth? chrF is the main code metric because it handles mixed symbols and numbers well.
- Image similarity (LPIPS/SSIM/PSNR): Does your rendered diagram match structurally? SSIM is the main image metric since itâs sensitive to shapes and edges.
- LLM-as-judge scores: Alignment, information coverage, numerical consistency, semantic compliance, and answer leakage (0â1 each; average is used). This catches deeper semantic mistakes.
- Main composite score: Average of chrF, SSIM, and the LLM judge score.
The Competition:
- Proprietary models: GPT-5.1, GPT-4o, Claude-4, Gemini-3-Pro, Gemini-2.5-Pro.
- Open-weight baselines: InternVL3-8B, InternVL2.5-8B, Intern-S1-mini, Mimo-VL-7B-RL, Qwen3-VL-8B.
- TwD model: Starts from Qwen3-VL-8B and is supervised on VisAlg.
The Scoreboard (with context):
- TwD hits an overall 82.63. Thatâs like scoring an A when other strong students get high Bs: Gemini-3-Pro at 79.96 and Gemini-2.5-Pro at 74.12.
- Open baselines mostly stay below 55, showing they struggle to produce valid, consistent DSL and clean diagrams without structure-first training.
- Biggest TwD gains are in structural fidelity: better code alignment and diagram alignment, strong info coverage, and minimal answer leakage. The remaining headroom is in numerical consistency against top proprietary models.
Schema-wise performance:
- Five types: proportional distribution, rate & percentage, change & revert, sum & split, difference analysis.
- TwDâs advantages are most visible in structure-heavy cases (proportional distribution, difference analysis) where exact units and boundaries matter most.
- Performance is steady across types, suggesting the DSL-and-verifier loop generalizes.
Surprising findings:
- Human vs LLM-judge agreement is very high (r â 0.96), so the automated verifier is a reliable stand-in for experts.
- On advanced set problems with multiple overlaps (Aâ©Bâ©C, etc.), some frontier models compute totals correctly but draw illegal intersections (topological hallucinations). TwD preserves legal overlaps by decomposing intersections explicitly as geometric pieces, keeping semantics and geometry aligned.
Bottom line: Treating drafting as thinkingâthen verifying the draftâyields better, more trustworthy reasoning than text-only steps or pixel-only drawings.
05Discussion & Limitations
Limitations:
- The DSL is tuned for bar-model visual algebraâlinear bars, shared boundaries, braces. It doesnât yet cover every scientific diagram (e.g., complex graphs, physics free-body diagrams, circuit schematics). Extending the grammar and macros will be needed.
- The system relies on a strict visible-text-only policy and rule checks; domains with ambiguous labels or free-form sketches may require looser but still safe rules.
- While TwD reduces hallucination, it can still draft a syntactically valid but semantically wrong structure if the parsing step misreads the problem.
Required resources:
- A multimodal model that can parse images and text, trained with examples of DSL drafting.
- The deterministic renderer and the verifier (LLM-based or rule-based) to enforce alignment, completeness, and compliance.
- Moderate compute for fine-tuning and evaluating on VisAlg (the paper used an 8-GPU node; the method itself is parameter-efficient relative to frontier models).
When NOT to use it:
- Tasks where geometry is not the right backbone (e.g., free-form creative art) or where relations cannot be cleanly represented by bars, braces, or aligners.
- Real-time scenarios with ultra-low latency requirements where rendering and verification loops would be too slow.
- Problems where the input has no stable, shared boundaries (e.g., vague stories with no numeric anchors) may not benefit from the DSL structure.
Open questions:
- How to expand the DSL family to cover curves, angles, forces, or network flows while keeping the language small and checkable.
- How to combine TwD with program-of-thought for algebraic solving and formal proof systems, bridging diagrams and equations more tightly.
- How to train models to choose the right visual grammar automatically (bar model vs. Venn vs. timeline) based on the problem type.
- How to teach partial-credit verification (diagnose whatâs fixable automatically) to reduce human oversight even further.
- How to incorporate uncertainty: can the draft carry confidence tags and trigger targeted re-checks when confidence is low?
06Conclusion & Future Work
In three sentences: This paper turns visual reasoning into optical decompressionârebuilding hidden logic from images and text into a tiny, executable diagram language. Its method, Thinking with Drafting (TwD), forces the model to âshow its workâ by drafting code, rendering a precise diagram, and verifying it before answering. On the VisAlg benchmark, this structure-first approach lets a compact 8B model outperform strong proprietary systems on logic-aware visual algebra.
Main achievement: Proving that a minimalist DSL plus a closed draftingârenderingâverification loop is a powerful cognitive scaffold that upgrades fuzzy reasoning into checkable structure.
Future directions: Broaden the DSL to cover more diagram types (sets, graphs, physics, circuits), connect drafts to symbolic solvers and formal verifiers, and build automatic âgrammar selectionâ so the model picks the best visual language for each problem.
Why remember this: TwD shows that the path to trustworthy multimodal AI isnât just better reading or prettier picturesâitâs making models commit to explicit, verifiable structures. When the drawing becomes a proof, answers become safer, clearer, and easier to teach and check.
Practical Applications
- âąMath tutoring that generates bar-model diagrams with exact alignments and no answer leakage.
- âąAutomated grading of student diagrams, checking bracket endpoints, labels, and transfers for validity.
- âąForm and invoice parsing that reconstructs totals, parts, and discounts as verifiable structures before summing.
- âąData dashboard generation where proportions and comparisons are enforced by alignment rules, not guesswork.
- âąInteractive problem solvers that let users edit the DSL draft and instantly re-verify the logic.
- âąAccessible learning tools that convert word problems into clean bar models, highlighting unknowns explicitly.
- âąSet and probability problems rendered as legal Venn-style partitions with verified overlaps.
- âąCompliance checks in reports to ensure no diagrams embed final answers in labels (preventing leakage).
- âąCurriculum design that teaches students to move from text to structured diagrams using a simple DSL.
- âąAgent systems that pick a visual grammar (bars, sets) and verify structure before calling a calculator.