AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations

Minjun Zhu; Zhen Lin; Yixuan Weng; Panzhong Lu; Qiujie Xie; Yifan Wei; Sifan Liu; Qiyao Sun; Yue Zhang

AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations

Intermediate

Minjun Zhu, Zhen Lin, Yixuan Weng et al.2/3/2026

arXiv PDF

Key Summary

•AutoFigure is an AI system that reads long scientific texts and then thinks, plans, and draws clear, good-looking figures—like a careful student who makes a neat, accurate poster from a long chapter.
•It follows a think-then-draw recipe called Reasoned Rendering: first build a clean blueprint (structure), then make it pretty (aesthetics), then fix tiny text issues (legibility).
•A new dataset, FigureBench, with 3,300 text–figure pairs, tests how well systems turn long texts into accurate and attractive scientific illustrations.
•AutoFigure uses a critique-and-refine loop where an AI “designer” proposes layouts and an AI “critic” gives feedback until the figure is balanced, readable, and faithful to the text.
•To avoid blurry labels, AutoFigure erases all rendered text, reads it with OCR, checks against the blueprint, and overlays crisp vector text back on top.
•Across blogs, surveys, textbooks, and research papers, AutoFigure scores highest overall and wins most blind comparisons against baselines like text-to-image, code-only SVG/HTML, and other agents.
•In a human study with first authors judging figures for their own papers, 66.7% said they would publish AutoFigure’s result—second only to the original human-made figures.
•Ablations show more refinement iterations help, structured formats like SVG beat PPT, and the rendering stage noticeably boosts visual design without hurting accuracy.
•Limits remain: tiny typos can slip through in dense figures, and very complex, bespoke research diagrams are harder to get perfect without domain-specific rules.
•The work matters because it saves researchers days of design time, helps students learn faster with clearer visuals, and removes a key bottleneck for future AI scientists.

Why This Research Matters

Clear figures speed up understanding for everyone—from busy reviewers to curious students—so science moves faster. AutoFigure reduces days of design work to minutes, freeing researchers to focus on ideas instead of arrows and fonts. Teachers and students get cleaner, more accurate visuals that make tough topics feel approachable. Journals and conferences can raise the bar for clarity without raising the burden on authors. And as AI systems start discovering new results on their own, tools like AutoFigure give them a trustworthy visual language to share what they find.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how a great poster can make a complicated topic suddenly click? Before this work, making that kind of poster—or any publication-ready scientific figure—usually took people days. Researchers had to deeply understand long technical papers, pick out what’s essential, and then design a figure that is both correct and attractive. Computers weren’t much help for long texts. Text-to-image models made pretty pictures, but they often messed up structure or labels. Code-based drawing tools made neat shapes, but the results looked stiff and less professional. Multi-agent poster/slide systems mostly rearranged existing figures, not create new ones from scratch.

🍞 Hook: Imagine you must read a whole chapter (10,000+ words) and then draw one perfect figure that explains the main idea. That’s hard! 🥬 The Concept (Semantic Parsing): It means turning long text into a map of key ideas and how they connect.

How it works: 1) Read; 2) Find important entities and steps; 3) Link them (who connects to what and how); 4) Save as a clear blueprint.
Why it matters: Without this, the AI guesses what to draw and gets lost in details. 🍞 Anchor: From a paper about training AI with human feedback, semantic parsing finds the three stages (SFT → RM → PPO) and the roles (labelers, reward model), so the figure shows the real workflow instead of random boxes.

🍞 Hook: You know how a baker first builds the cake layers before decorating with icing and fruits? 🥬 The Concept (Aesthetic Rendering): It’s the stage where the system turns the plain layout into a polished, beautiful figure.

How it works: 1) Take the blueprint; 2) Apply a visual style; 3) Pick colors/icons; 4) Render a high-quality image.
Why it matters: Without this, figures are accurate but dull or hard to read. 🍞 Anchor: A pipeline with steps and arrows becomes a clean infographic with aligned boxes, soft colors, and clear icons.

🍞 Hook: Before you play a new game, you need a practice field. 🥬 The Concept (FigureBench): A big benchmark of 3,300 long-text–to-figure pairs for training and testing.

How it works: 1) Collect diverse sources (papers, surveys, blogs, textbooks); 2) Curate high-quality pairs; 3) Hold out 300 tough test cases; 4) Score models on design, clarity, and accuracy.
Why it matters: Without a fair testbed, we can’t tell if new methods truly improve. 🍞 Anchor: A model that only looks nice but gets steps wrong scores lower on FigureBench than one that’s both neat and correct.

🍞 Hook: When you build a treehouse, you draw the plan first, then you decorate. 🥬 The Concept (Reasoned Rendering): Think first, then draw—separate structure planning from final styling.

How it works: 1) Parse text into a symbolic layout; 2) Iteratively fix problems; 3) Render with style; 4) Repair text.
Why it matters: Mixing all at once causes either pretty-but-wrong or correct-but-ugly results. 🍞 Anchor: The system first locks the positions of nodes and arrows, then picks a color scheme, then ensures labels are sharp.

🍞 Hook: Have you ever explained a picture to a friend while pointing at parts? 🥬 The Concept (Vision-Language Model, VLM): A model that understands both words and images together.

How it works: 1) Read the text; 2) Look at the figure; 3) Judge alignment and clarity; 4) Score and give feedback.
Why it matters: Without vision+language understanding, automated judging or guided improvement is unreliable. 🍞 Anchor: A VLM can tell if the “Reward Model” box is missing even when the caption mentions it.

🍞 Hook: Think of turning in a draft, getting notes, and revising until it shines. 🥬 The Concept (Critique-and-Refine Loop): An AI “designer” proposes layouts; an AI “critic” checks alignment, spacing, and overlaps, then the designer revises.

How it works: 1) Score current layout; 2) Critic points out issues; 3) Designer fixes; 4) Keep best version; 5) Repeat a few times.
Why it matters: Without iteration, early mistakes and crowding persist. 🍞 Anchor: If text boxes overlap arrows, the critic flags it; the designer nudges them apart and realigns columns.

🍞 Hook: When marker letters look smudged, you erase and rewrite them neatly. 🥬 The Concept (Erase-and-Correct Strategy): Remove blurry text from the image, read what it was, correct it, then reprint crisp vector text at the right spots.

How it works: 1) Erase all text pixels; 2) OCR extracts strings and boxes; 3) Verify against the blueprint; 4) Overlay sharp text.
Why it matters: Without this, small typos and blur hurt readability. 🍞 Anchor: A fuzzy “gravity” label gets corrected and re-rendered perfectly as “gravity.”

The problem researchers faced was twofold: understanding super long, technical text and drawing figures that are both accurate and pleasant to look at. Past attempts stumbled: end-to-end image generators lost structure; code-only approaches lost polish; rearrangers didn’t create new visuals. The missing piece was a plan-first system plus a fair test to prove progress. The real stakes are big: saving days of researcher time, helping students learn faster, improving public science communication, and enabling future AI scientists to show their discoveries clearly.

02Core Idea

🍞 Hook: You know how movie makers sketch storyboards before filming? That one step makes everything else smoother. 🥬 The Concept (Reasoned Rendering): AutoFigure’s key insight is to separate thinking from drawing—first nail the blueprint, then beautify, then perfect the text.

How it works: 1) Parse long text into a symbolic layout (nodes, arrows, labels, style hints); 2) Iterate designer–critic revisions; 3) Render with a guided style; 4) Erase blurry text and overlay crisp vector labels.
Why it matters: Without this sequence, you either get pretty-but-wrong or correct-but-ugly. 🍞 Anchor: For the InstructGPT diagram, AutoFigure locks the three stages (SFT, RM, PPO), fixes spacing and overlaps, renders with a clean palette, and ensures labels are 100% legible.

The “Aha!” in one sentence: Think like an architect, build like a designer, and letter like a typesetter.

Three different analogies:

Blueprints then house: First design the rooms and hallways, then pick paint and furniture, then mount crisp room signs.
Recipe then plating: First plan ingredients and steps; then plate the dish nicely; finally add labels on the menu.
Lego then stickers: First snap blocks into a stable structure; then add decorative pieces; last, place neat name stickers.

Before vs After:

Before: Systems tried to do everything at once, leading to mistakes—hallucinated steps, cramped arrows, or fuzzy text.
After: AutoFigure locks structure early, improves it through critique, and only then invests in aesthetics and precise text, reducing errors while improving looks.

Why it works (intuition, not equations):

Cognitive load is split. The LLM reasons about structure (what goes where) without worrying about pixel-perfect style. The renderer focuses on beauty without changing meaning. The final text pass ensures human-grade legibility.
Iteration catches local issues (overlaps, misalignments) that a single pass misses. Like editing drafts, a few quality loops boost clarity a lot.
Constraining the renderer with a layout image and style description keeps the output faithful to the plan.

Building blocks (each with a purpose):

Semantic Parsing: extracts entities, steps, and relations from long text and turns them into a symbolic graph (e.g., SVG/HTML). Purpose: makes hidden structure explicit.
Critique-and-Refine Loop: an AI critic checks alignment, balance, and overlaps; an AI designer revises. Purpose: improves clarity, avoids crowding.
Aesthetic Rendering: translates the validated blueprint into a polished image guided by a style prompt and the layout reference. Purpose: professional appearance without changing meaning.
Erase-and-Correct: removes blurry text, uses OCR + verification to ensure exact wording, then overlays crisp vector text. Purpose: perfect legibility and fewer typos.
FigureBench: a long-text benchmark with expert-labeled pairs to evaluate structure, clarity, and design. Purpose: a fair, challenging testbed to drive real progress.

Put together, AutoFigure makes complex science easier to see: it decodes the story, arranges it sensibly, dresses it well, and prints the words clearly. That combination is why the figures are both accurate and publication-ready.

03Methodology

At a high level: Long Text → Concept Extraction → Layout Critique-and-Refine → Style-Guided Rendering → Text Erase-and-Correct → Final Figure.

Step 1: Concept Extraction and Symbolic Layout

What happens: The system reads the long document and distills a method summary plus a list of entities (nodes), relations (arrows), and labels. It produces a machine-readable blueprint in SVG/HTML and a style descriptor based on source type (Paper, Survey, Blog, Textbook). The SVG is also rasterized into a plain layout image.
Why it exists: If you don’t capture the structure first, rendering can drift—missing steps, wrong connections, or messy topology.
Example: From an RLHF paper, extract three big stages (SFT, Reward Modeling, PPO), include roles (Labelers, Reward Model), connect arrows left-to-right, and assign a clean academic style.

Step 2: Critique-and-Refine Loop (AI Designer × AI Critic)

What happens: Start with the best-known layout and a score. The critic reads the layout and flags problems (misalignment, overlaps, poor balance). The designer proposes a new candidate that fixes issues. If the candidate scores higher, it becomes the new best. Repeat for a few iterations or until no gains.
Why it exists: A single pass often leaves preventable defects (crowding, tangled arrows). Iteration steadily reduces these.
Example: The critic notices two labels collide with an arrow and the stage columns drift. The designer enlarges margins, straightens columns, and routes arrows more clearly.

Step 3: Style-Guided Aesthetic Rendering

What happens: The system turns the final layout + style descriptor into a detailed prompt and feeds it, along with the layout reference image (which pins positions), into a multimodal generator. It outputs a polished figure that adheres to the plan and reflects the chosen style.
Why it exists: Separate style work from structure locks in accuracy while achieving professional polish.
Example: The blueprint of boxes and arrows becomes a refined infographic with consistent typography, icons, and a pleasant color palette.

Step 4: Ensuring Textual Accuracy (Erase-and-Correct)

What happens: 1) Erase all text pixels from the rendered image to get a clean background. 2) Use OCR to read detected strings and their bounding boxes. 3) Compare OCR text to the ground-truth labels from the SVG and correct any errors. 4) Reinsert all labels as crisp vector text at the same coordinates.
Why it exists: Renderers can blur small fonts or drop characters; this pass guarantees legible, exact wording.
Example: “Policy Optimization” partially blurs to “Policy Optimizati”; the verifier restores the missing “on,” and the overlay prints it sharply.

Step 5: Output and Style Control

What happens: Users can keep the default academic style or swap styles (e.g., minimalist vs comic) without touching the structure.
Why it exists: Many venues have style preferences; decoupling structure from style makes adaptation easy.
Example: The same RLHF layout can be rendered in a sober gray-blue scheme for a journal or a friendlier pastel palette for a blog.

Secret sauce—what’s clever here:

Decoupling reasoning from rendering avoids trade-offs that plagued older methods.
The designer–critic loop is lightweight but powerful—small iterative fixes compound into big clarity gains.
The erase-and-correct pass treats text as special, ensuring human-grade legibility that typical image generators miss.
Structured intermediate formats (SVG/HTML) are expressive, verifiable, and easy to refine—superior to piecemeal PPT insertions.

Concrete walk-through (mini example: Waterfall model from a textbook):

Input: A few paragraphs describing Requirements → Design → Implementation → Testing → Deployment, plus key traits like “strictly sequential.”
Step 1: Extract 5 phase nodes, arrows top-to-bottom, side panels for Key Characteristics and Historical Context; serialize as SVG.
Step 2: Critique flags cramped spacing between Design and Implementation; revise with larger gaps and clearer arrowheads.
Step 3: Render with a minimalist academic style.
Step 4: OCR spots “Requirments”; verifier corrects to “Requirements”; overlay perfect text.
Output: A clean, accurate teaching diagram ready for print.

Efficiency notes: Typical run uses a few refinement iterations. Rendering is guided by the layout to preserve structure. OCR and verification are fast and robust for most fonts. The whole pipeline can run with commercial APIs or strong open-source VLMs locally, trading cost for speed/privacy.

04Experiments & Results

The Test (what and why): The goal is to see if models can turn long texts into figures that are (1) visually well-designed, (2) communicatively clear, and (3) faithful to content. Using FigureBench’s 3,300 pairs (300 for testing), scoring follows a VLM-as-a-judge protocol with referenced scoring and blind pairwise comparisons. Human experts (first authors) also evaluate figures for their own papers to measure practical, publication-level utility.

The Competition (baselines):

End-to-end text-to-image (e.g., GPT-Image): Great looks, often wrong content.
Text-to-code (HTML/SVG): Solid structure, weaker polish/aesthetics.
Multi-agent diagramming (Diagram Agent, AutoPresent): Arrange content but struggle to invent new schematics from long text.
TikZ code generation (TikZero/+): Exact geometry focus but overwhelmed by complex, long-context layouts.

The Scoreboard (with context):

AutoFigure achieves top Overall scores across categories: Blog ~7.60, Survey ~6.99, Textbook ~8.00, Paper ~7.03. Think of 7–8 as “A-/A” when others are mostly “B” or lower.
Blind Win-Rates: Blog 75%, Survey 78%, Textbook 97.5%, Paper 53%. In textbooks, AutoFigure almost always wins; papers are tougher because layouts are bespoke and dense.
Human study with first authors: AutoFigure ranks second only to the original human figures and earns 66.7% “I’d publish this” intent—strong real-world validation.
Ablations: Rendering boosts visual design without harming accuracy (e.g., GPT-5 backbone jumps in Overall after rendering). More refine iterations steadily improve scores (0 → 5 loops raises Overall notably). SVG/HTML as structured intermediates beat PPT insertions.
Open-source backbones: Qwen3-VL-235B drives AutoFigure to an Overall ~7.08—beating several commercial backbones and ranking just behind GPT-5, showing cost-effective local deployment is viable.
Text refinement module: Overall rises modestly, but aesthetics and professional polish improve meaningfully—exactly what makes a figure feel “publication-ready.”
Cost/time: Cloud API run ~17.5 minutes at ~$0.20 per figure; local open-source on strong GPUs ~9.3 minutes and near-zero marginal cost.

Surprising findings:

Structured intermediate formats (SVG/HTML) matter a lot: they enable one-shot coherent figures, while PPT’s incremental approach drifts.
Open-source VLMs are strong enough to rival commercial options in this pipeline.
Even small improvements in polish (fonts, spacing) can flip human preferences—publication intent jumped for AutoFigure despite similar content completeness.
VLM-as-a-judge correlates well with humans (Pearson r≈0.66), supporting reliable automated evaluation.

Case intuition:

InstructGPT figure: End-to-end images look fine but miss roles/steps; code-only looks sterile. AutoFigure preserves all stages, keeps labels clean, and uses spacing/icons for instant comprehension.
Waterfall model: AutoFigure transforms a basic flow into a clear teaching panel with added context boxes, while keeping perfect labels via text correction.

05Discussion & Limitations

Limitations (honest view):

Tiny text errors can still appear in very dense layouts or tiny fonts; the erase-and-correct pass reduces but may not eliminate rare character slips.
Paper figures are bespoke and multi-layered; the system may simplify nuanced relations to keep readability, or keep everything and risk clutter.
Domain-specific conventions (e.g., biology pathways) aren’t fully encoded, so uncommon symbols/standards might be missed without domain add-ons.
Evaluation via VLM-as-a-judge, though correlated with humans, isn’t a perfect substitute for expert review.

Required resources:

A capable long-context LLM/VLM for parsing and critique, a graphics-friendly renderer, OCR, and moderate GPU compute for fast iterations. Local open-source setups need strong GPUs; cloud APIs trade cost for convenience.

When not to use:

Data-accurate charts (exact numbers/axes) where specialized plotting is required.
Ultra-dense posters with tiny fonts where vector-native workflows or manual typesetting are mandatory.
Sensitive domains where any mislabeling is risky without human verification.
Tasks demanding interactivity or animation (this work is static-first).

Open questions:

How to enforce domain constraints (terminology, allowed connections) before rendering?
Can retrieval and knowledge bases improve faithfulness for specialized fields?
Better automatic metrics for structure/topology beyond VLM judgments.
Stronger constrained text rendering that removes residual OCR alignment issues.
Extensions to interactive/animated figures while keeping the reasoned-then-rendered discipline.

06Conclusion & Future Work

In three sentences: AutoFigure introduces Reasoned Rendering—a think-then-draw pipeline that first builds a correct symbolic layout, then renders it beautifully, and finally perfects the text. Powered by the new FigureBench dataset and a designer–critic refinement loop, it consistently beats baselines on accuracy, clarity, and aesthetics, with many figures judged publication-ready by human authors. This work removes a major bottleneck in science communication and equips future AI systems to show their ideas clearly.

Main achievement: Demonstrating that decoupling reasoning from rendering, plus iterative critique and text repair, produces publication-grade scientific figures from long, complex texts.

Future directions: Add domain verifiers and retrieval for specialized fields; strengthen constrained text overlays; evolve from static diagrams to interactive/animated ones; and tailor style guides to discipline-specific conventions.

Why remember this: It teaches a general lesson for AI creativity—think first, then draw. That simple shift turns messy generations into clear explanations, saving researchers time, helping students learn, and giving AI scientists a visual voice.

Practical Applications

•Generate a publication-ready method diagram for your paper draft directly from the Methods section.
•Turn a long survey into a clean taxonomy figure that groups concepts and shows relationships.
•Convert a technical blog post into an explanatory infographic with consistent style and icons.
•Redraw an old, cluttered figure by feeding its text description to get a cleaner, modern version.
•Switch styles (e.g., minimalist for journals, playful for outreach) without changing the structure.
•Batch-produce lecture slides’ key diagrams from textbook chapters to save teaching prep time.
•Localize labels: keep the same layout but output crisp vector text in another language.
•Preflight check for figures: run critique-and-refine to catch overlaps and balance issues before submission.
•Create internal process maps (e.g., ML pipelines) for team documentation with accurate steps and roles.
•Rapidly prototype multiple layout options for the same content and pick the highest-scoring one.

Version: 1