PaperBanana: Automating Academic Illustration for AI Scientists

Dawei Zhu; Rui Meng; Yale Song; Xiyu Wei; Sujian Li; Tomas Pfister; Jinsung Yoon

PaperBanana: Automating Academic Illustration for AI Scientists

Beginner

Dawei Zhu, Rui Meng, Yale Song et al.1/30/2026

arXiv PDF

Key Summary

•PaperBanana is a team of AI helpers that turns a paper’s method text and caption into a clean, accurate, publication-ready figure.
•It works in stages: find good examples, plan the content, choose the style, draw the image, and then improve it with a self-critique loop.
•A new benchmark, PaperBananaBench, built from NeurIPS 2025 papers, tests how faithful, concise, readable, and aesthetic the figures are.
•Across 292 cases, PaperBanana beats strong baselines, with big gains in conciseness and readability and steady gains in faithfulness and aesthetics.
•For statistical plots, PaperBanana switches to code generation (like Matplotlib) to ensure numerical accuracy, while still polishing style.
•A stylist agent automatically learns an academic style guide from many examples, raising visual quality without heavy manual rules.
•An ablation study shows each agent matters; the critic loop especially helps recover faithfulness lost during style polishing.
•The evaluation uses a VLM-as-a-Judge setup that correlates well with human ratings, making testing faster and consistent.
•Limitations include raster (not easily editable vector) outputs, reduced style diversity, and occasional fine-grained connection errors.
•This work lowers the time and skill barrier for making professional research figures, speeding up how scientists share ideas.

Why This Research Matters

Clear, accurate figures help scientists explain and verify ideas faster, which speeds up discovery. PaperBanana reduces the time and special skills needed to make professional diagrams, so more researchers—especially students and small labs—can communicate at a high level. Better readability and faithfulness mean fewer misunderstandings and stronger peer review. Automated, code-based plots protect against numerical mistakes while still looking publication-ready. A consistent style guide raises overall quality across papers, making science easier to learn and teach. In short, this turns a slow, frustrating bottleneck into a reliable, fast step in the research workflow.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a great comic book uses both words and pictures so you instantly get what’s going on? Research papers are like that too—good figures make ideas click.

🥬 The Concept (Automated Academic Illustration): Automated academic illustration is using AI to turn research ideas and captions into the exact kind of diagrams and plots journals expect. How it works (big picture):

Read the method text and caption.
Plan what the figure should show.
Choose a clean academic style.
Draw it.
Check and fix mistakes. Why it matters: Without it, researchers spend hours wrestling with tools, and figures can end up messy or misleading. 🍞 Anchor: Imagine you wrote a new AI method. Instead of spending a weekend drawing boxes and arrows, you feed your text to an AI that returns a neat NeurIPS-style diagram.

The World Before: Large Language Models (LLMs) became strong at reading and writing, even coding. But making publication-ready diagrams stayed slow and fussy. People tried two paths: (1) code-based drawing (like TikZ, PPTX, or SVG) that’s precise but hard for fancy visuals; (2) image generation models that can make pretty images but often miss small details, misplace arrows, or render text poorly for academic standards.

🍞 Hook: Imagine building a LEGO city. Instructions (code-based) are precise but can’t easily express custom, cool shapes; freehand drawing (image models) looks creative but risks missing the blueprint.

🥬 The Concept (Methodology Diagrams): Methodology diagrams are those big overview pictures that explain how a new AI method works—modules, arrows, losses, inputs/outputs. How it works: They must convert technical steps into boxes, connections, and labels. Why it matters: If they’re wrong or unclear, readers misunderstand the science. 🍞 Anchor: A diffusion model paper’s figure must show where noise goes, how features connect, and what gets trained; a wrong arrow can flip the story.

The Problem: Researchers needed a way to keep the precision of code-like structure while getting the expressiveness and modern look of top-tier papers—without spending endless hours. Plus, there wasn’t a dedicated, rigorous benchmark tailored to methodology diagrams to measure real progress.

Failed Attempts:

Pure code-based diagrams hit expressiveness limits (special icons, soft pastels, nuanced layouts).
Pure image generation looked nice but often hallucinated text or mixed up connections.
Few-shot prompting helped a little but lacked stable, consistent academic styling and structure.

🍞 Hook: You know how having a good recipe plus photos of the final dish makes cooking easier?

🥬 The Concept (VLMs): Visual Language Models (VLMs) are AIs that understand both text and images and can reason about them together. How it works: They read text, look at images, compare, and explain. Why it matters: They can plan figures, judge quality, and give targeted feedback to improve images. 🍞 Anchor: A VLM can look at your diagram and your caption and tell you, “This arrow should go to the decoder, not the encoder.”

The Gap: We needed a multi-step AI “crew” that: (1) looks up useful example diagrams, (2) plans content carefully, (3) applies a learned academic style, (4) draws images, and (5) self-critiques to fix issues—plus a benchmark to test it fairly.

Real Stakes: Better, faster figures mean clearer science, fewer misread results, and more time for experiments. For students and labs without design support, this levels the playing field. It also reduces burnout from the most fiddly part of paper writing.

🍞 Hook: Think of a school science fair. Clear posters win attention and trust.

🥬 The Concept (Evaluation Dimensions): Four keys define a great figure—Faithfulness (matches the method), Conciseness (no clutter), Readability (easy to follow), and Aesthetics (professional look). How it works: A judge compares the AI figure to the human reference on these four. Why it matters: Without these checks, you might get pretty but wrong, or correct but unreadable. 🍞 Anchor: A bar chart with perfect numbers but tiny, blurry labels fails readability; a sleek diagram with a wrong arrow fails faithfulness.

02Core Idea

🍞 Hook: Imagine a movie crew—researcher is the screenwriter, but you still need casting, set design, filming, and editing to make a blockbuster.

🥬 The Concept (PaperBanana): PaperBanana is a team of five AI agents (Retriever, Planner, Stylist, Visualizer, Critic) that turn method text and a caption into a polished, accurate academic figure using references, a learned style guide, and an iterative self-critique loop. How it works:

Retriever finds good example figures.
Planner writes a precise, step-by-step blueprint of the target figure.
Stylist applies an academic style guide learned from many papers.
Visualizer draws the figure.
Critic checks it against the text and asks for fixes; repeat ~3 times. Why it matters: Each specialist reduces a common failure—missing structure, messy style, drawing errors, or unfaithful details. 🍞 Anchor: Input: “Overview of our framework.” Output: a clean left-to-right flow with labeled modules, correct arrows, readable text, and modern colors—ready for a NeurIPS submission.

The Aha Moment in one sentence: Split the hard problem of academic figure-making into specialized AI roles and let references and self-critique guide the whole pipeline.

Three analogies:

Kitchen: Retriever is the shopper (finds ingredients), Planner is the recipe writer, Stylist is the plating chef, Visualizer is the cook, Critic is the taste-tester; repeat until delicious.
Orchestra: Retriever picks scores, Planner arranges parts, Stylist sets the tone, Visualizer plays, Critic conducts rehearsals; repeat to reach harmony.
Lego build: Retriever finds similar sets, Planner writes clear build steps, Stylist picks the color theme, Visualizer assembles, Critic checks the structure; repeat till sturdy and stylish.

Before vs After:

Before: One big prompt to an image model—often pretty but off-target, or correct but ugly, with no consistent fix loop.
After: Structured pipeline with references, a learned academic style, and repeated check-and-fix cycles; better faithfulness, clarity, and aesthetics.

🍞 Hook: Ever followed a maze faster by looking at a solved example first?

🥬 The Concept (Reference-Driven Planning): Using example figures to guide structure and style helps the AI learn layouts and visual norms quickly. How it works: Retrieve similar diagram types and study their composition; write a plan that mirrors good practices. Why it matters: Without examples, plans become verbose and scattered, and diagrams drift from academic norms. 🍞 Anchor: Want a pipeline diagram? Seeing past pipelines helps choose rounded boxes, elbow arrows, light zone backgrounds, and big left-to-right flow.

🍞 Hook: Think of a school’s dress code—it keeps everyone looking neat without picking the same outfit.

🥬 The Concept (Automatic Style Guide): The Stylist scans many papers to synthesize rules on colors, shapes, arrows, and fonts that match modern academic taste. How it works: Summarize common palettes, container styles, arrow types, and typography from a large reference set; apply them consistently. Why it matters: Without a guide, figures mix clashing palettes, odd fonts, and confusing arrows, hurting readability. 🍞 Anchor: Soft pastel zones, rounded rectangles for processes, dashed lines for auxiliary flows, sans-serif labels plus italic serif math: instant “NeurIPS look.”

🍞 Hook: You know how checking homework catches small mistakes before a test?

🥬 The Concept (Iterative Self-Critique): The Critic inspects the draft image against the original text and caption, then revises the description so the Visualizer can redraw. How it works: Generate → Review → Edit description → Regenerate, for ~3 rounds. Why it matters: One-shot images lock in mistakes; iteration fixes miswired arrows, missing labels, or overcrowding. 🍞 Anchor: First draft misses a skip-connection; the Critic adds, “draw dashed skip from Encoder to Decoder,” and the fix appears next round.

🍞 Hook: When exact numbers matter, you use a calculator, not a paintbrush.

🥬 The Concept (Code-Based Plots): For statistical plots, PaperBanana converts the plan into Matplotlib code so values are exact. How it works: Visualizer writes Python plotting code; render; Critic checks and requests code tweaks; repeat. Why it matters: Pure image models can make pretty but numerically wrong charts (hallucinated bars, duplicated categories). 🍞 Anchor: A heatmap’s cell values must match the data table; code ensures that every number lands exactly where it should.

03Methodology

At a high level: Input (method text + caption) → Retriever → Planner → Stylist → Visualizer ↔ Critic (3 rounds) → Final figure.

Step-by-step recipe with why and examples:

Inputs

What: Source context S (method text) + communicative intent C (caption like “Overview of our framework”).
Why: The figure must match both content and focus; otherwise, you might draw the wrong part of the method.
Example: S describes modules A→B→C with a loss; C says “Our training pipeline.”

🍞 Hook: Finding a model in a closet goes faster if you first look at photos of similar closets. 🥬 Concept (Retriever Agent): Picks N example triplets (text, caption, figure) from a curated reference set. How: A VLM ranks candidates by diagram type and structure, prioritizing visual layout over topic. Why: Examples anchor planning and style; without them the plan gets verbose and style drifts. 🍞 Anchor: For a pipeline, it selects past pipelines with zones, rounded modules, and elbow arrows.

Planner

What: Writes a detailed, unambiguous description P of the target figure using S, C, and E.
Why: The Visualizer needs precise instructions (modules, labels, arrow directions, line styles) to avoid guesswork.
Example: “Left-to-right; three rounded boxes: Encoder, Fusion, Decoder; dashed skip from Encoder to Decoder; red Loss box.”

🍞 Hook: Blueprints make buildings safe. 🥬 Concept (Planner Agent): Turns messy text into a step-by-step visual blueprint. How: In-context learning from retrieved examples to mirror proven layouts and label conventions. Why: Without a blueprint, images become inconsistent, cluttered, or miss key parts. 🍞 Anchor: The plan says “dashed lines = auxiliary flow,” preventing style confusion later.

Stylist

What: Applies a learned academic style guide G to produce P* (optimized description).
Why: Consistency in colors, arrows, and fonts boosts readability and professionalism.
Example: Switch to soft pastels for zones, sans-serif labels, italic serif for math, elbow connectors for networks.

🍞 Hook: School uniforms keep everyone neat, while still allowing small personal touches. 🥬 Concept (Stylist Agent): Learns and applies a “NeurIPS look” guideline from many references. How: Summarizes common palettes, shapes, line semantics, and typography, then rewrites P accordingly. Why: Without it, figures look dated or mismatched. 🍞 Anchor: It turns “gray boxes with black borders” into “rounded modules with pastel zone backgrounds and clear arrows.”

Visualizer + Critic (loop for T=3)

What: Visualizer turns P_t into an image I_t; Critic compares I_t with S and C, then refines P_{t+1}.
Why: Iteration catches subtle errors and improves clarity.
Example: Round 1 misses an arrow; Critic adds instruction; Round 2 fixes it; Round 3 adjusts spacing for readability.

🍞 Hook: Draft, proofread, rewrite—better every pass. 🥬 Concept (Visualizer Agent): Draws the picture from the optimized description. How: Uses an image generator for diagrams; for plots, writes Matplotlib code. Why: Without a faithful renderer, even perfect plans fail. 🍞 Anchor: It renders a legible 21:9 pipeline diagram at high resolution.

🍞 Hook: A coach replays the game to point out exactly what to fix next practice. 🥬 Concept (Critic Agent): Checks images against S and C to find misalignments and glitches, then rewrites instructions. How: VLM reasoning locates wrong arrows, missing labels, or clutter. Why: Without critique, first-draft mistakes persist. 🍞 Anchor: “Add dashed skip from Encoder to Decoder; increase font size; reduce overlapping lines.”

Extension to statistical plots:

Visualizer emits Python code (Matplotlib) for numerical precision.
Critic inspects the rendered plot and requests code edits (legends, ticks, colors, hatches, annotations).
Result: Plots that are both accurate and styled per academic norms.

The secret sauce:

Reference-driven planning ensures structural correctness and genre-appropriate layouts.
An auto-learned style guide locks in modern, consistent aesthetics.
The Visualizer–Critic loop raises faithfulness and clarity over iterations.
A hybrid renderer (image for expressive diagrams, code for precise plots) prevents numerical hallucinations.

04Experiments & Results

🍞 Hook: You know how science fairs use rubrics to judge posters so everyone is graded fairly?

🥬 The Concept (PaperBananaBench): A benchmark of 292 methodology diagram cases (with 292 paired references) from NeurIPS 2025, used to test how well systems generate publication-ready figures. How it works: Compare AI-generated diagrams to the human originals using a VLM judge on four dimensions—Faithfulness, Conciseness, Readability, Aesthetics—with a hierarchical rule favoring truth and clarity first. Why it matters: Without a strong, domain-matched test, progress is guesswork. 🍞 Anchor: Given a vision paper’s method section and caption, the judge asks which diagram better matches and communicates the content.

The Test:

Inputs: Method text and caption; system outputs a diagram.
Judge: A VLM compares AI vs. human on the four dimensions. Primary: Faithfulness + Readability; Secondary: Conciseness + Aesthetics. Primary decides ties when possible.
Reliability: Correlates well with other models and with humans (tau ~0.41–0.60 range depending on dimension), indicating consistent, valid judgments.

The Competition:

Vanilla: Directly prompt the image model.
Few-shot: Add 10 example triplets to the prompt.
Paper2Any: An agentic diagram tool focusing on high-level ideas rather than fine-grained method flows.
Backbones: Gemini-3-Pro as VLM; Nano-Banana-Pro and GPT-Image-1.5 for image generation.

The Scoreboard (contextualized):

PaperBanana (with Nano-Banana-Pro) vs. Vanilla Nano-Banana-Pro: • Faithfulness: +2.8% (like moving from B to B+ on accuracy). • Conciseness: +37.2% (from wordy to crisp—huge cleanup). • Readability: +12.9% (clearer layouts and labels). • Aesthetics: +6.6% (more professional look). • Overall: +17.0% (a solid overall lift).
GPT-Image-1.5 struggled with instruction-following and text rendering, leading to weak scores.
Paper2Any underperformed here because this benchmark demands precise method flows, not just big-picture figures.

Surprising findings:

Random retrieval nearly matched semantic retrieval: having any good academic patterns beats none at all.
The Stylist boosts aesthetics and conciseness but can drop faithfulness if un-checked; the Critic loop recovers this.
Category-wise, Agent & Reasoning scored highest overall; dense Vision & Perception figures were harder.

Ablations (what each agent contributes):

Removing iterations hurts all metrics; 3 iterations best balance looks and truth.
No retriever: plans get verbose; readability and aesthetics drop.
Random retriever: still good—structure/style priors matter greatly.
Without Stylist: less polished; with Stylist but no Critic: prettier but risk of missed details; with Critic: both pretty and right.

Statistical plots:

PaperBanana (code-based Visualizer) beats vanilla VLM prompting on faithfulness and still lifts readability and aesthetics, for a +4.1% overall gain.
Image generation makes pretty plots but can duplicate categories or misplace values in dense charts; code keeps numbers exact.

05Discussion & Limitations

Limitations:

Raster outputs: Hard to edit like vector graphics; 4K helps quality but not editability.
Style diversity: A unified guide boosts consistency but can reduce creative variety.
Fine-grained faithfulness: Subtle connection errors (arrow endpoints, directions) can slip past the critic.
Evaluation: VLM-as-a-Judge is efficient but still imperfect for fine structural checks and subjective aesthetics.

Required resources:

Strong VLM and image/code generation backbones, a curated reference set, and compute for multi-round refinement.
For plots, a Python environment for Matplotlib rendering.

When not to use:

When you must deliver fully editable vector diagrams (e.g., Illustrator/SVG hand-tuning is essential).
Extremely dense, number-critical plots if you cannot use code-based rendering.
Highly unusual, artistic figure styles that deviate from academic norms.

Open questions:

Can we generate fully editable vector outputs end-to-end via a GUI agent?
How to keep high style quality while expanding aesthetic diversity per user preference?
Can critics detect and fix tiny structural mistakes reliably (better visual perception and graph reasoning)?
Can future evaluation include structure-aware graph metrics and learned aesthetic rewards for tighter human alignment?
How best to scale test-time candidate generation and VLM-based selection to cover diverse tastes while saving compute?

06Conclusion & Future Work

Three-sentence summary: PaperBanana turns method text and a caption into a polished academic figure by coordinating five specialized AI agents that retrieve examples, plan content, apply an academic style, render, and then self-critique over several rounds. A new benchmark, PaperBananaBench, shows strong improvements over baselines in faithfulness, conciseness, readability, and aesthetics, and the approach extends cleanly to code-based statistical plots. Together, these advances reduce the time and expertise needed to create publication-ready illustrations.

Main achievement: Showing that a reference-driven, multi-agent, iterative pipeline can reliably produce publication-quality methodology diagrams and accurate, well-styled statistical plots.

Future directions: End-to-end vector graphic generation via GUI agents; richer, preference-aware style control; stronger critics for fine-grained structural faithfulness; structure- and reward-based evaluation; generate-and-select test-time scaling for diverse outputs.

Why remember this: Clear figures speed up science. PaperBanana demonstrates that “figure-making” is not a single-shot prompt but a coordinated craft—retrieve, plan, style, render, critique—making complex ideas easier to see, check, and share.

Practical Applications

•Turn a method section and caption into a ready-to-submit overview figure in minutes.
•Generate accurate, styled bar/line/scatter/heatmap plots from tabular data via Matplotlib code.
•Refine an existing human-made figure by applying the learned academic style guide for cleaner visuals.
•Produce multiple candidate figures at test-time and pick the best with a VLM-based preference check.
•Create consistent figures across a whole paper (shared colors, fonts, arrow styles) automatically.
•Convert verbose, messy diagram drafts into concise, readable versions through the critic loop.
•Rapidly prototype diagram layouts by retrieving and adapting structures from similar references.
•Use the benchmark rubric to self-check figures for faithfulness, readability, conciseness, and aesthetics.
•Teach newcomers good diagramming norms by showing retrieved references plus the synthesized style guide.
•Automate conference camera-ready touch-ups (font sizes, color contrast, legend placement) before submission.

Version: 1