Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization
Key Summary
- â˘This paper teaches a model to turn a question about a table into both a short answer and a clear, correct chart.
- â˘Instead of only checking if the code runs, the model also looks at the finished picture and learns from it.
- â˘The authors use reinforcement learning (RL) with a special method called GRPO that compares several candidate answers for each question and keeps improving.
- â˘They design a multi-objective reward that scores three things at once: the text answer, the codeâs executability and intent, and the chartâs readability and correctness.
- â˘Their system, RL-Text2Vis, raises code execution success from 78% to 97% compared to the same model without RL.
- â˘It achieves about a 22% relative improvement in chart quality over GPT-4o on the Text2Vis benchmark.
- â˘Supervised fine-tuning helps a little, but it cannot optimize what only shows up after you actually render the chart; RL with post-execution feedback can.
- â˘The method generalizes well to other datasets like VIS-Eval, NVBench, and PandasPlotBench without extra training.
- â˘GRPO works well here because it avoids training a critic model and uses group-based comparisons and a KL safety tether to a reference model.
- â˘The approach is open-source friendly and practical for organizations that need privacy and cost control.
Why This Research Matters
Clear, correct charts help people make better decisions in business, education, journalism, and healthcare. This work shows how to train models to value what humans care about most: the final, interpretable picture and the correct takeaway. Because the method is open-source friendly, privacy-focused teams can deploy strong systems without relying on closed models. By combining text, code, and vision into one learning signal, the approach reduces broken code, misleading visuals, and wasted time. Its strong generalization suggests that once trained this way, models can handle new tables and questions with confidence. The framework also offers an auditable path to improving quality over time by adjusting reward weights or evaluators.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Top Bread (Hook): You know how when you ask a friend to make a graph from a spreadsheet, you donât just care that their code runsâyou care that the graph actually answers your question and is easy to read. If the bars or lines tell the wrong story, the whole point is lost.
𼏠Filling (The Actual Concept: Text-to-Visualization, introduced here)
- What it is: Text-to-Visualization (Text2Vis) is about turning a natural-language question about a table into (1) a short, direct answer and (2) runnable code that draws the right chart.
- How it works (recipe):
- Read the userâs question and the table.
- Figure out what needs to be computed (filters, groups, metrics).
- Produce the final answer text and code that makes a chart (e.g., Matplotlib or similar).
- Run the code to render the chart so people can see the result.
- Why it matters: If we only generate code without checking the final picture, we risk charts that are technically valid but misleading or hard to readâand users might make wrong decisions.
đ Bottom Bread (Anchor): Imagine asking, âWhich year had the biggest increase in renewable energy share?â You want the chart to compare year-over-year changes in share (not raw quantity) and the text to say the correct year. A wrong chart type or wrong data column ruins the answer even if the code runs.
The World Before: Early systems used rules or templates to produce visualization specs. These were safe but stiff: they could draw simple charts, but struggled with diverse, messy, and realistic questions. Newer large language models (LLMs) brought flexibilityâthey can write chart code and answers from natural language. But this freedom came with new problems.
The Problem: Even top closed-source LLMs often generate runnable code that draws the wrong thing (misaligned with the question), or the chart looks cluttered and unreadable. Open-source models, which many teams prefer for cost and privacy, more often fail to produce executable code at all. The killer issue: the qualities we truly care aboutâsemantic correctness of the chart and its readabilityâonly show up after we render the picture. Traditional training (supervised fine-tuning) tries to mimic example outputs token by token and cannot directly âfeelâ the final chartâs goodness.
Failed Attempts: Prompt engineering can nudge a model but doesnât change what it has learned. Supervised fine-tuning (SFT) can make code more executable by aligning syntax and libraries but still canât optimize visual clarity or semantic alignment, because thereâs no training signal from the rendered chart. Single-metric RL for code (e.g., just âdid it run?â) helps executability but canât ensure the chart answers the right question or looks clean.
The Gap: What was missing is a way to train using the finished chartâpost-execution, multimodal feedbackâso the model learns not only to write code but to deliver a correct and readable visualization aligned with the question.
Real Stakes: In real life, a confusing or misaligned chart can sway business strategy, public policy, or news stories in the wrong direction. Analysts, teachers, journalists, and clinicians need charts that faithfully reflect the data and the question. Organizations also need private, affordable, and auditable systemsâso open-source solutions that can match or beat proprietary ones are especially valuable.
02Core Idea
đ Top Bread (Hook): Imagine baking cookies. Itâs not enough that the oven turns on (code runs). You also care that the cookies look right (visual clarity), taste right (semantic alignment), and match the recipe (answer correctness). Youâd only know all this after you pull the tray out and look and taste.
𼏠The Concept in One Sentence (Aha!): Train the model with reinforcement learning that scores the final text, the codeâs behavior, and the rendered chartâso it learns to produce correct, clear, and question-aligned visualizations.
Multiple Analogies:
- Report Card Analogy: Instead of grading only âDid you submit on time?â (code runs), we grade âIs the essay accurate?â, âIs it clear?â, and âIs it on topic?â (text correctness, clarity, alignment) and then teach based on that.
- Cooking Analogy: Donât just check if the stove lights; check that the dish tastes like the recipe, is plated nicely, and uses the right ingredients.
- Sports Analogy: A team isnât judged only by scoring points; defense, teamwork, and fair play all count. The model should be rewarded for several objectives, not just one.
Before vs After:
- Before: Models often shipped charts that compiled but didnât answer the question or were messy. Training only looked at tokens or pass/fail execution.
- After: The model sees the final picture, gets multi-part feedback (text, code, vision), and learns to optimize everything that users actually see and need.
Why It Works (Intuition): The qualities we care aboutâcorrect answer, executable and intent-matching code, and a readable, faithful chartâare visible only after running the code and looking at the image. By converting that post-execution judgment into a multi-objective reward, the model receives clear signals about what to improve. Group Relative Policy Optimization (GRPO) makes this efficient by comparing several candidate outputs per prompt, pushing the policy toward the best ones without training a separate critic.
Building Blocks (with Sandwich explanations of new concepts):
đ Hook: You know how a dog learns tricks by getting treats when it does the right thing? 𼏠Reinforcement Learning (RL)
- What it is: RL is a way for a model to learn by trying actions and receiving rewards that say how good the result was.
- How it works:
- The model proposes an answer and code.
- We execute the code and examine the chart and text.
- We give a reward based on text correctness, code validity, and visual quality.
- The model updates itself to make higher-reward choices next time.
- Why it matters: Without RL, the model canât directly learn from how the final chart looks and whether it truly answers the question. đ Anchor: Ask, âWhich year had the biggest increase in renewable share?â The model tries several outputs; itâs rewarded most when the chart shows year-over-year share changes clearly and the text names the correct year.
đ Hook: Imagine a triathlon where you earn points for swimming, cycling, and runningânot just one. 𼏠Multi-Objective Reward
- What it is: A scoring system that combines text correctness, code executability/intent, and visualization readability/correctness into one overall score.
- How it works:
- Format check: Did the model return valid JSON with 'answer' and 'code'?
- Text score: Does the short answer match the ground truth (with some tolerance)?
- Code score: Does the code run and align with the queryâs intent?
- Visual score: Is the chart readable and faithful to the data and question?
- Combine these scores with tuned weights so no single part dominates unfairly.
- Why it matters: If we only reward executability, we get runnable but misleading charts; if we only reward text, we may get good words but broken code or bad visuals. đ Anchor: A bar chart with clear labels and the right metric (share, not raw count) plus the correct year in the answer gets a high combined score.
đ Hook: Think of a classroom where several students attempt the same problem, and we learn most from comparing their answers. 𼏠Group Relative Policy Optimization (GRPO)
- What it is: An RL method that samples multiple outputs for the same prompt, ranks them by reward, and nudges the model toward better onesâwithout training a separate critic network.
- How it works:
- Generate a group (e.g., 8) candidate outputs for one question.
- Score each using the multi-objective reward after executing the code and inspecting the chart.
- Compute advantages by comparing each outputâs score to the group average.
- Update the policy with stable, clipped steps and keep it near a reference model via a KL safety tether.
- Why it matters: Itâs efficient, stable, and well-suited for long, structured outputs like code plus text. đ Anchor: For the renewables question, if outputs #2 and #5 are the most readable and correct, GRPO increases the chance the model produces outputs more like #2 and #5 next time.
03Methodology
High-Level Overview (like a recipe): Input (question + table) â Policy generates 'answer' + 'code' in JSON â Stage 1: Format reward â Execute code to render chart â Stage 2: Compute text, code, and visualization rewards â Combine into one score â GRPO update using several candidate outputs per prompt â Improved policy.
Step-by-Step Details:
- Input Understanding
- What happens: The model receives a natural-language query and a tabular dataset.
- Why this step exists: The model must map words to data operations (filtering, grouping, aggregation) before it can answer or plot correctly.
- Example: Table columns: Year, Renewables_Share (%). Question: âWhich year had the greatest increase in renewables share?â
- Structured Output Generation
- What happens: The policy model produces a JSON with 'answer' (short text) and 'code' (Python plotting code ending with plt.show()).
- Why this step exists: A fixed schema makes it easy to check structure, execute safely, and evaluate consistently.
- Example: answer: '2020'; code: computes year-over-year differences of Renewables_Share and draws a bar chart highlighting the max.
- Stage 1 â Format Reward (Gatekeeper)
- What happens: We verify the JSON structure and the presence of both fields. If missing or malformed, reward = 0 and we skip deeper checks.
- Why it exists: Prevents the model from drifting into free-form text or broken formats that canât be executed or judged.
- Example: If the model forgets the 'code' field, the sample gets zero reward.
- Execute Code in a Sandbox
- What happens: The generated code runs in a restricted, safe environment; if it executes, we capture the rendered chart image.
- Why it exists: Many crucial qualities (label clarity, correct metric, chart type) only appear after rendering.
- Example: If code uses the wrong column or crashes on a shape mismatch, this becomes visible here.
- Stage 2 â Composite Reward (Three Parts)
- Textual Correctness (R_text)
- What happens: An LLM judge compares the modelâs short answer to the ground-truth answer with tolerance for near matches.
- Why it exists: Users need a quick, accurate summary alongside the chart.
- Example: If the true year is 2020, but the model says âYear 2020,â it matches; if it says 2019, it fails.
- Code Reward (R_code)
- What happens: Two checksâexecutability (binary) and intent match (does the code compute the right thing?).
- Why it exists: Runnable code is necessary but not sufficient; it must implement the right analysis.
- Example: If the question asks for share but the code plots raw counts, intent match fails.
- Visualization Quality (R_vis)
- What happens: A vision-language model (VLM) scores the chartâs readability (labels, layout) and correctness (faithfulness to the question and data), then averages them.
- Why it exists: Clean, interpretable visuals are critical for trust and usability.
- Example: Clear axis labels, suitable chart type (e.g., bars for differences), and non-cluttered legends raise the score.
- Weighted Combination
- What happens: Final reward R = 0.50R_text + 0.25R_code + 0.25*R_vis (weights chosen via small grid search).
- Why it exists: Balances answer accuracy with code and visual quality so no single part dominates.
- Example: A sample with a perfect answer but messy chart wonât get a top score; a balanced, high-quality output will.
- GRPO Policy Update
- What happens: For each prompt, the model generates a group (e.g., 8) candidate outputs. Rewards are standardized within the group to compute advantages (how much better or worse each sample is than the group average). The policy is updated with clipped steps and a KL penalty that keeps it close to a stable reference policy.
- Why it exists: Group-based relative learning stabilizes training, avoids training a critic network, and scales to long outputs like code.
- Example: If two candidates are excellent and the rest are average, GRPO increases the probability of producing excellent ones next time.
The Secret Sauce (why this method is clever):
- It pushes learning beyond âdid the code run?â to âdid the final picture answer the question clearly?â, which is what people truly care about.
- The two-stage reward prevents wasted computation on malformed outputs and focuses learning on structured, evaluable results.
- GRPO learns from several attempts at once, comparing and improving without needing a separate value modelâmaking it efficient and stable.
- Using off-the-shelf open-source LLM/VLM judges gives strong, privacy-friendly, and practical signals during training.
Concrete Walkthrough:
- Input: Year, Renewables_Share (%) for 2017â2021. Question: âWhich year had the biggest increase in renewables share?â
- Generation: The model outputs answer: '2020' and code that computes diff = share[t] - share[t-1], draws a bar chart of diffs, and highlights the max bar.
- Execution: Code runs; chart shows bars for each yearâs increase with labels.
- Reward: Text judge matches '2020' to ground truth (1.0). Code judge gives 1 for execution and 1 for intent (1.0). VLM scores readability 0.9 and correctness 0.9 (avg 0.9). Combined R = 0.51.0 + 0.251.0 + 0.25*0.9 = 0.975.
- GRPO: Among 8 candidates, this one ranks near the top, so the model updates toward similar future behavior.
04Experiments & Results
The Test: The authors evaluate four dimensions that reflect what users really want:
- Answer Correctness (binary match to ground truth).
- Code Executability (does the code run?).
- Chart Readability (1â5: labels, layout, fonts, clutter).
- Chart Correctness (1â5: does the visualization faithfully answer the question?). A combined pass requires the code to run, the answer to match, and both visual scores to be at least 3.5.
The Competition: They compare RL-Text2Vis against leading closed-source models (GPT-4o, Gemini 1.5/2.0 Flash), strong open-source baselines (Mistral, Llama 3.1, CodeLlama), Qwen2.5 in zero-shot and SFT forms, and even another architecture (Llama-3.1-8B) to show framework generality.
In-Domain Scoreboard (Text2Vis):
- RL-Text2Vis-14B jumps code execution success from 78% to 97% versus its zero-shot base, while chart readability rises from 3.12 to 4.10 and chart correctness from 2.94 to 4.03. Answer match also improves (29% â 35%).
- RL-Text2Vis-7B similarly boosts readability (2.81 â 3.84) and correctness (2.69 â 3.86) over zero-shot, with higher executability and answer match.
- Compared to GPT-4o, RL-Text2Vis achieves about a 22% relative improvement in chart quality (readability + correctness) and posts a comparable final pass rate (GPT-4o ~30%, RL-Text2Vis-14B ~29%), while clearly leading on visual clarity metrics.
- Supervised fine-tuning provides smaller gains than RL; SFT helps executability somewhat but cannot capture post-execution visual quality as effectively.
Out-of-Domain Generalization:
- VIS-Eval: RL-Text2Vis-7B improves code execution (57% â 72%), readability (1.50 â 2.50), and correctness (0.69 â 1.37), despite schema diversity and layout challenges.
- NVBench: RL-Text2Vis-7B lifts executability (75% â 93%), readability (2.64 â 3.47), and correctness (2.34 â 3.28). The 14B model goes further (up to 96% executability and 3.95/3.59 on readability/correctness).
- PandasPlotBench: Gains persist (e.g., 7B executability 65% â 75%, readability 2.42 â 3.32, correctness 2.49 â 3.37), showing robustness in plain Pandas plotting tasks.
Human vs Automated Judging:
- Human annotations on the official set correlate strongly with automated evaluations (Pearson r â 0.88â0.91 across metrics), reinforcing that the improvements are real and perceptible.
Ablations and Surprises:
- Reward Component Ablation: Removing any component (format, answer, or code+visual) hurts performance; the full multi-objective reward performs best, confirming that all three parts matter.
- Group Size Matters: Sampling more candidates per prompt (8 vs 4) stabilizes learning and yields better results, aligning with GRPOâs design philosophy.
- Scaling/Architecture Checks: Even a different family (Llama-3.1-8B) benefits from the same RL-Text2Vis recipe, and a small 3B model improves executability/readability though its answer accuracy plateausâsuggesting some reasoning limits at small scales.
Context for Numbers:
- Think of 97% executability as turning code crashes from 1 in 5 to about 1 in 33âmuch less debugging.
- Raising readability/correctness by around a full point on a 1â5 scale is like moving from a cluttered, confusing slideshow to a clear, polished presentation.
- Beating GPT-4o on chart clarity while remaining open-source means teams can get top-tier visuals without relying on closed models.
05Discussion & Limitations
Limitations:
- Compute and Memory: The 14B model performs best but needs substantial GPU resources; the 7B variant is more budget-friendly with solid gains.
- Domain Specialization: While results transfer well, highly specialized domains (e.g., nuanced medical or financial charts) arenât fully tested and may need domain-specific rewards or guardrails.
- Static Visuals Only: Interactive dashboards or multi-view analytics are out of scope for now; extending to interactivity is a key next step.
- Reward Model Bias: Although cross-judge checks show high agreement, any learned evaluator can have biases. Ongoing auditing is important.
Required Resources:
- GPUs for RL fine-tuning (e.g., A100/H100 class), a safe sandbox for code execution, and access to open-source LLM/VLM evaluators.
- A benchmark or in-house dataset with questions, tables, and ground-truth answers for text scoring.
When NOT to Use:
- Ultra-low-latency settings that cannot afford code execution or in-loop image scoring.
- Environments without safe sandboxes (security risk) or where code execution is disallowed.
- Tasks demanding strict formal guarantees beyond what learned evaluators can provide.
Open Questions:
- Interactive & Multiview Extension: How to define rewards for interaction quality (tooltips, brushing, filtering) and coordinated views?
- Human-in-the-Loop Feedback: Can lightweight human preferences be blended with automatic judges to reduce bias and improve faithfulness?
- Robustness & Safety: How to harden against adversarial prompts or data edge cases that could yield misleading visuals?
- Reward Hacking Prevention: How to detect when the model learns to game evaluators rather than truly improve visualization quality?
- Domain-Aware Rewards: Can we encode domain-specific best practices (e.g., clinical charting standards) into the reward to boost safety and trust?
06Conclusion & Future Work
Three-Sentence Summary: The paper introduces RL-Text2Vis, a reinforcement learning framework that trains models to produce both a correct short answer and a clear, faithful chart from a natural-language question over tabular data. It uses GRPO with a multi-objective reward that evaluates text correctness, code validity, and visualization quality after actually running the code and rendering the chart. This approach substantially improves executability and visual quality, rivaling or surpassing proprietary systems on key visualization metrics while remaining open and practical.
Main Achievement: Turning the finished chartâand its readability and alignmentâinto a training signal via a carefully balanced, multi-objective reward, optimized efficiently with GRPO.
Future Directions:
- Add support for interactive and multi-view visualizations with new reward components that score interactivity and coordinated insights.
- Explore domain-specific evaluators (e.g., healthcare) and human-in-the-loop signals to improve safety and faithfulness.
- Scale to larger backbones and more diverse datasets, and further study reward design to prevent evaluator gaming.
Why Remember This: It shows how to teach models to care about what people actually see and useâthe final pictureânot just whether the code runs. By aligning text, code, and vision with post-execution feedback, RL-Text2Vis sets a template for training multimodal systems that must be both correct and clear.
Practical Applications
- â˘BI copilots that answer questions over company spreadsheets and produce clean, trustworthy charts instantly.
- â˘Self-service analytics in dashboards where users type questions and receive both an answer and an interpretable visualization.
- â˘Education tools that turn student questions about datasets into readable plots for classroom learning.
- â˘Newsrooms that need fast, accurate charts for data-driven stories with transparent code behind each figure.
- â˘Healthcare analytics prototypes that plot cohort trends clearly (with domain safeguards and review).
- â˘Financial reporting assistants that generate audit-friendly visuals and short summaries of key metrics.
- â˘Data science assistants that draft correct plotting code from Pandas DataFrames to speed up EDA.
- â˘Customer support analytics that visualize ticket trends, categories, and response times on demand.
- â˘Quality assurance bots that re-check visualizations for readability/alignment before reports are published.
- â˘Governance tools that store the text, code, and rendered chart for compliance and reproducibility.