ReGuLaR: Variational Latent Reasoning Guided by Rendered Chain-of-Thought

Fanmeng Wang; Haotian Liu; Guojiang Zhao; Hongteng Xu; Zhifeng Gao

ReGuLaR: Variational Latent Reasoning Guided by Rendered Chain-of-Thought

Intermediate

Fanmeng Wang, Haotian Liu, Guojiang Zhao et al.1/30/2026

arXiv PDF

Key Summary

•Chain-of-Thought (CoT) makes AI think step by step, but it is slow because it writes many tokens one by one.
•ReGuLaR teaches AI to think silently in a few hidden steps (latent states) instead of writing long explanations.
•It uses a Variational Auto-Encoder (VAE) so each hidden step is sampled from a smart distribution, not guessed blindly.
•The key trick is to turn the written reasoning into compact images and use their visual features to guide those hidden steps.
•This guidance (via KL divergence) keeps the hidden thoughts accurate while still being much shorter and faster.
•Across math datasets, ReGuLaR beats past latent-reasoning methods and even matches or beats explicit CoT in multimodal tasks.
•It can compress an entire reasoning chain into just one latent step and still outperform strong baselines.
•The model stays text-in/text-out at test time; images are only used to train better hidden thinking.
•It scales well to bigger models and different backbones, and needs fewer reasoning steps for the same or better accuracy.

Why This Research Matters

Fast and accurate reasoning makes AI more useful in classrooms, on phones, and in low-power settings by cutting token-by-token generation costs. By guiding hidden thinking with visual priors, we keep answers reliable while using far fewer steps. This opens the door to handling longer contexts and richer content without exploding compute needs. Training with images that include diagrams, charts, or molecule drawings teaches the model to use structure text alone might miss. Because visuals are only used in training, everyday use stays simple: plain text in, plain text out. The approach scales across different model sizes and backbones, suggesting a robust path forward for efficient, high-quality reasoning.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook (Chain-of-Thought): You know how you show your work in math class—writing each step so your teacher sees how you solved the problem? That’s like how many AIs solve tough questions: they write out a long Chain-of-Thought (CoT) before giving the final answer. 🥬 The Concept: CoT is an AI strategy where the model generates step-by-step text to reason.

How it works: (1) Read the question, (2) write intermediate steps as tokens, (3) then produce the final answer.
Why it matters: Without CoT, models often skip logic and make mistakes; with CoT, accuracy goes up—but it becomes slow because it types many tokens. 🍞 Anchor: When an AI solves “What’s 37×19?”, CoT helps it write partial products before saying the result.

🍞 Hook (Variational Auto-Encoding): Imagine packing a suitcase for a long trip—you squeeze a lot into a small space but still keep what you need. 🥬 The Concept: A Variational Auto-Encoder (VAE) is a way to compress information into a small, meaningful code (a distribution), then use it to reconstruct useful outputs.

How it works: (1) Encode data into a distribution (mean and spread), (2) sample a code from it, (3) decode that code to get back something useful, (4) keep the code close to a helpful prior using a penalty.
Why it matters: Without smart compression, important details get lost; the VAE keeps compressed codes informative and stable. 🍞 Anchor: Like shrinking a big poster into a postcard that still lets you recognize the scene.

The world before: Large Language Models (LLMs) got much better at tough problems by using CoT. But CoT is slow and costly at test time because the model must write many tokens one by one. Many of these tokens are repetitive or not crucial. Teams tried “latent reasoning,” which means thinking in hidden vectors (continuous space) instead of visible words. This can be much faster because you skip writing out steps.

The problem: Early latent-reasoning methods often got worse accuracy. Why? They compressed the reasoning into a few hidden vectors but had little guidance to make those vectors carry the right meaning. As the model passed hidden states forward, tiny mistakes piled up (error accumulation), leading to semantic drift—like getting fuzzier at each step.

🍞 Hook (Latent Reasoning): Imagine doing math in your head instead of writing every step—much faster if you don’t lose track. 🥬 The Concept: Latent reasoning is solving problems via a few hidden states (vectors) instead of long text steps.

How it works: (1) Read the question, (2) update a small number of hidden reasoning states, (3) produce the final answer from those states.
Why it matters: Without good guidance, those hidden states drift and the answer gets wrong. 🍞 Anchor: It’s like mental math: quick but risky if you forget a step; guidance keeps it accurate.

Failed attempts: Some methods fed the last hidden layer back in loops (Coconut), or averaged/clustered token embeddings into fewer chunks (CoLaR). These helped speed but often dropped meaning because hidden states weren’t anchored to a strong target. When there are no visible tokens, there’s nothing to “lock onto.”

The gap: We needed a way to tell the model exactly what each hidden step should capture, without making it write the full text. We needed a compact, information-rich guide for each hidden step.

🍞 Hook (Rendered CoT): Imagine turning your notes into a single page of neatly arranged pictures and words—easy to skim but still dense with meaning. 🥬 The Concept: Rendered CoT is taking the written reasoning and turning it into images, then extracting dense visual features.

How it works: (1) Split the reasoning into segments, (2) render each segment as an image, (3) use a visual encoder to get a compact vector, (4) use it to guide each hidden state.
Why it matters: Without this visual guidance, hidden states lose structure; with it, they stay faithful to the original logic. 🍞 Anchor: Like a study guide sheet with highlighted steps that keeps your memory on track.

Real stakes: Faster reasoning lowers cost, reduces energy, and makes LLMs practical on devices with limited compute. Better compression means longer or multimodal contexts fit easily. And if you can guide hidden thinking with visual anchors, you can even inject diagrams or graphs during training so the model learns to reason across text and visuals—yet still answer with plain text at test time.

🍞 Hook (Compressed Reasoning State): Think of a summary card that captures a whole chapter of a book. 🥬 The Concept: A compressed reasoning state is a short, dense hidden vector that stands in for many text steps.

How it works: (1) Map a chunk of reasoning to a distribution, (2) sample a vector (the state), (3) pass it forward to build the next state, (4) decode the final answer from the sequence of states.
Why it matters: Without careful design, summaries miss key facts; the right constraints keep them meaningful. 🍞 Anchor: It’s like a flashcard that really captures the key proof steps, not just buzzwords.

🍞 Hook (Visual-Textual Compression): Have you noticed how a small infographic can explain a whole page of text? 🥬 The Concept: Visual-textual compression turns long text into images whose visual features carry dense meaning.

How it works: (1) Render text on a page with high information density, (2) use a visual encoder to turn it into a compact vector, (3) feed that vector as a guide.
Why it matters: Without dense carriers, compression throws away structure; images keep layout and relations. 🍞 Anchor: A well-designed cheat sheet lets you recall the whole topic from a glance.

🍞 Hook (KL Divergence Regularization): When you summarize a story, you check your summary stays close to the original plot. 🥬 The Concept: KL divergence regularization nudges the model’s sampled hidden state to stay close to a target prior built from the rendered segment.

How it works: (1) Build a prior from the image features, (2) build a posterior from the model’s current context, (3) penalize their difference, (4) train so posterior matches prior while still solving the task.
Why it matters: Without this pull, hidden states drift; with it, each state stays on-topic. 🍞 Anchor: Like a teacher’s rubric keeping student summaries faithful to the source.

🍞 Hook (Multi-Modal Reasoning): Sometimes a picture (like a chart) plus text tells the full story. 🥬 The Concept: Multi-modal reasoning uses both words and visuals to think better.

How it works: (1) Combine text and diagrams during training, (2) compress both into a unified guide, (3) learn hidden steps that understand cross-modal relations.
Why it matters: Without visuals, some tasks miss crucial structure; with them, complex problems become simpler. 🍞 Anchor: Reading a science lab sheet with a diagram makes the steps easy to follow.

02Core Idea

Aha! One-sentence insight: If we turn the model’s written reasoning into compact pictures and use those pictures to guide a variational latent-thinking process, we can make the model think in a few faithful hidden steps instead of writing many tokens.

Three analogies:

Study guide: Write a long solution once, then make a one-page visual cheat sheet. Next time, use the cheat sheet to think faster without re-writing everything.
GPS vs. wandering: Hidden states used to wander; now rendered images act like GPS pins that keep each step on the right path.
Zipped video: You can stream a movie smoothly when it’s well-compressed; the visual prior is the smart codec that keeps quality while shrinking size.

Before vs. After:

Before: CoT was accurate but slow because it had to write many tokens. Latent methods were fast but lost accuracy because the hidden states lacked guidance.
After: ReGuLaR keeps speed (few hidden steps) and restores accuracy by anchoring each hidden step to a dense, image-derived prior of the corresponding reasoning segment. In multimodal settings, it can even beat explicit CoT because the visual prior captures structure text misses.

Why it works (intuition, no equations):

The VAE setup makes each hidden step a sample from a learned distribution. But sampling only helps if the target distribution is informative. The rendered segment provides that target (a prior), so the sampled hidden state is pulled toward the right meaning. The KL term is the gentle leash. The latent reasoning loss and the final answer loss ensure the hidden steps are not just close to the prior, but also useful for predicting tokens and the outcome. This combo prevents drift and preserves semantics.

Building blocks (the idea in pieces):

Segmenting: Split the original reasoning chain into K chunks (or even put all of it into one chunk for extreme compression).
Rendering: Turn each chunk into a dense image (high semantic density layout).
Visual encoding: Feed the image to a strong OCR-style visual encoder to get a compact vector.
Adapter: Map visual vector into the model’s hidden space—this is the mean of the prior for that step.
Posterior sampler: From the current question and previous latent states, predict a posterior (mean and spread), then sample the next hidden state.
KL regularization: Keep the posterior close to the visual prior so each state stays faithful to its segment.
Latent reasoning loss: Encourage hidden states to be predictive of the segment’s tokens (semantic consistency).
Answer head: After a few latent steps (or one!), switch to the language head to produce the final answer.
No visuals at test time: The model runs on text-only inputs; the visual prior trained the hidden thinking so it stays on track without needing images later.

Multiple views of the same core:

Structural view: Visual prior = anchor, posterior = student guess, KL = gentle correction, latent loss + answer loss = task grounding.
Efficiency view: Replace hundreds of tokens with a handful of vector updates; less decoding, less latency.
Robustness view: The prior protects against error snowballs in hidden-space loops.

03Methodology

High-level pipeline (training): Question → (Split reasoning into K segments) → Render each segment → Visual-encode each image → Adapt into prior means → For step k: predict posterior from question and past z’s, sample zk → (latent reasoning loss + KL to prior) → After K steps, generate final answer (answer loss). High-level pipeline (inference): Question → Iteratively sample a few zk (until a special stop token) → Generate final answer. No images used.

Step-by-step recipe with reasons and examples:

Prepare the teacher signals (offline):

What: Split the full written Chain-of-Thought (CoT) into K segments (e.g., by sentences). Render each segment as a compact image. Run a frozen visual encoder to get a vector for each image; store them.
Why: These image vectors act like dense, lossless summaries of each segment. Without them, hidden states have no clear target and drift.
Example: A 6-line math solution becomes 6 images; each image’s vector describes that line’s meaning and layout.

Build the prior for each step:

What: Pass each image vector through a small adapter to match the LLM’s hidden size. Treat it as the mean of a normal distribution (the prior) for that step.
Why: The prior defines where each hidden state should live. Without a good prior, the model guesses blindly.
Example: For segment 3 (“add the partial products”), its image vector becomes the prior mean for z3.

Predict the posterior and sample a latent state:

What: Given the question and previous latent states, the model’s latent head outputs a posterior (mean and log-variance). Sample zk = mean + std × noise (reparameterization).
Why: Sampling captures uncertainty and avoids “mean collapse.” Without sampling, you get blurry, average states that miss crisp logic.
Example: For z2, the model samples a vector that best represents “carry over the 1” with some uncertainty.

Keep posterior close to prior (KL term):

What: Add a KL divergence penalty between the posterior and the image-based prior for each step.
Why: This softly pulls sampled states toward the correct semantic region; without it, errors compound.
Example: If z4 drifts, KL nudges it back toward the vector representing “simplify the fraction.”

Teach semantic faithfulness (latent reasoning loss):

What: Ask the language head to score actual tokens from that segment given zk, and maximize their likelihood (sample a token from the segment to train efficiency-wise).
Why: This tells zk what words it should be able to reconstruct; without it, zk might be close to the prior but not useful for predicting the text it stands for.
Example: If segment 5 contains “therefore x=12,” zk should help predict tokens like “x=12.”

Produce the final answer (answer loss):

What: After building the sequence Z = [z1..zK], switch to answer generation. Condition the language head on (Q, Z) and train it to produce the correct final answer.
Why: We ultimately care about correct answers. Without this, the model might preserve meaning but not solve the task.
Example: For a word problem, after latent steps, the model answers “126.”

Inference without images:

What: At test time, the model only gets the question. It iteratively samples latent states using its learned posterior until a special stop token appears, then generates the answer.
Why: The visual prior is a training-time teacher. At test time, the student thinks efficiently on its own.
Example: New math problem: the model takes 3 latent steps, then answers—no CoT written out.

What breaks without each step:

No rendering/visual prior: Hidden states drift; accuracy drops sharply.
No KL: Catastrophic failure in ablations; the posterior wanders far from meaningful states.
No latent reasoning loss: States don’t align with their segments; weaker token-level grounding.
No answer loss: Good segments but poor final answers.

Concrete data example (GSM8K-style):

Question: “If a bus has 24 seats and 5 rows are full with 4 students each, how many seats are left?”
Segments (teacher CoT): (1) 5×4=20 seats filled, (2) total seats=24, (3) 24−20=4 left.
Training: Each segment rendered → vector → prior mean. The model samples z1..z3 with KL to priors, learns to predict tokens like “5×4=20,” then learns to output “4” at the end.
Inference: It produces 2–3 latent states and directly outputs “4.”

The secret sauce:

Using rendered CoT images as priors. Images carry dense structure (layout, symbols, ordering) that pooling text embeddings can lose. Aligning the posterior to these priors via KL is what keeps the few latent steps semantically rich. This also naturally supports multi-modal training (e.g., add a small diagram), which later helps even without images at test time.

04Experiments & Results

The test: The authors measured (1) Accuracy: percent of correct answers, and (2) Reasoning Length: how many latent steps were used. They compared ReGuLaR to strong baselines (iCoT, CODI, Coconut, CoLaR) across math datasets like GSM8K-Aug, GSM-Hard, SVAMP, MultiArith, and also on harder sets (GSM8K-Aug-NL, AQUA-RAT, MATH). They repeated runs with different seeds for reliability.

The competition: CoLaR and Coconut are the strongest prior latent methods; explicit CoT is strong but slow. ReGuLaR aims to be both fast (few steps) and accurate.

Scoreboard with context:

Main table (LLaMA-3.2-1B-Instruct backbone): On GSM8K-Aug, ReGuLaR gets about 34.9% accuracy with ~3.69 latent steps, while CoLaR gets 26.6% with ~5.63 steps. That’s like getting a solid B when the best rival gets a C, and doing it 35% faster (shorter reasoning length).
Averages over four math datasets: ReGuLaR improves both accuracy and efficiency, hitting the best average accuracy and the fewest steps (~3.03 vs. CoLaR’s ~4.70).

Generalizability: Swapping the backbone to DeepSeek-R1-Distill-Qwen-1.5B, ReGuLaR still wins on both accuracy and fewer steps. On GSM-Hard, CoLaR uses ~12.8 steps; ReGuLaR gets higher accuracy with only ~3.1 steps—like finishing the test in a quarter of the time with a better score.

Compression analysis: For the same compression rate (how many text tokens per latent step), ReGuLaR consistently beats CoLaR across backbones. Even as compression gets more aggressive (harder), ReGuLaR keeps a lead—evidence that visual guidance preserves meaning better than pooling text embeddings.

Scalability: Across model sizes (LLaMA 1B, 3B, 8B), ReGuLaR keeps or widens its margin over baselines. Bigger backbones mean better absolute scores for everyone, but ReGuLaR’s advantage stays steady.

Extreme compression (K=1): Put the entire reasoning into a single rendered image; train with just one latent state. ReGuLaR still beats CoLaR on GSM8K-Aug-NL, AQUA-RAT, and MATH. On MATH, average accuracy jumps from ~7.76% (CoLaR, ~62 steps!) to ~11.9% (ReGuLaR, 1 step). That’s like acing a quiz with 1 note card when others carry a whole binder.

Beyond text (molecular captioning, multimodal reasoning): With textual+2D molecular graphs rendered together into one image, ReGuLaR achieves state-of-the-art BLEU/METEOR/ROUGE scores across backbones—beating explicit CoT that uses hundreds of text steps. Even when removing the 2D graphs (text-only rendering), ReGuLaR stays competitive while using just one latent step.

Surprising findings:

KL is crucial: Remove it and performance collapses (<14% in ablations). This shows the prior really anchors meaning.
Probabilistic beats deterministic: Sampling avoids “mean collapse” and gives crisper hidden states.
Vision-based regularization beats text-based pooling: Rendering preserves layout and structure that matter for reasoning.
Tiny visual encoder mode works almost as well as large modes: Because final features are pooled into one vector, high-resolution internal tokens give diminishing returns—great for efficiency.

05Discussion & Limitations

Limitations:

Needs rendering during training: Pre-rendering CoT into images and running a visual encoder adds setup cost (though it’s offline and frozen). If rendering quality is poor, the prior may be less helpful.
Segment choices: Splitting reasoning into K segments (or K=1) is a design knob; bad segmentation can weaken guidance.
Domain assumptions: The approach shines when layout/structure help (math steps, formulas, diagrams). For purely free-form prose, the visual edge may be smaller.
No visual input at inference: That’s a feature (no extra cost), but it also means the model relies on what it internalized; if the test distribution shifts a lot, priors learned from training visuals might help less.

Required resources:

A frozen visual encoder (e.g., OCR-trained), a small adapter MLP, and a latent reasoning head.
Training compute (e.g., GPUs) for LoRA finetuning; storage for precomputed visual features.

When not to use:

If problems are trivial (no need for reasoning) or tiny models suffice with direct answers.
If the domain has no helpful structure to capture visually and you cannot provide reliable training CoT.
If strict latency budgets forbid even minimal latent steps (though ReGuLaR is already very short).

Open questions:

Theory: Under what conditions can latent reasoning provably match or exceed explicit CoT?
Data: How to build large, high-quality datasets with rich, structured CoT to push this further?
Segmentation: Can the model learn how many segments (K) to use on its own?
Priors: Beyond images, can other compact signals (graphs, tables) serve as even better priors?

06Conclusion & Future Work

Three-sentence summary: ReGuLaR turns long, written reasoning into compact images and uses their visual features to guide a variational latent thinking process inside an LLM. This keeps each hidden step faithful to the original logic (via KL regularization) while slashing the number of steps and the decoding cost. As a result, it outperforms prior latent methods, scales across backbones, and can even beat explicit CoT in multimodal tasks. Main achievement: Showing that rendered CoT as a visual prior inside a VAE-style latent reasoning framework can preserve semantics so well that a handful (or even one) hidden step replaces long chains of tokens without sacrificing accuracy. Future directions: Build larger, richer reasoning datasets to test the limits; explore learned segmentation and adaptive K; extend priors beyond images (e.g., structured graphs) and study theoretical guarantees for when latent beats explicit CoT. Why remember this: It’s a simple but powerful recipe—use pictures of your thoughts to teach your brain to think silently. This could make future AI systems faster, cheaper, and better at using varied information, all while staying just as smart (or smarter).

Practical Applications

•Homework helpers that solve multi-step math with less delay on low-end devices.
•Tutoring systems that reason through science problems (with charts/diagrams during training) but answer in plain text.
•Customer support bots that infer solutions with few hidden steps, reducing server costs.
•On-device assistants that handle complex scheduling or budgeting without long chains of generated text.
•Medical triage assistants trained with diagrammatic workflows, answering quickly in text-only at use time.
•Code-assist tools that compress multi-step refactoring plans into a few latent thoughts for faster suggestions.
•Financial analysis bots that learned from chart-augmented rationales but reply concisely in text.
•Educational content generators that reason silently and produce step-aware yet compact explanations.
•Search agents that internally chain fewer steps while preserving logic, improving latency.
•Robotics planners trained with visual task diagrams, executing quick, reliable text commands.

Version: 1