Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Boqiang Zhang; Lei Ke; Ruihan Yang; Qi Gao; Tianyuan Qu; Rossell Chen; Dong Yu;  Leoweiliang

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Intermediate

Boqiang Zhang, Lei Ke, Ruihan Yang et al.3/6/2026

arXiv

Key Summary

•Penguin-VL shows that small vision-language models (2B and 8B) can be very strong if you give them a better vision encoder, not just a bigger brain.
•Instead of starting the vision encoder from CLIP/SigLIP contrastive training, Penguin-Encoder starts from a text-only LLM and is adapted to see images.
•This switch fixes an 'objective mismatch'—contrastive encoders hide tiny details that matter for captions, OCR, charts, and reasoning.
•A special warm-up with three reconstruction losses (amplitude, direction, and relation) teaches the LLM-initialized encoder to keep fine visual details.
•For videos, a Temporal Redundancy-Aware (TRA) compressor spends more tokens on keyframes and fewer on repetitive frames, saving compute without losing meaning.
•Across many benchmarks, Penguin-VL-2B and 8B rival or beat larger models, especially on documents, charts, and long video reasoning.
•The project emphasizes efficient data curation, high-quality re-captioning, and a two-stage instruction tuning to harmonize image and video skills.
•Ablations show the LLM-based vision encoder and the relation loss are the main drivers of the gains, not just more data or parameters.
•This design is practical for phones, robots, and edge devices because it delivers high fidelity with lower latency and memory.
•Bottom line: Better visual representations + smart training > simply scaling model size.

Why This Research Matters

Penguin-VL proves that smarter visual representations can replace brute-force scaling, making high-quality multimodal AI practical for phones, laptops, and robots. It reads documents and charts with care, helping with office workflows, education, and accessibility. Its video understanding works efficiently, so long clips can be summarized or searched with fewer resources. This reduces cost, latency, and energy, enabling broader, greener deployment. Better fine-grained perception also unlocks safety, compliance, and analytics use cases where tiny details matter. Finally, the approach opens a path to more general, compact, and reliable multimodal agents.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) Imagine you're building a robot helper. You could give it a giant brain so it "knows" a lot, or you could give it great eyes so it "sees" a lot. If you only make the brain enormous but the eyes blurry, it will still miss the tiny details that matter.

🥬 Filling (The Actual Concept)

What it is: A Vision Language Model (VLM) is a computer program that learns to understand images and videos (vision) together with words and sentences (language).
How it works (high level):
1. A vision encoder turns pixels into tokens (tiny pieces of meaning).
2. A projector reshapes those tokens so a language model can read them.
3. A large language model (LLM) reasons over both visual and text tokens to answer questions or describe scenes.
Why it matters: If the vision encoder blurs details (like a smudge over small text or chart lines), the LLM can’t reason correctly, no matter how smart it is.

🍞 Bottom Bread (Anchor) When you ask, “What’s the total price on this receipt?” the VLM must read tiny numbers and align them with labels. Good "eyes" make the difference between the right total and a random guess.

The World Before

Most modern VLMs used contrastive learning (like CLIP/SigLIP) to pretrain vision encoders. This teaches the model to say which image best matches a caption and which caption best matches an image.
It worked well for “this picture is a dog vs. a cat” but often missed small, task-critical details (like digits in a document or exact positions in a chart). Why? Because contrastive training rewards category-level sameness. It’s okay if different photos of the same class look “the same,” which can wash away fine-grained cues.

🍞 Top Bread (Hook) You know how a friend who’s great at guessing “the main idea” of a story might skip over tiny clues in a mystery novel? That’s like contrastive learning.

🥬 Filling

What it is: Contrastive learning compares pairs of images and texts so matching pairs get close and mismatched pairs move apart.
How it works:
1. Show the model an image and many captions.
2. Push the right image-caption pair together and the wrong ones apart.
3. Repeat this at massive scale.
Why it matters: It’s great for recognition, but it doesn’t directly teach the model to generate detailed descriptions or reason step-by-step like a language model does.

🍞 Bottom Bread If you show a line chart and ask, “Which month dips the lowest?”, a contrastive-trained encoder might know it’s a chart but may lose the exact pixel-level minimum.

The Problem

Research systems kept getting bigger to score higher. But huge models are hard to deploy on phones, robots, or anywhere with tight memory/latency limits.
Worse, the strengths were uneven. Some models did great on images but stumbled on video temporal reasoning, or vice versa.

Failed Attempts

Pure contrastive pretraining: Scales well but suppresses detail, needs huge datasets, and can be unstable.
Directly aligning a contrastive encoder to an LLM with language modeling loss: Can overfit to the training images and be picky about data quality.

The Gap

We needed a vision encoder that speaks the LLM’s “language” from day one, keeps fine details, and scales efficiently—without a giant compute bill.

🍞 Top Bread (Hook) Imagine teaching a new student (vision) in a classroom where the teacher (LLM) already has strong language skills. If the student starts school already speaking the teacher’s language, class goes smoothly.

🥬 Filling

What it is: Starting the vision encoder from a text-only LLM (not a contrastive vision model) so their internal “languages” are naturally aligned.
How it works:
1. Take an LLM backbone and convert its attention to bidirectional (so it can see the whole image at once).
2. Add 2D positional hints so it understands where patches are in the image.
3. Warm it up with special reconstruction losses so it learns to keep tiny visual details.
Why it matters: Now the encoder produces visual features the LLM naturally understands and can reason over.

🍞 Bottom Bread (Anchor) A small 2B Penguin-VL model reads a complex form and answers layout-sensitive questions more accurately than similar-sized baselines. The difference is in the better-aligned “eyes.”

Real Stakes

Everyday uses: Reading receipts, comparing charts, summarizing documents, captioning family videos, assisting with homework.
On-device AI: Phones and robots need compact, fast, reliable models.
Accessibility: Better OCR and layout reading help screen readers and assistive tech.
Safety and analytics: Understanding long videos to detect key events with fewer compute resources.
Sustainable AI: Smarter training and encoding instead of endless parameter scaling reduces cost and energy.

02Core Idea

🍞 Top Bread (Hook) Imagine upgrading a camera not by buying a bigger lens, but by teaching it your own language so you instantly understand what it sees. That’s the trick Penguin-VL pulls off.

🥬 Filling (The Actual Concept)

One-sentence “Aha!”: Initialize the vision encoder from a text-only LLM, then gently adapt it to images and videos using bidirectional attention, 2D positional hints, and a warm-up with reconstruction losses—so the LLM and the “eyes” think alike from the start.

Multiple Analogies (3 ways)

Language Exchange Program: Instead of forcing the vision side to learn a new dialect (contrastive), we start with a fluent speaker of the LLM’s language and teach it to "see." Fewer misunderstandings.
Orchestra Tuning: Rather than having the violin (vision) and piano (LLM) tune separately, we tune the violin from the piano’s pitch. The duet sounds harmonious immediately.
Map vs. Compass: Contrastive training gives you a compass (category direction). LLM-initialized vision gives you a detailed map (fine-grained landmarks) that makes step-by-step navigation (reasoning) easier.

Before vs. After

Before: Big models with contrastive encoders; good at recognition but weaker on dense details; scaling-heavy; uneven on video temporal reasoning.
After: Compact models with LLM-initialized encoders; sharper fine details; better doc/chart/video reasoning; improved data efficiency; smoother modality alignment.

Why It Works (intuition)

Objective alignment: Language modeling is generative and sequential; contrastive is discriminative and global. Starting from an LLM aligns the encoder’s inner space with the decoder’s needs, making visual tokens “feel” like text tokens the LLM can reason about.
Architectural perks: Modern LLM backbones include stability tricks (like QK norm) and scale-friendly designs. Switching attention to bidirectional plus 2D positions equips them for images.
Gentle adaptation: A warm-up with three reconstruction losses (amplitude, direction/cosine, relation) teaches the encoder to preserve both tiny details and token-to-token relationships.
Smart efficiency: For videos, Temporal Redundancy-Aware compression spends tokens where motion changes, not where frames repeat.

Building Blocks

Penguin-Encoder (LLM-initialized vision): Convert causal to bidirectional attention and add 2D-RoPE.
Mixed Supervision Warm-up: Combine language modeling with reconstruction (amplitude, direction, relation) so features keep detail and structure.
Simple Projector: A tiny MLP aligns visual features to the LLM’s hidden size—no heavy resamplers.
Unified Training Recipe: Low-to-high resolution curriculum; high-quality re-captioned data; region grounding and region captions for fine-grained localization.
Video TRA Compression: Allocate more tokens to keyframes with changes; fewer to in-between frames; respect a minimum token floor for semantic integrity.

🍞 Bottom Bread (Anchor) Ask Penguin-VL: “Which year on this multiseries line chart has the lowest blue line value?” The LLM-initialized encoder keeps exact line dips and relations, and the LLM decodes the right year—without a giant model size.

03Methodology

High-level Overview Input (image/video + optional text) → Vision encoder (LLM-initialized, bidirectional + 2D-RoPE) → MLP projector → LLM decoder → Output (answer/caption/summary)

Step-by-step with Sandwich Explainers for Key Pieces

Vision Encoder from a Text-only LLM 🍞 Top Bread (Hook) You know how switching from reading word-by-word to scanning a whole page helps you understand a picture book’s layout? Images need that kind of all-at-once attention.

🥬 Filling

What it is: Start from a small LLM backbone (e.g., Qwen3-0.6B), convert its causal attention into bidirectional full attention, and add 2D-RoPE so it understands image coordinates.
How it works:
1. Replace causal masks with full attention so any patch can attend to any other patch.
2. Add 2D rotary positional embeddings so tokens know their row and column.
3. Feed visual patches as tokens; process at native or downscaled resolution depending on token budget.
Why it matters: Causal text attention reads left-to-right; images need to see everything at once to keep spatial detail and context.

🍞 Bottom Bread (Anchor) On a receipt, the model keeps left-to-right and top-to-bottom reading order intact because tokens attend across the whole page.

Warm-up with Reconstruction Losses 🍞 Top Bread (Hook) Imagine tracing a drawing while also checking your angles and how parts relate. You won’t just copy lines—you’ll preserve shapes and proportions.

🥬 Filling

What it is: A three-part reconstruction objective that teaches the encoder to match a teacher’s visual feature space while keeping fine details and inter-token relations.
How it works (formulas with examples):
- Amplitude loss: $L_A = \frac{1}{N} \sum |F_s - F_t|$ . Example: If $F_s = \begin{pmatrix}2 \\ 3\end{pmatrix}$ and $F_t = \begin{pmatrix}1 \\ 5\end{pmatrix}$ , then $|F_s - F_t| = \begin{pmatrix}1 \\ 2\end{pmatrix}$ , their average is $(1+2)/2 = 1.5$ , so $L_A = 1.5$ .
- Direction loss (cosine): $L_D = \frac{1}{N}\sum \frac{F_s \cdot F_t}{\|F_s\|\,\|F_t\|}$ . Example: If $F_s = \begin{pmatrix}3 \\ 4\end{pmatrix}$ and $F_t = \begin{pmatrix}6 \\ 8\end{pmatrix}$ , then $F_s \cdot F_t = 3\times6 + 4\times8 = 50$ , $\|F_s\| = 5$ , $\|F_t\| = 10$ , so cosine $= 50/(5\times10) = 1$ and $L_D = 1$ .
- Relation loss: $L_R = \frac{1}{N}\sum \Big\| \frac{F_s F_s^\top}{\|F_s\|} - \frac{F_t F_t^\top}{\|F_t\|} \Big\|^2$ . Example: Suppose we have two tokens with $F_s = \begin{pmatrix}1 \\ 0\end{pmatrix}$ and $F_t = \begin{pmatrix}0.8 \\ 0.6\end{pmatrix}$ . Then $F_s F_s^\top = \begin{pmatrix}1 & 0 \\ 0 & 0\end{pmatrix}$ (norm 1), and $F_t F_t^\top = \begin{pmatrix}0.64 & 0.48 \\ 0.48 & 0.36\end{pmatrix}$ with $\|F_t\| = 1$ . The difference matrix is $\begin{pmatrix}0.36 & -0.48 \\ -0.48 & -0.36\end{pmatrix}$ ; its squared Frobenius norm is $0.36^2 + 2\times0.48^2 + 0.36^2 = 0.1296 + 0.4608 + 0.1296 = 0.72$ ; so $L_R = 0.72$ (for this tiny example).
Why it matters: Amplitude keeps magnitudes, direction keeps angles, and relation keeps token-to-token structure—critical for charts, documents, and dense scenes.

🍞 Bottom Bread (Anchor) On a line chart, relation loss helps preserve how peaks and valleys relate across time, so the model can identify the exact month with the lowest value.

Simple Vision–Language Projector (MLP) 🍞 Top Bread (Hook) Think of an adapter that lets a foreign plug fit your wall outlet. Simple but necessary.

🥬 Filling

What it is: A tiny two-layer MLP with GELU that matches the vision feature size to the LLM’s hidden size.
How it works: Linear → GELU → Linear; no fancy pooling or spatial rearrangement.
Why it matters: Keeps things efficient and preserves token granularity, which helps detailed reasoning.

🍞 Bottom Bread (Anchor) Because tokens are not overly compressed, the LLM can still spot the tiny serial number in a product photo.

Data Curation and Re-captioning 🍞 Top Bread (Hook) Imagine building a study guide: you pick high-quality examples, toss out duplicates, and write clear notes.

🥬 Filling

What it is: A multi-stage pipeline to gather diverse images and videos, filter and de-duplicate them, and generate rich, structured long captions.
How it works:
1. Aggregate from many sources; remove low-res or corrupted samples.
2. Cluster embeddings to balance semantic diversity.
3. Prompt a strong annotator to produce structured attributes (subjects, actions, spatial relations, OCR, etc.) and then synthesize a single long caption.
Why it matters: Rich, reliable supervision teaches finer distinctions (like layout and small text) that generic captions miss.

🍞 Bottom Bread (Anchor) A historical poster with tiny, fancy text gets a caption that includes the exact words and layout, improving OCR and document QA performance.

Video: Temporal Redundancy-Aware (TRA) Token Compression 🍞 Top Bread (Hook) You don’t write down every second of a movie—just the scenes where something changes.

🥬 Filling

What it is: A dynamic policy that spends more tokens on keyframes (big changes) and fewer on repetitive frames, under a global token budget.
How it works (key equations):
- Budget check: $\sum_{k\in K} T_k + \sum_{i\in I} T_i \le T_{max}$ . Example: If $T_{max}=2000$ , keyframes have $T_k=400$ each for 3 keyframes (1200), and 10 intermediate frames have $T_i=50$ each (500), total $=1700\le 2000$ ; no compression needed.
- Synchronous scaling: $T_k \leftarrow \alpha T_k,\; T_i \leftarrow \alpha T_i$ until budget fits. Example: If total is 2600 and $T_{max}=2000$ , pick $\alpha=2000/2600\approx0.77$ so all counts scale (e.g., 400→308, 50→38).
- Saturation floor: If $T_i = T_{min}$ , clamp intermediates and continue scaling keyframes only. Example: If $T_{min}=32$ and intermediates hit 32 while still over budget, reduce keyframe tokens (e.g., 308→280) until the sum fits.
Why it matters: Saves compute while protecting the most informative frames and maintaining spatial-temporal consistency.

🍞 Bottom Bread (Anchor) In a 3-minute cooking video, TRA keeps high resolution for the “add eggs” and “flip pancake” moments, and compresses the long stirring parts.

Training Stages (Recipe)

Stage 1: Penguin-Encoder Training • Low-res pretraining (≈223M samples): cap at ~2048 visual tokens, supervise with noisy captions + reconstruction (amplitude/direction/relation), including lots of unlabeled charts. • High-res finetuning (≈47M samples): up to ~10240 visual tokens; remove reconstruction branch; train on high-quality re-captions for fine alignment.
Stage 2: VLM Pretraining (≈121M samples): Jointly train vision encoder + projector + LLM across general captions, documents, OCR, grounding, region captions, math, code, science, and text-only to prevent language forgetting.
Stage 3: Supervised Fine-Tuning (≈39M image QA + 3.7M video SFT): Two-stage instruction tuning to harmonize image and video reasoning, including temporal grounding and ordering.

The Secret Sauce

Start the vision encoder from a text LLM to align objectives and inherit semantic priors.
Use relation loss to explicitly preserve inter-patch structure—key for charts, documents, and temporal reasoning.
Apply TRA so video tokens are spent where changes happen, not on redundancy.

04Experiments & Results

The Test: What and Why

The authors evaluate on a broad suite spanning documents/OCR/charts (InfoVQA, ChartQA, DocVQA, OCRBench, CharXiv), math/logic (MathVista, MathVerse, LogicVista), general knowledge and multi-image reasoning (AI2D, RealWorldQA, V-star, MMMU-Pro, BLINK), and video understanding (MVBench, VideoMME, EgoSchema, MMVU, ActivityNetQA, NextQA, Charades-STA, LongVideoBench, Perception Test).
The goal: show that compact models (2B/8B) with a better vision encoder can match or beat larger baselines, especially on fine-grained and temporal tasks.

The Competition

2B class: Qwen3-VL-2B, InternVL3.5-2B, SmolVLM2-2.2B, Gemma3n-E2B-it.
8B class: Qwen3-VL-8B, InternVL-3.5-8B, and closed OpenAI GPT-5 nano (where available).

The Scoreboard with Context

Documents/Charts (8B): Penguin-VL hits 96.2 on DocVQA and 90.5 on ChartQA—like acing a test with A+ while others get A or B+. On OCRBench it’s close to the top but slightly behind Qwen3-VL.
General Knowledge and Multi-image (8B): Leads on AI2D and V-star; competitive on MMMU-Pro and BLINK, signaling strong structured diagram reasoning and high-res detail handling.
Math/Logic (8B): Tops MathVista (77.4) but is edged out by some baselines on MathVerse/LogicVista—great at visual math grounding; still room on deep abstract reasoning chains.
Video (8B): Shines on long-form and temporal tasks—67.0 on LongVideoBench and 85.4 on NextQA—indicating excellent temporal coherence and event localization.
2B Highlights: Despite the small size, Penguin-VL-2B often beats or matches peers. It’s notably strong on documents/charts and long video tasks (e.g., LongVideoBench 59.5 vs. Qwen3-VL-2B’s 52.1). On Charades-STA (temporal grounding), it beats InternVL3.5 by a huge margin (56.2 vs. 21.9)—that’s like going from a D to a B+.

Surprising Findings

The biggest jumps appear where fine-grained structure matters most: documents, charts, and temporal grounding. This strongly supports the paper’s claim that better-aligned visual representations—not model size—drive the gains.
Even with fewer pretraining samples than massive contrastive pipelines, the LLM-initialized encoder delivers higher ceilings once data/task complexity scales.

Ablations that Explain the Gains

LLM-based init vs. random: +3.3 average points with the same setup—starting from an LLM truly helps.
Relation loss: Adds a notable bump over amplitude + direction alone, confirming that supervising token relationships is essential for dense perception.
Against SigLIP2 and other contrastive encoders under matched backbones and recipes, Penguin-Encoder leads on AI2D, MathVista, ChartQA, MMMU-Pro, and RealWorldQA, showing the architecture and objective alignment are key.

05Discussion & Limitations

Limitations

Math/Logic depth: On MathVerse and LogicVista, Penguin-VL can trail top baselines, suggesting that deeper chain-of-thought math SFT or post-training could help.
OCR in the wild: While strong on documents, Penguin sometimes lags the very best OCR-specialized models on unconstrained scene text (OCRBench).
Data pipeline reliance: High-quality re-captioning and curated datasets are important; results could vary with weaker data or fewer annotations.
Longest videos: Though TRA scales well, extremely long videos beyond frame/token budgets may still need hierarchical or streaming strategies.
No audio: The system focuses on vision + language; multimodal audio understanding is out of scope here.

Required Resources

Training uses large curated corpora (hundreds of millions of samples across stages) and long context windows (up to ~10,240 visual tokens, ~16k total).
Compute: While more efficient than mega-scale contrastive training, full training still requires multi-GPU clusters. Inference is lightweight compared to giants and suits edge devices better, especially at 2B.

When NOT to Use

If your task is pure OCR in messy street scenes and you need the absolute top accuracy, a specialized OCR model might outperform.
Extremely long-horizon video analytics (hours) without chunking or streaming.
Tasks needing audio cues (speech, music, sound events) since audio isn’t modeled.

Open Questions

How far does LLM-initialized vision scale? Does the advantage persist at much larger or tinier sizes?
Can RL-based post-training further enhance temporal reasoning and agentic behavior?
How robust is the approach to domain shift (e.g., medical imaging, satellite data) without domain-specific SFT?
What’s the best way to incorporate streaming updates for real-time applications on-device?

06Conclusion & Future Work

Three-sentence Summary Penguin-VL replaces contrastive-pretrained vision backbones with an LLM-initialized vision encoder adapted for images and videos using bidirectional attention, 2D positions, and reconstruction warm-ups. This better alignment preserves fine details and relationships, enabling compact 2B/8B models to outperform peers on documents, charts, and long video reasoning without brute-force scaling. Experiments and ablations confirm the vision encoder—and especially the relation loss and TRA video compression—are the primary drivers of the gains.

Main Achievement Showing that the “optics,” not just the “brain size,” drive VLM quality: an LLM-based vision encoder with mixed supervision sets new efficiency and fidelity marks for compact multimodal models.

Future Directions

Real-time, streaming inference with early exiting and dynamic token budgets.
Post-training with reinforcement learning to improve long-horizon reasoning and agentic skills.
Stronger math/logic SFT and domain-targeted finetuning (e.g., specialized OCR or scientific figures).

Why Remember This Penguin-VL flips a long-held assumption: you don’t need massive contrastive pretraining or ever-bigger models to get sharp multimodal reasoning. If your vision encoder thinks in the LLM’s language and preserves relationships between patches and frames, you get clearer eyesight—and smarter answers—on a tighter budget.

Practical Applications

•On-device document assistants that extract totals, addresses, and signatures from scanned forms.
•Business analytics that read and compare charts across reports without cloud compute.
•Educational helpers that explain diagrams, math visuals, and lab figures step-by-step.
•Customer support tools that understand UI screenshots to guide users through fixes.
•Video summarizers that keep key scenes sharp while compressing repetitive footage.
•Retail shelf scanners that read tiny labels and count items accurately with a phone camera.
•Robotics perception that preserves small, safety-critical details while running on edge hardware.
•Accessibility tools that read complex layouts aloud with correct text order and structure.
•Compliance auditing that checks invoices, receipts, and statements for exact values and mismatches.
•Scientific figure readers that interpret plots and annotate key trends and anomalies.

Version: 1

Key Summary

Why This Research Matters

Turn this paper into a decision

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes