OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

Letian Zhang; Sucheng Ren; Yanqing Liu; Xianhang Li; Zeyu Wang; Yuyin Zhou; Huaxiu Yao; Zeyu Zheng; Weili Nie; Guilin Liu; Zhiding Yu; Cihang Xie

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

Intermediate

Letian Zhang, Sucheng Ren, Yanqing Liu et al.1/21/2026

arXiv PDF

Key Summary

•OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).
•It stacks a Vision Transformer (ViT) on top of a frozen VAE encoder so images become compact 'latents' first, then become smart tokens with meaning.
•The same tokens are trained in two ways at once: to rebuild the original picture (reconstruction) and to match and describe text (contrastive learning and captioning).
•By training these two goals together, the model surprisingly gets better at both: learning meaning helps drawing details, and learning details helps meaning.
•On understanding tasks with LLaVA-1.5, OpenVision 3 matches or beats CLIP on several benchmarks like SeedBench and POPE while using a unified tokenizer.
•For generation with the RAE framework, it makes higher-quality images than CLIP-based systems, with a better gFID score (1.89 vs. 2.54 on ImageNet-256).
•For reconstruction, it beats other unified tokenizers by a wide margin (e.g., PSNR 30.33 dB vs. 25.34 dB for UniTok on ImageNet).
•Its training recipe is simple and clear: freeze the VAE, train the ViT and text parts from scratch, use progressive resolution, and balance the losses (understanding weighted a bit more).
•Because one encoder handles both jobs, systems become simpler, faster to plug in, and easier to extend to new multimodal tasks.
•The team releases code, data, and checkpoints to help the community build better unified multimodal models.

Why This Research Matters

OpenVision 3 shows that one encoder can be great at both understanding and generating images, which simplifies products and reduces costs. Because the same tokens handle both jobs, systems become more consistent—what the model says about an image matches what it can redraw. This unlocks better assistants that can describe, edit, and create visuals using one shared brain. It also helps research by providing a clean, reproducible recipe for training unified continuous tokenizers from scratch. In practical terms, it can improve content creation, accessibility tools, education apps, and safety systems all at once. The strong generation scores mean higher-quality images, and the solid understanding scores mean better grounding in language. Together, these make multimodal AI more capable and trustworthy.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you have one magic backpack that works for school and for sports. You carry fewer things, move faster, and worry less. Wouldn't it be great if computers could have one 'magic backpack' for pictures too—both to understand them and to draw them?

🥬 The Concept (Unified Multimodal Models, or UMMs): UMMs are AI systems that can look at images, talk about them, and even create new ones—all in one place. How it works:

They turn images into tokens (little chunks of information). 2) They also work with words as tokens. 3) They learn how pictures and words connect. 4) They do tasks like answering questions about images or creating new pictures from text. Why it matters: Without a good shared way to represent images, the system gets clunky—often using different tools for understanding and for drawing, which makes things slower and less consistent. 🍞 Anchor: Think of GPT-4o or Gemini: you show a photo and ask a question, or ask it to make an image. A great unified system lets both happen smoothly with the same internal picture-language.

🍞 Hook: You know how a pencil sharpener and a crayon sharpener are different? Old systems used two different 'sharpeners' for the same picture.

🥬 The Concept (Tokenizer): A tokenizer turns a big picture into bite-sized tokens the model can understand. How it works:

Split or compress the image. 2) Encode important bits into a set of tokens. 3) Feed tokens to different parts of the AI. Why it matters: If you use separate tokenizers—one for meaning and one for pixels—you double the work and make it harder for the two parts to work together. 🍞 Anchor: Some past models used a semantic tokenizer (like CLIP) for understanding, plus a VAE-style tokenizer for generation, so the same image got processed twice.

🍞 Hook: Picture using only LEGO bricks that must snap to fixed studs—you can build a lot, but tiny curves are hard.

🥬 The Concept (Discrete tokenizers): Discrete tokenizers turn images into codes from a fixed codebook. How it works:

Learn a dictionary of visual codes. 2) Replace each patch with its nearest code. 3) Use codes for both understanding and generation. Why it matters: Fixed codes can cause discretization errors—fine details and smooth textures can get lost, hurting generation quality. 🍞 Anchor: Methods like VQ-based tokenizers sometimes made images look blocky or less detailed when decoding.

🍞 Hook: Now imagine soft clay instead of LEGO—you can make smooth shapes and fine details.

🥬 The Concept (Continuous tokenizers): Continuous tokenizers keep features as smooth vectors, not fixed codes. How it works:

Compress the image into a continuous latent. 2) Process those latents with a neural network (like a ViT). 3) Keep everything differentiable and detailed. Why it matters: You keep subtle gradients and textures, which helps both understanding and generating crisp images. 🍞 Anchor: OpenVision 3 uses a continuous approach (VAE latents + ViT) so the same tokens work for meaning and for drawing.

The world before: Researchers often picked two tokenizers: one for high-level meaning (e.g., CLIP-like encoders) and another for pixel-perfect reconstruction (e.g., VAE latents). This worked but had downsides: more complexity, doubled compute at inference, and weaker synergy—each side learned in its own corner.

The problem: Can we find one simple visual representation—continuous, compact, and powerful—that serves understanding and generation equally well, without trading one off for the other?

Failed attempts:

Dual-tokenizer designs (e.g., separate semantic and pixel tokenizers) increased complexity and didn’t fully share knowledge.
Discrete unified tokenizers reduced the gap but introduced quantization errors, hurting image quality.
Some recent continuous designs existed but lacked a transparent, from-scratch training recipe that the community could reproduce.

The gap: A clean, practical, from-scratch training method for a continuous unified tokenizer that:

Uses one set of tokens for both tasks,
Preserves pixel details, and
Gains rich semantic alignment with text.

Real stakes:

Simpler systems: one encoder to plug into chatbots, captioners, and image generators.
Better consistency: the way the model sees an image is the same way it draws one.
Lower cost and easier deployment: fewer moving parts.
Better apps: smarter photo helpers, safer content filters, and creative tools that describe and edit without losing detail.

OpenVision 3 steps into this space with a simple stack: freeze a good VAE encoder, add a trainable ViT on its latents, and jointly train the same tokens to reconstruct images and to align with text through contrastive learning and captioning.

02Core Idea

🍞 Hook: You know how a Swiss Army knife folds many tools into one handle? One handle, many jobs, no swapping.

🥬 The Concept (Core idea in one sentence): OpenVision 3 learns a single set of continuous image tokens—made by stacking a ViT on top of a VAE encoder—and trains them simultaneously to rebuild images and to understand images with text. How it works:

Compress the image into VAE latents (details preserved). 2) Pass latents through a ViT to get unified tokens (smart features). 3) Use the same tokens in two branches: (a) reconstruction (pixel-level) and (b) understanding (contrastive + captioning). 4) Train both branches together so they help each other. Why it matters: Without a single unified space, we either lose detail, lose meaning, or juggle two encoders. This design keeps both, simply. 🍞 Anchor: With OpenVision 3, the tokens that help answer “What’s in this photo?” are exactly the ones that help redraw it cleanly.

Multiple analogies:

Two teachers, one notebook: The art teacher (reconstruction) and the language teacher (semantics) both write in the same notebook (tokens). The student learns better because the notes connect drawing with describing.
Noise-canceling choir: Many voices (losses) sing together. When tuned right, they cancel mistakes and amplify harmony, making the song (representation) clearer.
Camera + Tour Guide: The VAE keeps the high-resolution snapshot; the ViT acts like a guide who explains what’s important. Together they produce photos you can both admire and talk about.

Before vs. After:

Before: Separate tokenizers or discrete codes; complexity, inefficiency, and quality trade-offs.
After: One continuous tokenizer; simpler pipeline; better reconstruction than other unified approaches; understanding on par with CLIP; stronger generation than CLIP-based RAE.

Why it works (intuition):

The VAE latents keep pixel-rich structure so details aren’t lost early.
The ViT learns to organize those details into meaningful concepts.
Reconstruction loss pressures tokens to remember fine textures; contrastive and captioning losses pressure tokens to latch onto semantics.
Surprisingly, these pressures align: knowing what’s important helps draw it well; drawing things well helps recognize what they are.
Adding a bit of noise during training makes tokens robust and generalizable.

Building blocks (explained with the Sandwich Pattern):

🍞 Hook: Imagine one master key that opens both the library (knowledge) and the art studio (creativity). 🥬 Unified Visual Encoder: It’s one encoder that turns images into tokens useful for both understanding and generation. How it works: 1) Take VAE latents. 2) Run a ViT to produce unified tokens. 3) Use them for two branches: reconstruction and understanding. Why it matters: Without a single encoder, we waste time switching keys and never fully share what one room teaches the other. 🍞 Anchor: The same tokens help a chatbot describe your photo and help a generator recreate it.

🍞 Hook: Think of a shrink-ray that makes a big photo small but keeps its essence. 🥬 VAE (Variational Autoencoder): A VAE compresses images into smooth, continuous latents that can be decoded back into pictures. How it works: 1) Encoder squeezes the image. 2) Latents store key details. 3) Decoder expands latents back to pixels. Why it matters: If early compression throws away too much, you can’t draw crisp images later. 🍞 Anchor: Your 256×256 cat becomes a neat latent grid the model can still turn back into a cat.

🍞 Hook: You know how you scan a comic page panel by panel to understand the story? 🥬 ViT (Vision Transformer): A ViT looks at an image in patches and learns relationships between them to form smart features. How it works: 1) Split latents into tiny patches. 2) Use attention to see which parts matter together. 3) Output tokens that summarize the scene. Why it matters: Without this, the model may miss which details connect (like whiskers to a cat’s face). 🍞 Anchor: The ViT notices the triangle ears and round eyes go together—“catness.”

🍞 Hook: Imagine sorting socks: pairs that match go together, mismatches stay apart. 🥬 Contrastive Learning: It teaches image and text to match when they describe the same thing and to repel when they don’t. How it works: 1) Embed image tokens. 2) Embed caption text. 3) Pull true pairs closer; push wrong pairs apart. Why it matters: Without it, the model can mix up “dog” and “cat” even if it draws either well. 🍞 Anchor: A cat photo should be close to “a small gray cat” and far from “a red fire truck.”

🍞 Hook: Think of telling a friend what’s in a photo, word by word. 🥬 Captioning Objectives: The model learns to generate a sentence that describes the image. How it works: 1) Feed unified tokens to a text decoder. 2) Predict the next word again and again. 3) Learn to mention the right objects and details. Why it matters: Without language practice, the model might recognize things but can’t explain them clearly. 🍞 Anchor: From a beach photo it writes, “Two kids building a sandcastle by the ocean at sunset.”

03Methodology

High-level recipe: Input image → VAE encoder (frozen) → ViT encoder (trainable) → two branches (reconstruction and understanding) → outputs (reconstructed image, aligned semantics).

Step-by-step with what, why, and example:

VAE encode the image into latents.

What: Use FLUX.1 VAE to downsample the image by 8× in height and width, producing continuous latents (z_vae).
Why: Compact latents keep details but make learning faster and smoother than raw pixels.
Example: A 256×256 cat becomes a smaller grid of numbers capturing fur texture and shape.

ViT encodes the latents into unified tokens.

What: Feed z_vae into a ViT with small 2×2 patches (overall 16× compression with VAE). Output unified tokens z_u.
Why: The ViT organizes details into meaningful patterns (e.g., ears + eyes + whiskers = cat).
What breaks without it: Tokens won’t capture relationships; both captioning and reconstruction may be fuzzy.
Example: The ViT spots the cat on a couch, not just random fur pixels.

Reconstruction branch (draw it back):

What: Add Gaussian noise to z_u (for robustness), then a ViT decoder with 1×1 patches plus a linear layer maps noised tokens back to VAE latents (z_hat_vae). The VAE decoder turns them into a reconstructed image (x_hat).
Why: Forces tokens to remember fine details; noise teaches them to be sturdy even if slightly perturbed.
What breaks without it: Images degrade; tokens drift toward only ‘meaning’ and forget textures and edges.
Losses used:
- Pixel-level reconstruction loss on x vs. x_hat (to keep images visibly correct).
- Latent-level loss on z_vae vs. z_hat_vae (to align internal codes).
- LPIPS perceptual loss (to preserve human-perceived quality).
Example: The model recreates the cat’s fur lines and couch fabric, not just a blurry blob.

🍞 Hook: Think of polishing a mirror—noise helps you learn to see clearly even with smudges. 🥬 Reconstruction Losses: These are goals that make the model rebuild the original picture faithfully. How it works: 1) Compare reconstructed image to the real one (pixel). 2) Compare reconstructed latents to original latents. 3) Compare perceptual features (LPIPS). Why it matters: Without them, the model might know there’s a cat but can’t draw its whiskers. 🍞 Anchor: If you remove these losses, outputs get blurrier and textures vanish.

Understanding branch (explain it):

What: Use the same z_u to do two things:
- Contrastive learning with a text encoder: pull matching image–caption pairs together, push mismatches apart.
- Captioning with a text decoder: predict the caption word by word from the image tokens.
Why: These make tokens carry clear semantic meaning.
What breaks without it: The model might draw nicely but mix up labels or fail to explain images.
Example: It learns “gray tabby cat” vs. “brown dog” and writes accurate sentences.

Training objective and balance:

Overall loss = ω_rec × reconstruction losses + ω_und × understanding losses.
They set understanding weight about 2× reconstruction during training to keep semantics strong without hurting generation.
Surprisingly, even training just one side helps the other (synergy).

Data and schedule:

Data: DataComp recaptioned by LLaVA-Llama-3 (high-quality captions).
Stages: Progressive resolution: pretrain at 128×128, then finetune at 224–256. Most epochs at low resolution for efficiency (≈10:1).
Optimization: Freeze the FLUX.1 VAE; randomly initialize and train the ViT encoder/decoder, text encoder/decoder, and linear layer. Use large batches (8K → 4K), cosine learning rate schedule (8e-6 → 4e-7).

Secret sauce (what’s clever):

Simple stack: VAE latents + trainable ViT = continuous unified tokens.
Dual-goal training: Reconstruction + semantics on the exact same tokens.
Noise in reconstruction: makes tokens robust for generation.
Clear from-scratch recipe: Freeze VAE, train others, progressive resolution, balanced losses.

Concrete walk-through with example data:

Input: “A small gray cat on a blue couch.”
VAE encoder: makes a compact latent picture with couch fabric grain and fur edges intact.
ViT encoder: converts latents into tokens that capture ‘catness’ and context (‘blue couch’).
Reconstruction branch: decodes tokens back to a crisp cat image; pixel/latent/LPIPS losses guide texture fidelity.
Understanding branch:
- Contrastive: pulls the cat image token near the text “a small gray cat on a blue couch” and away from “a red fire truck.”
- Captioning: trains the text decoder to write that sentence from the image tokens.
Output: One token set that both draws the cat well and describes it correctly.

04Experiments & Results

🍞 Hook: You know how in school you don’t just take one test? You take reading, writing, and art to show you’re good at many things.

🥬 The Concept (Evaluation suite): The team tested OpenVision 3 on three fronts—reconstruction (can it copy well?), generation (can it create well?), and understanding (can it explain well?). How it works:

Reconstruction: Measure how close the rebuilt images are to the originals.
Generation: Train a generator on top of the tokenizer and measure sample quality/diversity.
Understanding: Plug the tokenizer into a VLM (LLaVA-1.5) and measure task accuracy. Why it matters: A unified encoder must pass all three, not just one. 🍞 Anchor: It’s like getting good grades in math, art, and language—one report card, many skills.

Reconstruction metrics (what they mean):

PSNR/SSIM/LPIPS: higher PSNR/SSIM and lower LPIPS mean the copy looks closer to the original, both numerically and perceptually. 🍞 Hook: Think of checking a photocopy—sharper, more similar pages mean better copying. 🥬 PSNR/SSIM/LPIPS: These measure how sharp and similar the reconstructed image is (PSNR, SSIM) and how human-like the quality feels (LPIPS). How it works: Compare original vs. reconstructed images at pixel and perceptual levels. Why it matters: Without good scores here, the ‘drawing’ side is weak. 🍞 Anchor: OpenVision 3 hit PSNR 30.33 dB on ImageNet—like getting an A when others got C’s.
rFID (reconstruction FID): lower is better; it checks the realism of reconstructions compared to real images. 🍞 Hook: Imagine comparing a batch of recopied photos to originals—if they ‘feel’ real, score goes down (good). 🥬 rFID: A statistic that says how close reconstructed images look to real ones in a learned feature space. How it works: Compute distribution distance between reconstructions and ground-truth images. Why it matters: Captures perceptual realism, not just pixel math. 🍞 Anchor: On ImageNet, OpenVision 3’s rFID is 0.216, beating many unified baselines.

Generation metrics (what they mean):

gFID (generation FID): lower is better; realism of generated images.
IS (Inception Score): higher is better; variety and confidence.
Precision/Recall: keep samples sharp (precision) and cover diverse modes (recall). 🍞 Hook: Think of baking cookies: precision is how perfectly shaped each cookie is; recall is how many different flavors you can bake. 🥬 gFID/IS/Pre/Rec: Together, they tell you if images look real, are diverse, and cover the dataset’s variety. How it works: Compare generated images to real ones using learned features and classifier signals. Why it matters: A great generator is both sharp and adventurous. 🍞 Anchor: With RAE on ImageNet-256, OpenVision 3 scored gFID 1.89, better than CLIP’s 2.54—like jumping from a B to an A+.

Understanding metrics:

Benchmarks: MME, SeedBench, ScienceQA, GQA, and POPE test general comprehension, scientific Q&A, reasoning, and hallucination resistance. 🍞 Hook: These are like quizzes that ask, “Do you really understand what you see?” 🥬 Understanding Benchmarks: Standard tests where models read images, reason, and answer questions. How it works: Plug OpenVision 3 into LLaVA-1.5 and measure accuracy/scales. Why it matters: If unified tokens are truly semantic, scores should match top encoders like CLIP. 🍞 Anchor: OpenVision 3 matches or beats CLIP on SeedBench and POPE at equal token counts.

Scoreboard highlights (with context):

Reconstruction (ImageNet, 256×256): PSNR 30.33 dB vs. UniTok 25.34 dB; LPIPS 0.061 vs. 0.132 (lower is better); rFID 0.216—stronger than other unified tokenizers. That’s like drawing a portrait with sharper eyes and clearer hair strands.
Generation (ImageNet-256, RAE): gFID 1.89 (better than CLIP + RAE at 2.54), IS 289.2, Precision 0.84, Recall 0.59—like baking cookies that are both prettier and come in more yummy flavors.
Understanding (LLaVA-1.5): Keeps pace with OpenAI CLIP; for example, SeedBench 62.4 vs. 62.2 (B/16) and 66.0 vs. 65.4 (L/14), POPE 83.7 vs. 82.9 (B/16) and 85.3 vs. 84.7 (L/14). That’s like tying or edging out a star student on comprehension tests.

Surprising findings (synergy):

Train only semantics? Reconstruction loss still drops a lot—meaning helps drawing.
Train only reconstruction? Caption/contrastive get better, too—drawing helps meaning. This suggests the unified tokens naturally align both goals, instead of fighting each other.

05Discussion & Limitations

Limitations:

Resolution constraints: The design is tied to the VAE’s downsampling; extreme high-res use may need adjustments.
Data dependence: Strong performance uses high-quality recaptions (LLaVA-Llama-3); weaker captions could reduce semantic learning.
Compute needs: Large batches and long pretraining (even if efficient compared to alternatives) still require serious hardware.
Domain shift: Very unusual image domains (e.g., medical or satellite) might need further finetuning to keep both detail and semantics.
Frozen VAE: While simplifying training, a frozen VAE may cap potential if domain-specific textures are important.

Required resources:

Hardware: Multi-accelerator training (e.g., TPU/GPU clusters) to handle 8K→4K batch sizes.
Data: Large-scale image–text datasets with reliable captions (e.g., DataComp recaptioned).
Software: Access to the FLUX.1 VAE, ViT and text backbones, and the training recipe (which the authors open source).

When NOT to use:

Purely semantic tasks with no need for generation or reconstruction—CLIP-like encoders might suffice and be cheaper.
Pure pixel-perfect reproduction without language grounding—specialized generation-only tokenizers may be slightly simpler.
Extremely tight on memory or latency where a two-branch training regime isn’t feasible.
Highly specialized domains where you must end-to-end tune the VAE (but OV3 freezes it).

Open questions:

Can the same recipe scale to ultra-high resolutions (e.g., 1024+) without losing efficiency?
How well does the approach extend to video (temporal consistency) as a unified visual tokenizer?
What’s the best dynamic weighting between reconstruction and semantics over training time?
Can adaptive noise schedules improve robustness even more for generation?
How do different text encoders/decoders (or multilingual settings) change the synergy?

06Conclusion & Future Work

Three-sentence summary: OpenVision 3 builds a single, continuous visual tokenizer by stacking a ViT on top of a frozen VAE encoder, then jointly training the same tokens to both reconstruct images and align with text through contrastive learning and captioning. This yields strong, transferable representations that perform on par with CLIP for understanding and surpass it for generation under RAE, while significantly outperforming prior unified tokenizers on reconstruction. The training recipe is simple, reproducible, and efficient, making unified modeling more practical.

Main achievement: Showing that one continuous token space can serve both image understanding and generation at high quality, with a clear from-scratch training method and state-of-the-art results among unified tokenizers.

Future directions: Extend the approach to higher resolutions and video; explore adaptive loss balancing; combine with stronger text models; and investigate domain-specific tuning, multilingual captions, and interactive editing.

Why remember this: It’s a clean, effective proof that we don’t need two separate vision encoders—one unified encoder can do both jobs well, simplifying systems and enabling richer multimodal AI.

Practical Applications

•Build one plug-in vision encoder for chatbots that both describe user photos and generate helpful illustrative images.
•Create image editors that can explain changes in words and then apply those changes precisely using the same tokens.
•Develop accessibility tools that describe scenes to users and generate simplified visuals or tactile diagrams from the same representation.
•Upgrade content moderation by unifying recognition (what’s in the image) and counterfactual generation (what to blur or replace) with one backbone.
•Improve e-commerce search by connecting product photos and descriptions tightly, and generating realistic product variants on demand.
•Boost education apps that auto-caption diagrams and create new practice images aligned to the same concepts.
•Power creative tools where a user’s text prompt and an example image share one token space for accurate style transfer and edits.
•Speed up multimodal RAG systems by using one encoder for both visual grounding and illustrative generation.
•Enable consistent dataset curation: caption images and verify reconstructions to filter low-quality data with one model.
•Prototype video extensions where a unified image tokenizer lays the groundwork for consistent frame understanding and generation.

Version: 1