Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision

Zhixiang Wei; Yi Li; Zhehan Kan; Xinghua Jiang; Zuwei Long; Shifeng Liu; Hongze Shen; Wei Liu; Xiaoyu Tan; Haojia Lin; Yubo Zhu; Qianyu Li; Di Yin; Haoyu Cao; Weibo Gu; Xin Li; Yinsong Liu; Deqiang Jiang; Xing Sun; Yunsheng Wu; Mingkong Tang; Shuangyin Liu; Lexiang Tang; Haodong Lin; Junru Lu; Jiarui Qin; Lingfeng Qiao; Ruizhi Qiao; Bo Ke; Jianfeng He; Ke Li; Yangning Li; Yunhang Shen; Mengdan Zhang; Peixian Chen; Kun Yin; Bing Liu; Yunfei Wu; Huang Chen; Zhongpeng Cai; Xiaotian Li

Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision

Intermediate

Zhixiang Wei, Yi Li, Zhehan Kan et al.1/27/2026

arXiv PDF

Key Summary

•Youtu-VL is a new kind of vision-language model that learns to predict both words and tiny image pieces, not just words.
•It switches the training goal from ‘vision as input’ to ‘vision as target,’ so the model must remember fine visual details.
•A special Synergistic Vision Tokenizer turns image patches into learnable visual tokens that mix meaning (what it is) and shape (where and how it looks).
•A unified image-text vocabulary lets the model predict visual tokens and text tokens in the same step-by-step way.
•A multi-label next-token method (NTP-M) teaches the model that one image patch can have many labels (like color, object, and depth) at once.
•Without adding task-specific heads, the model handles dense tasks like segmentation and depth, and text-style tasks like detection and grounding.
•On popular tests, Youtu-VL performs competitively or better than other models of similar size, especially on fine-grained vision tasks.
•Scaling studies show the new training (VLUAS) keeps improving with more data and avoids early plateaus common in text-only supervision.
•This design simplifies the system into one general model that can act as a visual agent across many tasks.
•The approach reduces hallucinations by forcing the model to match what’s truly in the picture.

Why This Research Matters

Youtu-VL shows a simple way to make one model see and think with high detail across many tasks, without adding a bunch of special parts. This can make robots safer and more reliable because they understand exact locations, shapes, and distances. It helps assistive tools read documents and signs with fewer mistakes, improving accessibility. In creative tools, it supports precise edits like selecting hair strands or thin wires in photos. It also reduces hallucinations by forcing answers to match the actual image, which makes AI more trustworthy. Finally, it offers a cleaner path to general visual agents that can understand, localize, and act in the real world.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) Imagine you’re telling a friend about a picture. If you only say a few big things like “It’s a beach,” you might miss the tiny seashells, the footprints, or the exact shape of the waves. Now imagine a computer that learns like that—great at the big idea, but fuzzy on the tiny details that really matter.

🥬 Filling (The Actual Concept)

What it is: Vision-Language Models (VLMs) are AIs that look at images and read/write text so they can answer questions about pictures, describe scenes, and solve visual problems.
How it works (step by step):
1. A vision encoder turns the picture into a set of features (like notes about colors, shapes, and positions).
2. A language model reads text tokens (pieces of words) and uses attention to decide what’s important.
3. The model tries to predict the next token (word or symbol) to complete an answer or description.
Why it matters: Without a careful way to keep tiny visual details, the model may answer in a general way (e.g., “a bird”) but miss the crucial parts (e.g., “a puffin with an orange beak standing on rock #3”).

🍞 Bottom Bread (Anchor) When you ask, “Which window has the crack?” a coarse model might just say “the left window,” but a detail-aware model can say “the second pane from the top, left side, small diagonal crack.”

🍞 Top Bread (Hook) You know how your eyes jump around a page to the words that matter most? That’s like having a spotlight that shines brighter on the important bits.

🥬 The Concept: Attention Mechanisms

What it is: An attention mechanism helps the model focus on the most useful parts of an image or sentence.
How it works:
1. Look at all parts (words or patches).
2. Score how relevant each part is to the current goal.
3. Give more weight to high-score parts.
4. Use these weighted parts to make the next prediction.
Why it matters: Without attention, the model treats “the” and “giraffe” equally, which makes answers vague or wrong.

🍞 Anchor When asked, “What color is the giraffe’s tongue?”, attention helps the model zoom in on the giraffe’s mouth, not the sky.

🍞 Top Bread (Hook) Think of LEGO bricks: to build anything, you first break ideas into smaller pieces you can snap together.

🥬 The Concept: Tokenization

What it is: Tokenization breaks text (and now image content) into small units called tokens.
How it works:
1. Split text into subwords like “pen” + “guin.”
2. Map image patches to discrete visual tokens (numbers from a codebook).
3. Feed these tokens into the model so it can learn patterns.
Why it matters: Without tokens, the model can’t build up complex meanings step by step.

🍞 Anchor To say “penguin on ice,” the model uses a few text tokens and, with Youtu-VL, also predicts image tokens that describe the penguin’s look and location.

🍞 Top Bread (Hook) Imagine coloring every pixel of a picture by labeling what it is—sky, tree, road—like a super-detailed coloring book.

🥬 The Concept: Semantic Segmentation

What it is: Assigning a category label to every pixel of an image.
How it works:
1. Divide the image into a grid or consider each pixel.
2. Predict which category each spot belongs to.
3. Smooth the result so boundaries look clean.
Why it matters: Without segmentation, the model can’t know exactly where things are, only that they exist.

🍞 Anchor In a street photo, segmentation helps color the road gray, the cars red, and the sidewalks tan—all in the right places.

🍞 Top Bread (Hook) When you look at a photo, you can feel what’s closer or farther away. That’s depth—like a 3D secret hidden inside a 2D picture.

🥬 The Concept: Depth Estimation

What it is: Predicting how far every pixel is from the camera.
How it works:
1. Turn depth into bins (like near to far).
2. Predict which bin each pixel belongs to.
3. Optionally change bins back into real distances.
Why it matters: Without depth, robots and agents can’t tell which objects are in the way or how to grasp them.

🍞 Anchor If a robot wants to pick up an apple on a table, depth tells it how far to reach, not just where the apple is in 2D.

🍞 Top Bread (Hook) Think of writing a story one word at a time. Each new word depends on what you already wrote.

🥬 The Concept: Next Token Prediction (NTP)

What it is: The model learns to guess the next token in a sequence.
How it works:
1. Read the previous tokens.
2. Consider the context (image + text).
3. Predict the next token.
4. Repeat.
Why it matters: Without NTP, the model can’t produce coherent sentences—or, in Youtu-VL, coherent visual token streams.

🍞 Anchor When asked, “What animal is sitting on the ice?” the model steps through tokens to output “A penguin,” and with Youtu-VL, also predicts visual tokens that describe that specific penguin.

The World Before: Most VLMs used images just as helpers for text. Training mainly optimized text output, so tiny image details got ignored. The Problem: Models often missed small, important clues (like a tiny logo or an exact pixel location) and struggled with dense tasks like segmentation and depth without extra modules. Failed Attempts: People added task-specific decoders or special tokens for each task. It worked but made systems complicated and less unified. The Gap: A simple, single model that keeps fine visual detail and handles many vision tasks without bolt-on parts. Real Stakes: This matters for safer robots, better medical pre-screening, helpful accessibility tools (reading signs, forms), and reliable assistants that don’t hallucinate details.

02Core Idea

🍞 Top Bread (Hook) You know how a good art teacher doesn’t just ask you to talk about a painting—they also ask you to redraw parts of it so you notice the exact shapes and colors? That second part makes you really see.

🥬 The Concept: Vision-Language Unified Autoregressive Supervision (VLUAS)

What it is: A training style where the model predicts not only words but also visual tokens, so it must remember fine visual details.
How it works:
1. Build one big vocabulary that includes both text tokens and visual tokens.
2. Feed images as continuous features (so input is high quality) but make the model predict discrete visual tokens (so targets are stable and detailed).
3. Train the model to guess the next token, whether it’s a word or a visual token.
4. Use this same setup for many tasks (captioning, grounding, segmentation, depth).
Why it matters: Without making vision a target, models slip into text-only habits, lose detail, and plateau early.

🍞 Bottom Bread (Anchor) When asked “Draw a box around the red cup,” the model doesn’t just say “There is a red cup.” It predicts the exact coordinate tokens and visual details needed to localize it precisely.

Multiple Analogies for the Aha Moment

Teacher analogy: Don’t just explain the picture—recreate tiny parts. Predicting visual tokens forces attention to details.
Chef analogy: Don’t just name the dish—list the exact ingredients and chop sizes. Visual tokens are the chopped ingredients.
Map analogy: Don’t just say “Go north.” Give turn-by-turn steps. Unified tokens are the directions that keep you on the precise path.

🍞 Top Bread (Hook) Imagine writing a story where some words are regular words, and some words are tiny picture-pieces. You write them in order, all mixed together.

🥬 The Concept: Unified Autoregressive Modeling

What it is: Predicting a single mixed stream of text and visual tokens, one after another.
How it works:
1. Put text tokens and visual tokens into one sequence.
2. Predict the next token from that same unified set.
3. Repeat to build answers and dense maps.
Why it matters: Without one stream, the model treats images as sidekicks and never fully learns how vision and language interact step by step.

🍞 Anchor While answering “How many people are wearing hats?”, the model alternates between reasoning tokens and visual tokens that track detected heads and hat regions.

🍞 Top Bread (Hook) You know how a dictionary tells you both the word and what it means? Imagine a dictionary that also stores little building blocks for image parts.

🥬 The Concept: Image-Text Vocabulary

What it is: A combined dictionary that contains normal text tokens and a large set of visual tokens (from a learned codebook).
How it works:
1. Learn a codebook that turns image patches into token IDs.
2. Merge those IDs with the normal text token list.
3. Let the model predict from this merged list.
Why it matters: Without shared tokens, the model can’t smoothly switch between describing and pinpointing pixels.

🍞 Anchor To say “penguin at (x, y),” the model outputs the word “penguin” and tokenized coordinates chosen from the same overall vocabulary.

🍞 Top Bread (Hook) Think of two friends: one is great at language and meanings, the other at shapes and boundaries. Together, they notice both what things are and exactly where they are.

🥬 The Concept: Synergistic Vision Tokenizer

What it is: A tokenizer that fuses high-level semantics (from a language-aligned vision encoder) and crisp shapes/boundaries (from a structure-focused encoder) before turning image patches into discrete tokens.
How it works:
1. Take semantic features (what it is) and geometric features (where/shape).
2. Use cross-attention to let them talk and combine.
3. Quantize the fused features into a large visual codebook of token IDs.
Why it matters: Without this synergy, tokens might be either too vague (only meaning) or too noisy (only texture) and miss fine boundaries.

🍞 Anchor Around a penguin’s beak, the tokenizer’s fused token keeps both the “beak” meaning and its sharp edge so the model can localize it exactly.

🍞 Top Bread (Hook) Imagine a translator helping two people speak different languages, making sure meaning and structure stay aligned.

🥬 The Concept: Cross-Attention Fusion Mechanism

What it is: A way for semantic features and geometric features to exchange information so each patch token encodes both meaning and shape.
How it works:
1. Create queries from geometry, keys/values from semantics.
2. Match what shapes need with the best semantic clues.
3. Produce fused features that are both meaning-rich and boundary-aware.
Why it matters: Without fusion, boundaries can get blurry or labels can drift away from the right spot.

🍞 Anchor On a street scene, cross-attention helps keep the “car” tokens aligned exactly with the car’s outline, not the road next to it.

Before vs After

Before: Vision was just context. Training optimized text only, losing detail and stalling on dense tasks without extra heads.
After: Vision becomes a prediction target. The model keeps fine detail, learns dense tasks directly, and scales better.

Why It Works (Intuition)

Predicting visual tokens is like asking the model to ‘redraw’ the picture in token form, preventing detail loss.
Keeping image input continuous (no quantization error on input) but predicting discrete tokens (stable targets) balances fidelity and learnability.
Unifying tokens means one brain for many tasks, not a toolbox full of separate gadgets.

Building Blocks

Unified vocabulary: one dictionary for words and image pieces.
Synergistic tokenizer: tokens that carry both meaning and shape.
Unified next-token prediction: one loop that can produce text answers and dense visual outputs.
Multi-label extension (NTP-M): one patch can carry object, color, and depth at the same time.

03Methodology

High-Level Recipe: Input → Vision Encoder → Spatial Merge Projector → LLM with Unified Vocabulary → Mixed Text-and-Visual Token Predictions

Step A: Vision Encoder (Continuous Input Path)

What happens: The image is processed at native resolution by a strong vision encoder (SigLIP-2). It creates continuous embeddings rich in semantics and location cues.
Why this step exists: If inputs were discretized here, you’d lose fine details before the model even starts thinking.
Example: A 2048×2048 photo remains high-detail; the encoder preserves tiny logo patterns.

Step B: Spatial Merge Projector

What happens: Adjacent 2×2 patch features are merged to reduce token count to 1/4, then an MLP maps them into the LLM’s input space.
Why it matters: Without this, the sequence would be too long and slow. Merging keeps detail while speeding up training.
Example: Four tiny road patches become one token, but lane markings are still traceable.

Step C: Unified Vocabulary and Mixed Prediction

What happens: The LLM uses one big dictionary: normal text tokens plus visual tokens from the codebook. It predicts the next token, which could be a word or an image token.
Why it matters: Without one dictionary, the model would juggle two separate worlds and never master their step-by-step dance.
Example: Answering “Find the cat and draw a box” outputs class words and coordinate tokens in a single stream.

🍞 Top Bread (Hook) Think of building a box of color swatches so you can describe any shade exactly by picking its closest card.

🥬 The Concept: Visual Codebook

What it is: A learned set of prototype vectors; each image patch gets mapped to the nearest prototype and becomes that token ID.
How it works:
1. Fuse semantic and geometric features (from SigLIP-2 and DINO-like encoders) using cross-attention.
2. Project and quantize the fused features into codebook entries.
3. Train with perceptual and adversarial losses so tokens capture structure and meaning, not just raw pixels.
Why it matters: Without a good codebook, visual tokens are blurry or repetitive and can’t supervise details.

🍞 Anchor Edges of a penguin’s flipper map to specific codebook entries that repeat across similar flippers in other images.

🍞 Top Bread (Hook) You know how a mall map uses exact coordinates so you can find a shop without guessing?

🥬 The Concept: Axis-Specific Vocabulary and Absolute Pixel Coordinates

What it is: Special tokens <x_#> and <y_#> that directly represent pixel positions on X and Y axes.
How it works:
1. Expand the vocabulary with tokens for X and Y coordinates (e.g., 0–2048).
2. Predict sequences like X1, Y1, X2, Y2 to represent boxes or keypoints.
3. No normalization tricks—use real pixels to avoid confusion and scaling errors.
Why it matters: Without distinct axis tokens and absolute coordinates, the model can mix up X and Y or mis-scale boxes.

🍞 Anchor To box a red cup at the top left, the model outputs <x_120> <y_80> <x_260> <y_220>—simple and precise.

🍞 Top Bread (Hook) Imagine coloring a picture by picking the strongest color card for each patch.

🥬 The Concept: Dense Prediction from Standard VLM Outputs

What it is: Use the model’s own output scores (logits) over the unified vocabulary to produce dense maps (like segmentation or depth) without extra decoders.
How it works:
1. For each category (e.g., “road,” “tree”), gather the scores of the tokens that spell that category.
2. Average those scores per patch, reshape to a grid, and upsample.
3. Pick the highest-scoring category per patch to get the final map; optionally refine with a CRF.
Why it matters: Without this, people add task-specific heads that make the system complex and less unified.

🍞 Anchor For “road” vs “sidewalk,” the model’s own scores pick the winner per patch to create a clean segmentation mask.

🍞 Top Bread (Hook) Sometimes a sticker can say more than one thing: shiny, blue, star-shaped. A single patch in an image can also hold multiple truths at once.

🥬 The Concept: Multi-Label Next Token Prediction (NTP-M)

What it is: A training trick that allows each patch to have multiple correct tokens (object, attribute, depth bin), not just one.
How it works:
1. Create a multi-label target for each patch (many 1’s across different token IDs).
2. Train separate yes/no decisions for each token ID.
3. Sample only the most relevant negatives (the few confusing wrong labels) so learning doesn’t get drowned by easy negatives.
Why it matters: Without NTP-M, the model would pretend each patch has a single label and lose rich information.

🍞 Anchor A patch on a “red car on road” can be car + red + near-depth. NTP-M teaches the model to hold all three truths.

Training Stages (Simple View)

Stage 1–2 (Text Only): Build a strong brain for language and reasoning.
Stage 3 (Multimodal Foundation): Mix images and text; train the visual tokenizer and start unified supervision (predict text and visual tokens together).
Stage 4 (Task Adaptation): Teach many tasks like grounding, detection, pose, segmentation, depth, OCR, STEM, GUI; apply NTP-M for dense vision.

Secret Sauce

Asymmetric inputs/targets: Keep image input continuous for fidelity, but predict discrete visual tokens for stable, detail-rich supervision.
Unified tokens: One loop to rule text and vision makes the model general without bolt-on parts.
NTP-M: Teaches patches to be multi-talented (object + attribute + geometry).

04Experiments & Results

The Test: What and Why

Visual Grounding: Can the model find exactly which object a phrase refers to? Important for pointing-and-acting.
Object Detection: Can it list objects with precise boxes? Key for counting and interaction.
Semantic Segmentation: Can it paint every pixel with the right label? Vital for safe navigation and editing.
Depth Estimation: Can it sense near vs far from a single image? Needed for 3D understanding.
Pose Estimation, Classification, Counting: Measures fine localization, category knowledge, and number sense.
General Multimodal VQA and OCR: Checks reasoning, chart reading, and document understanding.

The Competition: Who We Compared Against

General VLMs with standard architectures (e.g., Qwen3-VL, InternVL-3.5).
Vision-centric VLMs with added task heads or special tokens (e.g., VisionLLM v2, UFO, GiT).
Classic specialist models for single tasks (e.g., Mask2Former for segmentation, UniDepth-v2 for depth).

The Scoreboard (With Context)

Visual Grounding (RefCOCO family): Youtu-VL averages around the low 90%s, which is like getting an A when strong peers are at high B to A- levels—showing it really locks onto the right object.
Object Detection (COCO val): 47.1% mAP. That’s neck-and-neck with GiT (46.7%) and close to UFO (48.9%), even though UFO uses extra task-specific designs. It’s like tying for silver without special gear.
Semantic Segmentation (ADE20k): 54.2 mIoU, beating GiT (47.8%). This is a solid step up, showing the dense prediction method works well.
COCOStuff: 52.5% mIoU without fine-tuning, noticeably ahead of some baselines. Like getting a sturdy A- where others get a B.
Depth (NYUv2): About 90% on a key accuracy-at-threshold metric, close to larger or specialized systems, but achieved inside a single general model.
Pose (MPII): 89.1% PCKh@0.5—competitive with specialist-level performance while remaining a unified VLM.
Counting (CountBench, TallyQA): Strong results, often ahead of similar-size general VLMs, thanks to precise localization.
OCR/Charts/Docs (TextVQA, ChartQA, DocVQA, CharXiv): Robust across many datasets, with notable strength on chart reasoning and layout consistency.
General VQA and Reasoning (MMBench, MathVerse, LogicVista, MMMU family): Competitive to strong on many, with room to grow on the most knowledge-heavy, long-context tasks.

Surprising Findings

Dense Prediction Without Extra Heads: Youtu-VL can produce segmentation and depth maps simply by reading its own output scores—no extra decoders required. That’s like building a bridge without needing a second crane.
Less Hallucination: On tricky tests that try to fool models into saying an object exists when it doesn’t, Youtu-VL is more cautious and image-grounded, likely because it’s trained to predict visual tokens that must match real details.
Better Scaling: When trained longer with VLUAS, performance keeps rising instead of flattening early. Think of climbing a hill and discovering it’s a gently rising slope, not a sudden cliff.

What It Means

Treating visual details as targets, not just inputs, helps the model keep tiny but crucial information.
A single architecture can cover a wide span of vision tasks competitively, reducing engineering complexity.
This sets the stage for generalist “visual agents” that can understand, localize, and act with fewer moving parts.

05Discussion & Limitations

Limitations (Honest View)

Low-Resolution Weakness: On small or blurry images, very tiny details can still be hard to recover perfectly.
Geometry Sensitivity: Depth and precise 3D understanding can depend on camera settings; unusual lenses or scenes may reduce zero-shot accuracy.
Very Deep Knowledge Reasoning: On expert-level or very long, multi-step problems, pure language specialists can still lead; there’s room to grow.

Required Resources

Big Training Mix: The system uses trillions of tokens across stages—large data curation and compute are needed.
Strong Encoders: It relies on advanced vision encoders and a learned codebook.
Long Context: Handling dense tasks and multi-image understanding benefits from longer context windows.

When NOT to Use

Pure Text-Only Expert Exams: If you only need top-tier language-only reasoning at graduate-level depth, a text-specialist LLM may be better.
Ultra-Precise Scientific 3D Without Calibration: For lab-grade measurements where camera intrinsics are critical, a calibrated specialist may be safer.
Extreme Tiny-Object Forensics on Low-Res Images: A dedicated super-resolution + specialist pipeline can still be preferable.

Open Questions

Even Better Tokens: Can the visual codebook become adaptive per scene, or combine with continuous hints without losing the simple training loop?
Richer Geometry: How to fold in camera parameters and multi-view cues so depth and pose generalize more broadly?
Long-Context Reasoning: How to scale stable chain-of-thought across many pages and images without drifting?
Safer and Fairer: How to further reduce hallucinations and biases while keeping strong performance?
Efficiency: Can we keep the unified strengths but cut compute and memory so it runs widely on edge devices?

06Conclusion & Future Work

Three-Sentence Summary

Youtu-VL trains a single model to predict both words and visual tokens, switching from “vision as input” to “vision as target.”
This unified supervision preserves fine details and enables dense tasks like segmentation and depth without extra decoders.
Experiments show competitive or superior performance across many benchmarks, with better scaling and fewer hallucinations.

Main Achievement

The #1 contribution is Vision-Language Unified Autoregressive Supervision (VLUAS): one mixed token stream for image and text that makes the model keep and use fine-grained visual information.

Future Directions

Strengthen geometry with better handling of camera intrinsics and multi-view cues.
Improve long-context reasoning and expert-domain knowledge.
Explore lighter, faster versions for on-device use while keeping unified capabilities.

Why Remember This

Youtu-VL shows that a single, standard VLM can be a true visual generalist by learning to predict visual details directly. This is a blueprint for simpler, stronger visual agents that see, think, and act without a pile of task-specific gadgets.

Practical Applications

•Photo editing that can select and modify very fine details (hair, thin lines) using dense segmentation.
•Robotics navigation and grasping that uses depth and precise localization without extra task modules.
•Retail shelf analytics that count and locate products accurately from camera feeds.
•Medical pre-screening tools that segment regions (e.g., organs) and estimate structures with clearer boundaries.
•Document understanding systems that read forms, charts, and receipts with better layout grounding.
•AR/VR scene understanding that labels and measures objects for interactive overlays.
•Wildlife monitoring that detects, localizes, and counts animals in complex scenes.
•Traffic analysis that segments roads, sidewalks, and vehicles for safer city planning.
•GUI automation agents that robustly find buttons and fields and execute multi-step tasks.
•Industrial inspection that detects tiny defects and measures their exact positions.

Version: 1