SAMTok: Representing Any Mask with Two Words

Yikang Zhou; Tao Zhang; Dengxian Gong; Yuanzheng Wu; Ye Tian; Haochen Wang; Haobo Yuan; Jiacong Wang; Lu Qi; Hao Fei; Anran Wang; Zhuochen Wang; Yujing Wang; Cheng Chen; Shunping Ji; Xiangtai Li

SAMTok: Representing Any Mask with Two Words

Intermediate

Yikang Zhou, Tao Zhang, Dengxian Gong et al.1/22/2026

arXiv PDF

Key Summary

•SAMTok turns any object’s mask in an image into just two special “words” so language models can handle pixels like they handle text.
•This keeps everything simple: no new heavy vision heads, no tricky losses—just the usual next-token prediction used in chat models.
•It is trained on 209 million masks and can faithfully reconstruct masks from those two tokens using a SAM2-based encoder–decoder.
•By adding these mask words to an MLLM’s vocabulary, the model can read masks (understand) and write masks (generate) in plain dialogue.
•A simple text-only reward lets the model learn better mask generation with reinforcement learning, avoiding complex pixel IoU reward plumbing.
•On many benchmarks (region captioning, GRES, GCG, interactive segmentation), models with SAMTok reach state-of-the-art or comparable results.
•RL with a textual answer-matching reward brings big gains on GRES and GCG, improving both localization and recall.
•The approach scales easily across different MLLMs because the tokenizer is decoupled and uses only two tokens per mask.
•This unifies lots of pixel-level tasks into one language-like interface, making training and inference faster and simpler.
•It points to a future where models talk about and precisely highlight image regions as easily as they use words.

Why This Research Matters

SAMTok turns pixel-precise shapes into short, language-like tokens, letting models talk about exact image regions as easily as words. This makes interactive tools—like photo editors, design assistants, and accessibility helpers—more accurate and responsive. It also cuts engineering complexity, because training can reuse the same simple recipe used for text, and reinforcement learning can use fast, text-style rewards. With two tokens per mask, responses stay short and efficient, which matters for real-time apps and large-scale deployments. The approach generalizes well across tasks and models, hinting at a common “mask language” for the MLLM ecosystem. In short, SAMTok is a clean bridge between language and pixels that unlocks precise, scalable multimodal interaction.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how you can point at a picture and say, “Describe that cat’s tail,” and a friend immediately knows exactly which part you mean? Computers haven’t always been so good at that. AI models could talk about whole images, but when it came to pinpointing exact pixels—like outlining the tail, the wheel of a bike, or a tiny button—things got complicated.

The world before: Multimodal large language models (MLLMs) had become great at chatting about images, reading text in them, and answering questions. But when someone wanted pixel-perfect actions—like “highlight the third leaf,” “cut out the sticker,” or “segment the left window”—systems needed extra add-ons: special region encoders to feed masks into the model, and special decoders to spit out masks. Each of these parts had its own training rules. It was like building a robot with many arms that didn’t quite agree on how to move.

The problem: Four frictions made scaling pixel-wise MLLMs hard. First, “mask in” and “mask out” were handled by different, complicated parts. Second, reinforcement learning (RL)—which supercharges models—was tricky because most systems used continuous (not word-like) features for masks, so rewards were hard to define and compute. Third, training these systems together with normal text and VQA data was messy, since different tasks needed different losses and pipelines. Fourth, some tried to represent masks as long text strings (like polygons or run-length codes), but that created long, slow outputs—dozens to hundreds of tokens for a single mask.

Failed attempts: Teams tried boxes or points to simplify the problem, but these are fuzzy pointers, not exact shapes. Others built powerful segmentation heads attached to MLLMs, but then you had to co-train everything with custom losses and wiring. Some encoded masks as images or long token streams, which worked but made inference heavy and slow.

The gap: What if we could treat a mask like a single, short word—compact, precise, and easy to learn with the same training recipe as text? If masks could be read and written the same way models handle language, we could drop the extra vision heads, avoid special losses, and even do RL with simple text-style rewards.

Real stakes: Think about photo editing where you say, “Remove the shadow under the second cup,” and the model cleanly selects exactly that region. Or interactive design tools that highlight parts while you chat. Or accessibility tools that precisely name and outline objects. Or robots that need pinpoint, pixel-accurate directions like “grasp only the red cap.” A language-native mask interface makes these interactions faster, clearer, and easier to scale across many models and tasks.

02Core Idea

🍞 Hook: Imagine you have a magic label maker that can stick a tiny, two-letter tag on any shape in a picture, and when your friend sees the two letters, they can redraw that exact shape perfectly.

🥬 The Concept: SAMTok is a “mask tokenizer” that turns any region mask into exactly two special tokens (two “words”), and turns those two tokens back into the original mask with high fidelity.

How it works (big picture):
1. Take an image and a mask. 2) Encode the mask into a compact vector, 3) Discretize it into two codewords, 4) Later decode those two codewords (plus the image) back to the full mask. 5) Add these codewords to an MLLM’s vocabulary so the model can read and write them like normal words.
Why it matters: Without a tiny, word-like mask, we need bulky modules, long token strings, or special losses. With two tokens, training is as easy as text training, and reinforcement learning becomes simple.

🍞 Anchor: In a photo of a garden, the mask for “the third red tulip” becomes two tokens like <|mt_0011|><|mt_0347|>. The model can now “say” these two tokens to produce that exact tulip’s shape—or read them to understand which tulip you mean.

Multiple analogies for the same idea:

Library card analogy: Each mask gets a two-code library card. Show the card, and the librarian (decoder) retrieves the exact book (mask) instantly.
GPS pin analogy: Two short coordinates act like a precise GPS pin for a shape; the decoder navigates directly to the right pixels.
Secret handshake analogy: Two quick moves (tokens) are enough to unlock the exact mask; longer dances (long token strings) aren’t needed.

Before vs. after:

Before: MLLMs needed specialized region encoders for inputs, segmentation decoders for outputs, and special losses. RL required mask IoU tools and extra models.
After: MLLMs just read and write two mask tokens using normal next-token prediction. RL can use simple text-style rewards by checking token matches—no extra mask decoders in the RL loop.

Why it works (intuition, not equations):

Many different masks can be represented by a compact embedding. Residual vector quantization “snaps” that embedding to two nearby codes that keep the important details while staying short. A strong decoder (built on SAM2) uses the image and these two codes to reconstruct the detailed mask.

Building blocks (each explained with Sandwich):

🍞 Hook: You know how we use words to label things so everyone understands?
🥬 Mask Tokenization: Turning a detailed pixel shape (mask) into a tiny, fixed pair of tokens. It works by first encoding the mask as a vector, then discretizing it into two codebook entries. Without tokenization, MLLMs can’t treat masks like language, making training clunky.
🍞 Anchor: “Highlight the small blue sticker” becomes two mask tokens, so the model knows the exact sticker region.

🍞 Hook: Imagine summarizing a complex doodle into a single secret note.
🥬 Mask Encoder: A SAM2-based module that converts a 2D mask into a compact vector. It adds the mask prompt to image features and outputs a single mask embedding. Without it, there’s no clean summary to discretize.
🍞 Anchor: The region “cat’s left ear” becomes a single embedding ready to quantize.

🍞 Hook: Think of rounding to the nearest two “Lego bricks” to build a tiny but accurate model.
🥬 Residual Vector Quantization: A two-step snapping that finds the closest code, then snaps the leftover difference to a second code. This keeps details while using only two small tokens. Without RQ, you’d need huge codebooks or lose fidelity.
🍞 Anchor: An odd-shaped leaf gets represented by codes A and B; the decoder rebuilds the full leaf outline.

🍞 Hook: Like guessing the next word in a sentence you’re reading.
🥬 Next-token Prediction: The same training rule used for text also trains the model to output the right mask tokens. It simply learns which two tokens complete the response. Without this, we’d need special segmentation losses and code.
🍞 Anchor: For “segment the ‘striped mug’,” the model outputs the two tokens that decode to that mug’s mask.

🍞 Hook: Picture a chat where you can insert a tiny symbol that means “this exact region.”
🥬 Unified Mask-Token Interface: Adding mask tokens to the model’s vocabulary so it can read masks in prompts and write masks in answers. Without it, mask-in and mask-out need separate machinery.
🍞 Anchor: A prompt includes <|mt_start|><|mt_0011|><|mt_0347|><|mt_end|> to refer to a tulip; the answer uses the same style to return masks.

🍞 Hook: Like a quiz where you get points only if your answer exactly matches the key.
🥬 Textual Answer-Matching Reward: In RL, count how many predicted mask tokens match the ground-truth tokens. It’s simple, fast, and avoids extra tools. Without it, RL needs pixel IoU and extra decoders.
🍞 Anchor: The model proposes three mask words; two appear in the ground truth → reward = 2/3.

🍞 Hook: Training a dog with treats for the right trick.
🥬 Reinforcement Learning with GRPO: An RL method that nudges the model to pick better mask tokens over time, guided by the text-style reward. Without RL, the model might plateau after supervised training.
🍞 Anchor: After RL, the model finds “the leftmost spoon” more reliably and draws cleaner edges.

03Methodology

At a high level: Image + (optional) mask → SAMTok Encoder → Two mask tokens (via residual quantization) → SAMTok Decoder (with image) → Reconstructed mask → Plug tokens into MLLM to read/write masks via next-token prediction.

Stage A: Tokenizing masks into two tokens (Encoder + Quantizer)

What happens: Given an image and a region mask, the SAMTok encoder (built on SAM2) fuses the image features with a dense representation of the mask prompt, then outputs a compact mask embedding (a single vector). Residual vector quantization (RQ) snaps this vector to two codebook entries: first the nearest code, then the nearest code to the leftover residual. These two codes are mapped to two special tokens in the MLLM’s vocabulary.
Why this step exists: We need a short, discrete, information-rich handle for each mask so MLLMs can treat masks like words. Without quantization, the model would need continuous features (hard for RL) or long token strings (slow and costly).
Example: In a kitchen photo, the region “the handle of the striped mug” becomes <|mt_0011|><|mt_0347|>. Two tokens, one exact handle.

Stage B: Decoding two tokens back into a mask (Decoder)

What happens: During inference, the two tokens map back to their codebook vectors. The decoder (full SAM2) takes the image and these vectors as sparse prompts, attends across image features, and reconstructs the 2D mask.
Why this step exists: To visualize or use the predicted mask, we must turn tokens back into pixels. Without the decoder, we’d have labels but no shapes.
Example: Given <|mt_0011|><|mt_0347|> and the kitchen image, the model redraws the mug handle mask accurately.

Stage C: Unifying with MLLMs via vocabulary extension

What happens: We add mask tokens to the MLLM’s vocabulary, like adding new words. Now prompts can include mask tokens (mask-in), and the model can answer with mask tokens (mask-out). Training uses the same next-token prediction as normal text.
Why this step exists: It merges pixel tasks into the language pipeline so we don’t need special heads or losses. Without it, tasks would remain fragmented.
Example: Prompt: “For <|mt_start|><|mt_0011|><|mt_0347|><|mt_end|>, describe it.” Answer: “It’s a short, curved handle with blue stripes.” Or: “Find the ‘green button’ → <|mt_start|><|mt_0210|><|mt_0455|><|mt_end|>.”

Stage D: Supervised fine-tuning (SFT) with mixed tasks

What happens: Convert diverse datasets—region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, interactive segmentation—into dialogue-style text. All masks inside these datasets are pre-tokenized to two tokens. Train the MLLM only with next-token prediction.
Why this step exists: It standardizes training across many tasks and scales easily. Without this unification, each task would require separate designs and losses.
Example: For GCG, interleave mask tokens right after phrases in a caption: “A man holds <|mt(apple)|> an apple and looks at <|mt(dog)|> a dog.” The model learns to align text spans and mask tokens.

Stage E: Reinforcement learning (GRPO) with text-only rewards

What happens: Roll out the model on tasks like GRES or GCG. Extract all predicted mask tokens from its responses. Count how many match the ground truth token set (after dedup). Compute a simple reward = matches / max(predicted, ground truth). Optimize with GRPO.
Why this step exists: RL pushes the model to choose better tokens for harder cases (crowded scenes, subtle relations) without requiring pixel IoU computations or external decoders during reward calculation. Without RL, improvements may stall after SFT.
Example: The model predicts four mask pairs in a caption; three pairs match the ground truth set → reward 3/4.

The Secret Sauce

Two-token design: Just two tokens per mask keeps inputs/outputs short, speeds training/inference, and reduces confusion compared to long polygon/RLE strings.
Residual vector quantization: Two small codebooks, two steps, high fidelity. It balances accuracy with short outputs, making it easy for MLLMs to learn distribution over mask tokens.
SAM2 backbone: A strong, proven segmentation foundation makes both encoding (summarizing the mask) and decoding (reconstructing the mask) reliable.
Language-native RL: Because masks are words, we can do RL by matching words—fast, scalable, and tool-free.

Sandwich callouts inside the method:

🍞 Hook: Labeling a shape with a nickname.
🥬 Mask Tokenization: Two tokens that uniquely identify a region’s shape; encoder + RQ produce them; decoder recovers pixels. Without it, masks stay outside the language loop.
🍞 Anchor: “the left window” → <|mt_0005|><|mt_0321|>.
🍞 Hook: Snapping a sketch to the closest stencil, then fixing small gaps with a second stencil.
🥬 Residual Vector Quantization: First code captures the big picture; second code refines details. Without it, you’d need either a huge codebook or lose detail.
🍞 Anchor: A wavy leaf = Code 12 + Code 347.
🍞 Hook: Guessing the next word in a sentence.
🥬 Next-token Prediction: The same training rule for text now also teaches the model which two mask tokens to output. Without this, extra segmentation losses are required.
🍞 Anchor: Prompt: “Segment ‘blue sock’.” Model outputs token pair for that sock.
🍞 Hook: Stickers in a story.
🥬 Unified Interface: Mask tokens become stickers you can place in prompts or answers. Without them, mask-in and mask-out need separate systems.
🍞 Anchor: Prompt includes a mask; answer returns another mask—both two-token stickers.

04Experiments & Results

The Test: The authors evaluate three abilities—(1) generating text and masks together, (2) answering with masks from text (text-to-mask), and (3) describing a given region (mask-to-text). They also test multi-round interactive segmentation (keeping track of regions over several turns), scene graph parsing (predicting objects and their relations plus masks), and visual grounding (boxes derived from masks). The key goal: show that a two-token mask language works across many tasks using standard training.

The Competition: Strong baselines include LISA, GLaMM, OMG-LLaVA, Sa2VA, and others with specialized mask heads or losses. The SAMTok versions fine-tune QwenVL family models (3B–7B) and sometimes apply RL (GRPO) with a text-only reward.

The Scoreboard with context:

Grounded Conversation Generation (GCG): Models must write a caption while producing correct masks for mentioned phrases. SAMTok-based Qwen models reach new SOTA on validation: better caption scores (e.g., +1.3% METEOR, +5.5% CIDEr vs prior best) and stronger mask metrics (+5.3% AP50, +5.2% mIoU, +4.7% Recall). On test, similar gains hold. In school terms, SAMTok gets an A+ where others get solid As or Bs, especially on the exactness of region picks.
Multi-round Interactive Segmentation (MR-RefCOCO/+/g, MR-PACO): SAMTok shines at long conversations about parts and objects. It improves average cIoU by big margins (e.g., +7.7% across rounds on MR-RefCOCO/+/g and +10.7% on MR-PACO), meaning it remembers who’s who and where they are across turns—like keeping track of characters and props in a long play.
Text-to-Mask (GRES and RefCOCO family): Even trained only with next-token prediction, SAMTok models beat or match methods using special segmentation losses. On GRES, average gIoU rises by about +1.5% over the previous best and N-acc by +4.3%. Zero-shot GroundingSuite also favors SAMTok (67.8 vs. 62.6), showing strong generalization. Think of it as correctly coloring inside the lines more often—even when the coloring book is new.
Mask-to-Text (Region Captioning): On DLC-Bench, SAMTok models get close to expert systems (65.6 vs. 67.3) and outperform general MLLMs by a large margin. On MDVP-Bench (documents, multi-panels, screenshots), SAMTok excels in 3 of 4 subsets, meaning it grounds regions precisely in unusual scene types.
Visual Grounding (REC): When SAMTok outputs masks and the system evaluates bounding boxes derived from those masks, accuracy jumps notably over vanilla QwenVL at both 3B and 7B sizes. So even if you only need boxes, starting from good masks gives better boxes.

Surprising Findings:

Text-only RL reward works well: By simply checking if the predicted mask tokens match ground-truth tokens—no pixel IoU calculators—RL brings big gains. On GRES val: +6.8% gIoU, +4.9% cIoU, and +18.9% N-acc after RL. On GCG val, mask metrics also rise meaningfully (+4.5% AP50, +2.0% mIoU, +6.6% Recall). Caption quality dips slightly when not rewarded for text, which is expected.
Two tokens are enough: With a solid decoder and residual quantization, two tokens per mask keep performance high while making training/inference short and sweet. Ablations show the chosen 2-step RQ with modest codebooks balances fidelity and learnability best.
Plug-and-play across MLLMs: Because SAMTok is decoupled, it works with different vision encoders (tile-based or adaptive-resolution) using the same data. This hints at a standard, reusable “mask word” layer for the MLLM ecosystem.

What the numbers mean (kid-friendly):

gIoU/cIoU/mIoU: Bigger is better—means the drawn outline hugs the real shape more closely.
AP50/Recall: Think of AP50 as scoring how often the model’s masks are good enough to pass a 50% overlap test and Recall as how many true items it found. Higher means it finds and outlines more of the right stuff.
METEOR/CIDEr: These grade the “writing” part—the captions. Higher is better wording and alignment to references.

Bottom line: Across writing, pointing, and outlining, SAMTok’s two-token trick helps models keep their answers neat, precise, and fast, while staying inside the familiar world of language training and RL.

05Discussion & Limitations

Limitations:

2D only (for now): SAMTok trains on 2D image masks. It doesn’t reconstruct video masks yet, so tracking the same object over time still needs future extensions.
Masks only: Points, lines, and boxes aren’t yet tokenized in this framework. Adding them could make interactions even more flexible but requires new tokenizer designs.
Decoder dependence: High-fidelity reconstruction relies on a strong SAM2-based decoder. In domains far from training data (e.g., unusual sensors), quality may drop until adapted data or fine-tuning is used.
Token collisions under extreme density: In scenes with many tiny, similar parts packed together, learning to emit the exact pair for each distinct part can still be challenging.

Required Resources:

Training compute: SAMTok training used large-scale data (209M masks) and strong GPUs (e.g., A100 80GB). While inference is lightweight, reproducing training from scratch needs substantial compute.
Data curation: The approach benefits from diverse, high-quality masks. New domains (medical, satellite, UI) may require good coverage for best results.

When NOT to Use:

If you strictly need temporal consistency (video segmentation) without adding a video-aware decoder, SAMTok-2D alone won’t suffice.
If a task only needs coarse boxes and never needs precise shapes, a box-only system may be simpler.
Ultra-low-resource training settings where adding any tokens or external decoder is impossible.

Open Questions:

Video mask tokenization: What’s the best way to extend two-token masks across time—add a tiny “motion” token, or learn a temporal codebook?
More geometry types: Can we build a shared vocabulary for points, lines, boxes, and masks that still keeps outputs short and RL-friendly?
Finer RL signals: Beyond exact token matching, can we design text-native rewards that softly grade “almost-right” masks without computing pixel IoU?
Multilingual masks: How do mask words interact with multilingual vocabularies and code-switched prompts?
Generative editing: Can the same two-token interface guide pixel-level image editing or generation with precise, conversational control?

06Conclusion & Future Work

Three-sentence summary: SAMTok turns any image region mask into exactly two special tokens and back again, letting MLLMs read and write pixel-precise shapes as easily as words. This removes the need for special segmentation heads and losses, enabling simple next-token training and a text-only reward for RL. Across many benchmarks, models equipped with SAMTok achieve state-of-the-art or comparable performance while staying fast and scalable.

Main achievement: A unified, language-native interface for pixel-level understanding and generation—two tokens per mask—that makes supervision and reinforcement learning for masks as simple as for text.

Future directions: Extend from images to videos, add tokenizers for points/lines/boxes, explore softer text-native rewards, and apply the interface to image generation/editing and broader VQA tasks. Investigate domain adaptation for specialized areas (e.g., UIs, documents, scientific imagery) with minimal extra data.

Why remember this: Like the jump from coordinates to place names on a map, SAMTok names any shape with just two “words,” letting language models converse about exact pixels. It simplifies training, speeds inference, and opens the door to robust RL without complicated tools. This elegant bridge between language and pixels could become a standard layer in future multimodal systems.

Practical Applications

•Voice-guided photo editing: “Remove the shadow under the second cup,” with precise, two-token masks.
•Design assistants: Highlight and describe specific UI elements in mockups or screenshots during a chat.
•Educational tools: Let students ask for parts of diagrams (e.g., “label the mitochondria”) and get exact outlines and explanations.
•Accessibility support: Precisely identify and describe objects or regions in images for users with low vision.
•Robotics: Give pixel-accurate instructions like “grasp only the red cap,” improving reliability of manipulation.
•AR overlays: Place annotations exactly on object parts (e.g., “show torque points on this bike”) with crisp masks.
•E-commerce: Let users circle or refer to product parts (“show me only the collar”) for better search and recommendations.
•Medical imaging workflows (with caution): Highlight regions of interest specified by clinicians for review, keeping the interface simple and consistent.
•Document understanding: Accurately segment tables, figures, or panels in complex layouts for extraction.
•Content moderation or redaction: Precisely mask sensitive regions (faces, license plates) via conversational commands.

Version: 1