Towards Pixel-Level VLM Perception via Simple Points Prediction

Tianhui Song; Haoyu Lu; Hao Yang; Lin Sui; Haoning Wu; Zaida Zhou; Zhiqi Huang; Yiping Bao; Y. Charles; Xinyu Zhou; Limin Wang

Towards Pixel-Level VLM Perception via Simple Points Prediction

Intermediate

Tianhui Song, Haoyu Lu, Hao Yang et al.1/27/2026

arXiv PDF

Key Summary

•SimpleSeg teaches a multimodal language model to outline objects by writing down a list of points, like connecting the dots, instead of using a special segmentation decoder.
•It keeps everything in plain text (coordinates from 0 to 1), so the model can talk about images and also precisely show where things are at the pixel level.
•Training happens in two steps: first supervised fine-tuning (to learn the format and basics), then reinforcement learning (to improve whole-shape accuracy using an IoU-based reward).
•A simple trick—tracing the object boundary as an ordered polygon—controls token length and makes outputs human-readable and easy to check.
•On standard benchmarks (refCOCO, refCOCO+, refCOCOg), SimpleSeg matches or beats many complex decoder-based methods, especially after the RL step.
•The method generalizes beyond real photos to drawings, charts, GUIs, and anime, thanks to its language-aligned output format.
•A key finding is that standard MLLM architectures already have strong low-level perception inside them; they just need the right training and target format to unlock it.
•Ablations show a sweet spot in how many points to use: too few miss details, too many cause long-sequence errors; RL naturally finds a good balance.
•Limitations include very long sequences for highly curved shapes, troubles at sharp corners and thin parts when points are too sparse, and handling holes inside objects.

Why This Research Matters

Precise, text-native coordinates let one model both explain and act, enabling tools that can talk about an object and also select it exactly. This helps interactive photo editing, where you must grab only the part you mean (like just the bracelet). It supports GUI agents that need to click the right button, not the one next to it, by grounding actions at the pixel level. It improves accessibility tools that highlight the exact region being described or read aloud. It also simplifies engineering: no extra decoders, fewer moving parts, and outputs that are easy to audit and debug. Because the method generalizes to cartoons, charts, and screens, it fits real-world apps far beyond natural photos.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how you can circle a picture of a cat exactly along its fur, not just draw a big box around it? For a long time, most AI models that both see and talk (multimodal large language models, or MLLMs) were great at describing what’s in a picture and answering questions about it, but they didn’t draw that tight, neat outline. They could point with a bounding box, which is fast but fuzzy: boxes cover lots of extra background and miss fine details like tails, whiskers, or laces on shoes.

Imagine trying to do super-precise photo edits, like changing only the icing on a donut, or building a robot helper that taps exactly the right button in a phone app. Coarse boxes aren’t good enough for those jobs. People built special add-on parts (decoders) that could make pixel-perfect masks. Those worked, but they made the model more complicated, harder to train end-to-end, and less like a pure language model. Other groups tried to keep everything as text by writing masks as long strings (like run-length encoding) or by using very short polygons. Those stayed in the language space but either blew up the token count or lost fine details.

So the field had a trade-off: keep it simple and language-like but lose accuracy, or add complex decoders to get accuracy but lose the elegant, unified interface. That left a gap for anyone who wanted one model that reasons with language and also draws precise boundaries without extra parts.

Before we dive into the new idea, let’s warm up with the key building blocks, in the order we’ll need them.

🍞 Top Bread (Hook): You know how a puppy learns tricks better with treats? 🥬 The Concept (Reinforcement Learning): What it is: Reinforcement Learning (RL) is a way for a model to learn by trying, getting a score (reward), and doing more of what earns higher scores. How it works:

The model makes a full attempt (like drawing a whole outline).
A judge gives a reward based on how good the attempt is.
The model updates itself to make high-reward attempts more likely next time. Why it matters: When the goal is a whole shape, judging token-by-token is weak; RL lets us grade the final outline. 🍞 Bottom Bread (Anchor): Like a dog trying to sit nicely; when it sits just right, it gets a treat and repeats that behavior.

🍞 Top Bread (Hook): Imagine a coach marking mistakes on your homework so next time you write your answers in the right format. 🥬 The Concept (Supervised Fine-Tuning, SFT): What it is: SFT teaches a model from examples so it learns the correct output format and basic skills. How it works:

Show input-output pairs (question → correct coordinates).
Nudge the model to match the example answers.
Repeat until it reliably outputs well-formed results. Why it matters: Without SFT, the model might produce messy text and brackets that don’t parse. 🍞 Bottom Bread (Anchor): Like practicing math worksheets so you remember to write answers neatly with units.

🍞 Top Bread (Hook): Imagine labeling your closet so you can always find socks, shirts, and hats fast. 🥬 The Concept (Data Annotation Pipeline): What it is: A pipeline that automatically labels images with objects, masks, and point sequences. How it works:

Detect objects/text with GroundingDINO.
Get masks from SAM.
Convert each mask to an ordered contour (points) using a boundary-tracing algorithm.
Optionally add short descriptions with a VLM. Why it matters: Good labels at scale are the fuel that trains the model reliably. 🍞 Bottom Bread (Anchor): Like sorting toys into labeled bins so clean-up and playtime are easy.

🍞 Top Bread (Hook): Think of connect-the-dots pages—you reveal a shape by listing points in order. 🥬 The Concept (Point Prediction): What it is: Predicting a series of 2D points that follow an object’s edge. How it works:

Read the image and the question.
Output the next coordinate one step at a time.
Keep going until the shape loops back to the start. Why it matters: Points are compact, readable, and live in the same text space as language. 🍞 Bottom Bread (Anchor): Writing [(0.8, 0.4), (0.78, 0.45), …] to show exactly where the mouth is.

🍞 Top Bread (Hook): When telling a story, each sentence must follow the last to make sense. 🥬 The Concept (Sequence Generation Problem): What it is: Creating a meaningful list (sequence) of items, one after another. How it works:

Decide the first item (e.g., the first point).
At each step, choose the next one based on what’s already written.
Stop when the list completes the intended structure. Why it matters: Good outlines need good order; random points won’t form a proper shape. 🍞 Bottom Bread (Anchor): Like spelling a word letter-by-letter so it reads correctly when finished.

🍞 Top Bread (Hook): You know coloring books where you try to stay inside the lines? 🥬 The Concept (IoU-based Reward): What it is: A score that measures how much your predicted area overlaps the true area. How it works:

Turn your predicted points into a filled mask.
Compare with the ground-truth mask.
The higher the overlap (IoU), the better the reward. Why it matters: It teaches the model to fit the real shape tightly, not just some of the points. 🍞 Bottom Bread (Anchor): If your coloring matches the picture perfectly, you get a high score.

🍞 Top Bread (Hook): Think of fencing your yard with straight segments instead of a wiggly line. 🥬 The Concept (Polygonal Representation of Masks): What it is: Representing an object by a polygon traced along its boundary. How it works:

Find the boundary of the mask.
Sample points along the boundary in order (clockwise).
Connect them to form a closed polygon. Why it matters: It’s compact, human-readable, and easy for language models to output as text. 🍞 Bottom Bread (Anchor): A list like [[x1, y1], [x2, y2], …] that cleanly fences the object.

🍞 Top Bread (Hook): If you only have 10 stickers, you must place them wisely. 🥬 The Concept (Token Budget Control): What it is: Managing how many text tokens (points) you spend on an outline. How it works:

Use a tolerance to decide how many points to keep.
Keep enough to capture curves, but not too many to waste tokens.
Let training adjust lengths toward a sweet spot. Why it matters: Too short misses detail; too long causes errors and slowdown. 🍞 Bottom Bread (Anchor): Using just enough dots to draw the bracelet clearly without making the list huge.

These simple pieces set the stage: the world needed pixel-level precision inside a language model, without heavy add-ons. The paper’s gap-filling idea is to turn segmentation into writing ordered coordinates—something language models are born to do—then polish it with RL that rewards accurate shapes. This matters for photo editors, assistive tools, and GUI agents that must interact exactly where things are, down to the pixel.

02Core Idea

🍞 Top Bread (Hook): Imagine turning a tracing task into writing instructions: “Start here, then go there, then there…” until the outline closes. 🥬 The Concept (SimpleSeg): What it is: SimpleSeg reframes segmentation as predicting a sequence of boundary points entirely in text, using a standard multimodal LLM—no special decoder. How it works:

Represent each object as a polygon: [[x1, y1], [x2, y2], …].
Train the model with SFT so it outputs well-formed coordinates and basic localization.
Use RL with an IoU-based reward so the whole outline better matches the true mask, improving closure and fine details. Why it matters: It keeps perception inside the language space—simple, unified, human-readable—and still reaches near state-of-the-art accuracy. 🍞 Bottom Bread (Anchor): When asked “Where is the mouth?”, the model replies with about 70 normalized points that precisely trace the lips.

The “Aha!” in one sentence: If a language model is great at generating sequences, then let segmentation be the sequence of points that trace the object—then teach it with a reward that cares about the whole shape.

Three analogies to cement the idea:

Connect-the-dots: Instead of painting every pixel, just list the dots around the edge in the right order and connect them.
Recipe steps: The model writes a recipe for the outline—each step is the next coordinate—until it serves a closed shape.
GPS breadcrumbs: Rather than drawing the whole road, drop precise GPS points that a navigator can follow exactly.

Before vs. After:

Before: Precise segmentation needed extra decoders or clunky text encodings; language-only outputs often lost detail or blew up token counts.
After: A plain MLLM can write polygons as text and, with RL’s IoU reward, reach fine, pixel-level precision while staying simple and unified.

Why it works (intuition, no equations):

Language models excel at next-token prediction. A polygon is just the next-point prediction problem.
IoU-based rewards grade the final shape, not just each token, so the model learns global properties like closure and thin-structure fit.
Clockwise ordering and a light JSON-like grammar reduce confusion, so decoding stays stable.
Token density (how many points) trades detail for reliability; RL nudges the model toward a balanced length automatically.

Building blocks, each with a quick sandwich:

🍞 Hook: You know how practicing penmanship teaches you to keep letters neat before writing essays? 🥬 Concept (SFT for format and basics): What it is: SFT teaches the model to output valid, clean coordinate lists and basic grounding. How it works: show Q→A pairs; match examples; ensure brackets and ordering; learn point/box/mask tasks. Why it matters: Without SFT, the model’s answers won’t parse, and polygons might not close. 🍞 Anchor: “Give the polygon at [0.5, 0.7]” → The model returns a tidy list like [[[0.49,0.59], … ]].

🍞 Hook: A judge scores the final drawing, not each pencil stroke. 🥬 Concept (RL with IoU reward): What it is: A second-stage trainer that scores whole shapes by overlap. How it works: generate polygon; rasterize to mask; compute IoU (plus small extras like centroid distance and format checks); update policy to raise scores. Why it matters: It fixes global mistakes—gaps, wiggles, and missed thin parts—that token-level loss won’t catch. 🍞 Anchor: The model closes a tiny gap in a bracelet outline because IoU reward goes up when the loop is sealed.

🍞 Hook: One toolbox, many jobs. 🥬 Concept (Unified query interface): What it is: Inputs/outputs share a 4-tuple [text, point, bbox, mask], so tasks mix-and-match. How it works: ask text→bbox, point→mask, bbox→mask, text→mask, etc. Why it matters: More supervision from the same data; one consistent language for many tasks. 🍞 Anchor: “What is the polygon of the object at point [0.66, 0.35]?” → polygon; “Describe the box of the dog” → [x1, y1, x2, y2].

🍞 Hook: Short, clear lists beat giant paragraphs. 🥬 Concept (Polygons as text, not dense masks): What it is: Output boundary points in normalized [0,1] coordinates. How it works: contour tracing → ordered points → optional sparsify with tolerance ε. Why it matters: Human-readable, token-efficient, and easy for tools to consume. 🍞 Anchor: Debugging becomes simple when you can just print the coordinates and plot them.

Taken together, SimpleSeg shows that “sequence skill” equals “segmentation skill” if you choose the right sequence (points) and the right teacher (IoU reward).

03Methodology

At a high level: Image + Query → (SFT-taught Text Outputter) → (RL Polisher with IoU Reward) → Polygon-as-Text → Mask (if needed).

Step-by-step recipe:

Build training data (automatic annotation pipeline):
- What happens: Use GroundingDINO to find objects tied to phrases, SAM to produce masks, then convert each mask to an ordered polygon using a boundary-following algorithm (e.g., Suzuki–Abe). Normalize coordinates to [0,1]. Optionally attach short captions via a VLM. Store outputs in a clean JSON-like grammar so parsing is robust.
- Why it exists: The model needs many examples that already live in the same text space it will output. Strong labels and consistent formatting anchor learning.
- Example: “Where is the lightning?” → [[[0.925, 0.400], [0.916, 0.388], …, [0.916, 0.456]]].
Represent everything in one language (unified query interface):
- What happens: Treat [text, point, bbox, mask] as both inputs and outputs. Ask any Cartesian combo: text→bbox, text→mask, point→mask, bbox→mask, etc.
- Why it exists: It multiplies supervision (derive points/boxes from masks; masks from boxes) and keeps a single, consistent format for instruction tuning and RL.
- Example: “Give the polygon of the object at [0.500, 0.700].” → returns a well-formed polygon list.
Keep outputs light and readable (polygonal masks + token budget control):
- What happens: Convert each mask to a clockwise-ordered polygon; then sparsify using tolerance ε to control the number of points (token length). Smaller ε = more points (more detail); larger ε = fewer points (more compact).
- Why it exists: Long outputs are error-prone and slow; too short misses curved details. A tunable ε gives the right balance.
- Example: ε tuned so a bracelet uses ~200 tokens instead of 70 (too rough) or 800+ (too long).
Stage I training (SFT):
- What happens: Supervised fine-tuning on instruction-response pairs covering (text↔point), (text↔bbox), and (text/point→mask). The aim is to learn formatting rules (brackets, commas), valid coordinates, consistent clockwise order, and basic grounding.
- Why it exists: Without SFT, the model might output malformed polygons or jumble point order; even great perception won’t help if the answer can’t be parsed.
- Example: “Describe the bounding box of the person at point [0.45, 0.6].” → [0.404, 0.423, 0.590, 0.832].
Stage II training (RL with GSPO):
- What happens: Use Group Sequence Policy Optimization (GSPO) to optimize sequence-level objectives. The reward includes: (a) Mask IoU reward (primary), (b) centroid distance (small helper), and (c) format reward (must be parseable). No hard length penalty; the model learns to adjust length.
- Why it exists: Many polygons map to the same mask. Token-by-token matching overfits to one annotation. RL encourages any valid, high-IoU trajectory—tight shapes, closed loops, and fewer odd repeats.
- Example: For high-density ground truth, the model learns to skip redundant vertices, keeping IoU high while shortening the sequence.
Grammar and decoding hygiene:
- What happens: Use a minimal JSON-like grammar and enforce clockwise order. Evaluate format correctness; malformed answers get zero reward in RL.
- Why it exists: Reduces decoding entropy and improves training stability; unordered or random points won’t form valid shapes.
- Example: Random-order samples produced chaotic or repeated points; enforcing clockwise yielded clean polygons.
From polygons to other tasks:
- What happens: Convert predicted polygons to boxes with simple min/max, or answer point queries. The same model serves segmentation (RES), comprehension (REC), and promptable modes (point→mask, bbox→mask).
- Why it exists: The single interface lets one trained model handle many localization tasks without re-architecture.
- Example: The REC score (Acc@0.5) is computed by turning the predicted polygon into a bounding box and checking IoU ≥ 0.5.

Secret sauce (why this simple recipe works unusually well):

Turning segmentation into next-point prediction exploits the LLM’s strongest muscle—sequence generation—while staying inside the text space for interpretability.
RL aligns training with the final goal (good masks), not just local token matches, so subtle geometry (thin parts, closure) improve.
Clockwise ordering + light grammar lowers confusion, stabilizing long outputs.
Adjustable ε plus RL self-regularizes output length: the model spends tokens where they matter and trims where they don’t.

Quick sandwiches for key method pieces:

🍞 Hook: Sorting mail into neat piles. 🥬 Concept (Data Annotation Pipeline): What it is: An automated way to get phrase-grounded boxes, masks, and ordered polygons. How it works: detect (GroundingDINO) → segment (SAM) → trace (Suzuki–Abe) → normalize/format → optional caption. Why it matters: Scales training with consistent labels that match the output format. 🍞 Anchor: “Where is the zebra?” → returns a zebra polygon from web data.

🍞 Hook: Writing just enough steps in a recipe. 🥬 Concept (Token Budget Control): What it is: Choosing the right number of points. How it works: set ε; fewer points for simple shapes, more for curvy ones; RL fine-tunes the balance. Why it matters: Prevents both underfitting and long-horizon mistakes. 🍞 Anchor: A circle needs more points than a rectangle to look smooth.

🍞 Hook: Practicing scales before a piano recital. 🥬 Concept (SFT): What it is: Teach format and basics first. How it works: instruction tuning on text↔point/box/mask tasks. Why it matters: Ensures parseable, ordered outputs. 🍞 Anchor: “Point out the middle giraffe.” → [0.483, 0.396].

🍞 Hook: Getting a gold star when the whole project looks great. 🥬 Concept (RL with IoU): What it is: Reward the final outline’s overlap with ground truth. How it works: polygon → mask → IoU (plus centroid/format) → policy update via GSPO. Why it matters: Teaches global geometry, not just tokens. 🍞 Anchor: The outline tightens around thin tails because IoU rises when edges align.

04Experiments & Results

The tests and why they matter:

Referring Expression Segmentation (RES): Given a phrase like “the black mouth,” the model must return the exact object mask. This checks true pixel-level grounding.
Referring Expression Comprehension (REC): Given the phrase, predict a bounding box (we derive it from our polygon). This tests practical detection ability from the same model.

Competition (who we compared to):

Decoder-based systems: NEXT-Chat, LISA, PixelLM, AnyRef, GSVA, LaSagnA, Groundhog, and Text4Seg w/ SAM-refiner. These often use special segmentation heads.
Decoder-free systems: Text4Seg and UFO. These keep outputs in language space but historically struggled with fidelity or token efficiency.

Scoreboard with context:

On RES (cIoU on refCOCO, refCOCO+, refCOCOg), SimpleSeg matches or surpasses many strong baselines. With pre-training plus SFT+RL, the Qwen2.5-VL-7B and Kimi-VL variants reached average cIoU around the mid-70s (e.g., up to about 80.9 on some splits), which is like getting an A in a class where most students get B’s when they don’t use the simple, unified method.
On REC (Acc@0.5), SimpleSeg achieved state-of-the-art average accuracy (around 87%+), even edging out methods that include extra mask refiners. This means our precise polygons also make great boxes.

Ablations (what changed what):

Training stages (gIoU on validation): • SFT alone delivered a strong baseline (~61–66 gIoU across datasets). • Adding RL boosted gIoU by roughly +10 points, indicating whole-sequence rewards crucially improve closure and fine detail. • Pre-training with large web data further lifted SFT and SFT+RL by several points, showing that more diverse data strengthens perceptual priors.
Point density ε (token count vs. performance): • Too few points (very short sequences): underfit curved shapes; performance drops (e.g., 35.6 cIoU at ~78 tokens). • Too many points (very long sequences): long-horizon decoding errors accumulate; performance also drops (e.g., 72.5 cIoU at ~859 tokens). • Middle ground (~200 tokens) gave the best results. RL then fine-tuned lengths adaptively.
Reward design: • IoU reward is the backbone. Adding a small centroid-distance term gave a slight bump. • Hard length penalties hurt performance; the model naturally found reasonable lengths during RL without being forced.
Training dynamics during RL: • Even without length rewards, response length shifted toward a better accuracy-efficiency balance. • With dense initial polygons, the model trimmed redundant vertices; with sparse ones, it slightly increased points to refine shape.
Ordering of points: • Clockwise ordering is key. Random or row-major point orders confused the model, causing repeated or chaotic points and invalid polygons.

Generalization and surprises:

Natural and non-natural images: The model precisely segmented targets in photos (e.g., lightning) and also in synthetic content (anime, charts, GUIs). That shows the language-aligned output travels well across domains.
Extended tasks: The same model did point→mask and bbox→mask (SAM-like prompting), text→point, and text→bbox, all via the unified interface.
Unexpectedly helpful behavior: During RL, the model self-adjusted output length toward the “sweet spot” without explicit penalties—trading a few points for shorter sequences when shapes were over-detailed, and adding a few when too sparse.

Big picture: SimpleSeg doesn’t just keep up with specialized decoders; on REC it takes the lead, and on RES it’s often comparable or better—while staying simpler and fully in the language space. That combination of performance, simplicity, and interpretability is rare.

05Discussion & Limitations

Limitations (be specific):

Long sequences for highly curved or very high-resolution objects can still strain decoding; mistakes may accumulate over many tokens.
Sharp corners and thin structures are sensitive to over-sparsification; if ε is too large, detail is lost, lowering IoU.
Objects with holes (like a donut’s hole) and complex topologies are challenging when represented as a single simple polygon.
Outputs are text; while human-readable, rasterizing many polygons at extreme resolutions can be slower than native pixel decoders.

Required resources:

Training used 32 GPUs, large web-scale pre-training data (optional but helpful), and curated SFT/RL datasets (RefCOCO series). RL adds compute overhead relative to SFT-only training.

When not to use:

If you need ultra-fast, high-resolution panoptic segmentation over entire images in real-time, a specialized decoder might still be more efficient.
In medical or scientific imaging where micron-level accuracy and topology (holes, multiple components) must be guaranteed, polygon-only text outputs may be limiting without extensions.
If bandwidth limits forbid longer text outputs, extremely curved shapes may exceed a tight token budget.

Open questions:

Multi-component and holed shapes: How to elegantly extend the text grammar to multiple polygons and interior rings while keeping decoding stable?
Better rewards: Can topology-aware or boundary-precision rewards (e.g., Chamfer/F-score) boost thin-structure fidelity without hurting stability?
Active token control: Can the model predict an explicit target length given a complexity estimate of the object to minimize errors?
Curriculum learning: Would gradually increasing shape complexity or ε scheduling further stabilize training?
Cross-task synergy: How far can this unified interface scale—panoptic segmentation, depth edges, or even 3D contours—still within text?

06Conclusion & Future Work

Three-sentence summary: SimpleSeg turns segmentation into writing down a list of boundary points, so a standard multimodal language model can outline objects precisely without any special decoder. It learns the basics with supervised fine-tuning and then sharpens whole shapes using reinforcement learning with an IoU-based reward, unlocking strong pixel-level perception that was already latent in the architecture. On standard benchmarks, it matches or beats many complex systems while staying simple, unified, and human-readable.

Main achievement: Proving that a plain MLLM can reach high-fidelity, pixel-level segmentation by predicting point sequences in text, then polishing them with sequence-level rewards—no architectural add-ons required.

Future directions: Extend the text grammar for multiple polygons and holes, explore richer geometry-aware rewards, develop adaptive token budgeting, and push to broader tasks (panoptic sets, tool use, GUI agents) where precise coordinates enable reliable action. Investigate curriculum strategies and hybrid representations to handle extreme curvature at low token cost.

Why remember this: It shows that the core skill of language models—generating good sequences—can directly become the core skill of fine-grained vision—drawing good boundaries. With one simple change of viewpoint (points-as-text) plus the right reward, perception and reasoning live together in a single, elegant interface.

Practical Applications

•Interactive photo editing: Select and modify only the precise object (e.g., change the color of a shirt without touching the jacket).
•GUI automation: Robustly click or drag exactly on small buttons or sliders using text-to-mask to guide actions.
•AR overlays: Place labels and guidance exactly on object boundaries for heads-up instructions.
•Robotic grasping: Outline target objects precisely before planning a safe, accurate grasp.
•Medical pre-annotation: Provide quick, editable polygons for clinicians to refine, speeding up labeling workflows.
•Content moderation and blurring: Mask only sensitive regions (faces, license plates) without over-blurring.
•Document and chart analysis: Isolate plots, axes, or legends precisely for data extraction.
•Education tools: Let students query parts of diagrams and get exact outlines with textual explanations.
•Creative design: Vector-like polygon outputs plug into design tools for clean editing and compositing.
•Video frame assistance: Use per-frame polygons to guide tracking or post-processing in video pipelines.

Version: 1