Medical SAM3: A Foundation Model for Universal Prompt-Driven Medical Image Segmentation

Chongcong Jiang; Tianxingjian Ding; Chuhan Song; Jiachen Tu; Ziyang Yan; Yihua Shao; Zhenyi Wang; Yuzhang Shang; Tianyu Han; Yu Tian

Medical SAM3: A Foundation Model for Universal Prompt-Driven Medical Image Segmentation

Intermediate

Chongcong Jiang, Tianxingjian Ding, Chuhan Song et al.1/15/2026

arXiv PDF

Key Summary

•Medical SAM3 is a text-prompted medical image segmentation model that was fully fine-tuned on 33 diverse datasets to work across many imaging types like ultrasound, X-ray, endoscopy, and pathology.
•The key idea is to stop relying on boxes or clicks and instead teach the model to turn medical words into precise masks directly.
•Compared to the original SAM3 (no medical training), Medical SAM3 shows big jumps in accuracy, especially on tough cases with tiny vessels or low contrast.
•Internal tests improved average Dice from 54.0% to 77.0% and IoU from 43.3% to 67.3% using only text prompts.
•External tests on totally unseen datasets improved average Dice from 11.9% to 73.9%, showing strong generalization.
•A simple, unified recipe—image + mask + text for training, and text-only at test time—lets the model handle many organs and modalities with one interface.
•Smart training choices (high resolution, layer-wise learning rate decay, and set-prediction losses) help the model align language with medical visuals.
•This approach reduces the need for privileged spatial prompts like ground-truth boxes, which are unrealistic in real clinics.
•Medical SAM3 supports both 2D and slice-based 3D workflows via a detector-tracker with memory to connect neighboring slices.
•The work highlights that robust medical promptability is mainly a representation alignment problem, not just prompt engineering.

Why This Research Matters

This work means clinicians can ask for what they need in simple words and get precise outlines across many kinds of medical images, all from one model. It reduces the need for hand-drawn boxes or specialized models per department, saving time and resources. Hospitals with fewer experts or smaller datasets can still benefit from strong segmentation using a shared, universal tool. Clinical trials and registries can get more consistent measurements across sites, improving fairness and reliability. Faster, cleaner segmentations support safer procedures and better follow-up, especially for small or hard-to-see findings. By focusing on language–vision alignment, the model is easier to interact with and more adaptable as clinical needs evolve.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re looking at a treasure map, and you need to color in the exact shape of a hidden island so ships can avoid it. Doctors do something similar with medical images: they color the exact shape of organs or problems so care teams can make safe plans.

🥬 The Concept: Medical image segmentation means outlining the exact pixels (or voxels) of things like hearts, lungs, tumors, or vessels in a scan. How it works:

A computer looks at the image.
It decides which pixels belong to the target (like a polyp) and which don’t.
It makes a mask—a coloring book page where the target is filled in. Why it matters: Without segmentation, doctors have fuzzier measurements and less precise plans, which can affect diagnoses, treatments, and follow-up. 🍞 Anchor: On a colonoscopy video, segmentation can mark the exact area of a polyp so a doctor can remove it completely.

🍞 Hook: You know how you can ask a smart speaker, “Play jazz,” and it understands what to do? Wouldn’t it be nice if medical AI could also understand plain words?

🥬 The Concept: Prompt-based interaction lets us tell a model what to find using instructions—like words (“polyp”), points, or boxes. How it works:

You give a prompt describing the target.
The model uses that hint to search the image.
It outputs a precise mask. Why it matters: Without prompts, one model can’t easily switch tasks; with prompts, a single model can help across many targets. 🍞 Anchor: Type “thyroid gland” when viewing an ultrasound image, and the model outlines the thyroid.

🍞 Hook: Think of a giant Swiss Army knife—it’s one tool with many skills. Wouldn’t one AI that works across many medical tasks be handy?

🥬 The Concept: Foundation models are big, pre-trained models that can be adapted to many tasks with small changes or prompts. How it works:

Train on huge, diverse data to learn general patterns.
Reuse those patterns for new jobs using prompts or fine-tuning.
Quickly switch tasks without starting from scratch. Why it matters: Without a foundation, each new task needs its own model and lots of labels. 🍞 Anchor: One model can segment lungs on X-rays one day and blood vessels in eye photos the next, just by changing the prompt.

🍞 Hook: Imagine you learned biking on flat roads. Riding on a rocky trail (a different terrain) is harder, right?

🥬 The Concept: Domain shift means the data the model sees now (medical scans) is very different from what it learned on (natural photos), so it struggles. How it works:

The model learns from natural images.
It meets medical images with different textures and meanings.
Its old “rules” don’t work well; performance drops. Why it matters: Without fixing domain shift, zero-shot models miss targets or make messy masks. 🍞 Anchor: A model trained on cats and cars fails to find tiny retinal vessels in eye scans.

🍞 Hook: It’s easier to solve a maze if someone draws a box around the prize. But what if no one can draw the box for you?

🥬 The Concept: Privileged spatial prompts (like ground-truth boxes) give away the location, making the task much easier. How it works:

A box tells the model “look here.”
The model refines boundaries inside the box.
It looks strong, but only because it was told where to look. Why it matters: In real clinics, boxes aren’t always available; relying on them can be unrealistic. 🍞 Anchor: A model that needs a perfect box for each polyp won’t help much in busy endoscopy rooms.

Before this work, medical segmentation often used specialist models trained for one dataset at a time. These worked well when labels were plentiful and test data looked like the training data. Promptable models like SAM and SAM3 offered a unified interface, but in medicine they often needed strong geometric hints (like boxes) or failed under text-only prompting due to domain shift.

People tried parameter-efficient adapters, prompt engineering, and clever 3D architectures. These helped here and there but didn’t solve the core problem: reliably turning medical language (“calcified node,” “intraretinal fluid”) into clean, precise masks without location hints.

The gap: a single promptable model that understands medical imagery well enough to use only words and still find the right pixels—across many organs, scanners, and hospitals.

Real stakes include faster triage, safer surgeries, more consistent measurements in clinical trials, and help for smaller clinics that can’t afford bespoke models. This paper fills the gap by fully adapting a concept-promptable foundation model (SAM3) to medicine, training it on a large, diverse, text-and-mask-aligned corpus so that words alone can drive accurate segmentation.

02Core Idea

🍞 Hook: Imagine training a sniffer dog to find different scents just from their names—“lavender,” “cinnamon,” “smoke”—no one points to where the smell is; the dog just knows how to search.

🥬 The Concept: The “aha!” is to fully fine-tune a concept-promptable model (SAM3) on medical images so that text alone (“polyp,” “thyroid,” “retinal vessel”) reliably turns into the right mask—no boxes, no clicks. How it works:

Gather many medical images with masks and clear text labels.
Train the entire model (not just a small adapter) at high resolution so it learns medical textures.
Align text meanings with spatial features so words become smart search queries. Why it matters: Without this, text-only prompts fail; the model either misses the target or over-segments. 🍞 Anchor: Type “placental vessel” on a fetoscopy frame, and the model highlights the correct vessels—even though no one drew a box.

Three analogies:

Librarian: You say “mystery novels,” and the librarian walks you to the exact shelf; here, text guides the model to the exact pixels.
Treasure map: The word “island” becomes a compass; the model navigates currents (textures) to outline the shore (mask).
Search dog: Say the scent; the dog searches everywhere—no pointing needed—and finds the hidden source.

Before vs. After:

Before: Good results only with boxes or clicks; text-only often crumbled under domain shift.
After: Strong, text-only segmentation across many modalities, with big gains especially on small, thin, or low-contrast targets.

Why it works (intuition):

Full fine-tuning re-teaches the model’s deep layers to recognize medical patterns (stains, vessels, soft tissues) while keeping lower layers stable.
High-resolution inputs preserve tiny details (capillaries, lesion edges) that matter clinically.
Training only with text prompts forces tight language-visual alignment: words become spatial queries.
A set-prediction objective teaches the model to discover and outline exactly what was asked—no duplicates, no misses.

Building blocks (each with a tiny sandwich):

🍞 Hook: You know how a recipe card lists ingredients, the dish name, and a photo? 🥬 The Concept: Unified input triplet (image, mask, text) pairs what we see, the correct outline, and what we call it. How it works: Bundle each sample as image + gold mask + prompt text; do this across 33 datasets. Why it matters: Without consistent triplets, the model can’t learn how words map to shapes. 🍞 Anchor: Image of retina + vessel mask + text “retinal blood vessel.”

🍞 Hook: When you learn piano, your basics stay steady while advanced songs change more. 🥬 The Concept: Fine-tuning updates the whole model to the medical domain. How it works: Train all layers so the model’s brain truly learns medicine. Why it matters: Without full tuning, text-only performance remains shaky. 🍞 Anchor: The same model now segments both thyroid and placenta from words alone.

🍞 Hook: You don’t turn every dial the same; some need gentle nudges. 🥬 The Concept: Layer-wise learning rate decay lowers how fast early layers change and lets later layers adapt more. How it works: Shallow layers keep universal edges and textures; deeper layers learn medical semantics. Why it matters: Without it, the model forgets general vision or fails to specialize. 🍞 Anchor: Edges still pop, but now they mean “vessel wall” or “lesion border.”

🍞 Hook: Reading the word “polyp” should make you look for roundish, mucosal bumps. 🥬 The Concept: Text-driven segmentation treats words as spatial search queries. How it works: The text encoder turns words into a direction for the mask head. Why it matters: Without text alignment, words don’t reliably point to the right pixels. 🍞 Anchor: Prompt “intraretinal fluid,” get fluid pockets on OCT outlined.

🍞 Hook: Sorting toys means finding each toy once, not three times or zero. 🥬 The Concept: Set-prediction objective finds the correct set of targets and masks, one-to-one with ground truth. How it works: Match predictions to truth, train to cover each target cleanly. Why it matters: Without it, the model might duplicate or miss instances. 🍞 Anchor: Exactly one mask per polyp, no extras, no misses.

Together, these pieces transform SAM3 into Medical SAM3: a single, promptable model that uses words to precisely color the right pixels across many medical domains.

03Methodology

High-level overview: Input (medical image + text prompt) → encoders (vision + text) → detector proposes mask(s) that match the concept → optional tracker propagates masks across neighboring slices/frames with a memory bank → merge detections and propagated masks → output final segmentation.

Step 1: Unify medical data into 2D at high resolution 🍞 Hook: Think of flipping through a photo album page by page, even if each photo came from a different camera. 🥬 The Concept: Unified 2D formulation treats all images—X-ray, ultrasound, endoscopy, pathology, or slices from 3D scans—as high-resolution 2D inputs. How it works:

Resize/prepare images to 1008×1008 to preserve fine details.
If a study is 3D, process it slice-by-slice (sequentially).
Keep one framing so the same model can see every modality. Why it matters: Without this, juggling many geometries complicates training and can blur tiny structures. 🍞 Anchor: A fetoscopy frame and a pathology crop both enter as crisp 1008×1008 images.

What breaks without it: Small vessels or thin membranes vanish at low resolution; the model loses clinically crucial details.

Step 2: Train from triplets (image, mask, text) 🍞 Hook: A flashcard shows a picture, its label, and the correct answer; that’s how we learn quickly. 🥬 The Concept: Each training sample is (I, M, t): image, gold mask, and concept text. How it works:

Use native dataset labels (e.g., “gland,” “thyroid gland,” “polyp”).
Curate a consistent vocabulary across datasets.
Train so that t guides the model to paint M on I. Why it matters: Without clean pairs, the model can’t learn text-to-pixel alignment. 🍞 Anchor: Retina photo + vessel mask + “retinal blood vessel” teaches the word-to-shape link.

Step 3: Full-model fine-tuning (with stratified rates) 🍞 Hook: In a band, the drummer keeps steady beats (don’t change too much), while the lead guitarist explores new riffs (change more). 🥬 The Concept: Full fine-tuning updates all parameters, with layer-wise learning rate decay to protect early features and specialize deep ones. How it works:

Smaller learning rates for shallow layers; larger for deeper ones.
Train vision and text backbones plus decoder and heads.
Keep high-res inputs to capture edges and micro-textures. Why it matters: Without full, stratified tuning, text-only prompting stays unreliable under domain shift. 🍞 Anchor: After tuning, “polyp” reliably finds polyps on multiple endoscopy datasets.

Step 4: Text-driven semantic alignment (no boxes, no clicks) 🍞 Hook: Saying “find the exits” in a stadium shouldn’t require you to point at them. 🥬 The Concept: The model learns to treat a word embedding as a spatial search query. How it works:

The text encoder turns the prompt into a vector.
The mask head uses that vector to scan visual features for matching patterns.
Training never uses privileged boxes, so the model learns true localization. Why it matters: Without this, performance collapses when boxes aren’t available in clinics. 🍞 Anchor: Type “thyroid gland,” and the ultrasound gland is outlined with no clicks.

Step 5: Detector–tracker with memory (for sequential slices/frames) 🍞 Hook: When reading a comic, each panel helps you understand the next. 🥬 The Concept: A detector segments the current frame; a tracker propagates masks from the previous frame using a memory bank. How it works:

At time t, detector proposes masks from text and image.
Tracker brings forward likely masks from t–1.
Merge them to get the final mask; update memory. Why it matters: Without memory, small or faint structures can flicker across slices. 🍞 Anchor: In a CT slice stack, vessels remain continuous across neighboring slices.

Step 6: Set-prediction objective with matching 🍞 Hook: If you hand out name tags, each person should get exactly one. 🥬 The Concept: Set-prediction makes the model produce a set of masks that match targets one-to-one. How it works:

A matcher pairs each prediction with its best ground-truth instance.
Extra predictions get matched to “no object.”
Losses train classification, presence, and mask quality together. Why it matters: Without one-to-one matching, the model can duplicate masks or skip targets. 🍞 Anchor: Each polyp appears once—no doubles, no misses.

Step 7: Robust mask supervision 🍞 Hook: Coloring neatly means staying inside the lines and fully covering the shape. 🥬 The Concept: Combine pixel-wise and overlap-focused losses (like focal and dice) so masks are complete and boundaries are sharp. How it works:

Pixel-level loss balances foreground/background.
Dice-style overlap encourages tight coverage.
Presence loss reinforces “is there an instance?” Why it matters: Without strong mask losses, boundaries get fuzzy and small targets disappear. 🍞 Anchor: Retinal vessels appear as thin, continuous strands instead of blobby streaks.

Concrete example (endoscopy):

Input: Colonoscopy frame; text = “polyp.”
The vision encoder extracts features; the text encoder produces a “polyp” vector.
The detector proposes masks matching polyp-like textures.
If it’s a video/slice sequence, the tracker carries the previous mask forward.
The best-confidence mask becomes the output; in multi-class, query each class and pick pixels with top confidence.

Training setup (practical bits):

Hardware: 4× H100 80GB GPUs; up to 10 epochs; AdamW optimizer; warmup + inverse-square-root decay.
Data: 33 datasets; ≈65k train and 11.5k val images; text-only prompts in both training and testing.
Selection: Choose checkpoint by validation performance.

The secret sauce:

Full-model adaptation at high resolution teaches true medical textures.
Text-only training forces deep language–vision alignment.
Layer-wise learning rate decay balances stability and specialization.
Set prediction plus memory ensures clean, consistent masks across frames/slices.

04Experiments & Results

🍞 Hook: If two basketball teams play on the same court with the same rules, the score tells you who’s really better. We did the same with models: same text-only prompts, same datasets, fair comparison.

🥬 The Concept: The team measured segmentation quality with Dice and IoU on internal (held-out) tasks and external (never-seen) datasets, comparing Medical SAM3 to the original SAM3 checkpoint. How it works:

Internal tests: 10 tasks drawn from the fine-tuning corpus but on held-out splits.
External tests: 7 datasets excluded from training to test generalization.
Only text prompts were used—no spatial hints—to reflect real deployment. Why it matters: Without text-only evaluation, results might hide behind unrealistic boxes or clicks. 🍞 Anchor: Prompt “polyp” on colonoscopy frames and see which model paints the polyp more accurately.

The competition: Original SAM3 vs. Medical SAM3 (ours). Same backbone family, but ours is fully fine-tuned on medical triplets.

Scoreboard with context:

Internal average: Dice rose from 54.0% to 77.0%, and IoU from 43.3% to 67.3%. That’s like moving from a C to a strong A- overall.
External average: Dice leapt from 11.9% to 73.9%, and IoU from 8.0% to 64.4%—turning near-fail (F) on totally new tests into a solid A.

Examples (internal):

Retinal vessels (DRIVE): Dice 24.8% → 55.8% (big jump on thin, low-contrast structures).
Fetal head (PS-FH-AOP’23): 65.7% → 91.6% (cleaner boundaries on ultrasound).
Placental vessels (FetoPlac): 56.6% → 77.0% (better small-structure adherence).
Pathology glands (GlaS’15): 68.9% → 88.2% (adapts to staining/textures).
PAPILA (optic disc): already strong, now 99.4% Dice—near-perfect.

Examples (external, never trained on):

Endoscopic polyps: Some baselines were effectively 0% Dice; Medical SAM3 hit 87.9% (CVC) and 86.1% (ETIS). That’s from total miss to excellent.
Ultrasound (HC18): 23.9% → 92.6% Dice—a massive recovery.
Dermatology (PH2): 18.4% → 92.7% Dice—robust lesion outlining from text only.
Retinal datasets (CHASE, STARE): strong gains (e.g., CHASE 17.9% → 62.6%).

Surprising findings:

Text-only prompting works extremely well after holistic adaptation—even for datasets the model never saw, which suggests the learned language–vision alignment is robust.
The biggest wins are on tiny or low-contrast targets (retinal vessels, polyps), where naive text-only prompting used to fail.
High-resolution training pays off: thin structures and tricky borders improve markedly.

Why these numbers matter:

Dice in the 80–90%+ range indicates clinically useful precision on many tasks.
External generalization means hospitals can see strong results without site-specific retraining.
Consistency across modalities (ultrasound, fundus, pathology, endoscopy, X-ray) shows the model truly acts as a universal, prompt-driven segmenter.

🍞 Anchor: With just the word “polyp,” Medical SAM3 highlights the right blob in a new hospital’s videos—no boxes, no extra training.

05Discussion & Limitations

Limitations:

Compute intensity: Full-model, high-resolution tuning is heavy. Smaller sites might prefer lighter adapters, but those may reduce robustness.
3D continuity: Treating volumes slice-by-slice is universal, but it underuses true 3D context; complex shapes might lose inter-slice smoothness.
Prompt simplicity: This study uses atomic concepts (single words/short terms). Real clinics use synonyms, attributes, and multi-part descriptions.
Validation breadth: Strong internal/external results are promising, but wider multi-center studies and uncertainty estimates are needed for deployment confidence.

Required resources:

GPUs with large memory (e.g., H100-class) for training; fewer resources suffice for inference but benefit from good VRAM.
A curated label-to-text dictionary to stabilize prompts across datasets.
Access to diverse training data (the paper used 33 datasets) or a strong pretrained checkpoint (their release helps here).

When not to use:

Tasks needing strict volumetric consistency (e.g., fine neuroanatomy across 3D MRI) may need native 3D prompting or explicit 3D constraints.
Scenarios requiring interactive control (scribbles, clicks) as the main workflow; Medical SAM3 focuses on text-only prompting first.
Rare modalities or exotic imaging physics not represented in training may need brief adaptation.

Open questions:

Can parameter-efficient fine-tuning or distillation keep most gains while lowering cost?
How to make prompts robust to synonyms, attributes (“irregular,” “calcified”), and compositions (“multiple small polyps”)?
What’s the best way to bring in native 3D reasoning without losing universality and speed?
How to quantify and communicate uncertainty so clinicians know when to trust or double-check a mask?
How to keep performance stable over time as scanners, protocols, and populations change (domain drift)?

06Conclusion & Future Work

Three-sentence summary: Medical SAM3 turns plain medical words into precise segmentation masks by fully fine-tuning a concept-promptable foundation model on a large, diverse, text-and-mask-aligned medical corpus. This holistic adaptation fixes domain shift, removes the crutch of privileged spatial prompts, and delivers large, consistent gains across many modalities, including on unseen datasets. The result is a single, universal, text-driven segmenter suitable for broad clinical and research scenarios.

Main achievement: Proving that robust, text-only medical segmentation is possible and practical when the entire model is adapted—aligning language to medical visuals at high resolution with set-prediction training.

Future directions:

Parameter-efficient and distilled variants to cut compute while preserving robustness.
Native 3D prompting and inter-slice constraints for volumetric continuity.
Richer, synonym-tolerant, and compositional prompts to match clinical language.
Calibration and uncertainty tools for trustworthy deployment.

Why remember this: It shifts medical segmentation from “draw me a box” to “tell me what you want,” unifying many tasks under one simple interface—and shows that the secret to promptable medicine isn’t fancier prompts, but deeper alignment between words and pixels.

Practical Applications

•Text-guided polyp segmentation during colonoscopy for quicker detection and removal planning.
•Automated ultrasound measurements (e.g., fetal head or thyroid) by prompting the target structure.
•Retinal vessel and lesion outlining from fundus images to assist screening for eye diseases.
•Pathology region-of-interest masking (e.g., glands, tumor areas) to support diagnostics and research.
•Chest X-ray organ or abnormality segmentation for triage and report support.
•Quality control in clinical trials by standardizing segmentation across sites with text-only prompts.
•Rapid dataset labeling: use text prompts to pre-segment and then let experts correct quickly.
•Education and simulation: students type a structure name and see accurate masks for learning anatomy.
•Telemedicine support: remote clinicians prompt key structures for consistent measurements.
•Archive mining: search large image archives by concept and retrieve regions (e.g., 'intraretinal fluid').

Version: 1