Uncertainty-Aware Vision-Language Segmentation for Medical Imaging

Aryan Das; Tanishq Rachamalla; Koushik Biswas; Swalpa Kumar Roy; Vinay Kumar Verma

Uncertainty-Aware Vision-Language Segmentation for Medical Imaging

Intermediate

Aryan Das, Tanishq Rachamalla, Koushik Biswas et al.2/16/2026

arXiv

Key Summary

•This paper builds a medical image segmentation system that uses both pictures (like X-rays) and words (short clinical text) at the same time.
•It introduces two key building blocks: MoDAB to mix image and text smartly, and SSMix to remember long-range details efficiently.
•A new SEU Loss teaches the model to be accurate (spatial overlap), keep structure (spectral consistency), and be honest about doubt (low entropy where sure, higher when unsure).
•Across three datasets (QaTa-COV19 X-rays, MosMed++ CT, Kvasir-SEG endoscopy), the method beats previous top models by clear margins.
•The model is efficient: about 39.9M parameters and 17.87 GFLOPs, smaller and faster than many transformer-heavy baselines.
•Removing text or the special modules makes accuracy drop a lot, showing both are necessary.
•Uncertainty guidance helps the model avoid overconfident mistakes on blurry or noisy images.
•The approach keeps the text model frozen (BioViL CXR-BERT) and uses a ConvNeXt-Tiny image encoder for strong but lightweight features.
•Results suggest vision-language alignment plus uncertainty-aware training is a safe and practical path for clinical tools.

Why This Research Matters

This approach can help doctors get faster and more reliable outlines of diseases, even when images are blurry or labels are scarce. By using text like short clinical prompts, it reflects how clinicians actually think—combining what they see with what they read. The uncertainty maps show where the AI is unsure, guiding safer decisions and better teamwork between humans and machines. Its efficiency makes it more practical for hospitals that can’t run huge, slow models. Better masks mean clearer treatment plans, fewer unnecessary follow-ups, and more time for patient care. Over time, this could reduce costs, improve outcomes, and increase trust in AI-assisted medicine.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how doctors look at an X-ray or a CT scan while also reading short notes to decide what’s going on? They combine what they see with what they read to make a better decision.

🥬 Filling (The Actual Concept):

What it is: Medical image segmentation is when a computer colors in exactly the parts of a medical picture that matter (like a lung spot or a polyp), so doctors can see them clearly.
How it works (before this paper): Most systems looked only at the picture. Some newer ones also tried to use text, like a short description, to guide what to color. But they often ignored how unsure the computer might be, and struggled to align words and pictures well.
Why it matters: In medicine, being clear and reliable is everything. If the model is wrong but very confident, it can mislead care. If it ignores helpful text clues, it misses context doctors actually use.

🍞 Bottom Bread (Anchor): Imagine a computer outlining a fuzzy pneumonia patch on an X-ray while also reading, “opacities in right lower lobe.” If it trusts both image and words and knows when it’s unsure, it can outline more precisely and warn the doctor about any doubtful borders.

🍞 Top Bread (Hook): Imagine trying to color a tiny shape in a blurry photo. If no one tells you what you’re looking for, you’ll guess. But if someone whispers, “Circle the spotty patch near the bottom-right,” it’s way easier.

🥬 Filling:

What it is: Vision-language segmentation (VLS) uses text (like short prompts or report snippets) to guide which parts of an image to segment.
How it works: 1) Read the image. 2) Read the text. 3) Fuse them so the text points the model toward the right visual areas. 4) Output a mask (the colored-in region).
Why it matters: Without language guidance, the model may color the wrong area or need lots of labeled data. With language, it needs fewer labels and gets smarter hints.

🍞 Bottom Bread (Anchor): A prompt like “mark ground-glass opacities” helps the system highlight the hazy lung regions that match that description.

🍞 Top Bread (Hook): You know how sometimes you’re 100% sure of an answer, and other times you say, “I think so, but I’m not sure”? That “not sure” is uncertainty, and it’s important to say out loud.

🥬 Filling:

What it is: Uncertainty is a measure of how confident the model is about each pixel it colors.
How it works: The model’s probabilities can be turned into a number called entropy. High entropy = it’s unsure; low entropy = it’s sure.
Why it matters: Without uncertainty awareness, the model might look confident even when it’s wrong, which is risky in healthcare.

🍞 Bottom Bread (Anchor): If the mask edge around a polyp is fuzzy, the model can flag that area as uncertain, telling the doctor, “Please double-check here.”

The world before: Classic models like U-Net colored in medical images well when data was clean and labels were plentiful. But in real clinics, images can be noisy or low contrast, and labeled data is often limited. Even newer transformer-style models improved the context understanding but were heavy (big and slow) and still didn’t fully use text or say how unsure they were.

The problem: Two big gaps remained. First, lining up what the words mean (“ground-glass opacities”) with where those patterns live in the image is tricky. Second, models didn’t learn to be careful in ambiguous areas because training didn’t really teach them about uncertainty.

Failed attempts: Many tried only image models (missing helpful text), or used text but with bulky transformers, or trained with simple losses that reward overlap but ignore shape structure and uncertainty. The results: decent masks on easy cases, but overconfident errors on tough, blurry scans.

The gap: We needed a method that 1) fuses text and image features in a structured, efficient way that captures long-range connections without huge compute, and 2) trains with a loss that balances accuracy, shape structure, and honest uncertainty.

Real stakes: In daily life, this means faster, safer triage of chest X-rays, more accurate polyp borders during colonoscopy, and better decision support when images are noisy. It can reduce unnecessary follow-ups from false alarms and help doctors spend less time correcting masks and more time with patients.

02Core Idea

🍞 Top Bread (Hook): Imagine building a Lego model using both the picture on the box (image) and the written instructions (text), while also circling any parts you’re unsure about so a friend can help.

🥬 Filling:

What it is (one sentence): The key insight is to fuse images and text efficiently while training the model to be both accurate and honest about uncertainty, using a special mixer (SSMix), a cross-modal attention block (MoDAB), and a new all-in-one loss (SEU Loss).

Multiple analogies:

Tour guide + map: The image is the map, the text is the guide’s voice, and uncertainty is the guide saying, “I’m not fully sure about this alley; let’s check.”
Chef + recipe: The image is the cooking pot; the text is the recipe instructions; uncertainty is the chef noting, “This sauce might be too thick—taste test!”
Teacher + worksheet: The image is the worksheet diagram; the text is the hint; uncertainty is the student marking a question with a question mark.

Before vs After:

Before: Heavy transformers tried to align words and pictures but were costly; training focused on overlap (Dice) and ignored structure and uncertainty.
After: Lightweight SSMix models long-range patterns cheaply; MoDAB aligns words to pixels precisely; SEU Loss rewards overlap, preserves shape structure (frequency domain), and teaches confidence calibration.

Why it works (intuition):

Words narrow the search: Text points the model toward the relevant visual regions, shrinking the “where to look” space.
Efficient long memory: SSMix (a state-space mixer) captures long-range dependencies with linear-time dynamics, avoiding bulky attention everywhere.
Aligned focus: MoDAB’s cross-attention lets visual queries pull the most relevant text cues token-by-token.
Smart learning target: SEU Loss balances three forces—spatial overlap (be accurate), spectral consistency (keep the shape/topology), and entropy (be honest about doubt)—so the model learns robustly, especially on noisy scans.

Building blocks (each as a Sandwich):

🍞 Top Bread (Hook): You know how when you’re drawing, you glance at the picture and also read the caption to understand what to draw? 🥬 Filling:

What it is: MoDAB (Modality Decoding Attention Block) is a module that mixes visual features with text features using self-attention and cross-attention.
How it works: 1) The image features learn relationships among themselves (self-attention). 2) The image features ask the text features for help (cross-attention) to highlight relevant parts. 3) A learnable scale blends this guidance back into the visual stream.
Why it matters: Without MoDAB, the image and text are like two people talking past each other; alignment becomes weak and masks drift. 🍞 Bottom Bread (Anchor): If the text says “right lower lobe,” MoDAB helps the image features emphasize that region before decoding the final mask.

🍞 Top Bread (Hook): Imagine reading a long story and remembering details from the beginning without rereading every page. 🥬 Filling:

What it is: SSMix (State Space Mixer) is a lightweight sequence mixer that captures long-range dependencies in the text stream (and its alignment) efficiently.
How it works: 1) Expand and split channels; 2) Apply depthwise 1D convolutions to learn local patterns; 3) Compute state-space updates (a fast way to remember long-range patterns); 4) Gate and blend results; 5) Project to match spatial tokens.
Why it matters: Without SSMix, long-distance relationships across tokens (like “opacity” connected to “lower lobe”) are weaker or costlier to learn. 🍞 Bottom Bread (Anchor): With SSMix, the system can connect “ground-glass” and “hazy” even if those words are far apart in the sentence.

🍞 Top Bread (Hook): Think of grading a drawing contest: you check if the drawing overlaps the outline, if the overall shape looks right, and if the artist admits when parts are guesses. 🥬 Filling:

What it is: SEU Loss combines Dice (overlap), spectral consistency (shape/topology in frequency space), and entropy (uncertainty) into one training target.
How it works: 1) Dice pushes accurate pixel overlap. 2) Spectral term aligns global structure via Fourier magnitudes. 3) Entropy regularizes predictions to be confident where appropriate and cautious where ambiguous.
Why it matters: Without SEU Loss, models can overfit to crisp edges, lose global shape, and act overconfident in noisy areas. 🍞 Bottom Bread (Anchor): A polyp’s border stays smooth and plausible (spectral), tightly overlaps ground truth (Dice), and flags fuzzy edges (entropy).

03Methodology

At a high level: Image + Text → Encode each → Fuse with MoDAB (self-attention + cross-attention) and SSMix → Decode to full-size mask → Train with SEU Loss.

Step 1: Visual Encoding (ConvNeXt-Tiny)

What happens: The image is resized ( $224×224$ ) and passed through a ConvNeXt-Tiny encoder to get multi-scale feature maps (from coarse to fine details).
Why this step exists: Different layers capture textures, shapes, and semantics; the decoder later needs these scales to rebuild sharp masks.
Example: An X-ray produces four feature maps; the finest one preserves lung edges, while deeper ones capture “disease-like” patterns.

Step 2: Text Encoding (BioViL CXR-BERT, frozen)

What happens: The text prompt (e.g., “Ground-glass opacities in right lower lobe”) is tokenized and mapped to embeddings using a pretrained, frozen medical text model.
Why this step exists: Using a frozen, domain-tuned language encoder preserves stable, clinically informed semantics and saves compute.
Example: Tokens like “opacities,” “right,” “lower,” “lobe” become vectors carrying both meaning and medical context.

Step 3: SSMix prepares text for fusion

What happens: Text embeddings are projected, split, convolved, and passed through a selective state-space scan that models long-distance relationships; then reprojected to match the spatial token dimension.
Why this step exists: It gives the text stream long memory and structure without expensive global attention.
Example: Even if “right” appears far from “lower lobe” in the sentence, SSMix keeps their connection active.

Step 4: MoDAB Self-Attention on image features

What happens: The image tokens attend to each other (self-attention) with positional encodings, learning which spatial parts relate.
Why this step exists: Structures like a diffuse lesion may require comparing distant pixels; self-attention links them.
Example: Shaded patches in opposite lung corners can still influence each other if they form a known pattern.

Step 5: MoDAB Cross-Attention (image queries, text keys/values)

What happens: Each image token asks, “Which text tokens help me?” Cross-attention brings in the most relevant words to each spot.
Why this step exists: It anchors visual features to clinical semantics, making “what to look for” explicit at each location.
Example: Pixels in the right lower lung area listen more to tokens “right,” “lower lobe,” and “opacity.”

Step 6: Residual fusion with learnable scaling

What happens: The cross-attended result is normalized and scaled by a learnable factor α, then added back to the image stream.
Why this step exists: It stabilizes training and lets the model tune “how much text to use” per layer.
Example: If text is too generic, α shrinks its impact; if precise, α increases it.

Step 7: Decoder with skip connections and subpixel upsampling

What happens: The fused features are reshaped into spatial maps and progressively upsampled via transposed convolutions. At each scale, they’re concatenated with matching encoder features and refined by small conv blocks. A final subpixel upsampling (pixel shuffle) sharpens resolution, followed by average pooling and a $1×1$ conv to get class probabilities.
Why this step exists: Coarse-to-fine reconstruction keeps both global context and crisp edges; pixel shuffle helps create cleaner, high-res outputs.
Example: The final mask cleanly traces a polyp boundary while maintaining smooth texture transitions.

Step 8: SEU Loss for training

What happens: The model learns from three signals at once: Dice overlap, spectral (Fourier magnitude) similarity, and entropy regularization.
Why this step exists: Dice boosts local accuracy, spectral preserves realistic shapes, entropy reduces overconfident mistakes.
Example: A hazy COVID lesion stays anatomically plausible (spectral), aligns with ground truth (Dice), and marks uncertain borders (entropy).

Training details (so it’s reproducible):

Inputs: $224×224$ images, batch size 32. Optimizer: AdamW with cosine annealing. Max 200 epochs, early stopping with patience 20, and at least 20 epochs minimum. Learning rates: 5e-4 (MosMed++), 3e-4 (QaTa-COV19, Kvasir-SEG). Loss weights: λF=0.3 (spectral), λE=0.1 (entropy).

The Secret Sauce (why this method feels clever):

SSMix gives long-range power at a fraction of the cost of full attention, so the model is nimble.
MoDAB pinpoints where text should influence image features, so guidance is precise, not blurry.
SEU Loss teaches the model to be accurate, anatomically faithful, and humbly honest about doubt—critical for clinical trust.

Concept Sandwiches for new pieces introduced here:

🍞 Top Bread (Hook): Picture asking a friend which words in a sentence help you spot a shape in a picture. 🥬 Filling:

What it is: Cross-Attention lets image tokens query text tokens to fetch the most relevant clues.
How it works: Each image location forms a question; text tokens act as answers; the best matches are weighted more.
Why it matters: Without it, text guidance becomes vague and less helpful. 🍞 Bottom Bread (Anchor): The pixel in “right lower lobe” pays close attention to the words “right,” “lower,” and “lobe.”

🍞 Top Bread (Hook): Think of converting a blocky image into a sharper, smoother one by reorganizing pixels cleverly. 🥬 Filling:

What it is: Pixel Shuffle (subpixel upsampling) rearranges channel information into higher spatial resolution.
How it works: A conv expands channels; pixel shuffle redistributes them as a finer grid.
Why it matters: Without it, upsampling can look jagged or blurry. 🍞 Bottom Bread (Anchor): The final mask edges around a small polyp look crisp after pixel shuffle, not stair-stepped.

04Experiments & Results

The Test: What did they measure and why?

They measured Dice score (how much the predicted mask overlaps the true mask), mIoU (average overlap across classes), model size (parameters), and speed (FLOPs). These tell us accuracy and practicality.

🍞 Top Bread (Hook): Imagine grading how well two drawings overlap by tracing both and seeing how much area matches. 🥬 Filling:

What it is: Dice score measures overlap between predicted and true regions.
How it works: Twice the shared area divided by the sum of both areas; higher is better.
Why it matters: Without a good overlap measure, we can’t tell if the mask is actually correct. 🍞 Bottom Bread (Anchor): If the model colors almost exactly the same pixels as the ground truth, Dice nears 1 (or 100%).

🍞 Top Bread (Hook): Think of two shapes on tracing paper; mIoU asks, on average, how much they cover the same ground out of all the ground they cover together. 🥬 Filling:

What it is: mIoU is mean Intersection over Union across classes.
How it works: Intersection area divided by union area, averaged; it penalizes extra spillover and missing parts.
Why it matters: Without mIoU, a model might look good on one class but fail overall. 🍞 Bottom Bread (Anchor): A clean polyp outline both reduces missed areas and avoids coloring outside the border, boosting mIoU.

Datasets and comparisons:

QaTa-COV19 (X-rays), MosMed++ (CT), Kvasir-SEG (endoscopy). They compared against popular image-only models (U-Net, nnUNet, TransUNet, Swin-UNet, U-Mamba) and vision-language models (CLIP-based, LViT, Ariadne, MAdapter, etc.).

Scoreboard with context (selected highlights):

QaTa-COV19: 92.24% Dice, 84.9% mIoU. This is like scoring an A+ when other strong students (MAdapter at ~90.07% Dice) got an A- to B+.
MosMed++: 79.67% Dice, 66.38% mIoU, edging out MAdapter (~78.40% Dice). On tough CT slices, this consistent lead matters.
Kvasir-SEG: 93.83% Dice, 87.62% mIoU, topping UCTransNet and MAdapter—an outstanding performance on precise polyp borders.

Efficiency: With ~39.9M parameters and ~17.87 GFLOPs, the model stays lighter than many transformer-heavy baselines (e.g., 100M+ params), showing a strong accuracy-to-efficiency tradeoff.

Surprising/Informative Findings (Ablations):

No text at inference: Dice drops to ~87.28% (Kvasir-SEG)—text truly guides segmentation.
Remove MoDAB in training: Bigger drop to ~85.15%, proving structured fusion is essential.
Replace SSMix with a plain linear layer: Dice ~91.72%—long-range, efficient mixing really helps.
Replace cross-attention with simple addition: Dice ~92.11%—precise token-to-token matching matters.
Replace SEU with Dice-only or BCE-only: performance declines (e.g., ~93.44% or ~92.03% Dice), confirming the value of spectral + uncertainty terms.

Takeaway: The method isn’t just a little better; it’s better and leaner. The biggest gains appear where images are tricky or descriptions carry crucial hints—exactly the cases clinicians care about.

05Discussion & Limitations

Limitations:

Text quality matters: If prompts are vague or mismatched to the image, guidance can weaken, and α must down-weight text influence.
Domain shifts: New scanners or very different patient populations may require finetuning or prompt adaptation.
3D context: Current setup focuses on 2D slices/resized images; full 3D volumes may need architectural tweaks and more memory.
Calibration scope: Entropy regularization helps, but additional calibration (e.g., temperature scaling) could further improve reliability.

Required Resources:

A modern GPU (e.g., A30-class) is recommended for smooth training; inference is relatively light (~17.87 GFLOPs).
Pretrained encoders (ConvNeXt-Tiny for vision, BioViL CXR-BERT for text) and datasets with consistent text prompts.

When NOT to Use:

No text available or text is systematically misleading; then an image-only model or prompt engineering may be safer.
Real-time constraints with ultra-high-resolution 3D volumes where even lightweight attention may be too slow without optimization.
Tasks where spectral structure is not meaningful (e.g., highly stochastic textures) might benefit less from Fourier alignment.

Open Questions:

How best to craft, select, or learn prompts automatically for diverse pathologies and body parts?
Can we extend SSMix and MoDAB to full 3D scans or video endoscopy while keeping efficiency?
How do uncertainty maps influence radiologist trust and decision time in user studies?
Could active learning use the uncertainty maps to reduce labeling costs further?

06Conclusion & Future Work

Three-sentence summary: This work presents an uncertainty-aware vision-language segmentation system that fuses images and text efficiently (MoDAB + SSMix) and learns with a unified SEU Loss balancing overlap, structure, and confidence. It achieves state-of-the-art accuracy on three medical datasets while being significantly more efficient than many transformer-heavy baselines. The model is not only precise but also transparent about where it’s unsure—critical for clinical safety.

Main Achievement: Showing that structured, efficient vision-language fusion plus a holistic loss (SEU) produces masks that are both accurate and clinically trustworthy, outperforming larger models.

Future Directions: Better prompt design and automatic prompt generation; extending to 3D volumes and sequences; richer uncertainty calibration; integrating patient metadata from electronic health records; and user studies that measure trust and workflow impact.

Why Remember This: It’s a blueprint for how to combine pictures and words in medicine without overbuilding: use efficient long-range mixers, targeted cross-attention, and a loss that teaches accuracy, shape, and humility—exactly what clinicians need from AI helpers.

Practical Applications

•Interactive radiology: A clinician types a short prompt (e.g., “highlight right lower lobe opacities”) to get a guided lung lesion mask on X-rays or CTs.
•Endoscopy support: More precise, uncertainty-aware polyp segmentation during colonoscopy to assist real-time decisions.
•Triage tools: Rapid, efficient segmentation on portable X-rays in low-resource settings with clear uncertainty flags.
•Quality control: Use uncertainty maps to prioritize human review where the model is least confident.
•Active learning: Select high-uncertainty regions first to label, reducing annotation costs over time.
•Clinical reporting: Generate segmentation overlays aligned with report phrases, improving explainability and trust.
•Dataset bootstrapping: Use text prompts to create initial masks on partially labeled datasets, accelerating curation.
•Telemedicine: Provide robust, lightweight segmentation for remote diagnostics where bandwidth and compute are limited.
•Cross-modality adaptation: Apply prompts tailored to different organs (lungs, bowel, liver) to guide segmentation in new tasks.
•Education and training: Teach residents with side-by-side masks and uncertainty maps, illustrating where AI is confident vs. cautious.

Version: 1

Key Summary

Why This Research Matters

Turn this paper into a decision

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes