Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion

Yi Zhou; Xuechao Zou; Shun Zhang; Kai Li; Shiying Wang; Jingming Chen; Congyan Lang; Tengfei Cao; Pin Tao; Yuanchun Shi

Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion

Intermediate

Yi Zhou, Xuechao Zou, Shun Zhang et al.12/28/2025

arXiv PDF

Key Summary

•Co2S is a new way to train segmentation models with very few labels by letting two different students (CLIP and DINOv3) learn together and correct each other.
•It fights a big problem called pseudo-label drift, where early mistakes snowball during training and make the model worse over time.
•One student gets explicit help from language (text embeddings from CLIP), while the other gets implicit help from learnable queries tuned to the data.
•A global-local fusion rule uses confidence to decide which student to trust for each pixel: CLIP is better at big-picture meaning, DINOv3 at fine details.
•Consistency training keeps each student steady across weak and strong augmentations of the same image.
•Across six remote sensing datasets (WHDLD, LoveDA, Potsdam, GID-15, MER, MSL), Co2S achieves leading mIoU, especially when labels are extremely scarce.
•Ablations show that both explicit (text) and implicit (queries) guidance matter, and that mixing CLIP with DINOv3 beats using either alone.
•Pseudo-label quality becomes stable very early (over 95% in the first epoch on WHDLD 1/24), showing strong resistance to drift.
•The method uses ViT backbones, standard augmentations, and confidence thresholds; it is practical with modern GPUs.
•Co2S can save annotation time and cost while producing sharper boundaries and more semantically correct maps.

Why This Research Matters

Remote sensing maps guide real decisions: where to build roads, how to protect forests, and how to respond to floods and fires. Co2S produces accurate pixel-level maps with far fewer labels, saving months of expert annotation time and lowering costs. Its global-plus-local design keeps categories right and boundaries crisp, which matters for zoning lines, levees, and narrow streets. The approach also works on Martian terrain, helping planetary science teams classify rocks and soil with minimal manual labeling. By stabilizing pseudo-labels early, Co2S avoids common semi-supervised pitfalls, making learning from unlabeled data far more trustworthy. This means faster updates to maps after disasters and more reliable monitoring of environmental change.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re coloring a giant neighborhood map from space. You have to color every tiny square: buildings in one color, roads in another, trees in green, and water in blue. That’s a lot of coloring—and you don’t have time to label every single square by hand.

🥬 Filling (The Actual Concept): What it is: Remote sensing semantic segmentation is teaching computers to color every pixel in satellite or aerial images with the right class (like road, building, tree, or water). How it works (step by step):

We collect big pictures of Earth from satellites or planes.
Experts label some images pixel by pixel (very time-consuming).
A model learns patterns that connect pixels to classes.
The model then predicts labels for new, unlabeled images. Why it matters: Without good segmentation, maps are outdated, disaster response is slower, and city planning becomes guesswork.

🍞 Bottom Bread (Anchor): When a flood happens, a good segmentation model quickly marks which areas are water so rescue teams know where to go.

🍞 Top Bread (Hook): You know how you learn from a few examples and then guess the rest on your own? Computers try that too when we don’t have enough labeled pictures.

🥬 Filling (The Actual Concept): What it is: Semi-supervised segmentation lets a model learn from a few labeled images plus many unlabeled ones by making its own guesses (pseudo-labels) to keep learning. How it works:

Start with a small set of labeled images.
Predict labels for unlabeled images (these are pseudo-labels).
Keep training the model using these pseudo-labels (if they seem confident enough).
Repeat and improve. Why it matters: Without semi-supervision, we’d need to label everything by hand, which is slow and expensive.

🍞 Bottom Bread (Anchor): If you only have 1 labeled picture out of 8, semi-supervised learning still helps the model learn to color the other 7 pretty well.

🍞 Top Bread (Hook): Imagine you make a small mistake in a math problem and then use that wrong answer to solve the next step—you’ll drift farther from the truth.

🥬 Filling (The Actual Concept): What it is: Pseudo-label drift is when a model’s early wrong guesses become “fake teachers” that keep reinforcing the same mistakes. How it works:

The model makes a guess on an unlabeled pixel.
If the guess is wrong but looks confident, it’s used as training truth.
The model gets nudged to repeat that wrong guess.
Over time, errors snowball into bigger errors. Why it matters: Without stopping drift, models confuse similar classes (e.g., roads vs. paved areas) and draw messy boundaries.

🍞 Bottom Bread (Anchor): A river bank might be mistaken as a road; once that error sticks, the model keeps calling that river edge a road in other images too.

🍞 Top Bread (Hook): Think of two friends studying together—one is great at seeing the big picture, the other is awesome at spotting tiny details. Together they ace the test.

🥬 Filling (The Actual Concept): What it is: The paper proposes Co2S, a semi-supervised framework where two different students (CLIP and DINOv3) learn together to avoid pseudo-label drift. How it works:

One student (CLIP) brings global, language-aware semantics (big-picture meaning).
The other (DINOv3) brings local, detail-rich features (sharp boundaries, textures).
Each student learns from labeled data and from its own consistent predictions under augmentations.
A confidence-based rule decides which student supervises each pixel in unlabeled images, stabilizing learning. Why it matters: Without two complementary students and a smart referee, the model would keep reinforcing its own mistakes.

🍞 Bottom Bread (Anchor): On tricky scenes with buildings next to roads and water, CLIP helps pick the right class, and DINOv3 helps draw crisp edges—so the map is both right and neat.

🍞 Top Bread (Hook): Before this work, people tried a few popular tricks—but each had problems like wobbling training or locked-in biases.

🥬 Filling (The Actual Concept): What it is: Previous methods include GAN-based approaches, consistency methods, and pseudo-labeling. How it works (short tour):

GANs: Train a “checker” network to tell real from fake predictions; models can wobble and be hard to stabilize.
Consistency (e.g., FixMatch, UniMatch): Make predictions match under different views; still can lock in early mistakes.
Pseudo-labeling (e.g., U2PL, DWL): Use high-confidence predictions as labels; if early guesses are wrong, noise accumulates. Why it matters: Without stronger outside guidance (like language or diverse priors), these methods can confuse lookalike classes and drift.

🍞 Bottom Bread (Anchor): A paved plaza and a road both look gray; earlier methods often mix them up, and that confusion spreads as training goes on.

🍞 Top Bread (Hook): So what was missing? A wise pairing: one student that “speaks” language and sees context, plus another that zooms into texture.

🥬 Filling (The Actual Concept): What it is: The gap was a stable way to combine vision-language priors (CLIP) with self-supervised priors (DINOv3) so each can fix the other. How it works:

Feed CLIP explicit text embeddings for classes.
Give DINOv3 learnable queries that adapt to the data.
Fuse global and local predictions per pixel using confidence.
Keep each student consistent across augmentations. Why it matters: Without this combo, training either collapses into bias or misses fine details.

🍞 Bottom Bread (Anchor): With Co2S, “road” is anchored by language (CLIP) and outlined by texture (DINOv3), delivering cleaner and more accurate maps.

02Core Idea

🍞 Top Bread (Hook): You know how a great team uses a map and a magnifying glass—one for direction, one for detail—and checks each other’s work as they go?

🥬 Filling (The Actual Concept): What it is (one sentence): Co2S’s key insight is to pair a language-aware global thinker (CLIP) with a detail-focused local thinker (DINOv3) and let them co-teach each pixel based on confidence, stopping errors from snowballing. How it works (the recipe intuition):

Two different students process the same image.
CLIP gets explicit class hints from text; DINOv3 learns implicit class hints via queries.
Each student must be consistent under weak/strong augmentations.
A per-pixel referee trusts the more confident student, nudging the other to agree. Why it matters: Without this, semi-supervised training often reinforces its own early mistakes (pseudo-label drift).

🍞 Bottom Bread (Anchor): In a scene where a gray parking lot sits beside a gray road, CLIP reminds the model which region fits the “road” concept; DINOv3 ensures the boundary between them is sharp, and the confidence referee picks who leads for each pixel.

Multiple Analogies:

Study buddies: One friend remembers the textbook (CLIP’s language), the other sees tiny typos (DINOv3’s details). They check each answer together and follow whoever is more certain.
Coach and scout: The coach knows the playbook (global semantics), the scout spots the players’ foot positions (local cues). A judge (confidence) decides whose call to follow.
Weather forecast: A satellite view gives the big system (CLIP), local sensors catch small gusts (DINOv3). A control center blends them, trusting the more reliable reading at each spot.

Before vs. After:

Before: Models relied on their own pseudo-labels without strong outside anchors, so similar-looking classes blurred together and boundaries were messy.
After: Language-guided global anchors plus local detail cues, fused by confidence, keep semantics right and edges crisp—even with few labels.

Why It Works (intuition behind the math):

Complementary priors reduce correlated errors: CLIP’s text-aligned prototypes steer category meaning; DINOv3’s self-supervised dense features sharpen boundaries.
Per-pixel confidence arbitration prevents mutual drift: the student with a clearer signal tutors the other, pixel by pixel.
Consistency losses tame each student’s internal noise across augmentations, providing steady self-supervision.
Stability loss links the two students: when both are confident, they must agree—this couples their strengths while damping weaknesses.

Building Blocks (as mini-concepts using the sandwich pattern):

🍞 Hook: Imagine labeling stickers with words like “road” and “tree” to guide your coloring. 🥬 Concept: Explicit semantic guidance (CLIP text embeddings).

What: Fixed class embeddings from a text encoder act like semantic anchors.
How: Write class prompts (e.g., “a photo of a highway”), encode them, average concepts per class, then match pixels to class vectors.
Why: Without anchors, the model can drift and mix up lookalike classes. 🍞 Anchor: The word “water” helps the model prefer rivers and lakes over shiny rooftops.

🍞 Hook: Think of a set of blank, tunable magnets that learn what each class ‘feels’ like from the data. 🥬 Concept: Implicit semantic guidance (learnable queries).

What: Trainable vectors (one per class) that adapt to the dataset’s visual statistics.
How: Start randomly, update during training to align with class-specific patterns.
Why: Without adaptive queries, you miss dataset-specific textures (e.g., local road materials). 🍞 Anchor: The “building” query learns the signature of rooftops in that city’s imagery.

🍞 Hook: Picture a referee who chooses which friend’s answer to trust for each quiz question. 🥬 Concept: Global-local collaborative fusion with confidence.

What: A per-pixel rule that follows the more confident student.
How: Compute each student’s class probabilities; if one is more certain (above a threshold), use it to teach the other.
Why: Without arbitration, both students could reinforce each other’s mistakes. 🍞 Anchor: For a fuzzy river edge, DINOv3’s sharpness might win; for choosing “river” vs. “road,” CLIP’s semantics might lead.

🍞 Hook: It’s like checking your answer on the same problem after changing the font and colors—the meaning should stay the same. 🥬 Concept: Consistency regularization.

What: Each student’s prediction should match across weak and strong augmentations of the same image.
How: Create weak/strong/feature-perturbed views; use confident predictions from the weak view to supervise the others.
Why: Without this, the model is easily fooled by harmless changes. 🍞 Anchor: A road is still a road even if the image is slightly darker or blurred.

03Methodology

High-Level Overview: Input (labeled + unlabeled images) → Two Students in Parallel (CLIP student with explicit text guidance; DINOv3 student with implicit query guidance) → Per-student Consistency Losses → Global-Local Confidence Fusion (stability loss) → Segmentation Output.

Step 1: Prepare the data and views 🍞 Hook: You know how a teacher shows a clean worksheet and also a scribbled one to see if you still recognize the same answers? 🥬 Concept: Weak and strong data views.

What: Make several versions of each image: a weakly changed one, strongly changed ones, and a feature-perturbed one.
How: Weak view uses simple resize and flip; strong views add heavy color jitter, grayscale, blur, and CutMix; feature-perturbation uses dropout in the network.
Why: Without varied views, the model can overfit to exact appearances and fail to generalize. 🍞 Anchor: A road stays a road even if the picture is a bit blurrier or colors are shifted.

Step 2: Two heterogeneous students process the same views 🍞 Hook: Think of one student reading a captioned atlas (CLIP) and another using a high-zoom magnifier (DINOv3). 🥬 Concept: Dual-student architecture (CLIP + DINOv3).

What: Two ViT-based students with different pretraining: CLIP (vision-language) and DINOv3 (self-supervised vision).
How: Both extract feature maps from the weak view for supervision and from strong/perturbed views for consistency.
Why: Without heterogeneity, co-training often copies the same mistakes, causing drift. 🍞 Anchor: CLIP’s features capture scene meaning (global), DINOv3’s features capture edges and textures (local).

Step 3: Give the CLIP student explicit class anchors from text 🍞 Hook: Label your crayon box: “road,” “building,” “tree,” “water.” It’s easier to pick the right color. 🥬 Concept: Text prototypes via prompt ensembling.

What: For each class, craft several related words (e.g., for road: highway, main road, street), encode them with CLIP’s text encoder, and average them into a class prototype.
How: Encode prompts → average per class → stack prototypes → compare pixel features to prototypes by dot-product.
Why: Without explicit anchors, similar classes (e.g., plaza vs. road) blur together. 🍞 Anchor: The text vector for “building” nudges pixels with rooftop patterns to be labeled “building.”

Step 4: Give the DINOv3 student adaptive class queries 🍞 Hook: Imagine sticky notes that learn the feel of each class as you practice. 🥬 Concept: Learnable class queries.

What: One trainable vector per class that adapts to the dataset’s visual distribution.
How: Start random; optimize end-to-end so these queries align to class-specific patterns in DINOv3’s feature space.
Why: Without implicit adaptation, you can’t capture local, dataset-specific textures (e.g., Mars rocks vs. Earth gravel). 🍞 Anchor: The “tree” query learns leaf textures that distinguish trees from low vegetation.

Step 5: Per-student consistency with pseudo-labels 🍞 Hook: If you recognize your friend with glasses on or off, your recognition is consistent. 🥬 Concept: Weak-to-strong consistency.

What: Use the weak-view prediction (only if confident) as pseudo-labels to train strong and feature-perturbed views.
How: Compute class probabilities on the weak view; take the argmax label per pixel if confidence ≥ threshold (e.g., 0.95); apply masked cross-entropy to strong/perturbed logits.
Why: Without this, predictions can flip with tiny changes, making training unstable. 🍞 Anchor: A highway remains “road” under color jitter and blur; the model is trained to keep that decision.

Step 6: Global-local collaborative fusion (stability loss) 🍞 Hook: For each puzzle piece, pick the teammate who is more certain. 🥬 Concept: Confidence-based mutual supervision.

What: For each pixel in the weak view, compare CLIP and DINOv3 confidences. Teach the less confident one using the more confident one; if both are confident, they should match.
How: Compute softmax probabilities per student; pick the higher confidence; apply a mean-squared-error loss between the two probability vectors on confident pixels; use a ramp-up weight so early training is gentle.
Why: Without this stability link, the students might drift apart or reinforce the wrong answers. 🍞 Anchor: If CLIP is very sure a pixel is “water” but DINOv3 hesitates, DINOv3 learns from CLIP for that pixel; if DINOv3 is sure about a thin edge, CLIP refines its boundary.

Step 7: Total training objective 🍞 Hook: It’s like grades from homework (labeled data), self-check quizzes (consistency), and peer review (stability) all combining into your final score. 🥬 Concept: Combined loss (supervised + consistency + stability).

What: A weighted sum of supervised loss on labeled pixels, consistency loss for each student, and the stability loss tying both students together.
How: Use cross-entropy for labeled data; masked cross-entropy for consistency; MSE for stability; apply a cosine ramp-up for stability weight.
Why: Without balancing these parts, the model can either overfit to the small labels or get led astray by noisy pseudo-labels. 🍞 Anchor: Early on, supervised and light consistency lead; as training stabilizes, the stability link strengthens to align both students.

Concrete Data Example:

Suppose a pixel near a river bank: CLIP is 0.90 “water,” DINOv3 is 0.96 “water.” Since DINOv3 is more confident and above threshold, its distribution supervises CLIP there (stability). For a broader region over the river surface, CLIP may be 0.98 “water” vs. DINOv3 at 0.88; CLIP then supervises DINOv3 there. Across augmentations, each student must keep labeling the same pixels consistently (consistency).

Secret Sauce (what makes Co2S clever):

Heterogeneous priors: Language-aligned global semantics (CLIP) plus detail-rich self-supervised features (DINOv3) reduce correlated mistakes.
Pixel-wise arbitration: Confidence decides whose signal to trust per pixel, preventing lock-in of early errors.
Dual guidance: Explicit (text) anchors ensure semantic correctness; implicit (queries) adapt to dataset specifics for sharp boundaries.
Gentle coupling: A ramped stability loss ties students together only when they’re ready, improving robustness.

04Experiments & Results

🍞 Top Bread (Hook): Think of a science fair where many teams bring their best maps. The judges score how well each pixel is colored correctly—higher scores mean cleaner, smarter maps.

🥬 The Test (What they measured and why):

What it is: Mean Intersection over Union (mIoU) measures how much the predicted area of a class overlaps the true area (higher is better). It’s like grading how well your colored region matches the correct outline.
Why it matters: In remote sensing, accurate area shapes and boundaries are crucial—mIoU captures both correctness and coverage.

🍞 Anchor: If the model says a big region is “water,” mIoU checks how much that region truly is water vs. where it’s wrong.

The Competition (Who it was compared against):

Supervised baseline (OnlySup): trained only on labeled data.
General semi-supervised: FixMatch, UniMatch, U2PL.
Remote-sensing-specific: WSCL, DWL, MUCA.

Datasets (Diverse challenges):

WHDLD: 6 classes at 2 m resolution.
LoveDA: Urban and rural scenes at 0.3 m.
Potsdam: Ultra-high-resolution aerial imagery (5 cm).
GID-15: 15 fine-grained classes over huge areas.
MER & MSL: Mars surface segmentation with class imbalance.

The Scoreboard (with context):

WHDLD (1/24 labels): Co2S hits 61.1% mIoU, beating UniMatch by 3.7%. That’s like moving from a solid B to an A- when others plateau.
LoveDA (1/40 labels): Co2S reaches 58.2% mIoU, a 12.3% jump over OnlySup. That’s a big lift in a very small-label regime.
Potsdam (1/32 labels): Co2S gets 74.3% mIoU, matching or surpassing specialized methods like DWL at low-label splits and scaling to 80%+ as labels increase.
GID-15 (1/8 labels): Co2S reaches 75.4% mIoU vs. UniMatch at 73.9%, handling many fine-grained classes well.
MER & MSL (1/8 to 1/4 labels): Co2S leads with up to 60.9% (MER) and 65.9% (MSL), showing strength in highly imbalanced, extraterrestrial scenes.

Surprising (and important) findings:

Early pseudo-label stability: On WHDLD (1/24), Co2S’s pseudo-label accuracy surges past 95% in the first epoch and stays steady. Competing methods plateau around 80–88% and wobble more. This shows Co2S quickly locks onto reliable supervision.
Visual quality: Co2S reduces semantic confusion (e.g., impervious surface vs. clutter) and sharpens tiny structures (e.g., cars, small rocks) better than baselines.

Ablation Highlights (what matters most):

Explicit vs. implicit guidance: Using only learnable queries (implicit) from scratch can underperform early due to few labels. Using only text embeddings (explicit) helps a lot. Using both is best (61.09% mIoU on WHDLD 1/24).
Heterogeneity helps: Two DINOv3 students (local+local) perform poorly (45.20), two CLIP students (global+global) do better (60.78) but lack complementarity—mixing CLIP + DINOv3 wins (61.09).
Loss components: Adding consistency to supervised improves results; adding stability on top improves further—each piece contributes.
Swapping DINOv3 for other self-supervised models (MAE, BEiTv2, iBOT, SimMIM) still beats CLIP+CLIP; DINOv3 gives the best of the SSMs tested.

What this means in practice:

Co2S shines when labeled data is extremely scarce; the advantage grows as labels get fewer.
The method generalizes across Earth datasets and even Mars imagery, showing robustness to domain shifts and class imbalance.
The global-local fusion and dual guidance are the engine behind both semantic correctness and boundary precision.

05Discussion & Limitations

🍞 Top Bread (Hook): Even the best study plan has trade-offs—you still need good books, time, and the right tools.

🥬 Limitations (be specific):

Prompt quality: Explicit guidance depends on well-chosen text prompts and concept sets; poor prompts reduce CLIP’s anchoring power.
Sparse-label cold start: Learnable queries start random and may need time (or enough labels) to stabilize; the stability ramp helps, but very extreme scarcity can still slow convergence.
Threshold sensitivity: Confidence thresholds (e.g., 0.95) balance quality vs. coverage; too high wastes data, too low invites noise.
Compute and memory: Two ViT-B/16 students plus multiple views require modern GPUs and careful batching.
Domain gaps: If imagery differs a lot from CLIP’s pretraining world (e.g., unusual sensors), text anchors may need adaptation.

Required Resources:

Hardware: 1–2 high-memory GPUs (e.g., RTX 3090 or similar), mixed-precision recommended.
Data: Small labeled subset plus large unlabeled pool; class names and concept lists for prompts.
Software: ViT backbones, CLIP text encoder, DINOv3 (or other SSM), standard augmentations.

When NOT to Use:

When you have abundant pixel-level labels (full supervision may be simpler and faster).
When classes cannot be described well by text (weakens explicit guidance), and no reliable self-supervised prior is available.
When compute is too limited for dual-student and multi-view training.

Open Questions:

Better prompts: Can automatic prompt search or large language models craft stronger, domain-specific text anchors?
Adaptive thresholds: Can we learn per-class or per-region confidence thresholds to use more unlabeled data safely?
More modalities: How does adding SAR, multispectral, or elevation improve global-local fusion?
Continual learning: Can Co2S adapt online as new unlabeled streams arrive without forgetting?
Smaller backbones: How to keep the stability benefits with lightweight models for edge devices?

🍞 Bottom Bread (Anchor): Think of Co2S as two teammates who need good instructions (prompts), a fair referee (confidence rules), and enough practice time (compute). With those in place, they’re very hard to beat—especially when examples are scarce.

06Conclusion & Future Work

Three-Sentence Summary:

Co2S pairs a language-savvy global student (CLIP) with a detail-focused local student (DINOv3) and lets them co-teach each pixel based on confidence, stopping pseudo-label drift in semi-supervised segmentation.
Explicit text embeddings anchor semantics, learnable queries adapt to local textures, and a stability loss fuses their strengths while consistency losses steady each student.
Across six benchmarks, Co2S delivers leading mIoU, with the biggest advantage when labels are scarcest.

Main Achievement:

A stable, heterogeneous dual-student framework that synergizes vision-language and self-supervised priors to maintain semantic correctness and sharp boundaries under extreme label scarcity.

Future Directions:

Automate and personalize prompts; learn adaptive confidence thresholds; expand to multimodal sensors (e.g., multispectral, SAR); and compress models for edge deployment.
Explore continual and open-vocabulary settings where new classes appear without extra labels.

Why Remember This:

Co2S shows that combining complementary priors (global semantics + local details) with pixel-wise confidence arbitration is a powerful recipe to prevent error snowballing in semi-supervised learning.
It turns a common weakness—pseudo-label drift—into a strength by making two different students hold each other to a higher standard.

Practical Applications

•Rapid flood extent mapping from fresh satellite images using minimal labeled samples.
•City infrastructure mapping (roads, buildings, parking lots) with reduced annotation effort.
•Agricultural monitoring (fields, crops, irrigation) where labeled data is scarce.
•Forest and biodiversity mapping to track deforestation or habitat changes efficiently.
•Disaster damage assessment (collapsed buildings, blocked roads) shortly after events.
•Coastline and water resource monitoring with sharper boundaries and fewer labels.
•Large-area land-cover updates for national mapping agencies on tight budgets.
•Planetary surface analysis (e.g., Mars) where labeled examples are extremely limited.
•Change detection pipelines that benefit from stable pseudo-labels over time.
•Bootstrapping segmentation for new sensors or regions by leveraging CLIP text anchors.

Version: 1