C-RADIOv4 (Tech Report)

Mike Ranzinger; Greg Heinrich; Collin McCarthy; Jan Kautz; Andrew Tao; Bryan Catanzaro; Pavlo Molchanov

C-RADIOv4 (Tech Report)

Intermediate

Mike Ranzinger, Greg Heinrich, Collin McCarthy et al.1/24/2026

arXiv PDF

Key Summary

•C-RADIOv4 is a single vision model that learns from several expert models at once and keeps their best skills while staying fast.
•It uses updated teachers (SigLIP2, DINOv3, SAM3) and new training tricks to work well at many image sizes without getting confused.
•A shift-equivariant loss makes the student copy true image details while ignoring each teacher’s fixed noise patterns.
•A balanced summary loss uses angles to keep one teacher from shouting louder than the others, so the student learns fairly from all.
•Stochastic resolutions (randomly training at many sizes) give smooth resolution scaling and better low-resolution accuracy.
•ViTDet-mode lets most attention be in windows, making high-resolution inference much faster with only tiny quality tradeoffs.
•On ImageNet-1K, C-RADIOv4 matches or beats DINOv3 in kNN accuracy from 256px and up, despite being far smaller.
•On ADE20k, it keeps improving as resolution rises, showing robust any-resolution behavior.
•It can replace SAM3’s vision encoder, staying competitive on natural images and even fixing a tricky 'person' query case.
•The SO400M version is small, fast, permissively licensed, and useful for real products and research.

Why This Research Matters

Real products see images in many sizes, from phone snapshots to 4K frames, so a backbone that stays steady across resolutions cuts bugs and costs. By learning fairly from multiple expert teachers, C-RADIOv4 delivers both strong text-image alignment and crisp dense perception in one model. Shift-equivariant training helps ignore teacher-specific artifacts, improving reliability in safety-critical uses like robotics and driving. ViTDet-mode makes high-resolution processing much faster, enabling near real-time segmentation and analysis. Because it can replace SAM3’s encoder and is permissively licensed, teams can drop it into existing systems and immediately benefit. The smaller SO400M variant brings strong performance to tighter compute budgets. Overall, it helps unify fragmented vision stacks into one dependable, efficient backbone.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine a superhero team where each hero has a special power—one can read labels, one can see shapes in the dark, and one can trace perfect outlines. Wouldn’t it be amazing if one new hero could learn the best parts from all three and use those skills anywhere, from a tiny phone screen to a huge billboard?

🥬 The World Before: Vision AI models were great—but usually at one or two things. Some were awesome at matching pictures to text (like knowing that this picture is 'a red sneaker'), others excelled at dense perception (like finding object boundaries for every pixel), and some specialized in cutting out objects precisely (segmentation). When we tried to combine these skills, the usual way was to let a 'student' model copy from a 'teacher' model. That worked, but only copied one teacher at a time and often at one fixed image size.

🍞 Anchor: Think of a student practicing only with the math teacher—they’ll do fine on algebra but might stumble in geometry or science.

🍞 Hook: You know how zooming a photo on your phone can make the picture look better or worse, depending on the app? Models had the same trouble.

🥬 The Problem: When models trained at one or two resolutions saw different-sized images at test time, they sometimes 'mode switched'—their behavior changed unpredictably. A model might be confident at 384px but confused at 768px. This broke consistency and made real-world use risky, because image sizes vary a lot.

🍞 Anchor: It’s like wearing shoes that only fit when your feet are exactly one size—take a step and suddenly the shoes don’t fit.

🍞 Hook: Imagine coloring a picture with markers that sometimes leave dots on the edges, no matter what you draw. Those dots aren’t part of the picture—they’re just marker quirks.

🥬 Failed Attempts: Early agglomerative models distilled features from multiple teachers, but they also copied each teacher’s 'fixed-pattern noise'—weird speckles near borders or windows that had nothing to do with the image itself. Balancing teachers helped for spatial features (PHI-S), and training at two resolutions reduced 'mode switching,' but some quirks lingered, and summaries (the 'headline' tokens) weren’t fairly balanced.

🍞 Anchor: The student wasn’t just learning the lesson; they were copying the teacher’s smudged handwriting too.

🍞 Hook: Picture three coaches: one for language-and-vision (SigLIP2), one for dense scene understanding (DINOv3), and one for concept-driven segmentation (SAM3). Upgrade the coaches; upgrade the player.

🥬 The Gap: We needed a student that could (1) learn from stronger, newer teachers, (2) stay steady across many resolutions, (3) ignore teacher-specific noise, (4) balance summary signals so no teacher dominates, and (5) run fast at high resolution. Previous versions got parts of this right but not all together.

🍞 Anchor: We wanted one Swiss Army knife that actually works well in every mode, not a bundle of fragile tools.

🍞 Hook: Think of everyday moments—your camera app tagging photos, a robot finding toys on the floor, or a doctor’s tool outlining organs in a scan.

🥬 Real Stakes: In real life, images come in all sizes and settings. A consistent, any-resolution model saves memory, latency, and engineering time. If it can also slot into big systems like SAM3 and run faster with windowed attention, teams can build better products—from AR glasses to self-driving cars—without juggling many separate models.

🍞 Anchor: With C-RADIOv4, you can use one reliable vision backbone across phones, robots, and cloud servers and expect strong, steady behavior.

02Core Idea

🍞 Hook: You know how a great student listens to three teachers, practices at different zoom levels, and focuses on the lesson—not the chalk squeaks? That’s the idea here.

🥬 The Aha Moment (one sentence): Teach one student to copy the best, task-specific skills from several expert teachers at many resolutions while ignoring each teacher’s fixed quirks, using angle-balanced summaries and shift-equivariant feature matching.

🍞 Anchor: It’s like building a champion who learns language labels from Coach SigLIP2, pixel-perfect shapes from Coach DINOv3, and concept-guided outlining from Coach SAM3—then plays well on any-sized field.

🍞 Hook (Analogy 1—Cooking): Imagine tasting three chefs’ dishes, learning each signature spice, and then cooking a new dish that blends the best flavors without copying any kitchen noise like a buzzing fridge.

🥬 Analogy 2—Maps: You study a city using a street map (text-image), a terrain map (dense features), and a landmark map (segmentation). You learn the city itself, not the printing artifacts on any single map.

🥬 Analogy 3—Music: Three musicians teach you rhythm, melody, and harmony. You train with songs played at slow, medium, and fast tempos. You learn the music, not the metronome ticks.

🍞 Anchor: The result is a musician who can switch tempos smoothly and still play the same song beautifully.

🥬 Before vs After:

Before: Multi-teacher students sometimes changed personality with image size, copied teacher noise, and let one teacher dominate summaries.
After: C-RADIOv4 samples many resolutions during training, aligns teacher-student features with random shifts to kill fixed-pattern copying, and balances summary angles so no teacher wins unfairly. It also speeds up high-res inference with ViTDet-mode.

🍞 Anchor: Previously, the student could ace the test only at one desk height; now, it aces it sitting anywhere.

🥬 Why It Works (intuition):

Random shifts make it impossible to match location-bound artifacts; the student must learn true content.
Balancing angles (not just magnitudes) means the student cares about being directionally right relative to each teacher’s usual spread, so teachers with naturally wider cones don’t overwhelm the loss.
Training across many sizes builds a smooth skill curve; the model doesn’t flip modes when you zoom in or out.
Windowed attention reduces the quadratic pain of global attention at high resolution while keeping a few global layers for scene-wide context.

🍞 Anchor: It’s like practicing soccer on fields of many sizes, focusing on the ball (content), not the field’s chalk patterns (noise), and using sprints in local zones with occasional full-field scans to stay fast and smart.

🥬 Building Blocks (each with a mini Sandwich):

🍞 You know how learning from three teachers gives a rounder education? 🥬 Multi-Teacher Distillation: One student copies signals from SigLIP2 (text-aligned), DINOv3 (dense semantics), and SAM3 (concept segmentation). How: run all teachers, align features, compute balanced losses, update student. Why: single teachers are narrow; you want their combined strengths. 🍞 Example: The student learns 'bike' from text-image, 'wheel edges' from dense features, and 'bike mask' from SAM3.
🍞 Imagine zooming a camera from tiny to huge. 🥬 Resolution Scaling + Stochastic Resolutions: Train by randomly picking from many low and high sizes so the model behaves smoothly across resolutions. Why: fixed sizes cause mode switching. 🍞 Example: Train at 192, 384, 768, 1024px; test at 1536px still works great.
🍞 Picture sliding a photo slightly; the content stays the same. 🥬 Shift Equivariance (Loss + MESA): Randomly shift crops for student and each teacher independently; only align overlapping regions. Also, match the EMA student with a different crop. Why: prevents copying fixed border/window artifacts. 🍞 Example: The student can’t memorize a border speckle because it moves every time.
🍞 Think of sharing the mic fairly among speakers. 🥬 Balanced Summary Loss: Use angle-based loss normalized by each teacher’s angular dispersion so wider cones don’t shout louder. Why: cosine alone lets some teachers dominate. 🍞 Example: DINOv3’s wide cone no longer drowns out SigLIP2’s summaries.
🍞 Like sharpening a photo the right way. 🥬 FeatSharp Upsampling: For fixed-resolution teachers (e.g., SigLIP2 at 384px), upsample features with FeatSharp, not naive bilinear. Why: preserves crisp features for high-res training. 🍞 Example: Upsampled edges look clean, so the student learns correct boundaries.
🍞 Like practicing on a grid to run faster. 🥬 ViTDet-Mode: Use mostly windowed attention with a few global layers; faster at high resolution with minimal quality drop. Why: global attention is expensive. 🍞 Example: 4096px images run far faster with similar accuracy.

03Methodology

At a high level: Images at many sizes → run teachers and student with independent random shifts → align overlapping features → compute two kinds of losses (spatial feature loss and angle-balanced summary loss) plus a shift-equivariant MESA regularizer → train with DAMP weight noise → export a single student backbone that supports global or ViTDet windowed attention.

Step-by-step recipe:

Choose and prepare teachers

What: Use SigLIP2-g-384 (text-image), DINOv3-7B (dense SSL), SAM3 (concept segmentation). For SAM3’s fixed 1152×1152 input, use mosaic augmentation.
Why: Each teacher brings complementary strengths; SAM3 compatibility unlocks segmentation head reuse.
Example: For a stadium photo, SigLIP2 helps label 'bike/helmet/person,' DINOv3 clarifies boundaries and surfaces, SAM3 guides object masks.

Stochastic resolution sampling

What: Randomly pick a resolution from low set {128, 192, 224, 256, 384, 432} or high set {512, 768, 1024, 1152} each step. Use aspect-preserving resize. For SigLIP2 at high-res training, apply FeatSharp 3× upsampling from 384px to 1152px; use raw outputs in low-res. SAM3 gets 1152 with mosaic.
Why: Builds smooth behavior across sizes and boosts low-res quality; prevents mode switching.
Example: The same zoo image might be seen at 192px on one batch and 1024px on another.

Independent random shifts and alignment

What: For every image, crop/shift the student and each teacher differently (by patch multiples). Compute a mapping F_S→T to align student tokens to the teacher’s coordinates, and only compare overlapping positions.
Why: Forces the student to match content, not fixed-position noise like border speckles or window artifacts.
Example: If the teacher’s border has a ghost line, shifting ensures it won’t line up, so copying it won’t reduce loss.

Spatial feature loss (shift-equivariant)

What: Normalize teacher features with PHI-S. Compute L_spatial by comparing aligned spatial tokens over the overlap set Ω after applying F_S→T to the student features.
Why: Matches dense semantics while removing the ability to memorize fixed noise.
Example: On a street scene, road and sidewalk tokens align; fake high-energy blobs at borders don’t consistently align and thus don’t help the loss.

Summary loss with angular dispersion balancing

What: Instead of plain cosine, compute the angle Θ(x, y) between student and teacher summaries and divide by that teacher’s dispersion Disp(Θ_y) (its typical cone radius). Aggregate across teachers.
Why: Normalizes how spread-out each teacher’s summaries are; prevents teachers with wider cones (like DINOv3) from dominating the loss.
Example: If DINOv3’s summary vectors vary widely, their angles are scaled down so SigLIP2’s tighter cone gets fair weight.

Shift-equivariant MESA regularization

What: Maintain an EMA (slow-moving average) of the student. Compute a second loss by comparing the current student to its EMA, but again with different crops and an alignment transform F_S→Ŝ. Apply LayerNorm before comparison.
Why: Encourages flat, stable optima and suppresses fixed-pattern artifacts from creeping into the backbone or adapters.
Example: The student and its EMA must agree even when the crop shifts, so brittle, position-locked tricks are penalized.

DAMP weight noise

What: During training, multiply weights by small random factors (DAMP) to improve robustness.
Why: Makes the model less sensitive to small corruptions and distribution shifts.
Example: Slightly perturbed filters still produce correct predictions, like a driver staying steady on a bumpy road.

Any-resolution support with ViTDet-mode

What: Export the student with a switch: full global attention or windowed attention (e.g., window sizes 6×6 to 32×32 tokens) plus a few global layers. Choose via a vitdet_window_size flag.
Why: At high resolutions, windowed attention is much faster; keeping some global layers preserves scene context.
Example: At 4096px, window size 8 or 16 runs far faster than pure global attention for near-identical quality.

Heads and evaluation

What: Provide a SigLIP2-aligned head for zero-shot classification and use frozen features for kNN evaluations. For segmentation, plug the backbone into SAM3’s decoder stack.
Why: Measures both open-vocabulary recognition and dense understanding; demonstrates drop-in replacement for SAM3.
Example: Zero-shot ImageNet at 1024px peaks; ADE20k linear probe improves with resolution; SA-Co/Gold instance segmentation is competitive on natural images.

The secret sauce:

Random, independent shifts for student and each teacher + alignment: you can’t memorize artifacts you can’t consistently line up.
Angle-balanced summaries: fairness between teachers with different summary spreads.
Stochastic resolutions: a smooth, zoom-proof skill curve.
ViTDet-mode: practical speed-ups that keep quality high.

Mini Sandwiches for key ingredients:

🍞 You know how enlarging a photo can blur details unless you use a smart algorithm? 🥬 FeatSharp: A feature-aware upsampler that keeps edges crisp when lifting SigLIP2 features from 384px to 1152px; without it, student learns fuzzier boundaries. 🍞 Example: Shoe laces stay sharp, not smeared.
🍞 Think of a calm driver who doesn’t overreact to pebbles on the road. 🥬 MESA + DAMP: Together, they stabilize training and improve robustness; without them, the model might chase tiny artifacts or overfit. 🍞 Example: Predictions don’t wobble when the image shifts a little.
🍞 Grids make big tasks manageable. 🥬 ViTDet-mode: Local windows speed things up; a few global layers keep the big picture; without it, high-res is slow. 🍞 Example: 4K images run much faster on an A100 with small windows, with negligible accuracy loss.

04Experiments & Results

The tests and why they matter:

Zero-shot classification on ImageNet-1K: checks open-vocabulary recognition using text prompts without task-specific training.
kNN classification: probes raw feature quality and clustering, comparable across backbones (used by DINOv2/3).
ADE20k linear probe: tests dense semantics across resolutions.
Probe3D: evaluates 3D awareness (depth, surface normals, NAVI, SPair) from 2D features.
SA-Co/Gold (SAM3 benchmark): tests instance segmentation via SAM3 decoder with RADIO as the vision backbone.
Latency vs resolution: measures real-world speed with and without ViTDet windows on A100 GPUs.

Competition and baselines:

Prior RADIO models (v2.5, v3) and state-of-the-art DINOv2/DINOv3 families.
For segmentation replacement, compare using SAM3’s own encoder vs. RADIO as a drop-in.

Scoreboard with context:

ImageNet-1K zero-shot vs resolution: C-RADIOv4-H now peaks at 1024px and is notably better at low resolutions than earlier RADIO generations. That’s like getting consistent B+/A- across all quiz sizes instead of A at one size and C at another.
kNN vs DINOv3: Starting at 256px, C-RADIOv4-H matches or beats DINOv3 in kNN accuracy despite having about an order of magnitude fewer parameters than DINOv3-7B. That’s like running as fast as a heavyweight sprinter while wearing much lighter shoes.
ADE20k linear probe scaling: C-RADIOv4-H improves steadily from 512→1536px (e.g., 55.20→57.72), mirroring DINOv3’s strong scaling. This shows true any-resolution behavior, even beyond trained sizes.
Probe3D: C-RADIOv4-H improves over prior RADIOs on several metrics (e.g., SPair and NAVI), indicating better geometric awareness. Think of recognizing both 'what it is' and 'how it sits in 3D space.'
SAM3 replacement on SA-Co/Gold: RADIO is second-best overall. It’s close to SAM3 on natural-image splits ('metaclip_nps' and 'sa1b_nps') but leaves a larger gap on 'fg_sports_equipment' and 'wiki_common.' That means RADIO is already strong where images look like everyday photos, with room to grow in niche or synthetic-like domains.
Curious 'person' case: The public SAM3 demo struggles with the 'person' text query, but with RADIO as the encoder, masks appear correctly—even in ViTDet-mode. This hints that RADIO’s representations can cross certain thresholds cleanly.
Speed gains with ViTDet-mode: For SO400M, window size ≤12 is faster than SAM3’s ViT-L+ encoder. ViT-H with window size 8 is nearly as fast. Across 256–4096px, ViTDet-mode drastically reduces latency growth. That’s like finishing the same homework in half the time at poster-sized pages.

Surprising findings:

DINOv3-H+ sometimes outperforms 7B on kNN, suggesting that bigger isn’t always better for raw clustering metrics.
C-RADIOv4’s big low-res gains imply that stochastic resolutions and shift-equivariant losses reduce the old 'mode-switching' more than expected.
The 'person' query fix suggests subtle representation differences can unlock decoder thresholds in concept-driven segmentation.

05Discussion & Limitations

Limitations:

Uneven SAM3 replacement: RADIO lags more on specific SA-Co/Gold splits (e.g., 'fg_sports_equipment', 'wiki_common'), showing that some concept domains aren’t matched as tightly yet.
Residual noise risk: While shift-equivariance helps a lot, some teacher quirks can still leak into adapters or backbones in edge cases.
Window tradeoffs: Smaller ViTDet windows may slightly reduce quality, and hardware may shrink or erase speed gaps between similar window sizes.
Resource needs: Training with multiple large teachers (e.g., DINOv3-7B, SAM3) and stochastic resolutions is compute- and memory-heavy.

Required resources:

Multi-GPU setup with high memory for teacher-student distillation at varied resolutions; storage for teacher checkpoints; acceleration libraries for ViTDet-mode.

When not to use:

Ultra-specialized domains with unique artifacts (e.g., medical imaging modalities not represented in teachers) where a dedicated, task-specific model may outperform a generalist backbone.
Extreme real-time constraints on edge devices where even SO400M may be too heavy; a smaller distilled student might be needed.

Open questions:

How to further reduce the SAM3 gap on challenging splits—better concept alignment, domain augmentations, or decoder-aware distillation?
Can dispersion balancing be extended temporally for video models, or across multiple summary tokens per scale?
Are there smarter shift strategies (e.g., random rotations/perspective) that help without harming alignment?
Can we learn an adaptive window schedule that picks window sizes by content or resolution dynamically?
How do these techniques generalize to multimodal video-language tasks and 3D vision backbones?

06Conclusion & Future Work

Three-sentence summary: C-RADIOv4 is a single vision backbone that learns from multiple top teachers (SigLIP2, DINOv3, SAM3), stays stable across many image sizes, and runs fast at high resolutions. It achieves this with shift-equivariant feature matching, angle-balanced summary losses, stochastic resolutions, and an optional ViTDet-mode for efficiency. The result is strong zero-shot and dense performance that competes with much larger models, plus drop-in replacement for SAM3’s encoder.

Main achievement: Unifying diverse teacher skills into one compact, any-resolution model that avoids teacher noise and balances their influence, while delivering practical speed-ups via windowed attention.

Future directions: Improve SAM3 replacement on harder domains, extend dispersion balancing and shift strategies, explore adaptive windowing, and broaden to video and 3D-native tasks. Investigate even smaller students with similar quality and more robust text alignment across languages.

Why remember this: C-RADIOv4 shows that with the right training recipe—fair multi-teacher learning, noise-proof alignment, and smart resolution practice—one model can be versatile, accurate, and fast enough for real-world use, from research labs to products.

Practical Applications

•Use C-RADIOv4 as a drop-in encoder for SAM3 to accelerate high-resolution instance segmentation with minimal quality loss.
•Power open-vocabulary image tagging and search with the SigLIP2-aligned head for zero-shot classification.
•Deploy on robots and drones for stable perception across changing camera resolutions and viewpoints.
•Enable efficient 4K document OCR pipelines by switching to ViTDet-mode for faster processing.
•Improve AR/VR scene understanding by leveraging strong dense features for segmentation and surface cues.
•Use as a backbone for vision-language models (e.g., VLMs) to boost text-image alignment and dense grounding.
•Apply to autonomous driving stacks for robust multi-scale perception without juggling multiple backbones.
•Accelerate satellite and aerial imagery analysis by combining windowed attention with any-resolution support.
•Run kNN-based analytics on embeddings for fast category discovery or anomaly detection without classifier training.
•Build compact, production-ready systems using the SO400M model to balance accuracy, speed, and cost.

Version: 1