Implicit Neural Representation Facilitates Unified Universal Vision Encoding

Matthew Gwilliam; Xiao Wang; Xuefeng Hu; Zhenheng Yang

Implicit Neural Representation Facilitates Unified Universal Vision Encoding

Intermediate

Matthew Gwilliam, Xiao Wang, Xuefeng Hu et al.1/20/2026

arXiv PDF

Key Summary

•This paper introduces HUVR, a single vision model that can both recognize what’s in an image and reconstruct or generate images from tiny codes.
•HUVR is a hyper-network that predicts the weights of many small per-patch image functions (INRs) so it can rebuild images quickly and accurately.
•It adds a new global token (like a summary token) and cleverly uses patch tokens to control per-patch INRs, keeping both spatial detail and whole-image meaning.
•HUVR produces standard-size embeddings and also very small compressed ones called Tiny Tokens (TinToks) for speed and storage savings.
•By learning from a strong teacher model (knowledge distillation), HUVR gains high-level understanding while still being great at pixel-level reconstruction.
•On ImageNet, HUVR matches or beats DINOv3 for classification and improves ADE20K segmentation mIoU while also giving much better reconstruction PSNR.
•TinToks at 96x compression far outperform PCA-compressed baselines for recognition and reconstruction, and even work as latents for diffusion image generation.
•A new patch-wise hyper-network design and global-token modulation deliver state-of-the-art INR hyper-network reconstruction with far fewer training epochs.
•The method unifies recognition and generation, and even unifies across token sizes, enabling one encoder to serve many tasks and devices.

Why This Research Matters

HUVR means one vision model can serve many jobs at once: understand what’s in an image, reconstruct it crisply, and even help generate new ones. This reduces the need to juggle multiple specialized models, saving compute, cost, and engineering complexity. Tiny Tokens make large-scale search, recommendation, and on-device AI far more efficient without throwing away useful meaning. For creative tools, the same features that classify can also drive high-quality photo editing, restoration, and synthesis. In robotics or AR, compact yet semantic embeddings enable faster perception on limited hardware. Over time, unifying recognition and generation into a single encoder could simplify AI pipelines across industry, research, and consumer apps.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how some people are great at recognizing faces in photos, while others are amazing at drawing pictures from memory? Imagine trying to find someone who can do both really well—spot the face and also sketch it back perfectly.

🥬 Filling (The Actual Concept: The World Before) What it is: Before this paper, computer vision models were mostly split into two camps: recognition models (that understand images) and generation models (that draw or reconstruct images). How it works (before):

Recognition models (like DINOv3, SigLIP) turn images into feature vectors for tasks like classification, detection, or segmentation.
Generation models (like VAEs used by diffusion) turn images into latents that are great for reconstructing or synthesizing pictures. Why it matters: Using two different worlds means two training pipelines, two model families, and no single set of features that works excellently for both understanding and drawing.

🍞 Bottom Bread (Anchor): If you wanted a photo app that both sorts your pictures by what’s inside and also redraws or edits them cleanly, you usually needed two separate models. That’s clunky and expensive.

🍞 Top Bread (Hook): Imagine packing for a trip: a big suitcase for everything versus a tiny backpack for just the essentials. Different tasks need different sizes of “packing.”

🥬 Filling (The Problem) What it is: Even when we get good features for one task, the right “token size” (how big the embeddings are) differs across tasks and devices. How it works (current pain):

Large embeddings are powerful but slow and costly for search or mobile.
Super-small embeddings are fast but often lose detail, hurting accuracy or reconstruction. Why it matters: At web scale (billions of images), shaving off just a few numbers per image can save tons of compute and memory.

🍞 Bottom Bread (Anchor): A giant image search system needs tiny, fast fingerprints for images; a desktop photo editor may prefer richer features. Ideally, one encoder should give both.

🍞 Top Bread (Hook): Have you ever tried to glue two finished puzzles together? It kind of works, but the seams show.

🥬 Filling (Failed Attempts) What it is: Prior unification attempts usually stick a recognition encoder behind a generator (or vice versa) after the fact. How it works: They pretrain a model for one job, then adapt for the other later. Why it matters: Mismatched goals and formats limit quality; you don’t get a truly shared, native representation.

🍞 Bottom Bread (Anchor): It’s like teaching someone to sprint first and asking them to later become a ballet dancer. Possible, but not natural.

🍞 Top Bread (Hook): Imagine a sketchbook where each page can transform into detailed artwork when you shine the right light on it.

🥬 Filling (The Gap) What it is: We needed one native encoder that: (1) learns features useful from pixels up to semantics, (2) reconstructs images well, and (3) outputs both big and tiny tokens. How it works: That would require a design that keeps patch-level detail, global meaning, and compressibility. Why it matters: A single model becomes a foundation for classification, segmentation, depth, retrieval, compression, and generation.

🍞 Bottom Bread (Anchor): One upload to the cloud gives you both a compact search key and a reconstruction-friendly code—no second model needed.

🍞 Top Bread (Hook): Think of a choir that can sing quiet lullabies and booming opera with the same voices—just arranged differently.

🥬 Filling (Why This Paper Exists) What it is: This paper presents HUVR, a hyper-network for implicit neural representation (INR) that unifies recognition and generation. How it works:

Use a Vision Transformer (ViT) to encode images into tokens.
Create very small Tiny Tokens (TinToks) via learnable downsampling.
Predict patch-wise INR weights so each patch can be reconstructed quickly.
Distill knowledge from a strong teacher (like DINOv3) so features carry high-level meaning. Why it matters: Now a single encoder natively supports classification, segmentation, depth estimation, reconstruction, and even acts as a latent space for diffusion.

🍞 Bottom Bread (Anchor): On ImageNet and ADE20K, HUVR matches or beats recognition baselines while also giving stronger reconstruction than VAE latents of similar size—showing true unification.

02Core Idea

🍞 Top Bread (Hook): You know how a really good map helps you both find where you are and also figure out how to get somewhere new? It helps you see and plan at the same time.

🥬 Filling (Aha! in one sentence) What it is: The key insight is to make a single encoder that learns to both understand and reconstruct images by predicting the weights of small, per-patch image functions (INRs), while also producing compact Tiny Tokens—and to teach it good semantics via distillation from a strong teacher. How it works (high-level):

Encode the image into tokens with a ViT.
Shrink them into TinToks (tiny embeddings) with learnable downsampling.
Use a decoder and projections to form modulation matrices that adjust a shared base INR per patch.
Reconstruct the image patches and align tokens to a teacher’s features. Why it matters: Without it, you get either great recognition or great reconstruction—but not both, and not in tiny token sizes.

🍞 Bottom Bread (Anchor): HUVR’s TinToks let a search engine store tiny, fast keys while a graphics app uses the same model to faithfully reconstruct or edit images.

Multiple Analogies: • Toolbelt analogy: Before, you needed two toolbelts—one for understanding (recognition) and one for building (generation). HUVR is a single smart tool that switches modes. • Orchestra analogy: Patch tokens play instruments for local details; a global token conducts the theme; together they perform both the notes (pixels) and the song meaning (semantics). • Recipe analogy: The base INR is the house recipe; patch tokens and the global token are spices and timing; modulating them cooks each patch just right.

Before vs After: • Before: Recognition encoders ignored pixel-accurate reconstruction; generators lacked strong semantics for dense tasks; compression usually harmed understanding. • After: One encoder outputs both standard-size features and TinToks; it reconstructs well, classifies well, segments well, and feeds diffusion reasonably—unified by design.

Why it Works (intuition, no equations): • INRs compress in two ways: (1) each image becomes a small set of numbers; (2) a shared base INR captures what’s common across images. That double compression encourages features that carry both pixel detail and semantics. • Predicting per-patch INRs keeps spatial alignment, so dense tasks (segmentation, depth) don’t lose where-things-are. • A global token lets the model summarize the whole image and also helps form modulation matrices with patch tokens—linking global meaning to local detail. • Distilling from a strong teacher injects mature semantic structure, so TinToks stay meaningful even when tiny.

Building Blocks (each as Sandwich explanations):

🍞 Hook: Imagine a coloring book page that can come to life if you know the secret formula for each point on the page. 🥬 Implicit Neural Representation (INR):

What: An INR is a tiny neural network that takes a pixel coordinate and outputs a color.
How:
1. Take a pixel’s (x, y) position.
2. Run it through a small MLP with positional encoding.
3. Predict the RGB color.
Why: Without INRs, you need to store every pixel directly. INRs let you store a small function instead—great for compression and reconstruction. 🍞 Anchor: Instead of saving a 256×256 image as 65,536 colors, we save a few thousand numbers that a network uses to redraw the whole picture.

🍞 Hook: Think of a chef who doesn’t cook the dish but writes the recipe that another chef follows. 🥬 Hyper-Networks:

What: A hyper-network takes an image and outputs the weights for another small network (the INR).
How:
1. Learn a shared base INR on the whole dataset.
2. For each new image, predict small “modulation” weights.
3. Combine base + modulation to get a per-image (or per-patch) INR that reconstructs quickly.
Why: Without hyper-networks, you’d train a new INR from scratch for every image—too slow. 🍞 Anchor: Instead of baking a cake from zero each time, the chef tweaks a master batter with a few add-ins to match today’s flavor.

🍞 Hook: Picture a librarian who can file books by chapter (patches) and also write a summary card for each book. 🥬 Vision Transformers (ViT), Patch and Global Tokens:

What: A ViT splits an image into patch tokens and can keep a global summary token.
How:
1. Break the image into patches.
2. Turn patches into tokens through attention.
3. Keep a global token to summarize the whole image.
Why: Without patch tokens, you lose where-things-are; without a global token, you lose the big picture. 🍞 Anchor: For segmentation, patch tokens keep spatial detail; for classification, the global token makes the final call.

🍞 Hook: Imagine squishing a big pillow into a tiny travel pouch—still comfy when you open it up. 🥬 Tiny Tokens (TinToks):

What: TinToks are very small embeddings that still keep useful meaning for recognition and reconstruction.
How:
1. Learnable layers downsample standard tokens into TinToks.
2. A decoder upsamples and refines them when needed.
3. Projections turn tokens into modulation matrices for INRs.
Why: Without TinToks, large-scale search and on-device AI would be too slow and heavy. 🍞 Anchor: A 32-number TinTok can still help classify and even reconstruct surprisingly well, beating PCA and even matching or beating VAE latents at similar sizes.

🍞 Hook: Think of a smart student learning tips from a top teacher to study faster and understand deeper. 🥬 Knowledge Distillation:

What: Distillation teaches our model to match features from a strong teacher (like DINOv3).
How:
1. Compare our encoder and decoder token features to the teacher’s.
2. Use losses to nudge them closer.
3. Separate heads for global and patch tokens keep both views sharp.
Why: Without distillation, INRs might be great at pixels but weak at semantics; distillation balances both. 🍞 Anchor: After distillation, HUVR’s tokens line up with DINOv3’s structure, helping it score A-grades on classification while staying great at reconstruction.

03Methodology

At a high level: Input image → ViT encoder (standard tokens) → downsample to TinToks → transformer decoder → projections to INR modulation matrices (global × patch) → per-patch INR reconstruction → losses (pixel + SSIM/LPIPS optional + distillation to teacher at encoder/decoder).

Step-by-step recipe (with why and examples):

Input and Patching

What happens: Take an RGB image (e.g., 256×256). Split it into fixed-size patches (e.g., 16×16), giving 256 patches.
Why this exists: Patches preserve where-things-are, crucial for segmentation and depth.
Example: A cat photo becomes 256 mini-tiles, each turned into a token.

ViT Encoding (Standard Tokens)

What happens: A ViT with Rotary Positional Embeddings encodes all patch tokens and a global token. Output token dim might be 768 (ViT-B) or 1024 (ViT-L).
Why: Attention lets tokens share context; the global token captures whole-image meaning.
Example: The global token learns “this is a cat on a couch,” while patch tokens learn fur, eyes, and cushion textures.

Downsample to TinToks (Tiny Tokens)

What happens: A learnable linear layer reduces each token from big dim (e.g., 768) to tiny dim (e.g., 32). This creates TinToks for both patch and global tokens.
Why: Tiny embeddings make search, storage, and on-device inference fast.
Example: Each 768-number patch vector becomes just 32 numbers—like a zip file for meaning.

Decoder Refinement

What happens: A transformer decoder (few layers) processes the upsampled tokens to help reconstruction quality. Final projections map decoder tokens to two sizes: d_in for patch tokens and d_out for the global token.
Why: The decoder gives TinToks a chance to breathe back into richer features for accurate reconstruction.
Example: After decoding, the model can recover finer fur strands in the cat’s patch.

Global × Patch Modulation to Build INR Matrices

What happens: Project the global token to size d_out and each patch token to size d_in. Multiply global × patch^T to get a modulation matrix M for that patch.
Why: Without this step, we can’t turn tokens into the exact shapes needed to adjust the INR weights.
Example: If d_in=256 and d_out=256, global(256) × patch(256)^T gives a 256×256 M that tweaks the INR’s second layer for that patch.

Base INR and Patch-wise Weight Modulation

What happens: Keep a shared base INR (tiny MLP+upsample). For each patch, modulate only the INR’s second layer weights with M (copy first and last layers from base). Now you have a per-patch INR specialized for that patch.
Why: Modulating one strategic layer is enough for high quality with far fewer parameters; patch-wise keeps spatial alignment.
Example: The “eye patch” INR learns glossy highlights; the “couch patch” INR learns woven texture.

Per-Patch Reconstruction and Stitching

What happens: Each patch INR predicts RGB for its strided coordinates (e.g., stride=4 for speed). A small conv+PixelShuffle upsamples to the patch’s full resolution. Concatenate all patches back into the full image.
Why: This lets the model reconstruct the image with high fidelity while staying efficient.
Example: The cat’s eye patch outputs sharper reflections; all patches together rebuild the full 256×256 image.

Losses: Pixel + Optional Perceptual + Distillation

What happens: • Pixel loss: MSE between input and reconstructed image. • Optional perceptual losses: SSIM and LPIPS to better match human visual quality. • Distillation: Align encoder and decoder tokens (global and patches) to a teacher (e.g., DINOv3) via separate linear heads and weighted losses.
Why: Pixel-level loss teaches drawing; distillation teaches understanding; both are needed for a unified model.
Example: If reconstruction is blurry, SSIM/LPIPS help; if classification is weak, increasing global-token distillation helps.

Outputs

What happens: The model can output standard-size tokens for maximum accuracy, or TinToks for compactness. It can reconstruct images, power linear-probe classification, segmentation, depth estimation, and serve as diffusion latents.
Why: Different apps and hardware need different trade-offs.
Example: Cloud search uses TinToks for speed; photo editing uses standard tokens for detail.

Secret Sauce (what makes it clever)

Patch tokens as weight tokens: reuse the tokens that already know where-things-are to control per-patch INRs—no waste.
Global token modulation: a clean way to make d_in × d_out matrices from two vectors, and a natural CLS token for recognition.
TinToks: learned compression that stays meaningful for both recognition and reconstruction.
Distillation at encoder and decoder, with separate heads for global and patch tokens: injects semantics while keeping pixels sharp.

Mini Sandwich blocks for special mechanics:

🍞 Hook: Think of a dimmer switch that brightens or softens just the right part of a room. 🥬 Modulation Matrix (M):

What: A matrix made from global × patch token that scales a specific INR layer’s weights.
How: Project global to d_out and patch to d_in; multiply to get M (d_in × d_out); elementwise-multiply M with that INR layer.
Why: Without M, tokens can’t precisely control the INR layer to match each patch. 🍞 Anchor: The couch patch’s M tells the INR, “Boost the texture filters here; soften the color transitions there.”

🍞 Hook: Like reading with a magnifying glass to catch tiny details your eyes might miss. 🥬 Distillation Loss:

What: A loss that nudges our tokens to match a teacher model’s features.
How: Compare our encoder and decoder outputs (global/patch) to a teacher’s; apply weighted MSE losses.
Why: Without this, we’d get great pixels but weaker semantics; with it, we balance both. 🍞 Anchor: The model learns “cat-ness” from the teacher while still mastering shiny eyes and soft fur through reconstruction.

04Experiments & Results

The Tests (what they measured and why):

Recognition: Linear-probe classification on ImageNet-1k (original and ReaL labels), ObjectNet (robustness), plus fine-grained datasets (Cars, CUB, DTD, Flowers, Food).
Dense tasks: ADE20K semantic segmentation and NYUv2 depth (to test patch-level spatial understanding).
Reconstruction: PSNR, SSIM, and LPIPS on ImageNet validation images (never seen during training).
Generation: Replace Stable Diffusion’s VAE latents with HUVR TinToks in a DiT and compare FID/IS/Precision/Recall.
Hyper-network comparisons: ImageNette, CelebA, LSUN Churches to test INR hyper-network quality and training efficiency.

The Competition:

DINOv3 (self-distillation for strong recognition and dense features)
C-RADIOv3 and SigLIP 2 (state-of-the-art recognition/text-image setups)
SD VAE (standard latents for diffusion models)
Prior INR hyper-networks (TransINR, IPC, LA-IPC, ANR)

Scoreboard with Context:

Standard-size recognition (ViT-B/16): HUVR hits about 85.0% Top-1 on ImageNet (vs. DINOv3 ~84.6%), which is like nudging from an A to a slightly higher A while also adding new talents (reconstruction) for free.
Dense tasks (ViT-B/16 ADE20K): HUVR ~52.0 mIoU vs. DINOv3 ~50.8 mIoU—think of it as seeing a few more puzzle pieces’ boundaries correctly.
Reconstruction: HUVR outperforms a Stable Diffusion VAE at similar embedding size by about +1.26 PSNR with TinToks, and with standard settings shows strong PSNR improvements over baselines (e.g., +4.84 PSNR vs. SD VAE at equal size reported).
TinToks (compression): At 96× compression, HUVR’s TinToks dramatically beat PCA-compressed DINOv3 features on ImageNet linear-probe (e.g., +48% absolute improvement in some tiny-dimension regimes). That’s like cramming a big textbook into a pamphlet and still acing the quiz.
Generation with DiT: Replacing VAE latents with TinToks yields promising but not SOTA FID/IS yet. Larger TinToks (e.g., d_t=256) help, and longer training improves quality.
INR Hyper-network SOTA: On ImageNette/LSUN/CelebA, HUVR’s patch-wise modulation achieves top PSNR with far fewer training epochs than prior works—like finishing the race first even with fewer practice laps.

Surprising or Noteworthy Findings:

Patch-wise design is a game-changer: Switching from image-wise to patch-wise INR prediction jumps PSNR massively (e.g., to >51 dB in ablations) before final trade-offs.
Global token helps twice: It boosts reconstruction and also provides a clean CLS token for recognition.
Distillation trade-offs: Distilling to both global and patch tokens improves classification/segmentation but can reduce reconstruction slightly. The team counters this by training longer and using stronger teachers (ViT-L/H).
Decoder attention trade-off: Removing attention can boost TinTok classification with less compute, but hurts reconstruction—an explicit speed–quality dial.
Teacher size crossover: Larger teachers (ViT-L/H) can look worse at very few epochs but win clearly with longer training.

Takeaways:

Unified tokens: HUVR’s standard tokens rival state-of-the-art recognition. TinToks stay useful even at extreme compression, a rare feat.
Unified tasks: The same encoder supervises classification, segmentation, depth estimation, reconstruction, and can feed diffusion.
Unified training: Pixel-level learning plus feature distillation pulls both ends (drawing and understanding) toward a balanced, strong center.

05Discussion & Limitations

Limitations (honest view):

Scale: Pretraining wasn’t as massive or curated as some competitors (e.g., SigLIP 2’s huge data), so there are niches where others still edge out HUVR.
VLM alignment: HUVR is text-free; to plug into vision-language models, it would need text-aligned pretraining.
Generation not SOTA yet: While promising as diffusion latents, TinToks don’t beat the best VAEs today; specialized tuning is likely needed.
Trade-offs exist: Distillation settings, decoder attention, and TinTok size shift accuracy vs. reconstruction vs. speed—no single setting wins every metric.

Required Resources:

A solid GPU setup (or cluster) for ViT-B/L pretraining (tens of epochs over large image sets).
Teacher checkpoints (e.g., DINOv3-L/H) for distillation.
Storage for datasets like DataComp and ImageNet22k.

When NOT to Use:

If you only need text-aligned features for VLM tasks without any reconstruction needs, a specialized CLIP/SigLIP family model may be simpler.
If you require the absolute best diffusion generation quality today with minimal engineering, standard VAE latents remain a safer bet.
Extremely tight real-time constraints without room for any decoder or per-patch INR step might prefer a pure recognition encoder.

Open Questions:

Best practices for diffusion with TinToks: How to exploit the global/patch token structure in DiTs for bigger gains?
Multi-teacher distillation: Can mixing teachers (DINOv3, RADIO, SigLIP) remove trade-offs across tasks?
Token-size scheduling: Can we adapt token size per image or task on the fly for optimal compute-quality trade-offs?
Beyond images: How does the patch-wise INR hyper-network extend to video and 3D while keeping TinToks compact and semantic?

06Conclusion & Future Work

Three-sentence summary: HUVR is a hyper-networked vision encoder that predicts per-patch INRs and learns via both pixel reconstruction and feature distillation, unifying recognition and generation in a single model. It outputs standard tokens for maximum accuracy and Tiny Tokens for extreme compression, all while maintaining strong semantics and high-fidelity reconstruction. Experiments show competitive or superior recognition and dense-task performance alongside markedly improved reconstruction, and even promising diffusion results using TinToks as latents.

Main Achievement: Demonstrating that a single, natively unified encoder can excel at both understanding and rebuilding images—and do so across token sizes—by combining patch-wise INR modulation, a global token, TinToks, and targeted distillation.

Future Directions:

Tighten diffusion quality by tailoring DiT architectures and schedules to HUVR’s global+patch token structure.
Explore multi-teacher distillation for balanced excellence across recognition, dense tasks, and reconstruction.
Extend HUVR to video (temporal TinToks) and 3D, preserving compactness and semantics.
Add text alignment to plug into VLM ecosystems without losing generative strengths.

Why Remember This: HUVR shows that “see” and “draw” don’t have to live in separate worlds—and that one encoder can serve phones, servers, and artists alike with features that scale from tiny to rich, from pixels to semantics.

Practical Applications

•Image search at web scale using TinToks for fast retrieval with low storage.
•On-device photo organization and classification with tiny, power-friendly embeddings.
•High-quality image reconstruction and compression for cloud backup and streaming.
•Semantic photo editing: precise object selection (segmentation) plus faithful reconstruction.
•Cold-start for diffusion: use TinToks as latents to initialize generation or editing workflows.
•AR/VR perception on edge devices: compact tokens for real-time scene understanding.
•Robotics vision: unified features for recognition, depth cues, and reconstruction on limited compute.
•Content moderation and deduplication at scale with compact yet semantic codes.
•Medical or satellite image triage: quick screening with tiny tokens, detailed follow-up with standard tokens.
•Dataset curation: reconstructable embeddings let you verify and visualize samples without storing raw images.

Version: 1