Grounding and Enhancing Informativeness and Utility in Dataset Distillation

Shaobo Wang; Yantai Yang; Guo Chen; Peiru Li; Kaixin Li; Yufa Zhou; Zhaorun Chen; Linfeng Zhang

Grounding and Enhancing Informativeness and Utility in Dataset Distillation

Intermediate

Shaobo Wang, Yantai Yang, Guo Chen et al.1/29/2026

arXiv PDF

Key Summary

•This paper tackles dataset distillation by giving a clear, math-backed way to keep only the most useful bits of data, so models can learn well from far fewer images.
•It introduces two key ideas: Informativeness (the most important parts inside each image) and Utility (which whole images matter most for training).
•To find informative parts, it uses Shapley Values from game theory, which fairly score each patch of an image for how much it helps the model’s prediction.
•To pick high-utility images, it uses Gradient Norms, which upper-bound how much a sample can change the model’s learning steps.
•The method, called InfoUtil, first crops the most informative patches, then keeps the most useful samples, and finally reconstructs full images with soft labels from a teacher model.
•InfoUtil is both interpretable (you can see and justify what it kept) and efficient (it runs on a single A100 GPU and avoids huge memory needs).
•On ImageNet-1K with ResNet-18, InfoUtil improves accuracy by 6.1 percentage points over the previous best at 1 image per class, and also wins across many other datasets.
•Ablations show both parts matter: Shapley-based cropping and gradient-norm selection each help, and together they help the most.
•Using a slightly noisy selection of patches keeps diversity high, which turns out to be crucial, especially at larger image-per-class settings.
•The approach generalizes across architectures and even helps in continual learning, suggesting the distilled data is broadly useful.

Why This Research Matters

Training powerful models usually takes lots of data, time, and energy, which many teams don’t have. This work shows how to shrink datasets to a tiny, well-chosen core while keeping strong accuracy, reducing costs and carbon footprint. It also makes the choices interpretable: you can point to the exact patches and samples and explain why they were kept. That transparency helps in regulated fields like healthcare or finance, where clear reasons matter. Because the distilled data still works across different model architectures and in continual learning, it’s broadly useful. Overall, InfoUtil turns small-data training from a gamble into a reliable, theory-backed practice.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine packing for a long trip with only a tiny backpack. You can’t bring everything, so you choose only the items that give you the most value for the least space.

🥬 The Concept (Deep Learning): What it is: Deep learning is a way for computers to learn patterns (like recognizing cats) by practicing on many examples. How it works:

Show the model many labeled examples.
It makes a guess and measures error.
It adjusts its inner knobs (weights) to do better next time.
Repeat until it gets good. Why it matters: Without lots of practice data, deep learning models struggle to learn well. 🍞 Anchor: When your phone recognizes your face, it’s because a deep learning model learned patterns from many face photos.

🍞 Hook: You know how you can make a fruit smoothie that tastes like the whole fruit bowl but takes up less space?

🥬 The Concept (Dataset Distillation): What it is: Dataset distillation makes a tiny, super-concentrated version of a big dataset that still teaches a model nearly as well. How it works:

Start with a big dataset.
Create or select a few super-informative samples.
Train models on this tiny set.
Aim for accuracy close to training on the full data. Why it matters: It saves time, memory, money, and energy while keeping performance high. 🍞 Anchor: Instead of 1 million images, you might train on only a few per class and still get strong results.

The World Before: People had two main ways to distill data. Matching-based methods tried to match training dynamics (gradients, features, or learning paths) between real and synthetic data, but were heavy on compute and memory. Knowledge-distillation methods used a teacher network to guide synthetic images, which often worked well but felt like magic tricks: fast, strong, but hard to explain.

The Problem: Two big challenges kept showing up.

Efficiency vs performance: Stronger matching often demanded huge GPU memory and long runtimes—sometimes multiple 80GB GPUs for medium-size tasks.
Interpretability: Many steps (like random patch crops and heuristic scoring) weren’t principled. Why keep this crop? Why toss that one? It was hard to justify, especially in high-stakes settings.

🍞 Hook: Think of a museum guide choosing which artifacts to display. They must pick items that are both rich in information (tell a great story) and truly useful (teach visitors the key lessons).

🥬 The Concept (Information Theory): What it is: Information theory studies how to measure and communicate information. How it works:

Decide what counts as a “signal.”
Measure how much uncertainty it reduces.
Prefer signals that carry more meaningful bits. Why it matters: Without measuring information, you might keep flashy but unhelpful data. 🍞 Anchor: A sharp, close-up photo of a bird’s beak carries more class-specific information than a blurry background of trees.

🍞 Hook: When friends plan a group project, you want to fairly credit who contributed what.

🥬 The Concept (Game Theory): What it is: Game theory studies how players cooperate or compete and how to split rewards fairly. How it works:

Define the “game” and the “players.”
Consider all coalitions (who teams up with whom).
Assign each player a fair share of the outcome. Why it matters: Without fairness rules, credit (or blame) gets misassigned. 🍞 Anchor: In an image, “players” can be patches; we want to fairly credit patches that truly help predict the right class.

Failed Attempts: Prior work like RDED used random cropping plus a loss score. It was fast and decent, but random crops often missed the most meaningful regions (like cutting out background sky instead of the bird), and the scoring wasn’t tied to a firm theory.

The Gap: We needed a principled, interpretable path to two things at once: (1) keep the most informative parts inside each image and (2) choose the most useful whole images for training.

Real Stakes: This matters when data or compute is scarce—like training on a laptop, doing privacy-preserving learning with only small synthetic sets shared, updating edge devices, or reducing carbon footprint. It also matters for trust: if we can show exactly why we kept certain patches and samples, regulators and practitioners can understand and verify the process.

02Core Idea

🍞 Hook: Think of packing a moving box. You keep the most important pieces of each item (like manuals or key parts) and only pack items that truly matter for your new home.

🥬 The Concept (Aha!): What it is: The key insight is to balance two things—Informativeness (which parts inside an image matter most) and Utility (which whole images most affect learning)—and to measure both with principled tools. How it works:

Measure patch importance inside each image with Shapley Values (a fair-scoring rule from game theory).
Measure sample usefulness with Gradient Norms (how strongly a sample can change learning).
Keep only the top informative patches and the top useful samples. Why it matters: Without balancing both, you might keep sharp patches from unhelpful images, or helpful images with unhelpful patches. 🍞 Anchor: Like highlighting the key sentences (informative patches) and keeping only the best study guides (useful samples).

Multiple Analogies:

Sponge squeeze: Squeeze a soaked sponge (big dataset) to drip out only the richest water (informative patches), then pour only what truly nourishes the plant (high-utility samples) into the pot (the model).
School notes: First, highlight the most important lines in your textbook (informative patches). Then pick the few pages that, if you study them, boost your test score the most (high-utility samples).
Sports practice: Focus drills on the moves that define the game (informative) and schedule the practices that most improve your team fastest (utility).

🍞 Hook: You know how some paragraphs carry more meaning than others, even inside the same page?

🥬 The Concept (Informativeness): What it is: Informativeness says which parts of a single image carry the crucial clues. How it works:

Treat patches as “players.”
Ask how much each patch changes the model’s belief when added to others.
Score patches fairly using Shapley Values. Why it matters: If you cut out the wrong region, the synthetic image won’t teach the right lesson. 🍞 Anchor: For “otter,” the otter’s face fur and whiskers beat ocean waves as clues.

🍞 Hook: In class, some questions change how you understand the topic much more than others.

🥬 The Concept (Utility): What it is: Utility says which whole images most change and guide the model during training. How it works:

Estimate how removing a sample would change learning progress.
Use Gradient Norm as a principled, proven upper bound for that impact.
Keep samples with the biggest gradient norms. Why it matters: Without utility, you might store images that look sharp but barely move the needle during training. 🍞 Anchor: A tricky, borderline image that pushes the model to refine its boundary usually has high utility.

Before vs After:

Before: Heuristics like random crops and ad-hoc scores. Decent results but hard to explain, sometimes missing key parts.
After: A theory-backed recipe: Shapley-based attribution for patches (informative) and gradient-norm selection for images (useful). Clear reasons for each choice, better focus, and strong performance.

Why It Works (intuition): Shapley Values ensure fair credit to patches by checking their marginal contribution across many coalitions—so we don’t get fooled by patches that only look bright but add little. Gradient Norms capture how much a sample can bend the learning direction—so we keep the samples the model can learn the most from right now. Add a bit of noise to patch scores to keep diversity, and you cover more of the concept space.

🍞 Hook: Imagine a master checklist that defines the best mini-dataset possible.

🥬 The Concept (Optimal Dataset Distillation): What it is: A mathematical goal that says: pick masks inside images to maximize informativeness, then pick samples to maximize utility. How it works:

For each image, find the best informative mask (patches) that keeps the model’s output close to the original.
Among these compressed samples, rank by gradient norm (utility) and keep the top ones.
Reconstruct full-size images and assign soft labels from a teacher. Why it matters: Without a crisp target, we can’t be sure we’re keeping the truly best tiny dataset. 🍞 Anchor: It’s like a two-step filter: first zoom in on the right spots in photos, then keep only the photos that will teach your future self the most.

03Methodology

At a high level: Big Dataset → Step A: Find Informative Patches (Shapley) → Step B: Pick High-Utility Samples (Gradient Norm) → Reconstruct Images + Soft Labels → Distilled Dataset.

🍞 Hook: Think of building a tiny, powerful study guide.

🥬 The Concept (Knowledge Distillation & Soft Labels): What it is: A teacher model guides the creation of synthetic data, and soft labels give probabilities over classes, not just the single “hard” class. How it works:

Train or use a teacher network on real data.
Generate soft labels (probability vectors) for synthetic patches/images.
Students learn richer signals (not just right/wrong) from soft labels. Why it matters: Without soft labels, synthetic data may overfit or miss relationships between classes. 🍞 Anchor: For an image that looks 70% “otter” and 20% “seal,” the student learns how classes are related.

Step A: Game-theoretic Informativeness Maximization (Attribution Cropping)

What happens: Treat each image as a set of patches (players). Use Shapley Values to score how much each patch contributes to the correct class. Keep only the top-scoring regions (e.g., crop to 1/4 size).
Why this step exists: Random crops miss key semantics. We need a fair, theoretically grounded way to choose the right regions.
Example with data: Suppose an ImageNet “brambling” image is split into a 4×4 grid. After estimating Shapley scores, the patch around the bird’s head scores highest. We crop that region to form a compressed sample.
Diversity via noise: Add small random noise to the attribution heatmap so repeated crops don’t all hit the exact same pixel zone. This keeps varied but still good patches.
What breaks without it: You’d often keep background instead of the object; the tiny dataset loses meaning.

🍞 Hook: When dividing group project credit, you want fair shares that add up to the whole.

🥬 The Concept (Shapley Value): What it is: A fair scoring rule from game theory for each player’s contribution. How it works:

Consider all subsets that don’t include a patch.
Measure how the output changes when adding that patch.
Average these marginal gains with a principled weighting. Why it matters: Without fairness properties (like efficiency and symmetry), scores can be biased or inconsistent. 🍞 Anchor: If two patches contribute equally in all cases, Shapley gives them equal credit.

Practical estimation: Exact Shapley is expensive, so KernelSHAP estimates it efficiently. After getting a patch-level map, average-pool to choose the best crop. Use a 4×4 grid for most images; crop size often 1/4.

Step B: Principled Utility Maximization (Gradient Norm Scoring)

What happens: For each compressed sample, compute its gradient norm with the teacher model. Rank samples by this norm and keep the top ones.
Why this step exists: Utility reflects how much a sample can steer learning. The paper proves utility is upper-bounded by gradient norm, so this is a principled proxy that’s fast to compute.
Example with data: Two otter crops both look good. One yields a large gradient norm; the other small. Keep the large one—it will teach the student more.
What breaks without it: You might store pretty but low-impact samples, wasting precious space in the tiny dataset.

🍞 Hook: Steeper hills change your walking direction more than flat ground.

🥬 The Concept (Gradient Norm): What it is: The size of the gradient for a sample—how strongly it pulls the model to update. How it works:

Feed the sample through the model.
Compute the loss gradient with respect to model weights.
Measure its size (norm) and use it as the sample’s score. Why it matters: Low-norm samples won’t change the model much; high-norm ones do. 🍞 Anchor: Hard, borderline cases with confusing features often have higher gradient norms.

Reconstruction + Soft Labels

What happens: Merge several compressed crops (e.g., 2×2 grid of quarter-size patches) into a full-size synthetic image. Assign soft labels by resizing patches and using the teacher’s logits (optionally from an early or late training stage depending on the data budget).
Why this step exists: Students expect normal-sized inputs. Soft labels provide richer guidance than hard labels, improving generalization.
Example: Four 112×112 bird head patches combine into a 224×224 “bird” image. Each patch carries its own soft label slice, giving fine-grained supervision.
What breaks without it: Students might get low-resolution, mismatched labels or lose the rich class relations.

Secret Sauce

Shapley + Grad Norm synergy: Patches are selected fairly and meaningfully; samples are kept for their learning impact. Together, they avoid “sharp but useless” and “useful but off-target.”
Diversity noise: Small randomness during patch picking maintains coverage of distinct object parts (e.g., face, wing, tail), which boosts performance, especially at larger image-per-class settings.
Efficiency: Training-free selection avoids heavy bi-level optimization; runs on a single A100 with low memory.

04Experiments & Results

The Test: The authors measured Top-1 accuracy—how often the model’s top guess is correct—after training from scratch on the tiny distilled sets. They also checked time and peak memory to see if the method is practical.

The Competition: They compared against prior state-of-the-art methods, including matching-based MTT/DATM and knowledge-distillation-based SRe2L and RDED. They also tested cross-architecture setups (teacher vs student not the same) and compared attribution methods (Shapley vs Grad-CAM).

Datasets and Settings: Results span CIFAR-10/100, Tiny-ImageNet, ImageNette, ImageWoof, ImageNet-100, and the full ImageNet-1K. They tried multiple images-per-class (IPC) budgets: tiny (1, 10) and larger (50, 100, 200). Students include ConvNet, ResNet-18/50/101, MobileNet-V2, VGG-11, Swin-Tiny.

Scoreboard with Context:

ImageNet-1K, ResNet-18: • IPC=1: InfoUtil ~12.7% vs RDED ~6.6% (about +6.1 points). That’s like moving from an F to a solid D in an extremely hard one-example-per-class test—surprisingly meaningful. • IPC=10: InfoUtil ~44.2% vs RDED ~42.0% (about +2.2 points). • IPC=50: InfoUtil ~58.0% vs RDED ~56.5% (about +1.5 points).
ImageNet-100, ResNet-101, IPC=10: InfoUtil beats RDED by about 16 points—like jumping from a B- to a strong A.
ImageWoof, ResNet-18, IPC=10: InfoUtil gains ~12.9 points over RDED—again a big win when the budget is small.
CIFAR/Tiny-ImageNet: Consistent improvements across many settings; on Tiny-ImageNet with ResNet-101 at IPC=50, improvements exceed 13 points.

Efficiency:

Time and memory are far lower than heavy training-based methods like TESLA. InfoUtil is training-free in the selection steps, making it practical on a single A100 GPU.
For huge datasets, distillation completes in hours rather than days, which is critical for real-world workflows.

Ablations (what mattered most):

Utility alone helps: Switching from loss-based to gradient-norm-based scoring significantly lifts accuracy (e.g., +4.6 points on ImageNette IPC=50, +1.5 on ImageNet-1K IPC=10).
Together is best: Combining Shapley-based cropping (informativeness) with gradient-norm selection (utility) gives the highest scores (e.g., ImageNette IPC=50 rises further to ~86.2%).
Noise helps diversity: Removing the noise that perturbs the attribution map hurts performance, often by large margins (e.g., −15+ points at IPC=50), showing diversity is essential.
Shapley vs Grad-CAM: Shapley wins clearly (e.g., on ImageNet-1K IPC=10, ~43.9% vs ~30.4%). The fair-attribution foundation seems to translate to better distilled data.

Surprising Findings:

Early vs late teacher labels: At ultra-low budgets (IPC=1), early-stage, higher-entropy soft labels can work better than a fully-trained teacher, likely because they preserve diversity that the student needs when data is scarce. At larger budgets (IPC≥10), the fully-trained teacher’s lower-entropy, sharper labels tend to win.
Cross-architecture generalization: Distilled sets still help when the student differs from the teacher (e.g., VGG teacher to Swin student), indicating the synthetic data captures general, transferable cues.
Continual learning: In a 5-step class-incremental setup, InfoUtil stays ahead at every step, suggesting the distilled images retain robust, reusable knowledge.

05Discussion & Limitations

Limitations:

Teacher dependence: Soft label quality relies on the teacher. A weak or biased teacher can pass on its mistakes.
Attribution cost: Exact Shapley is expensive; KernelSHAP approximations are needed. Very high-res images or many patches raise costs.
Patch granularity: Using a coarse 4×4 grid may miss fine details; using a finer grid increases compute.
Gradient noise: Small batches or unstable optimization can produce noisy gradient norms; careful settings help.
Domain shift: If test data are very different from the teacher’s world, both attribution and utility estimates can degrade.

Required Resources:

A pretrained (or partially trained) teacher model.
One modern GPU (e.g., A100) for efficient, training-free selection plus light reconstruction.
Libraries for attribution (e.g., KernelSHAP) and gradient computation.

When NOT to Use:

If you must keep original images for legal/audit reasons (no synthesis allowed).
If your task isn’t vision-like or lacks a reliable teacher.
If your model is tiny and trains instantly on the full dataset (distillation overhead may not pay off).

Open Questions:

Adaptive patching: Can we learn patch shapes/sizes instead of fixed grids?
Faster, tighter utility proxies: Beyond gradient norm, can we estimate impact even more cheaply and accurately?
Robust teachers: How to avoid passing on teacher bias; can multi-teacher ensembles help?
Beyond images: How to bring principled informativeness/utility to text, audio, and multimodal data?
Safety & fairness: How to certify that distilled datasets preserve accuracy without amplifying biases?

06Conclusion & Future Work

Three-sentence summary: This paper defines a principled target for “optimal dataset distillation” and delivers a practical method, InfoUtil, that balances what matters inside images (informativeness) with which images matter most (utility). It uses Shapley Values to fairly find the most informative patches and Gradient Norms to keep the most impactful samples, then reconstructs full images with soft labels. The result is a compact, interpretable, and efficient dataset that trains strong models on a fraction of the data.

Main Achievement: Turning dataset distillation from heuristics into a theory-backed, two-step pipeline—Shapley-guided patch selection plus gradient-norm sample selection—that is both interpretable and fast.

Future Directions: Learn adaptive patch shapes and scales; explore ensembles and robustness for teachers; extend to text, audio, and multimodal distillation; design even cheaper, tighter utility estimators; add fairness and safety checks into the selection loop.

Why Remember This: InfoUtil shows you don’t have to choose between speed, accuracy, and transparency—you can have all three by fairly scoring what’s inside each image and wisely choosing which images to keep. It’s a blueprint for building small, mighty datasets that travel light but teach a lot.

Practical Applications

•Speed up model training on edge devices by using tiny, highly informative synthetic datasets.
•Share privacy-preserving distilled datasets instead of real images when data sensitivity is a concern.
•Bootstrap new projects quickly by training initial models on distilled data before full-scale training.
•Enable faster model iteration (A/B testing) with compact datasets that train in hours, not days.
•Support continual learning by providing distilled samples that retain knowledge across stages.
•Reduce cloud compute costs and energy usage by avoiding massive full-dataset training cycles.
•Create cross-architecture benchmarks: distill with one teacher, evaluate on varied students.
•Assist low-resource labs or classrooms to train meaningful models with minimal hardware.
•Accelerate transfer learning by distilling only the most relevant classes or domains.
•Improve data curation: use attribution maps to identify which regions and images truly matter.

Version: 1