CAPTAIN: Semantic Feature Injection for Memorization Mitigation in Text-to-Image Diffusion Models

Tong Zhang; Carlos Hinojosa; Bernard Ghanem

CAPTAIN: Semantic Feature Injection for Memorization Mitigation in Text-to-Image Diffusion Models

Intermediate

Tong Zhang, Carlos Hinojosa, Bernard Ghanem12/11/2025

arXiv PDF

Key Summary

•Diffusion models sometimes copy training images too closely, which can be a privacy and copyright problem.
•CAPTAIN is a training-free method that reduces copying by gently nudging the model’s hidden features during image generation.
•It starts with a special noise setup that mixes a reference image’s broad structure with random fine details to avoid early copying.
•It chooses the best moments (timesteps) to intervene by watching how well the image matches the prompt using CLIP scores.
•It pinpoints exactly where copying happens by combining Bright Ending attention with concept-specific attention maps.
•Then it injects features from a safe, non-memorized reference image only into the suspicious regions to steer away from copied looks.
•CAPTAIN keeps images well-matched to the prompt while cutting down on memorization more than past methods that tweak guidance or prompts.
•It works fast at inference time, needs no retraining, and adds only a small compute overhead.
•On Stable Diffusion v1.4, CAPTAIN lowers copy-detection scores while raising alignment scores compared to strong baselines.
•Ablations show both parts—frequency-based initialization and localized feature injection—are needed for the best trade-off.

Why This Research Matters

CAPTAIN tackles a real problem: AI images that accidentally copy training data can violate privacy and copyright. By reducing memorization without sacrificing how well images match prompts, CAPTAIN helps creators, companies, and users trust generative tools more. It works without retraining large models, which saves time and money and makes it easier to adopt in production. It uses public, licensed references and adds only small overhead, so it fits normal workflows. This approach can also guide safer content creation in fields like advertising, education, and design where originality matters. As text-to-image systems spread, practical safeguards like CAPTAIN help align innovation with responsible use.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how sometimes when you try to draw something from memory, your drawing ends up looking exactly like a picture you once saw? That can be a problem if you were supposed to make something new. In the world of AI, diffusion models can do something similar: they try to make new images from text, but sometimes they accidentally recreate pictures from their training data. That behavior is called memorization, and it raises real privacy and copyright concerns.

🍞 Top Bread (Hook): Imagine your class is asked to make new posters about animals, but a few students trace from old posters in the school archive. Even if the tracing looks great, it’s not original work. 🥬 The Concept (Latent Space): What it is: Latent space is a hidden “idea space” where the model keeps compressed, organized representations of images and concepts. How it works: (1) The model encodes images into a compact, multi-channel grid of numbers; (2) It does its thinking and editing there because it’s faster and keeps structure; (3) It then decodes back into pixels for you to see. Why it matters: Without latent space, generation would be slow and clumsy, and it couldn’t easily reason about structure and style. 🍞 Bottom Bread (Anchor): It’s like a backstage where actors (features) get costumed and lined up before stepping onto the stage (the final image).

🍞 Top Bread (Hook): Think of a fuzzy photo slowly becoming clear as you wipe away steam from a mirror after a hot shower. 🥬 The Concept (Denoising): What it is: Denoising is the step-by-step process of turning noisy latent features into a clean image that matches a text prompt. How it works: (1) Start from noisy latent features; (2) At each step, the model predicts and removes a bit of noise; (3) Coarse layout forms early, details arrive late; (4) After many steps, decode to pixels. Why it matters: Without careful denoising, the model can’t form stable shapes or fine details, and it may latch onto memorized patterns. 🍞 Bottom Bread (Anchor): It’s like sculpting from a rough block: big shapes first, then fine chiseling.

Before CAPTAIN, people tried several inference-time tricks to reduce memorization without retraining huge models. Many adjusted classifier-free guidance (CFG) strength, nudged prompt embeddings, or edited cross-attention. These helped sometimes but often caused a trade-off: when you push hard to avoid copying, the image drifts away from the prompt; when you keep the prompt tight, copying can sneak back in.

🍞 Top Bread (Hook): You know how giving a friend louder directions doesn’t always help if they keep turning onto the same wrong street? 🥬 The Concept (Classifier-Free Guidance, CFG): What it is: CFG is a way to push the model to follow the text more strongly by comparing conditional and unconditional predictions. How it works: (1) Run the denoiser twice—once with the prompt, once without; (2) Take the difference; (3) Multiply by a scale; (4) Add it back to guide toward the prompt. Why it matters: Without CFG, images might be off-topic; too much CFG can pull the model into memorized ruts. 🍞 Bottom Bread (Anchor): It’s like turning up the GPS voice; helpful, but if the map has a bias, you might still end up at the old destination.

Researchers also discovered a clue called Bright Ending (BE) attention: memorized patches often over-focus on the last text token at the final step. That spotlight can help find where copying is happening.

🍞 Top Bread (Hook): Picture a theater spotlight that, at the very end of the show, shines way too brightly on a tiny spot—it tells you, “Look here!” 🥬 The Concept (Bright Ending Attention): What it is: BE attention is a pattern where the last denoising step gives unusually high attention to the final text token in areas that are likely memorized. How it works: (1) Run a quick pass; (2) Read the last step’s cross-attention; (3) Find patches where the final token gets abnormally high focus; (4) Mark those spots as suspects. Why it matters: Without BE attention, it’s hard to know where the image might be copying. 🍞 Bottom Bread (Anchor): Like noticing the spotlight lingers on a specific corner of the stage—that’s where something fishy might be.

The missing piece was a way to keep strong prompt alignment while gently steering away from copied structures, and to do it exactly when and where it matters, without retraining. That’s where CAPTAIN comes in: it changes the model’s hidden features directly, at the right time and in the right place, using safe, non-memorized reference features so the image stays on-topic but not copied.

🍞 Top Bread (Hook): Imagine you’re decorating a cake. If one spot looks suspiciously like a store-bought design you saw, you carefully re-ice just that area with your own pattern. 🥬 The Concept (Spatial Localization): What it is: Spatial localization is figuring out exactly which parts (patches) of the image are likely copied and tied to the prompt concept. How it works: (1) Use BE attention to find suspicious spots; (2) Use concept attention to find where the target word appears; (3) Intersect the two to get a precise mask; (4) Only change those pixels/latents. Why it matters: Without localization, you’d edit the whole image and ruin things that were already fine. 🍞 Bottom Bread (Anchor): Like fixing only the smudged frosting flower instead of scraping off the entire cake.

02Core Idea

The key insight in one sentence: If you gently replace only the suspicious hidden features, at the right time and place, with safe, semantically matching features, you can stop copying without losing what the prompt asks for.

🍞 Top Bread (Hook): Think about painting a mural: you plan the big shapes first, then add details; if you notice one corner looks like a traced poster, you repaint just that corner while keeping the rest. 🥬 The Concept (Timestep Selection): What it is: Timestep selection means choosing the part of the denoising timeline where changes will help most—after meaning is stable but before details are locked. How it works: (1) Track image–text alignment with CLIP as the model denoises; (2) Find when the score crosses above average (meaning is present); (3) Stop before the score’s change drops sharply (details are settling); (4) Use this window for edits. Why it matters: Without picking the right moment, edits either get washed away (too early) or cause damage (too late). 🍞 Bottom Bread (Anchor): It’s like adding colors when the sketch is ready but before the paint dries.

🍞 Top Bread (Hook): You know how a music teacher can tell if a tune is on-topic by hearing just a few notes? 🥬 The Concept (CLIP Score): What it is: CLIP score measures how well the current image matches the text. How it works: (1) Encode the image; (2) Encode the text; (3) Compute their cosine similarity; (4) Monitor how it rises as denoising proceeds. Why it matters: Without a guide like CLIP, you don’t know when the image has the right meaning and when to intervene. 🍞 Bottom Bread (Anchor): Like checking if a melody fits the song’s lyrics before adding more instruments.

CAPTAIN’s “aha!” is to avoid tugging on CFG or scrambling prompts, and instead touch the model’s hidden features (latents) directly. It adds semantic feature injection: replacing only the likely-copied bits with features from a non-memorized reference image that still matches the idea of the prompt word (like “zebra” or “teapot”).

🍞 Top Bread (Hook): Imagine seasoning soup: if one spoonful tastes too much like a packet mix you once used, you add fresh herbs just to that spoonful area and stir it in. 🥬 The Concept (Semantic Feature Injection): What it is: A localized swap-in of reference latent features that match the concept but differ from memorized patterns. How it works: (1) Pick a safe reference image related to the target word; (2) Encode it into latents; (3) Use a mask to target suspicious spots; (4) Blend in the reference features with a small strength. Why it matters: Without semantic injection, you either don’t break the copying pattern or you break the prompt meaning. 🍞 Bottom Bread (Anchor): Like replacing traced pencil lines with your own sketch lines in just the traced area.

🍞 Top Bread (Hook): Think of starting a sandcastle on a beach: if your base shape copies someone else’s moat and towers, it’s hard to be original later. 🥬 The Concept (Noise Initialization): What it is: A special starting setup that mixes the reference image’s broad, low-frequency structure with random, high-frequency noise. How it works: (1) Take the reference latent; (2) Keep its low frequencies (big shapes); (3) Keep random noise for high frequencies (tiny details); (4) Start denoising from this blend. Why it matters: Without this, early steps can fall back on memorized structures before you even begin to edit. 🍞 Bottom Bread (Anchor): Like laying a fresh, unique base for your sandcastle so later towers don’t match anyone else’s.

🍞 Top Bread (Hook): Picture a translator who whispers helpful hints to the artist at just the right moments. 🥬 The Concept (CAPTAIN): What it is: A training-free pipeline that initializes with frequency-aware noise, finds the right timesteps, pinpoints suspicious regions, and injects safe, matching features to avoid copying. How it works: (1) Retrieve a non-memorized reference; (2) Frequency-based initialization; (3) Timestep window via CLIP; (4) Spatial mask via BE and concept attention; (5) Localized feature injection. Why it matters: Without CAPTAIN’s coordinated steps, you either lose prompt alignment, fail to stop copying, or both. 🍞 Bottom Bread (Anchor): It’s like a coach that helps you start with a fresh plan, know when to adjust, where to adjust, and what to replace so the final artwork is truly yours.

Multiple analogies:

Cooking: Season only where it’s too salty, and do it before the dish is plated.
Carpentry: Sand and replace only the warped plank, during the framing stage.
Writing: Swap a paragraph that sounds copied with a paraphrase that keeps the idea but uses your own words.

Before vs. After: Before, pushing CFG or scrambling prompts often hurt meaning or left copying. After CAPTAIN, you get strong prompt alignment and lower copying by targeting the right hidden spots at the right times.

Why it works (intuition): Copying is a structural habit that emerges when meaning is set but details are still flexible. By initializing away from common structures, editing during the semantic-but-not-frozen window, and replacing only suspect regions with safe-but-related features, CAPTAIN shifts the generation path into new territory while staying faithful to the prompt.

Building blocks recap:

Latent space and denoising (the canvas and the process),
Fourier-based initialization (fresh base),
CLIP-guided timestep window (good timing),
BE + concept attention (precise location),
Semantic feature injection (gentle, targeted fix).

03Methodology

At a high level: Prompt + Reference image → Frequency-based initialization → Denoising with monitoring → Find best timestep window → Find suspicious regions → Inject safe features (repeat in window) → Final image.

Step 1: Frequency-based initialization 🍞 Top Bread (Hook): You know how a song has deep bass (low frequency) and twinkly highs (high frequency)? 🥬 The Concept (Fourier Transform): What it is: A way to split a signal into low and high frequencies. How it works: (1) Transform features into frequency space; (2) Use masks to keep lows from the reference and highs from random noise; (3) Inverse-transform back; (4) Start denoising from this blended latent. Why it matters: Without separating frequencies, the early structure can echo memorized layouts. 🍞 Bottom Bread (Anchor): Like keeping the slow drumbeat from one track and the sparkling notes from random riffs to make a fresh intro. How CAPTAIN uses it: It encodes the safe reference image into latents. Then: keep low frequencies (broad shapes) from the reference latent, keep high frequencies (fine randomness) from Gaussian noise, and blend them. This gives a non-memorized but semantically friendly starting point. Example: Prompt: “A red teapot on a wooden table.” Reference image: a different teapot on a different table from a royalty-free site. Low frequencies give gentle table-plane and object blobs; high frequencies add unique texture randomness so the model doesn’t fall into a memorized exact teapot.

Step 2: Denoising with semantic monitoring We decode partial latents along the way and compute CLIP scores against the prompt to see when meaning appears and stabilizes. 🍞 Top Bread (Hook): Imagine checking a sketch every few minutes to see if it still looks like “a teapot on a table.” 🥬 The Concept (CLIP Score): What it is: A similarity score between the image and text. How it works: (1) Encode tentative image; (2) Encode text; (3) Cosine similarity; (4) Track curve over timesteps. Why it matters: Without CLIP, you don’t know when to edit for best effect. 🍞 Bottom Bread (Anchor): Like asking, “Does this still look like a teapot?” before adding fine details. We look for when the CLIP curve first rises above average (meaning present) and stops changing fast (details settling). That window is where edits stick but don’t break the final look.

Step 3: Finding the timestep injection window We set t_high when CLIP first exceeds the average for the run and t_low just before a sharp drop in the curve’s rate of change. In practice (for SD v1.4, 50 steps), the window corresponds to the phase where structure is set but details are malleable. Intervening here avoids early wash-out and late overcorrection. 🍞 Top Bread (Hook): It’s like knowing the best moment to frost cupcakes—after they cool, before the frosting hardens. 🥬 The Concept (Timestep Selection): What it is: Picking when to edit so changes hold and help. How it works: (1) Track CLIP; (2) Choose the window after meaning appears, before details freeze; (3) Edit only in this window. Why it matters: Without timing, you either do nothing useful or smudge finished parts. 🍞 Bottom Bread (Anchor): You frost when cupcakes are cool—too early it melts, too late it won’t stick.

Step 4: Localizing suspicious regions We combine two attention maps:

BE mask: flags patches that over-focus on the final token at the last step (a sign of memorization).
Concept mask: highlights where the chosen concept word (like “teapot”) lives in the image. We multiply them and threshold to get a crisp binary mask of likely memorized concept regions. 🍞 Top Bread (Hook): Like using two clues on a treasure map—the X and the compass—to get the exact spot. 🥬 The Concept (Spatial Localization): What it is: Pinpointing where to edit. How it works: (1) BE attention finds suspicious focus; (2) Concept attention finds where the word appears; (3) Intersection gives the mask; (4) Threshold for a clean region. Why it matters: Without localization, changing the whole image ruins good parts. 🍞 Bottom Bread (Anchor): Fix just the suspicious teapot spout, not the whole photo.

Step 5: Semantic feature injection Within the window and mask, we blend in a small amount (strength δ) of the reference latent features. This gently shifts the local content away from copied shapes while keeping the concept consistent. 🍞 Top Bread (Hook): Think of gently dabbing new paint onto only the traced lines. 🥬 The Concept (Semantic Feature Injection): What it is: A masked, small-strength replacement with safe, matching features. How it works: (1) Encode reference; (2) Make the mask; (3) Blend with strength δ (like 0.1); (4) Continue denoising so the change propagates naturally. Why it matters: Without a light, semantic touch, you either don’t break copying or you break the prompt. 🍞 Bottom Bread (Anchor): Like swapping a traced flower with your own hand-drawn petals, leaving the rest of the bouquet intact. Example with data: Suppose the model keeps making the same chair pose from training for “a cozy reading nook.” The mask targets the chair back and armrest. Injecting reference features from a different chair latent reshapes those parts while the rug, lamp, and window remain as they were.

Step 6: Repeat until the window ends At each step in the chosen window, recompute the clean prediction, inject within the mask, and move to the next step. After the window, the model finalizes details normally. The decoder then turns the latent into the final image.

Secret sauce (what’s clever):

Decoupling structure and detail at the start (frequency-based init) reduces early drift into memorized layouts.
Editing at the right time (CLIP-guided window) ensures changes neither disappear nor break the image.
Editing in the right place (BE ∩ concept) ensures you only touch the risky regions tied to the prompt.
Editing in the right way (semantic injection) keeps meaning while changing look.

Safety and practicality: CAPTAIN retrieves public, licensed reference images, runs fully at inference time, and adds only modest latency (~3 seconds per image reported). No retraining, no access to training sets required.

04Experiments & Results

The test: The authors asked a simple question—can we make images that match the prompt well while reducing how much they resemble training images? They measured two main things:

SSCD (Self-Supervised Copy Detection): lower is better (less like training images).
CLIP score: higher is better (better prompt alignment). They also checked image quality and diversity using FID and LPIPS (lower is better on both).

The competition: CAPTAIN was compared with strong inference-time baselines:

BE (Bright Ending): finds suspicious regions but tends to struggle to strongly suppress memorization without losing meaning.
PRSS: perturbs prompts and re-anchors guidance; can reduce copying but may drift from the intent.
Wen et al. (prompt embedding adjustments): mitigates memorization by adjusting prompt embeddings.
Han et al. (noise initialization): changes start noise to avoid memorization but may cause visual side effects or alignment drops.

Scoreboard (with context): On Stable Diffusion v1.4 with 500 memorization-triggering prompts, CAPTAIN achieved around 0.25 SSCD (lower is better) and 0.29 CLIP (higher is better). Think of CLIP like a report card for staying on topic: CAPTAIN scored like an A- to A while many others got closer to a B. Meanwhile, SSCD is like a plagiarism checker: CAPTAIN’s score is significantly lower than methods that keep high alignment and lower than or comparable to those that sacrificed alignment to reduce copying. CAPTAIN also posted strong FID and LPIPS, showing it preserved visual quality and diversity.

Surprising findings:

You don’t need to push CFG harder or rewrite prompts; precise latent edits work better. That’s like fixing a paragraph directly instead of shouting instructions louder.
The combination matters: frequency-based initialization sets a fresh base, but without localized injection, you can only go so far; injection alone is powerful but can be unstable without the right start.

Ablations (what each part does):

Initialization only: Helps a bit (SSCD improves), but can’t adapt later; it’s a fixed nudge.
Injection only: Very sensitive to strength δ—too low, copying remains; too high, prompt meaning drops. It’s powerful but needs stabilization.
Both together: Best balance—good CLIP and low SSCD consistently across δ. Like having both a good recipe and good timing in the kitchen.
Mask threshold τ: Larger τ narrows the mask (fewer pixels edited) and may reduce SSCD further, but it can hurt CLIP since you edit less of the concept region. The default τ = 0.1 gave a stable trade-off.

Generalization: On Stable Diffusion 2.0 (with a more de-duplicated training set), CAPTAIN still improved the privacy-utility trade-off, lowering SSCD further while nudging CLIP upward compared to reported baselines. This suggests CAPTAIN’s core idea—edit the right hidden spots at the right time—transfers beyond a single model version.

Speed and practicality: The method added modest overhead (~3 seconds per image in their setup). Compared with baselines, CAPTAIN remained practical for real use, especially given it requires no retraining and uses public reference images.

Takeaway: The numbers and pictures together show CAPTAIN can be both respectful of the prompt (high CLIP) and respectful of originality (low SSCD), a tough balance that prior methods often struggled to achieve simultaneously.

05Discussion & Limitations

Limitations:

Dependence on reference retrieval: If the external reference is poorly matched or not very novel, the injection can be less effective. This is like seasoning with the wrong herb; it won’t taste right.
Mask reliability on abstract prompts: If the prompt is vague (e.g., “the feeling of nostalgia”), spatial masks from BE and concept attention can be too small, too large, or fuzzy. Then edits may miss the real copied bits or affect the wrong places.
FAISS index requirement: To estimate novelty, CAPTAIN relies on a precomputed embedding index. Porting to other model/dataset combos means building a new index.
Extra components: Frequency transforms and CLIP-based window selection add small but non-zero overhead.

Required resources:

A diffusion model (e.g., SD v1.4 or SD 2.0) and its encoder/decoder.
CLIP encoders (vision and text) for alignment tracking and retrieval scoring.
Access to public image APIs (e.g., Pexels, Unsplash) or user-provided references.
Optional FAISS index and perceptual hash set for novelty estimation.
A GPU for real-time-ish inference.

When NOT to use:

Highly sensitive or offline-only environments with no access to safe reference images and no prebuilt novelty index.
Prompts that require exact reproduction of a specific known artwork or person (CAPTAIN will intentionally steer away from lookalikes).
Cases where you must avoid any extra latency, even a few seconds per image.

Open questions:

Can the reference be built from text-only cues (e.g., a learned concept prior) to remove dependence on external images?
Could the mask be improved using learned detectors of memorization, not just BE and concept attention heuristics?
Can timestep selection be adapted per-sample more precisely, beyond global statistics, to further boost stability?
How does CAPTAIN interact with newer diffusion backbones (e.g., DiTs) and larger guidance mechanisms?
Is there a way to self-generate safe references on-device to avoid network calls entirely?

Overall, CAPTAIN is a practical, training-free step forward that targets memorization where and when it happens, but it benefits from good references and precise masks. Future work can make it more autonomous, robust for abstract prompts, and even lighter-weight.

06Conclusion & Future Work

Three-sentence summary: CAPTAIN reduces unwanted copying in text-to-image diffusion models by changing the model’s hidden features directly—only where and when needed—using safe, semantically matching references. It blends a frequency-aware start, a CLIP-guided editing window, and a precise attention-based mask to inject just enough new features to break memorized patterns while keeping the prompt meaning. Experiments show better alignment and lower copy-detection than strong baselines, with practical runtime and no retraining.

Main achievement: CAPTAIN demonstrates that targeted latent-space feature injection, scheduled at the right timesteps and localized to suspicious concept regions, can decouple prompt faithfulness from structural memorization in a training-free way.

Future directions: Improve reference selection (or remove the need via learned priors), sharpen masks with more reliable memorization detectors, refine per-sample timestep selection, and adapt the pipeline to newer backbones and modalities (e.g., video).

Why remember this: CAPTAIN reframes memorization mitigation from pushing guidance knobs to performing surgical, semantic edits inside the model’s hidden space. That perspective—edit the right thing, in the right place, at the right time—offers a blueprint for safer, more original generative AI that still does exactly what you asked for.

Practical Applications

•Creative studios generate on-brand images while reducing the risk of copying protected training images.
•Advertising teams produce unique visuals that match campaign prompts without reproducing stock photos.
•Educational content creators make illustrations that fit lessons while avoiding lookalikes of textbook images.
•Product mockups and concept art stay faithful to descriptions yet remain novel for IP safety.
•Newsrooms and blogs illustrate stories with prompt-driven art that avoids memorized watermarks or logos.
•Designers rapidly explore styles (e.g., rugs, posters) while minimizing overlap with known catalog photos.
•Game developers prototype assets that follow text briefs but steer clear of training-set character poses.
•Researchers and auditors assess and mitigate memorization risk in deployed diffusion systems.
•User-facing AI apps offer a “less copying” mode to respect originality and reduce legal exposure.
•Enterprises deploy text-to-image safely without retraining, using CAPTAIN as an inference-time guardrail.

Version: 1