OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation

Yexin Liu; Manyuan Zhang; Yueze Wang; Hongyu Li; Dian Zheng; Weiming Zhang; Changsheng Lu; Xunliang Cai; Yan Feng; Peng Pei; Harry Yang

OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation

Intermediate

Yexin Liu, Manyuan Zhang, Yueze Wang et al.12/9/2025

arXiv PDF

Key Summary

•OpenSubject is a giant video-based dataset (2.5M samples, 4.35M images) built to help AI make pictures that keep each person or object looking like themselves, even in busy scenes.
•It uses a four-step pipeline: pick good videos, find and pair frames of the same subject, synthesize training images by smart in/out-painting, and auto-check plus caption everything with a vision–language model (VLM).
•The key idea is to learn 'cross-frame identity priors' from videos, so the model recognizes the same subject across different angles, lighting, and places.
•OpenSubject supports both subject-driven image generation and subject-driven manipulation (replacing one subject while keeping the rest of the scene intact).
•A new benchmark (OSBench) tests four tasks and scores identity fidelity, prompt adherence, manipulation fidelity, and background consistency with a VLM judge.
•Models fine-tuned on OpenSubject show stronger identity preservation and much better manipulation consistency, especially in multi-subject scenes.
•An open-source baseline (OmniGen2) improved its average OSBench score from 6.43 to 7.22 after training on OpenSubject.
•Gains also transfer to other benchmarks like OmniContext (better at multi-reference composition) and ImgEdit (better at precise edits).
•The pipeline mixes real video identity with synthetic, carefully verified inputs, reducing bias and boosting variety across scenes and viewpoints.

Why This Research Matters

OpenSubject helps AI keep people and objects recognizable across different places and poses, a core need in personalized media. It enables safer, cleaner subject replacement for photos and videos, preserving the rest of the scene so edits don’t break reality. Creative industries can re-render characters consistently across shots, saving time while maintaining continuity. E-commerce can swap models or products while leaving studio lighting and backgrounds untouched, keeping catalogs uniform. Family and memory apps can place loved ones into new scenes without altering their identity. Educational and accessibility tools benefit from reliable identity preservation when re-contextualizing known items. By grounding training in real video variety, the approach reduces bias and makes models more robust in the wild.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how you can spot your best friend from far away, even if they wear a hat today and a hoodie tomorrow? Your brain keeps a strong picture of their identity no matter the setting. 🥬 Filling (The Actual Concept): Subject-driven image generation is when an AI draws or edits pictures of a specific person or object while keeping that subject’s unique look the same. How it works (big picture): 1) The AI sees example images of the subject (the references). 2) It reads a prompt like 'the same dog on a beach at sunset'. 3) It produces a new image that should keep the same face/fur/markings while changing the scene. Why it matters: Without this, pictures drift—faces change shape, special markings vanish, or wrong people appear. 🍞 Bottom Bread (Anchor): 'Make my toy robot stand on a mountain at sunrise' should still look exactly like my toy, not a different robot. 🍞 Top Bread (Hook): Imagine a school photo album where each student shows up across many pages in different activities but still looks like themselves. 🥬 Filling (The Actual Concept): Identity fidelity is how closely the new picture matches the real appearance of the chosen subject. How it works: 1) The model collects clues (face shape, hairstyle, markings, clothing cues). 2) It keeps these features consistent as it draws in new places. 3) A checker compares the output with references. Why it matters: If identity fidelity breaks, the result might be pretty but it’s no longer that person or object. 🍞 Bottom Bread (Anchor): Your friend with freckles and curly hair should not come out freckle-free and straight-haired in the 'ski trip' photo. 🍞 Top Bread (Hook): Think of making a class collage where 3–4 classmates are placed into a new scene without mixing them up. 🥬 Filling (The Actual Concept): Multi-subject generation means composing several known subjects into one image, preserving each identity and following the prompt. How it works: 1) Read multiple references. 2) Keep a separate identity memory for each. 3) Arrange them as the text requests (who stands where, who holds what). Why it matters: Without this, faces swap, people merge, or one person becomes a stranger. 🍞 Bottom Bread (Anchor): 'Put Alex and Priya on a red couch in a sunny living room' should show both Alex and Priya correctly, not two Alexes or two look-alikes. 🍞 Top Bread (Hook): Imagine a flipbook of the same person walking through a park—their face stays the same, but the background and angles change. 🥬 Filling (The Actual Concept): A video-derived dataset is built by using many frames from real videos, which show the same subjects under different views, lighting, and scenes. How it works: 1) Gather high-quality clips. 2) Find frames that truly show the same subject. 3) Pair the most different-looking frames to maximize variety. Why it matters: Still photos can be too similar or biased; videos naturally cover different angles and contexts, making identity learning stronger. 🍞 Bottom Bread (Anchor): A skater filmed during a trick looks like themselves from start to finish; that variety trains the AI to recognize and preserve identity anywhere. The world before: Earlier subject-driven systems worked okay for one person in a simple, portrait-like setting. But they slipped when scenes got complex, when several subjects needed to appear together, or when the subject had to be swapped into a busy photo without breaking the background. The problem: How do we teach a model to keep each subject’s identity perfect while also following prompts in open, messy scenes with multiple people or objects? Failed attempts: • Synthesis-only pairing (generate pairs with another model): fast but inherits the generator’s biases and can break identity. • Retrieval-only pairing (collect web images): real but skewed to celebrities, weak on everyday objects, and hard to scale for multi-subject diversity. The gap: We need identity-consistent, diverse, scalable training pairs that cover varied viewpoints and contexts, especially for multiple subjects and precise subject replacement (manipulation). Real stakes: • Family photos where you place grandparents into a new scene without changing their faces. • Product catalogs swapping colors or models without breaking the studio background. • Games and films re-rendering characters consistently across shots. • Accessibility tools that keep a pet’s look accurate when placed in new surroundings. OpenSubject fills the gap by building a video-grounded yet synthetic-paired corpus. It uses real videos for true identities and variety across frames, plus careful inpainting/outpainting to create training inputs that challenge the model to change scenes while preserving who is who.

02Core Idea

🍞 Top Bread (Hook): Imagine recognizing your friend whether they’re facing left, right, under bright sun, or in the shade—you use clues from many moments, not just one snapshot. 🥬 Filling (The Actual Concept): The 'aha!' idea is to learn cross-frame identity priors from videos, then synthesize smart training inputs so models master both identity preservation and scene flexibility. How it works: 1) Use videos to collect multiple, varied looks of the same subject. 2) Automatically verify subjects and pick the most different-looking frames as pairs. 3) Synthesize inputs via mask-guided outpainting/inpainting so the model must keep the subject but change context or swap targets. 4) Use a VLM to verify quality and write captions. Why it matters: This turns scattered, biased pairs into rich, identity-true, multi-reference training that scales. 🍞 Bottom Bread (Anchor): Two frames of the same dog—one near a tree, one on a sidewalk—teach the model 'this is the same dog' even as the background changes, so it can place the dog anywhere later without losing its spots. Multiple analogies: • Lego analogy: Each video frame is a Lego brick with the same mini-figure (identity) but different scenery; OpenSubject picks the bricks that are most different and teaches the model to rebuild new scenes while keeping the mini-figure intact. • Detective analogy: The model collects clues about a subject across frames (face shape, markings) and ignores misleading background hints, like a detective focusing on a suspect’s traits, not the room wallpaper. • Choir analogy: Multiple references sing the same identity 'melody' in different keys (lighting/angles). Training teaches the model to recognize the melody no matter the key. Before vs. After: Before—Models often nailed simple, single-subject portraits but stumbled when composing 2–4 identities or replacing one subject in a crowded scene. After—With video-derived priors and verified synthetic inputs, models hold identities steady across new layouts and keep non-target regions untouched when editing. Why it works (intuition): • Diversity with consistency: Videos supply many looks of the same subject, so the model learns what stays the same (identity) and what can change (context). • Targeted synthesis: Outpainting surrounds a known subject with fresh context; inpainting forces clean replacement without harming the background—exactly the skills needed at test time. • Automated quality gates: A VLM checks correctness, so the dataset favors clear, unambiguous supervision. Building blocks (each a 'sandwich' concept): 🍞 Hook: You know how a friend can look at a picture and explain it in words? 🥬 VLM (Vision–Language Model): What it is—an AI that understands images and text together. How—1) See image, 2) read or write text about it, 3) answer questions or verify details. Why—It can label, verify, and caption at scale. 🍞 Anchor: It confirms 'this is a person' and writes a short prompt like 'replace the woman on the right with the subject from image 2'. 🍞 Hook: Imagine finishing a jigsaw by drawing the missing borders around a known center piece. 🥬 Outpainting: What—it expands an image around a kept subject using a mask. How—1) Keep subject pixels, 2) mask the rest, 3) generate new surroundings. Why—Teaches the model to keep identity but vary context. 🍞 Anchor: Keep the person; generate a new kitchen around them. 🍞 Hook: Think of fixing a hole in a photo so the patch blends in. 🥬 Inpainting: What—fill or replace content inside a selected box or mask. How—1) Erase region, 2) tell which subject should go there, 3) generate a seamless insert. Why—Trains precise subject replacement while preserving everything else. 🍞 Anchor: Replace the cat in the couch photo with the dog from the reference, leaving the couch untouched. 🍞 Hook: Like tracing an outline before you color. 🥬 Segmentation mask: What—an exact cut-out of the subject’s shape. How—1) Detect subject, 2) segment pixels, 3) use as a guide. Why—Prevents background leakage and makes cleaner edits. 🍞 Anchor: A neat silhouette of a person avoids boxy edges. 🍞 Hook: It’s like a 'find the object' highlighter driven by words. 🥬 Grounding-DINO: What—an open-set detector that links text labels to boxes in images. How—1) Read labels, 2) propose boxes, 3) score matches. Why—Ensures the right thing is localized. 🍞 Anchor: 'woman' or 'mural' gets correctly boxed before editing.

03Methodology

At a high level: Input (video clips) → Step A (video curation) → Step B (cross-frame subject mining & pairing) → Step C (identity-preserving reference synthesis) → Step D (verification & captioning) → Output (training pairs for generation and manipulation + benchmark). Step A: Video curation. 🍞 Hook: Imagine picking only sharp, well-lit pages for your flipbook. 🥬 What it is—Choose high-quality clips with stable subjects. How—1) Collect from OpenHumanVid, OpenVid-1M, OpenS2V. 2) Filter out low resolution (<720p) or low aesthetic (<5.8). Why—Blurry or messy videos teach weak identity signals. 🍞 Anchor: Keep only crisp 720p/1080p scenes with a clear main person/object. Step B: Cross-frame subject mining and pairing. 🍞 Hook: Think of choosing two yearbook photos of the same student that look most different (one outdoors, one indoors) but are definitely the same person. 🥬 What it is—Find reliable frames of the same subject and pick the most diverse pair. How—1) Frame sampling: grab four mid-sequence frames to avoid transitions. 2) Clip-level category consensus with a VLM (Qwen2.5-VL-7B): keep frames that agree on the same subject types (e.g., 'person'). 3) Local verification with Grounding-DINO and a VLM: check boxes, size, occlusion, blur, facial visibility. 4) Diversity-aware pairing with DINOv2 embeddings: choose the two frames with the largest cosine distance (most different contexts/views). Why—If frames are too similar or mislabeled, the model won’t learn to generalize identity across varied scenes. 🍞 Anchor: Pick the person on a city street and the same person in a kitchen—same face, very different backgrounds. Side sandwiches for tools: • 🍞 Hook: Turning pictures into 'smell-like' numbers to compare. 🥬 DINOv2 embeddings: What—image features that let us measure visual difference. How—1) Encode frames, 2) compute distances, 3) pick farthest pair. Why—Maxes out diversity without losing identity. 🍞 Anchor: Two far-apart vectors → two visually distinct frames. • 🍞 Hook: A word-powered spotlight. 🥬 Grounding-DINO: What—text-guided detection. How—1) read 'person', 2) find boxes, 3) filter by geometry/confidence. Why—Pinpoints the right subject area. 🍞 Anchor: Confidently box 'woman' while ignoring background chairs. Step C: Identity-preserving reference image synthesis. This creates the training inputs the model will see. Two branches: generation (mask-guided outpainting) and manipulation (box-guided inpainting). Generation branch. 🍞 Hook: Like pasting a cutout of a person onto a fresh poster and painting a new world around them. 🥬 What it is—Keep the subject pixels, generate the rest so the model learns to put known subjects into new contexts. How—1) Fine mask construction: refine boxes with SAM2 to get the subject’s exact shape. 2) Topology normalization: resolve overlapping masks to avoid identity leakage. 3) Geometry-aware augmentations: scale small subjects to ~30–40% area; re-center with jitter for layout diversity. 4) Mask-guided outpainting with FLUX.1 Fill [dev] to synthesize surroundings. 5) Irregular boundary erosion: roughen mask edges to avoid straight-line seams (banding). Why—If the context doesn’t change, the model overfits; if edges are too neat, artifacts appear; precise masks keep the true identity intact. 🍞 Anchor: The same child is re-centered and surrounded by a new living room that wasn’t in the original frame. Manipulation branch. 🍞 Hook: Imagine replacing a sticker on a poster without wrinkling the poster itself. 🥬 What it is—Box-guided inpainting: erase a target region and insert the reference identity while preserving everything else. How—1) Choose multi-object images with low overlap. 2) Use the box as the erase mask. 3) Inpaint with FLUX.1 Fill [dev] using the reference subject. Why—Teaches precise replacement while keeping backgrounds and neighbors unchanged. 🍞 Anchor: Replace 'the cat on the couch' with 'the dog from image 2' and keep the couch, pillows, and lighting identical. Side sandwiches for helpers: • 🍞 Hook: Scissors that perfectly follow the outline. 🥬 SAM2 segmentation: What—pixel-accurate subject cutouts. How—1) seed from detection, 2) refine to silhouette, 3) output mask. Why—Avoids boxy background leakage. 🍞 Anchor: Hair strands stay, wall pixels go. • 🍞 Hook: Scooting a sticker and resizing it until the page looks balanced. 🥬 Geometry-aware augmentations: What—controlled scale/position changes. How—1) detect small vs. large, 2) scale to target ranges, 3) jitter position. Why—Expands layout diversity so the model doesn’t memorize one spot. 🍞 Anchor: The person sometimes slightly left, sometimes centered—identity stays, layout varies. • 🍞 Hook: Tearing paper edges so they blend into a collage. 🥬 Irregular boundary erosion: What—make mask borders a bit rough. How—1) random erosion depth, 2) apply along edges, 3) break straight seams. Why—Prevents visible cut-and-paste lines. 🍞 Anchor: No more black bars at the top/bottom; blends look natural. Step D: Verification and captioning. 🍞 Hook: Like a careful editor who both checks your work and writes the figure captions. 🥬 What it is—Use a VLM to auto-check samples and to write short and long captions. How—1) VLM artifact checks (geometry errors, texture issues, lighting, background mismatch). 2) Fail → re-synthesize with a new seed. 3) Pass → generate two captions (short/long) in either generation or editing style. Why—Keeps quality high and provides rich text supervision for training. 🍞 Anchor: The VLM rejects a warped arm, regenerates the sample, then writes 'Replace the woman on the right with the subject from image 2; keep the studio lighting the same.' Secret sauce: • Video grounding supplies honest identity variety. • Diversity-aware pairing forces context shifts. • Mask-guided out/in-painting exactly matches test-time needs (keep identity, change scene or swap target). • Automated VLM verification scales quality control.

04Experiments & Results

🍞 Top Bread (Hook): Picture a report card with four subjects: drawing one person, drawing several people, replacing one person in a simple scene, and replacing one person in a busy scene. 🥬 Filling (The Actual Concept): OSBench is a benchmark that tests subject-driven generation (single- and multi-subject) and subject-driven manipulation (single- and multi-subject) using a VLM judge to score key skills. How it works: 1) Four tasks, 60 items each. 2) For generation: Prompt Adherence (PA) and Identity Fidelity (IF), plus an Overall (geometric mean). 3) For manipulation: Manipulation Fidelity (MF) and Background Consistency (BC), plus an Overall. Why it matters: It fairly checks if models both follow instructions and preserve who is who, especially when scenes are complex. 🍞 Bottom Bread (Anchor): A model might follow 'sunset on a beach' perfectly (high PA) but change the person’s face (low IF)—OSBench reveals that. The competition: Closed-source baselines include Gemini 2.5 Flash Image Preview and GPT-4o; open-source baselines include UNO, DreamO, XVerse, Qwen-Image-Edit-2509, and OmniGen2. Most models do noticeably worse on multi-subject generation and especially on manipulation in busy scenes, showing the field’s hard spot. Scoreboard with context (highlights from the paper): • On OSBench, strong closed-source models still drop sharply for multi-subject manipulation; for example, Gemini 2.5 Flash Image Preview’s Overall is 5.12 in the hardest manipulation case. • Open-source subject editors struggle most: UNO, DreamO, and XVerse show extremely low manipulation scores (often below 1 for Overall) in multi-subject cases. • Fine-tuning OmniGen2 on OpenSubject (+ 500k OpenSubject + 100k T2I for prompt skills) lifts the average OSBench score from 6.43 to 7.22. Biggest wins are in manipulation: single-subject Overall +0.81; multi-subject Overall +1.91, driven by BC (+1.93) and MF (+0.48). Meaning: OpenSubject training helps the model swap identities correctly and keep the rest of the scene stable. Surprising findings: • Adding only extra text-to-image (T2I) data improved prompt adherence but hurt identity fidelity and manipulation—evidence that plain T2I isn’t a substitute for identity-consistent supervision. • Adding OpenSubject reversed the declines and produced across-the-board gains, especially where identity and precise localization matter. Transfer to other benchmarks: • OmniContext: Average improves 7.18 → 7.34, with the largest gains in MULTIPLE cases (Character +0.23, Object +0.24, Char.+Obj. +0.42) and SCENE categories (+0.12 to +0.21). Translation: Better at blending multiple references in larger scenes. • ImgEdit: Overall improves 3.44 → 3.72. Biggest jumps in Extract (+0.84), Hybrid (+0.76), Add (+0.71), and Background (+0.56), indicating stronger localization and structural edits. Visual comparisons (qualitative): Models trained on OpenSubject keep identities steadier in multi-subject scenes, follow attribute instructions better, and confine edits to the marked area. Competing models often drift identity or unintentionally alter background elements. 🍞 Sandwiches for metrics: • 🍞 Hook: Following a recipe exactly. 🥬 Prompt Adherence (PA): What—how well outputs match the text (attributes, counts, relations). How—1) Read prompt, 2) check each requirement, 3) score 0–10. Why—Good pictures must follow instructions. 🍞 Anchor: If the prompt says 'two men at a wooden table,' generating three is a PA failure. • 🍞 Hook: Recognizing your friend across selfies. 🥬 Identity Fidelity (IF): What—how well the subject matches the references. How—1) Compare face/body/clothes cues, 2) penalize mismatches, 3) score 0–10. Why—Keeps who-is-who correct. 🍞 Anchor: The same hairstyle and jawline should appear across scenes. • 🍞 Hook: Doing a perfect sticker swap. 🥬 Manipulation Fidelity (MF): What—accuracy of the edit in the target region. How—1) Compare edited part to the reference, 2) check pose/attributes, 3) score 0–10. Why—Measures correct replacement. 🍞 Anchor: Replace 'cat' with 'the dog from image 2' exactly where told. • 🍞 Hook: Not knocking over other Lego when changing one piece. 🥬 Background Consistency (BC): What—how unchanged non-edited regions stay. How—1) Compare outside the edit box, 2) penalize lighting/layout shifts, 3) score 0–10. Why—Prevents collateral damage. 🍞 Anchor: Couch texture and lamp position should remain the same after swapping the pet.

05Discussion & Limitations

Limitations: • Data source dependency: If input videos skew toward certain people, places, or lighting, the learned identity priors may still carry bias. • Synthetic steps: In/out-painting uses a base model; any artifacts or style biases there can leak into supervision (though VLM verification helps). • Hard corner cases: Extremely heavy occlusions, motion blur, or micro-identities (tiny faces) remain challenging. • Scaling compute: Building and training with millions of samples requires serious GPU time and efficient pipelines. Required resources: • Storage for 4.35M images and metadata. • GPUs for in/out-painting synthesis and model fine-tuning (the paper cites 16× H800 for one run). • VLM inference capacity for verification and captioning. When NOT to use: • If you only need generic text-to-image without preserving a specific identity, simpler T2I data is cheaper. • If privacy constraints forbid handling identifiable subjects, even with acceptable-use policies—use synthetic or anonymized identities instead. • If the task involves domains far outside the dataset (e.g., medical scans, microscopy), the video priors here may not transfer. Open questions: • Robustness: How to better handle extreme pose changes, partial occlusions, or minuscule subjects while keeping identity perfect? • Fairness: How to balance categories and demographics so identity fidelity is strong across all groups? • Efficiency: Can we shrink the pipeline (lighter VLM checks, faster masks) while keeping quality? • Editing control: How to combine precise spatial instructions (like scribbles or point prompts) with multi-identity conditioning for even cleaner manipulations? • Unified evaluation: Beyond VLM judges, can we design standardized, human-calibrated audits for identity and background preservation at scale?

06Conclusion & Future Work

Three-sentence summary: OpenSubject builds a massive, video-derived dataset and pipeline that captures cross-frame identity priors and synthesizes high-quality training inputs for both subject-driven generation and manipulation. By pairing diverse frames of the same subject and using mask-guided in/out-painting with VLM verification, it teaches models to preserve identity while changing context or swapping targets. Experiments show large gains, especially in multi-subject and manipulation scenarios, with improvements that transfer to other benchmarks. Main achievement: Proving that video-grounded, diversity-aware pairing plus verified synthesis is the missing recipe for scaling identity-consistent generation and robust, background-safe manipulation. Future directions: Expand category balance and demographics, reduce synthesis bias, add finer-grained spatial controls, and develop lighter pipelines and broader human-calibrated evaluations. Why remember this: It reframes personalization as a video problem—learning who a subject is across many moments—then turns that understanding into practical, reliable image generation and editing that stays true to the subject and the scene.

Practical Applications

•Personalized character rendering for films and games that keeps identity across scenes.
•Photo editing tools that replace one person or pet without changing the background.
•E-commerce pipelines that swap products or models while preserving studio setups.
•Marketing content generation that composes multiple brand assets consistently.
•Social media filters that keep your exact face while changing settings or styles.
•Family memory apps that place relatives into new group photos while staying authentic.
•Virtual try-on that preserves model identity while changing outfits or accessories.
•Education demos that consistently re-contextualize lab objects or museum artifacts.
•Storyboarding tools that keep recurring characters stable across different panels.
•AR overlays that insert known objects or characters into live scenes without drift.

Version: 1