Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing

Runze He; Yiji Cheng; Tiankai Hang; Zhimin Li; Yu Xu; Zijin Yin; Shiyi Zhang; Wenxun Dai; Penghui Du; Ao Ma; Chunyu Wang; Qinglin Lu; Jizhong Han; Jiao Dai

Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing

Beginner

Runze He, Yiji Cheng, Tiankai Hang et al.1/8/2026

arXiv PDF

Key Summary

•Re-Align is a new way for AI to make and edit pictures by thinking in clear steps before drawing.
•It uses a special plan called In-Context Chain-of-Thought (IC-CoT) that first writes a simple goal caption and then links each reference image to its job.
•This clear plan stops the AI from mixing up which parts come from which reference picture.
•A gentle training method (reinforcement learning) rewards the AI when the final image matches its own step-by-step plan.
•The reward is measured by how well the generated image matches the plan’s caption using an image–text similarity score.
•To keep training stable, Re-Align makes several different reasoning plans for the same prompt so the AI explores more and learns better.
•A large, carefully filtered dataset (Re-Align-410K) with both prompts and reasoning plans helps the model learn many kinds of generation and editing tasks.
•On standard tests, Re-Align beats other methods of similar size in following instructions and keeping subjects consistent across images.
•It works for both creating new images from references and editing existing images using multiple reference pictures.
•The big idea is aligning what the AI says it will do (reasoning) with what it actually draws (generation).

Why This Research Matters

Creative work often needs combining parts from several photos into a single, clear image, and Re-Align makes that reliable. It helps professionals keep brand elements (like logos and styles) consistent while following complex instructions. It reduces trial-and-error by making the AI first write a simple plan and then draw to match it, saving time and effort. It supports both making new images and editing existing ones, so one method covers many real tasks. Because it rewards matches between the plan and the picture, it builds trust: what the AI says it will do is what you actually get. Over time, this approach can enable safer, more controllable image tools for classrooms, studios, and businesses.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re building a LEGO scene using several boxes of pieces. A friend tells you, “Use the red wheels from Box 1, the blue roof from Box 2, and make it look like the picture in this magazine.” If the instructions are messy or you mix up which box is which, the final build won’t match what your friend wanted.

🥬 The World Before: For a long time, AI could draw cool images from text prompts like “a panda riding a bike.” It could also edit a single picture when told, “make the sky pink.” But people often want more flexible instructions that mix words and several pictures, such as, “Put the cat from this photo on the couch from that photo, in the style of this sketch.” These mixed inputs are called image–text interleaved prompts, and they’re the normal way that humans describe complex visual ideas.

🍞 Anchor: Think of asking an artist: “Use this face (photo A), this outfit (photo B), and place it in this room (photo C).” That’s what users want from modern image tools.

🍞 Hook: You know how teachers say, “Show your work!” in math? That’s because writing down the steps helps catch mistakes and proves your answer fits the question.

🥬 The Problem: Some advanced multimodal AIs can ‘explain’ what they think they should do, but when it’s time to actually draw, the final picture doesn’t match their own explanation. The thinking and the drawing aren’t aligned. In complex, interleaved prompts with multiple reference images, models often confuse which reference supplies which part (like using the hat from the wrong person).

🍞 Anchor: It’s like a student explaining perfectly how to solve a math problem, then writing the wrong final answer.

🍞 Hook: Imagine a recipe card that says “Make spaghetti,” but doesn’t say what sauce, how much salt, or what to do with the mushrooms you laid out.

🥬 Failed Attempts: Previous methods tried two main things. First, “bigger brains” for understanding: large multimodal models that can read and reason about complex prompts. Second, “better brushes” for drawing: improved image generators like diffusion or flow-matching models. But without a bridge between the reasoning and the drawing, the model still treated every reference loosely, leading to mismatched subjects, missing edits, or styles bleeding into the wrong parts.

🍞 Anchor: The chef (AI) read the whole recipe blog post (reasoning) but still cooked a dish that didn’t match the short recipe card (final target) because the steps weren’t tied directly to the cooking.

🍞 Hook: Picture traffic cones guiding cars onto the right lanes so they don’t crash into each other.

🥬 The Gap: What was missing is a simple, structured plan that locks thinking to doing: a way to clearly write down what the final picture should be (a short, target caption) and to label how each reference image contributes (who brings the car, the hat, the background). Without this structure, the AI keeps guessing and mixing references.

🍞 Anchor: We need a plan that says, “Final image: a woman in a brown jacket in a garden. Reference 1 gives the person and background. Reference 2 gives the jacket.”

🍞 Hook: Remember the gold star stickers teachers give when your work matches the assignment?

🥬 Real Stakes: People use these systems to customize product photos, create marketing visuals, make consistent characters in comics, or edit family pictures. If the model misplaces a logo, swaps faces incorrectly, or forgets the requested style, it wastes time and can break trust. A method that tightly aligns “what I asked for,” “what the AI said it would do,” and “what it actually drew” makes creative work faster, safer, and more reliable.

🍞 Anchor: A designer can say, “Put our new sneaker (Photo A) on this table (Photo B) in the ad style (Photo C),” and get exactly that—first written as a plan, then drawn as the final image.

02Core Idea

🍞 Hook: You know how coaches tell athletes to visualize the play before they run it? First, you plan the moves; then you execute them.

🥬 The Aha! Moment (One Sentence): Make the AI commit to a clear, structured plan—what to draw and which reference supplies what—then train it so the final picture must match that plan.

🍞 Anchor: First, the AI writes a tiny caption of the target image and lists which reference image provides each part; then it draws an image that should fit that caption and list.

🍞 Hook: Imagine a shopping list that says exactly which store to visit for each item.

🥬 Multiple Analogies:

Recipe Card: IC-CoT is the recipe. It says the final dish (the short caption) and which ingredients come from each pantry (reference images). The cooking (image generation) must match the recipe.
LEGO Blueprint: IC-CoT labels which box each brick comes from and shows the final build sketch. Building follows the labels.
Theater Casting: IC-CoT assigns roles: who’s the actor, who brings the costume, what backdrop to use. The performance (image) follows that script.

🍞 Anchor: If the plan says, “Final: a black-and-white sketch of a man and woman on a brown leather sofa; man from Image 1, woman from Image 2, sofa from Image 3, style from Image 4,” the drawing should match exactly.

🍞 Hook: You know how checklists stop pilots from forgetting steps?

🥬 Before vs After:

Before: The model reasoned in long, messy paragraphs and then drew something loosely related, often mixing up references.
After: The model writes a crisp, structured IC-CoT: a short target caption (semantic guidance) plus labeled relations (reference association). Then it draws to match that plan. The result is tighter instruction following and better subject consistency.

🍞 Anchor: Fewer surprises: the hat from the right person ends up on the right head, in the right scene, with the right style.

🍞 Hook: Think of a game where you get points when your result matches your plan.

🥬 Why It Works (Intuition):

Clear Target: A compact caption focuses the generator on a single, well-defined image goal, shrinking confusion from long, tangled prompts.
Untangled References: Labeling each reference image’s job prevents accidental swaps and style leaks.
Gentle Rewards: A similarity score between the plan’s caption and the final image nudges learning toward tighter plan–picture alignment, without building a giant custom judge for every task.

🍞 Anchor: If the caption says “a barn owl on the cushioned chair in a cozy home office,” and the image looks like that, the reward is high; if not, it’s low.

🍞 Hook: If all your practice questions are too similar, small changes look bigger than they are.

🥬 Building Blocks:

Structured Reasoning (IC-CoT): Two parts: out_caption (what the final image should look like) and relation_i tags (who contributes what from each reference image).
Surrogate Reward: Use a standard image–text similarity to reward matches between the generated image and the IC-CoT caption.
Group Relative Policy Optimization (GRPO): A team-based RL training that improves the model by comparing samples in a group—no extra value network needed.
Reasoning-Induced Diversity: Create several different reasoning chains for the same prompt so the samples in a group vary more. This stabilizes RL training and avoids amplifying tiny differences.
Re-Align-410K Dataset: Many examples with prompts, references, plans (IC-CoT), and images so the model learns both to think and to draw.

🍞 Anchor: Together, these blocks act like a good teacher: demand a plan, check the plan against the result, give points for matching, and encourage trying slightly different plans to learn faster.

03Methodology

At a high level: Input (interleaved images + text) → Plan (IC-CoT) → Draw (image) → Check alignment (reward) → Improve (GRPO).

Step 0. The Input 🍞 Hook: You know when you get instructions like “Use this picture of a puppy and this picture of a beach, and make the puppy sit on the beach at sunset”? 🥬 What: Image–text interleaved prompts mix several reference images with a written instruction. How it works: The model reads the instruction and looks at all references together. Why it matters: Without understanding all parts together, the model could ignore key references or misread the goal. 🍞 Anchor: Prompt: “Put the cat from Image 1 on the couch from Image 2 in the watercolor style of Image 3.”

Step 1. Make the Plan (IC-CoT) 🍞 Hook: Before building a LEGO kit, you skim the picture on the box and label which bag holds which parts. 🥬 What: IC-CoT is a structured mini-plan with two pieces: a short target caption (out_caption) and several labeled relations (relation_i) that say what each reference image contributes. How it works (recipe style):

Predict out_caption: a concise description of the final image.
For each reference image i, write relation_i: explain its role (e.g., subject face, jacket, sofa, style, background).
Keep the plan clear and compact. Why it matters: Without a clear plan, the generator guesses, leading to reference confusion and missed instructions. 🍞 Anchor: Example IC-CoT: out_caption: “A woman in a brown leather jacket and blue checked shirt, holding flowers in a garden.” relation_1: “The woman’s face, hair, pose, and background from Image 1.” relation_2: “The brown jacket and blue checked shirt from Image 2.”

Step 2. Draw the Image (Flow-based Generation) 🍞 Hook: After your plan, you actually build the LEGO model. 🥬 What: The model generates an image guided by the plan. How it works: A powerful image generator (rectified flow/diffusion-like) uses the out_caption and relations to synthesize pixels that fit the plan. It treats the plan as the target to aim for. Why it matters: Without directly feeding the plan into drawing, the final image can drift away from the intended goal. 🍞 Anchor: If out_caption says “black-and-white sketch style,” the generator renders lines and shading, not colorful paint.

Step 3. Check Alignment (Surrogate Reward) 🍞 Hook: After drawing, you compare your picture to the box art. 🥬 What: A surrogate reward scores how well the final image matches the out_caption using a standard image–text similarity model (like CLIP). How it works:

Encode the generated image.
Encode the out_caption from the plan.
Compute their similarity; higher means better match. Why it matters: Building a custom judge for every possible editing/generation task is expensive; this simple score gives a consistent, general signal. 🍞 Anchor: If the plan says “owl on a cushioned chair in a cozy office,” and the picture clearly shows that, the similarity score is high; if the owl is missing or the scene is wrong, it’s low.

Step 4. Improve the Model (GRPO) 🍞 Hook: A soccer team watches replays and chooses better plays next time. 🥬 What: Group Relative Policy Optimization is an RL method that compares several generated samples as a group and nudges the model to favor the higher-reward ones. How it works:

For a single prompt, generate multiple candidate plans and images.
Score each result using the surrogate reward.
Update the model to increase the chance of producing higher-scoring samples while keeping changes stable. Why it matters: Without RL, the model doesn’t steadily learn from successes and failures across diverse prompts; without grouping, training is less stable and more memory-hungry. 🍞 Anchor: From five attempts, keep leaning toward the one that best matches the plan and instruction.

Step 5. Encourage Variety (Reasoning-Induced Diversity) 🍞 Hook: If you always practice the same math problem, you don’t learn as much. 🥬 What: For each prompt, write slightly different IC-CoT plans so the model explores varied reasoning paths and image outcomes. How it works:

Generate multiple IC-CoTs with small wording or emphasis changes.
Produce images from each plan.
Compare them; diverse results make the reward more informative and training more stable. Why it matters: If all samples look nearly the same, tiny differences in scores get over-amplified, confusing learning. 🍞 Anchor: One plan says “focus on jacket texture,” another says “focus on garden lighting.” The best-balanced image wins more reward.

Step 6. Train with the Right Data (Re-Align-410K) 🍞 Hook: A good workbook has clear examples and answer keys. 🥬 What: A large dataset containing reference images, adaptive instructions, IC-CoT plans, and target images across many tasks (generation and editing). How it works:

Pick varied references (people, objects, scenes).
Use a strong language-vision model to write instructions tailored to the references.
Generate structured reasoning (IC-CoT) that predicts the goal without peeking at the answer image.
Use a top-tier generator to make the target image.
Filter by quality, instruction following, and plan–image match to keep only strong samples. Why it matters: Without rich, clean data that includes both plans and images, the model can’t learn to tightly connect thinking and drawing. 🍞 Anchor: Example data: references (a person, a jacket), instruction (“replace the white dress with the brown jacket from image 2”), IC-CoT plan, and a correct target photo.

The Secret Sauce 🍞 Hook: Great cookies come from both a good recipe and tasting as you go. 🥬 What makes Re-Align clever is connecting a compact, structured plan (IC-CoT) directly to both drawing and learning through a simple alignment score and stable RL. How it works: The plan simplifies the target, the reward checks plan–image match, diversity keeps learning steady, and GRPO turns comparisons into improvements. Why it matters: This turns fuzzy, multi-image instructions into clear, checkable steps the generator can reliably follow. 🍞 Anchor: The AI doesn’t just talk about an owl-in-office picture; it writes a neat plan and then proves it by drawing the matching image.

04Experiments & Results

The Test: What did they measure and why? 🍞 Hook: When you take a quiz, the teacher checks both if you followed directions and if your answer looks like what was asked. 🥬 What: They measured Prompt Following (PF: did the image follow the instruction?) and Subject Consistency (SC: did the image faithfully use the referenced people/objects?). An Overall Score combines PF and SC. Why it matters: Without good PF, the picture ignores the request; without good SC, the picture misuses references. 🍞 Anchor: If you ask “put the husky from Image 2 into Image 1,” PF checks if a husky is added; SC checks if it’s the same husky.

The Competition: Who was compared? 🍞 Hook: Think of a science fair with several strong teams. 🥬 What: Re-Align was compared with notable systems: BAGEL, OmniGen2, Echo-4o, Qwen-Image-Edit (multi-image version), and DreamOmni2—well-known models for in-context image generation/editing. Why it matters: Beating strong peers of similar size shows the method itself is effective, not just raw scale. 🍞 Anchor: It’s like winning against other A-team clubs, not just beginners.

The Scoreboard: Results with context 🍞 Hook: An 87% on a hard test means more than a 90% on an easy one. 🥬 What:

On OmniContext (a popular benchmark), Re-Align reached an overall average about 8.21, which is like getting an A when most others get a B or B+ under the same rules.
On DreamOmni2Bench, which includes both generation and editing tasks (Add, Replace, Global style/attribute changes, and Local edits), Re-Align consistently scored higher PF and SC across categories than peers of similar scale. It excelled especially when multiple references or scene changes were involved—cases where confusion usually happens. Why it matters: Higher PF and SC together mean the image both obeys instructions and stays faithful to the references—a double win. 🍞 Anchor: Re-Align didn’t just add a dog; it added the right dog, in the right place, with the right style.

Surprising Findings 🍞 Hook: Sometimes a tiny trick makes a big difference. 🥬 What surprised the researchers:

Reasoning-Induced Diversity (making slightly different plans for the same prompt) was key for stable RL progress. Without it, rewards became noisy and less helpful.
The simple surrogate reward based on plan caption vs image similarity was enough to noticeably boost plan–image alignment, without training a huge task-specific judge. Why it matters: Simple, general tools—structured plans plus a generic similarity score—can go far when used thoughtfully. 🍞 Anchor: Like tasting cookie dough with a basic spoon: you don’t need a fancy tester to know it’s sweet enough.

Ablations (what breaks without each piece) 🍞 Hook: Remove a bike’s chain, and pedaling won’t move you. 🥬 What they checked:

Without IC-CoT (no structured plan), results dropped: more reference mix-ups and weaker instruction following.
With unstructured, long reasoning instead of IC-CoT, performance was worse than with IC-CoT: the compact plan worked better than rambly thoughts.
Adding RL alignment (GRPO) improved text–image consistency; adding diversity (RID) on top stabilized training and lifted overall scores. Why it matters: Each part of the recipe—plan, reward, RL, diversity—pulls its weight. 🍞 Anchor: The final system is strongest when all parts are included: plan first, draw second, check third, improve fourth.

Robustness to Number of References 🍞 Hook: Juggling one ball is easy; juggling four takes skill. 🥬 What: Across tasks with one to four reference images, Re-Align stayed near the top or at the top for PF and SC, showing it can juggle multiple references without mixing them up. Why it matters: Real users often bring several references; reliable performance there is crucial. 🍞 Anchor: Even with four references—person, hat, logo, and scene—the model kept roles straight and the style on target.

05Discussion & Limitations

Limitations 🍞 Hook: Even great tools have things they can’t do yet. 🥬 What it can’t do (yet):

Scale Gap: It doesn’t match huge, closed models trained on massive private data—so some edge cases or rare styles can still fail.
Text-Only Reasoning: IC-CoT is textual; adding visual step-by-step hints (like intermediate sketches or masks) might help in very tricky edits.
Narrow Edits Unseen in Training: Very specific tasks (e.g., matching a rare font exactly or obeying a niche color-coding rule) may be misunderstood or weakly followed.
Hallucination Under Extreme Complexity: With many vague instructions and many references, occasional mix-ups can appear. 🍞 Anchor: Like a strong student who still struggles with a few ultra-hard, unusual puzzles.

Required Resources 🍞 Hook: To bake a big cake, you need a big oven. 🥬 What you need: A unified multimodal model backbone, GPUs for fine-tuning (they used dozens of high-end GPUs), the Re-Align-410K dataset, and inference-time compute for high-resolution images. Why it matters: Training from scratch is heavy; reusing the method on a similar backbone lowers cost, but it’s still non-trivial. 🍞 Anchor: It’s doable in a well-equipped lab or company; lighter variants could be distilled for smaller setups.

When Not to Use 🍞 Hook: Don’t use a paint roller to write your name. 🥬 Situations to avoid:

Pure text-only tasks without images: simpler text-to-image systems may suffice.
Demands for pixel-perfect, CAD-level geometry edits: specialized tools might be better.
Extremely long, contradictory prompts with too many references and no clear priorities: cleaning the prompt first helps. 🍞 Anchor: If you only need “a blue circle,” this method is overkill; if you need to carefully swap items among many photos, it shines.

Open Questions 🍞 Hook: Future science fairs need new ideas. 🥬 What we still don’t know:

Visual Chain-of-Thought: Can we add visual intermediate steps (sketches, masks, layout diagrams) to boost alignment further?
Task-Specific Rewards: Would lightweight, plug-in reward models for certain edits (e.g., face identity, logo correctness) help without losing generality?
Broader Safety and Robustness: How to ensure no bias, misuse, or identity errors across varied cultures and edge cases? 🍞 Anchor: Next versions might write a plan, draw a rough draft, and then a final image—checked by smart, targeted rewards.

06Conclusion & Future Work

Three-Sentence Summary

Re-Align aligns an AI’s step-by-step plan with its final drawing for in-context image generation and editing.
It uses a structured plan (IC-CoT) that states a short target caption and assigns each reference image a clear role, then reinforces matches between plan and picture using a simple reward and stable RL.
With a large, curated dataset and a diversity trick for training, it outperforms similar-sized peers on following instructions and keeping subjects consistent.

Main Achievement

Turning messy, multi-image instructions into a compact, checkable plan that the generator can reliably follow—closing the gap between reasoning and image creation.

Future Directions

Add visual Chain-of-Thought (intermediate sketches or masks), develop plug-in rewards for tough sub-tasks (logos, faces, fonts), scale data and models, and distill the method for lighter use.

Why Remember This

Re-Align’s big idea is simple but powerful: make the model say clearly what it will draw and who contributes what, then reward it when the drawing matches. This blueprint-plus-reward recipe is a practical path to more trustworthy, controllable image creation and editing from complex, mixed media prompts.

Practical Applications

•Marketing design: Combine a product from one photo with a styled background from another while preserving brand colors and logos.
•E-commerce: Place the same model or item in different scenes to create consistent catalog images quickly.
•Film and game concept art: Compose multiple character references into a single scene while keeping each identity and costume correct.
•Education projects: Students can provide references and instructions to generate illustrations that match lesson themes.
•Photo editing: Replace or add objects (e.g., swap a dress or add a backpack) using trusted references while keeping lighting and style consistent.
•Comics and storytelling: Keep characters’ faces, outfits, and styles consistent across panels by reusing references with clear role labels.
•Brand mockups: Insert new logos into varied scenes and ensure they match the requested material or style from references.
•Interior and product visualization: Transfer materials (wood, metal, fabric) from a reference onto target furniture or gadgets.
•Social media content: Remix elements from multiple images into eye-catching, on-brand posts without accidental mix-ups.
•Rapid prototyping: Explore several structured plans (IC-CoTs) for the same idea to see diverse, high-quality concepts fast.

Version: 1