OmniPSD: Layered PSD Generation with Diffusion Transformer

Cheng Liu; Yiren Song; Haofan Wang; Mike Zheng Shou

OmniPSD: Layered PSD Generation with Diffusion Transformer

Intermediate

Cheng Liu, Yiren Song, Haofan Wang et al.12/10/2025

arXiv PDF

Key Summary

•OmniPSD is a new AI that can both make layered Photoshop (PSD) files from words and take apart a flat image into clean, editable layers.
•It keeps transparency (the see‑through parts) accurate by using a special RGBA-VAE so edges and soft shadows look right.
•For text-to-PSD, it places four pictures in a 2×2 grid (full poster, foreground, midground, background) so the model can learn how layers relate.
•For image-to-PSD, it works step by step: extract text, remove it, extract objects, and rebuild the background, saving each piece as an RGBA layer.
•It mixes two Diffusion Transformer models from the FLUX family: Flux-dev for making images from text and Flux-Kontext for smart editing.
•A large 200k layered poster dataset lets the model learn real designer workflows, not just toy examples.
•In tests, OmniPSD beats or matches strong baselines on realism, prompt alignment, and reconstruction, while uniquely outputting true layered PSDs.
•Results show lower errors (higher PSNR/SSIM) for reconstruction and better FID/CLIP scores for generation, plus high GPT-4 judge scores.
•Designers can change text, swap foreground objects, or recolor backgrounds instantly because layers stay clean and semantically correct.

Why This Research Matters

Graphic design runs on layers. If AI only makes flat pictures, every small change becomes hard and risky. OmniPSD brings AI into real designer workflows by creating or recovering true PSD layers with accurate transparency, so teams can localize text, swap products, and reuse assets safely. This saves hours of manual cutouts, reduces errors like halos, and keeps brand visuals consistent. It also helps non-experts make polished posters that professionals can refine. In short, it turns AI images from “pretty but fixed” into “pretty and editable,” unlocking speed without losing control.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine making a birthday poster in an app like Photoshop. You keep the photo of your friend on one layer, the confetti on another, and the text on its own layer. That way, you can fix a typo or move the confetti without ruining the photo.

🥬 The Concept (PSD and Layers): A PSD file is a picture made of stacked, editable layers, each possibly with transparency (alpha), so designers can tweak parts without touching everything else. How it works:

Each element (text, icons, photos) lives on its own layer.
Layers stack from background (bottom) to foreground (top).
The alpha channel stores how see-through each pixel is. Why it matters: Without layers and alpha, you get a single flat picture—changing one thing (like text) is painful and messy.

🍞 Anchor: If your poster says “Saturday” but the party moves to “Sunday,” editable text layers let you fix it in seconds.

🍞 Hook: You know how a sticker can be partly transparent around its edges so it blends nicely when you stick it on a window?

🥬 The Concept (Alpha/Transparency): The alpha channel is a special fourth channel (besides red, green, blue) that controls how see-through each pixel is. How it works:

0% alpha = fully transparent; 100% alpha = fully solid.
When layers stack, alpha decides what shows through.
Soft shadows, glass, fog, and anti-aliased text rely on smooth alpha. Why it matters: Without alpha, edges look jagged, halos appear, and layers don’t blend naturally.

🍞 Anchor: A cloud icon with soft, fuzzy edges looks right because its alpha gently fades at the border.

🍞 Hook: Think of artists building a painting from background sky to mid hills to front flowers.

🥬 The Concept (The World Before): Most AI image generators made beautiful but flat pictures—like a printed photo—good for looking, bad for editing. How it worked:

The AI read your prompt and generated a single RGB image.
If you wanted changes, you edited pixels or tried inpainting.
Transparency was often guessed afterward by a separate tool. Why it failed: Post-hoc matting/segmentation often misaligned edges, broke soft glows, and produced halos, so designers couldn’t rely on it for clean PSD workflows.

🍞 Anchor: Changing the color of just the wave on a flat poster often spilled into the ocean behind it.

🍞 Hook: Imagine trying to take apart a baked cake into flour, eggs, and sugar. That’s what old methods tried to do—separate layers after everything was already mixed.

🥬 The Concept (The Problem): Designers need editable, layered outputs directly, or faithful decompositions from a flat image back into layers—with alpha intact. How it works today (before):

Generate a flat image, then segment objects.
Guess transparency with matting tools.
Try to rebuild layers. Why it breaks: Errors pile up between steps; soft effects, shadows, and overlapping elements are especially fragile.

🍞 Anchor: A glow around text becomes a chunky outline after post-processing instead of a smooth, editable effect layer.

🍞 Hook: Picture a teacher who can both write a neat new essay and also take a messy essay and separate sentences into topics and subtopics.

🥬 The Concept (The Gap): We lacked a single system that could do both text-to-layered-PSD creation and image-to-PSD decomposition while truly understanding transparency and how layers relate. How it should work:

Learn alpha-aware visual features.
Understand layer order and occlusion.
Produce or recover clean, semantically labeled layers. Why it matters: This makes AI useful for real design work—quick edits, brand-safe updates, and clean asset reuse.

🍞 Anchor: A marketing team can spin out localized posters by swapping text layers and a few foreground icons, leaving everything else untouched.

🍞 Hook: Think of a librarian who not only gives you the right book from a description but can also take a mixed box of pages and sort them back into the correct chapters.

🥬 The Concept (OmniPSD’s Promise): OmniPSD is a unified Diffusion Transformer framework that both generates layered PSDs from text and decomposes flat images back into layers, using a transparency-savvy autoencoder and in-context learning. How it works:

A shared RGBA-VAE learns to represent images with alpha.
A text generator (Flux-dev) makes 2×2 grid outputs for multiple layers at once.
An editor (Flux-Kontext) iteratively extracts text/objects and restores background. Why it matters: You get editable, designer-grade layers—accurate edges, realistic blending, and consistent structure—from either words or an input image.

🍞 Anchor: Type “minimalist eco poster…” and receive separate, editable layers: teal background, Earth with clouds, a blue wave, small plants, and bubble accents.

02Core Idea

🍞 Hook: You know how assembling LEGO is easier when you see the picture on the box and how the pieces connect?

🥬 The Concept (Aha! Moment): Train one transparency-aware system that learns layers together—by letting them “see” each other in a 2×2 grid—and also learns to peel layers off a flat image step by step. How it works:

Use a Diffusion Transformer backbone for powerful image reasoning.
Share an RGBA-VAE so alpha (transparency) is treated as a first-class signal.
For text-to-PSD, generate a 2×2 grid: full poster, foreground, midground, background, so spatial attention aligns them.
For image-to-PSD, iteratively extract text/objects and rebuild the background via flow-matching in latent space. Why it matters: Without shared alpha-aware features and in-context layer reasoning, you get messy edges, mismatched layers, and broken layouts.

🍞 Anchor: The model creates a neat Earth layer with soft cloud halos that perfectly sit over a teal background, all editable.

Three analogies:

Orchestra: Each section (strings = background, brass = midground, soloist = foreground) follows a shared score (grid + prompts) so the final music (poster) is harmonious.
Jigsaw: The 2×2 grid shows the whole picture and key pieces, helping the model learn exactly how parts fit.
Archaeology: The decomposition method brushes away text and objects layer by layer, revealing the clean background like uncovering buried artifacts.

Before vs After:

Before: Flat images, after-the-fact segmentation, frequent halo artifacts, and weak control of layout/occlusion.
After: Direct, multi-layer PSDs from text; faithful layer recovery from a single image; crisp transparency and consistent structure.

Why it works (intuition):

Diffusion Transformers are great at global attention—perfect for understanding relationships across a grid of layers.
A shared RGBA-VAE ensures alpha edges and soft effects are encoded consistently for both generation and editing.
Iterative, in-context editing mirrors how designers work: remove text, patch background, then handle objects.

Building blocks (each with a mini-sandwich):

🍞 Hook: Imagine a sculptor who refines a statue gradually. 🥬 The Concept (Diffusion Transformer): A model that iteratively transforms noisy representations into clean images using attention to look across the whole canvas. How it works: Look at all tokens, decide what matters, refine step by step. Why it matters: Without global attention, layers won’t stay consistent. 🍞 Anchor: It keeps the blue wave aligned with the ocean theme while respecting the Earth’s position.

🍞 Hook: Think of special glasses that let you see both color and see-through parts. 🥬 The Concept (RGBA-VAE): An autoencoder that compresses and reconstructs images with transparency so edges, glows, and shadows survive. How it works: Encode RGBA into a latent code; decode it back faithfully. Why it matters: Without it, alpha breaks and layers look fake. 🍞 Anchor: The cloud ring around Earth keeps its soft fade when exported as a layer.

🍞 Hook: You know how a storyboard shows the whole story and key scenes side by side? 🥬 The Concept (2×2 Grid + Hierarchical Prompts): Place full poster, foreground, midground, background together so attention learns their relationships; describe each one with its own caption. How it works: Joint generation in one pass with layer-specific text. Why it matters: Without shared context, layers drift off-theme. 🍞 Anchor: The wave’s light blue and the background teal stay harmonious.

🍞 Hook: Like peeling stickers off a book cover without tearing the paper. 🥬 The Concept (Iterative Decomposition via Flow Matching): Learn a smooth path from input image to each target layer; alternate extraction and background restoration. How it works: Predict a flow that moves latents toward the desired layer; repeat per element. Why it matters: One-shot separation often confuses overlaps. 🍞 Anchor: Remove the text first, patch the background cleanly, then pull out plants and bubbles as separate layers.

🍞 Hook: Imagine snapping on small adapters to a big robot to do specialized chores. 🥬 The Concept (LoRA Adapters): Lightweight add-ons that specialize the editor for text extraction, object extraction, and erasure. How it works: Fine-tune only tiny rank-limited weights. Why it matters: Without LoRA, training is heavy and less flexible. 🍞 Anchor: Switch from removing text to restoring background by swapping adapters.

03Methodology

At a high level: Input (text or image) → RGBA-VAE encode → Diffusion Transformer (generation or editing) with in-context cues → RGBA-VAE decode → RGBA layers → Assemble into PSD.

Step-by-step (Text-to-PSD):

Input: A hierarchical prompt like:

poster: minimalist eco poster, deep teal background, cloud-ringed Earth centered, light blue wave on top, plants and bubbles as accents
foreground: circular Earth with clouds
midground: wavy, light blue ocean pattern
background: solid deep teal fill Why this exists: Layer-specific text gives each panel a job; without it, elements blur together. Example: The word “wave” only guides the midground panel, preventing it from swallowing the Earth.

2×2 Grid In-Context Setup: Arrange panels as [full poster | foreground; midground | background]. Why this exists: Spatial self-attention lets panels “see” each other; without it, the foreground may not match the full layout. Example: The foreground Earth aligns with the same center found in the top-left full poster.
RGBA-VAE Encode: Turn the 2×2 RGBA grid into latent tokens. Why this exists: Keeps transparency precise while compressing data; without it, alpha becomes a guess. Example: The cloud ring retains feathered edges in latent form.
Diffusion Transformer Generation (Flux-dev): In one forward pass, the model denoises the latent grid guided by the captions. Why this exists: Joint generation avoids stage-mismatch; without joint attention, colors and positions can drift. Example: The teal background, blue wave, and Earth colors stay consistent.
RGBA-VAE Decode: Recover each panel as an RGBA image. Why this exists: Bring back pixel-accurate alpha and color; without it, exportable layers look off. Example: Exporting to PSD preserves transparent halos and soft shadows.
Build PSD: Save background, midground, foreground as separate layers; render text last via OCR/font pipeline if needed. Why this exists: Designers need editable text objects; if text is baked into pixels, typos are hard to fix. Example: You can swap the typeface or change “Eco Day” to another language.

The secret sauce (Text-to-PSD):

Spatial in-context learning via the 2×2 grid makes cross-layer coherence emerge naturally from standard attention—no extra custom cross-layer module.
Hierarchical captions are a simple but powerful control signal that keeps roles clear.

Step-by-step (Image-to-PSD):

Input: A single flattened poster image. Why this exists: Real designs often start from a flat file or a screenshot; without decomposition, editing is limited. Example: You receive a client’s flat poster and must localize text and replace a sticker.
RGBA-VAE Encode: Convert the image to latent tokens that include alpha capacity. Why this exists: Alpha-aware encoding supports crisp boundaries on extraction; without it, edges fray. Example: Fine text strokes remain clean after extraction.
Iterative Experts with LoRA on Flux-Kontext:

Text Extraction Expert: Finds text regions and outputs an RGBA text layer (transparent outside letters). Why: Text’s sharp edges and outlines need special handling; without it, text blends into the background. Example: The word “ECO” becomes an editable, cut-out layer.
Text Erasure / Background Restoration Expert: Removes the extracted text and reconstructs what was behind it. Why: Reveals deeper content and keeps the background usable; without it, you leave blank holes. Example: After erasing “ECO,” the teal background is smoothly filled.
Object Extraction Expert: Pulls out foreground items (Earth, plants, bubbles) as RGBA layers. Why: Complex occlusions require careful separation; without it, shadows and glows break. Example: Plants lift cleanly with soft edges.
Object Erasure / Background Restoration Expert: Patches the uncovered areas after each extraction. Why: Keeps the composite consistent layer after layer; skipping this causes ghosts and seams. Example: Where the Earth sat, the background reappears without a halo.

Flow Matching Formulation: Each expert learns a smooth vector field in latent space that moves from the conditioned input toward the target layer. Why this exists: Deterministic ODE paths lead to stable, fast inference; without it, sampling becomes noisy and slow. Example: The path from full poster latent to “foreground Earth” latent is direct and repeatable.
Editable Text Recovery: OCR → font retrieval → vector re-rendering. Why this exists: Designers need true text objects, not just pixels; without it, you can’t change wording cleanly. Example: Swap the font to the brand’s official typeface in one click.
Build PSD: Stack {foreground layers..., background} and vector text layers in correct order. Why this exists: Final PSD mirrors real workflows; without correct stacking, occlusions look wrong. Example: Clouds sit above the Earth, bubbles above the wave, text on top.

The secret sauce (Image-to-PSD):

Alternating extract→erase cycles mirror human editing logic and prevent compounding artifacts.
Shared RGBA-VAE + LoRA experts keep all steps speaking the same “alpha-aware language.”

04Experiments & Results

The test: Two main tracks.

Text-to-PSD: Given hierarchical prompts, generate layered outputs (foreground(s) + background) and evaluate realism (FID), text alignment (CLIP), and structural coherence (GPT-4 judge).
Image-to-PSD: Given a flat poster, decompose and then recomposite; evaluate reconstruction error (MSE/PSNR/SSIM), realism (FID/CLIP), and structural coherence (GPT-4 judge).

The competition: Compared to LayerDiffuse SDXL and GPT-Image-1 for text-to-PSD; for image-to-PSD, there’s no exact prior, so baselines include commercial RGBA-capable tools and a proxy pipeline that generates on white and estimates masks with SAM2 (non-alpha-aware).

Scoreboard with context:

Text-to-PSD (Table 2): OmniPSD achieves FID 30.43 (lower is better), CLIP 37.64 (higher is better), and GPT-4 Score 0.90. That’s like getting an A when others get a B: GPT-Image-1 (FID 53.21, CLIP 35.59, GPT-4 0.84) and LayerDiffuse SDXL (FID 89.35, CLIP 24.78, GPT-4 0.66).
Image-to-PSD (Table 1): OmniPSD gets PSNR 24.0 and SSIM 0.952 with CLIP-I 0.959 and GPT-4 0.92. Think of PSNR as how close your reassembled puzzle is to the original; higher means fewer visible mistakes. Competing setups show notably worse reconstruction (e.g., PSNR ~16–19, SSIM 0.65–0.82, GPT-4 0.64–0.86).
Component-by-component (Table 3): Text extraction/erasure and object extraction/erasure each perform strongly (e.g., text erasure PSNR 26.37; full pipeline PSNR ~23.98), indicating that the alternating extract→erase steps work as intended.

Transparency and VAE ablation:

RGBA-VAE (Table 4) dramatically improves reconstruction quality over alternatives: PSNR 32.5, SSIM 0.945, LPIPS 0.0348 (lower is better), beating Alpha-VAE and others by a wide margin. This means edges, soft shadows, and semi-transparent text are preserved sharply.

Prompt structure ablation:

Removing layer-specific prompts hurts: FID degrades to 38.56 and CLIP to 34.31 vs 30.43/37.64 with full OmniPSD. Clear roles per layer matter.

User study highlights:

Participants preferred OmniPSD for layering reasonableness and overall quality (e.g., 4.39–4.72 scores vs baselines around 3.3–4.3), praising “clear layer separation” and “realistic transparency.”

Surprising findings:

The simple 2×2 grid provides enough in-context signal that no special cross-layer module is needed; attention alone learns harmony across layers.
Rendering text last (vectorized) preserves typography much better than trying to generate pixel-text in the diffusion pass.

Takeaway: OmniPSD isn’t just making nice pictures; it’s making designer-ready parts that fit together cleanly and can be edited safely.

05Discussion & Limitations

Limitations:

Style coverage: Trained on layered posters, it may be less perfect on wildly different domains (e.g., complex UI mockups, technical CAD drawings) without adaptation.
Fine typography effects: Extremely intricate text effects (emboss, displacement maps) may still need manual tweaking after recovery.
Small objects: Tiny, low-contrast items can be harder to separate cleanly in one shot and may benefit from iterative zoomed passes.
Layer count choices: The 2×2 grid encourages one background and a small number of foreground layers; many-layer scenes might require multiple passes.

Required resources:

A GPU setup for diffusion transformers; LoRA fine-tuning for specialized adapters.
The RGBA-VAE weights and the FLUX backbones (Flux-dev and Flux-Kontext).
The 200k layered dataset (or similar) for best performance.

When NOT to use:

If you only need a single flat image with no future edits, a simpler text-to-image model might be faster.
For vector-only deliverables (pure SVG logos), use native vector generation tools; OmniPSD focuses on RGBA raster layers plus vectorized text.
Ultra-precise scientific diagrams where exact geometry is critical may be better served by programmatic or vector-first tools.

Open questions:

Scaling to arbitrary numbers of layers: How to extend beyond 2×2 without losing attention focus?
Rich effect layers: Can we recover editable Photoshop effect stacks (e.g., separate shadow, glow, blend modes) rather than just RGBA results?
Interactive loops: How best to incorporate user hints (scribbles, clicks) mid-generation to adjust layer roles on the fly?
Cross-domain generalization: What data or training tricks help transfer from posters to product mockups, comics, or infographics without retraining from scratch?

06Conclusion & Future Work

Three-sentence summary: OmniPSD is a unified Diffusion Transformer framework that both generates layered PSDs from text and decomposes flat images back into clean, alpha-accurate layers. It uses a shared RGBA-VAE and clever in-context setups—a 2×2 grid for generation and iterative extract→erase cycles for decomposition—to keep layers semantically consistent and transparencies crisp. Experiments, ablations, and user studies show OmniPSD delivers designer-ready layers that outperform prior approaches on realism, alignment, and reconstruction.

Main achievement: Turning layered, transparency-aware design from a fragile, multi-stage workaround into a single, reliable system that works in both directions—text→PSD and image→PSD.

Future directions:

Support for more layers and richer effect decomposition (shadows, glows, blend modes as separate editable items).
Tighter vector integration to output hybrid PSD/SVG projects with fully editable shapes.
Interactive, hint-driven generation and decomposition to meet designer intent faster.

Why remember this: OmniPSD shows that with the right shared representation (RGBA-VAE) and in-context cues (2×2 grid and iterative flows), AI can finally speak the language of professional design—layers, transparency, and structure—so edits are safe, assets are reusable, and creativity speeds up.

Practical Applications

•Fast localization: Replace text layers to translate posters into multiple languages while keeping layout intact.
•Brand updates: Swap logos or product shots on dedicated foreground layers without touching backgrounds.
•A/B testing: Generate several foreground variations (different icons or colors) and compare quickly.
•Template creation: Make layered starter PSDs from prompts for marketing or event graphics.
•Asset libraries: Extract clean RGBA stickers (plants, bubbles, icons) from existing flat posters for reuse.
•Error fixing: Remove typos by re-rendering vector text layers instead of repainting pixels.
•Style adaptation: Recolor or texture backgrounds as separate layers to match seasonal campaigns.
•Education: Teach layer-based design by showing decompositions of professional posters.
•Prepress cleanup: Recover soft shadows and edges with proper alpha for high-quality print.
•Rapid mockups: Compose mood boards with editable layers that art directors can fine-tune.

Version: 1