EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture

Xin He; Longhui Wei; Jianbo Ouyang; Minghui Liao; Lingxi Xie; Qi Tian

EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture

Intermediate

Xin He, Longhui Wei, Jianbo Ouyang et al.12/4/2025

arXiv PDF

Key Summary

•EMMA is a single AI model that can understand images, write about them, create new images from text, and edit images—all in one unified system.
•Its big trick is shrinking images into far fewer tokens (pieces) using a 32x autoencoder so the model runs faster and cheaper without losing much quality.
•Because both the understanding side and the generation side shrink images the same amount, EMMA can fuse their information by stacking channels instead of piling up extra tokens.
•A shared-and-decoupled network lets early layers share knowledge across tasks while later layers specialize for either understanding or generation.
•A small Mixture-of-Experts in the vision encoder adds a STEM expert for charts, math, and documents, boosting accuracy with only a tiny parameter increase.
•For understanding tasks EMMA learns by next-token prediction; for generation it uses flow matching with velocity prediction, so both sides train smoothly together.
•On tough tests EMMA-4B beats larger unified models like BAGEL-7B in both accuracy and efficiency, and reaches 0.91–0.93 on GenEval for text-to-image quality.
•In image editing, EMMA uses about one-fifth the visual tokens of some rivals yet stays competitive and preserves subject consistency.
•The model shows surprise skills like following Chinese editing prompts even without Chinese editing data, thanks to strong multilingual understanding.
•EMMA demonstrates that careful design (32x compression, channel-wise fusion, shared-then-specialized layers, light MoE) can make unified multimodal models both powerful and efficient.

Why This Research Matters

EMMA makes powerful image-and-text AI faster and cheaper by shrinking how much it has to read per image, which means more people can use it on everyday devices. It unifies reading, drawing, and fixing images in one brain, so skills transfer: what it learns about understanding charts can help it generate better diagrams. Businesses get smarter tools that keep faces and identities consistent when editing, improving trust and safety in media. Students and teachers benefit from clearer document reading and chart Q&A while also getting high-quality illustrations and step-by-step visual explanations. Designers and photographers gain quick, faithful edits without juggling multiple apps. As evaluation catches up (e.g., measuring identity consistency), EMMA’s balanced design will matter even more. Overall, it shows that clever architecture choices can beat just making models bigger.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your school backpack trying to carry every subject’s books at once—math, art, science, and reading. If you just cram everything in, the backpack gets heavy and messy. But if you organize smartly, the same backpack can carry it all with ease.

🥬 The World Before: AI had separate backpacks for different subjects. One AI could read and understand pictures and text (understanding), another could create images from words (generation), and another could edit images (editing). Running each one separately was slow, expensive, and hard to coordinate. People started building unified multimodal models—like one backpack for all subjects—so understanding, generation, and editing could help each other. But early unified models hit two big snags: too many visual tokens (like too many tiny pieces to process) and a tug-of-war during training because understanding and generation need different skills.

🍞 Anchor: Think of a robot class helper that can read a worksheet (understand), draw a diagram (generate), and fix a mislabeled axis (edit). With one brain, it should be easy—but only if that brain is organized.

🍞 Hook: You know how squeezing a big sponge into a tiny cup makes it much easier to carry? That’s like compression for images.

🥬 The Problem: Previous unified models often used different ways and amounts to shrink images on the understanding side versus the generation side. Understanding encoders (like SigLIP families) chopped images into many tokens and sometimes reduced them further, while generation encoders (autoencoders) usually compressed less (like 8x), leaving more tokens. When the model tried to combine both sets, it had to glue them token-by-token—ballooning the context length, slowing everything, and making training imbalanced (one side overwhelmed the other).

🍞 Anchor: If math homework is on index cards and art homework is in a sketchbook, you can’t stack them neatly. Matching sizes makes organizing simple.

🍞 Hook: Picture a team where some parts of a job are the same for everyone, but other parts need specialists.

🥬 Failed Attempts: Three main directions emerged. (1) Unified architecture formats: fully tie understanding and generation in one end-to-end model. This unlocked synergy but caused token explosion and conflicting needs. (2) Unified task formats: bolt a generator onto an understanding model with bridges. Easier to build, but weaker fusion and slower hand-offs. (3) Unified learning paradigms: train both with the same loss style (e.g., only next-token or only diffusion). Cleaner math, but one-size-fits-all ignores that generation loves high-frequency details while understanding cares most about semantics.

🍞 Anchor: It’s like trying to teach both painting and algebra with only multiple-choice questions—or only paint-by-numbers. You miss what each really needs.

🍞 Hook: Imagine if both math problems and art projects used the same paper size so they could be filed together perfectly.

🥬 The Gap: What if both the understanding and generation branches compressed images by exactly the same ratio? Then their tokens could line up by spatial position. Instead of adding more tokens, you could stack channels—keeping the token count low while mixing the understanding branch’s meaning with the generation branch’s details. Also, what if the network shared early layers (general skills like reading instructions) but split later for specialization (understanding semantics vs. generating crisp details)? And what if a tiny STEM expert jumped in for charts and documents?

🍞 Anchor: Same-sized pages fit the same binder; early pages share the same header; later sections branch off for math vs. art; and there’s a visiting tutor for tricky STEM pages.

🍞 Hook: Why does this matter to you? Imagine photo apps that run fast on your phone, homework helpers that read charts and generate diagrams instantly, and editing tools that keep people’s faces consistent while changing backgrounds.

🥬 Real Stakes: Reducing token counts saves memory and speeds up training and inference, lowering costs. Balanced training means the model won’t be great at one thing but clumsy at another. Stronger perception (via the small STEM expert) helps with real tasks like reading receipts, understanding graphs, and solving visual math problems. Better editing that preserves subject consistency is crucial for safe media, design work, and education.

🍞 Anchor: A teacher who can read, draw, and fix your poster—fast, neatly, and accurately—helps you finish projects on time without wasting supplies.

02Core Idea

🍞 Hook: You know how packing cubes let you fit more into a suitcase by keeping everything the same shape and size? Then you can stack cubes side-by-side without making the suitcase bigger.

🥬 The Aha Moment (one sentence): Make both the understanding and generation branches compress images by the same 32x ratio so their tokens align, then fuse them channel-wise, while a shared-then-specialized network and a tiny Mixture-of-Experts boost both accuracy and efficiency.

🍞 Anchor: Same-sized cubes slide neatly into the suitcase; early packing rules are shared, later choices specialize for shoes vs. shirts; and you call a sock-folding expert for tricky pairs.

— Multiple Analogies —

Orchestra: Same tempo (32x compression) lets strings (understanding) and brass (generation) play in sync; you layer parts (channels) instead of adding more musicians (tokens); early rehearsals are joint, later sections practice their solos; a guest violinist (STEM expert) helps with a hard passage.
Cooking: Chop all ingredients to the same dice (32x). Instead of more bowls (tokens), you stack flavors in layers (channels). The kitchen shares basic prep (shared layers) but has separate stations for baking vs. grilling (decoupled layers), plus a pastry chef (STEM expert) for delicate desserts.
Sports: Everyone trains basic fitness together (shared shallow layers), but sprinters and marathoners split for specialized drills (decoupled deep layers). Using the same lap length (32x compression) makes relay handoffs clean (channel fusion). A coach for hurdles (STEM expert) jumps in when needed.

— Why It Works (intuition) —

Same compression aligns spatial grids, so you fuse meaning (understanding) and detail (generation) without multiplying token counts.
Channel-wise fusion keeps context short, which speeds learning and reduces confusion.
Shared shallow layers transfer general language–vision skills; decoupled deep layers let each task master what it needs most (semantics vs. high-frequency detail).
A tiny MoE gives specialized perception for STEM images without bloating the whole model.

— Building Blocks (with Sandwich explanations for each key concept) —

🍞 Hook: Imagine a Swiss Army knife—many tools fold into one handle. 🥬 Unified Multimodal Architecture: It’s one model that can read images and text, generate images from text, and edit images.

How it works:
1. A vision encoder turns images into visual tokens.
2. A language model processes text and visual tokens together.
3. A generator decodes tokens back into images (for generation/editing).
Why it matters: Without one unified brain, you’d juggle multiple apps that don’t share skills or speed. 🍞 Anchor: One assistant that can caption a photo, draw a new scene, and tweak colors—no app switching.

🍞 Hook: You know those vacuum bags that shrink clothes for travel? 🥬 Autoencoder: A model that compresses an image into a tiny code and then reconstructs it.

How it works: (1) Encoder squeezes the image; (2) Latent tokens carry key info; (3) Decoder rebuilds the image.
Why it matters: Fewer tokens mean faster, cheaper training and inference. 🍞 Anchor: A 1024×1024 image becomes a small grid of codes you can store and later unzip into a picture.

🍞 Hook: Think of squeezing a sponge—less size, same sponge. 🥬 Compression Ratio: How much we shrink data size when encoding.

How it works: 32x compression means each dimension shrinks so the token grid is 1/32 the area.
Why it matters: High compression slashes token counts—and cost—if quality stays good. 🍞 Anchor: A 1024×1024 image → about 1024 tokens at 32x, not thousands.

🍞 Hook: Stacking toppings on a burger adds flavor without making it wider. 🥬 Channel-wise Concatenation: Fuse understanding tokens and generation tokens by stacking channels at the same spatial positions.

How it works: (1) Match grids using same 32x; (2) Stack features along channels; (3) Keep token count the same.
Why it matters: You get both meaning and detail without exploding the context length. 🍞 Anchor: One token per spot, but with extra layers of info inside—like a stuffed burger patty.

🍞 Hook: In a classroom, everyone learns basics together, then splits into reading or art groups. 🥬 Shared-and-Decoupled Network: Early layers are shared across tasks; deeper layers specialize for understanding or generation.

How it works: (1) Share shallow attention (e.g., QK) to transfer general skills; (2) Keep task-specific parts (e.g., V projections) to preserve independence; (3) Fully decouple deeper layers.
Why it matters: Sharing helps tasks teach each other; decoupling prevents one task from hurting the other. 🍞 Anchor: A student takes shared language lessons, then chooses advanced literature or design studio.

🍞 Hook: Some doctors are generalists; others are specialists. 🥬 Mixture of Experts (MoE): A small router sends STEM images to a STEM expert encoder; others use a versatile expert.

How it works: (1) Router predicts image type; (2) Routes to STEM or general expert; (3) STEM expert is only tuned at the end with STEM data.
Why it matters: Specialized perception boosts accuracy on charts, math, and documents with tiny overhead. 🍞 Anchor: A triage nurse sends an X-ray case to a radiologist but a sprain to a general doctor.

03Methodology

At a high level: Input (text + image) → Two encoders (Und-Enc with SigLIP2+shuffle, Gen-Enc with 32x autoencoder) → Channel-wise concatenation → Adapters into a shared-and-decoupled network → Task-specific training objectives (next-token for understanding; flow matching for generation) → Output (answers, new images, or edited images).

Step-by-step with what–why–example:

Visual Understanding Encoder (Und-Enc: SigLIP2 with 2×2 pixel shuffle)

What happens: The image is patchified, embedded, and then the token grid is reduced 4× via pixel shuffle, reaching an overall ~32x compression. Native-resolution support is added by interpolating positional embeddings.
Why it exists: To extract strong semantic tokens (what’s in the image) at a low token count so the language model can reason efficiently.
Example: A 1024×1024 image becomes ~1024 tokens carrying clear meanings like “two dogs on a beach,” “blue sky,” “sign reads BEACH.”

Visual Generation Encoder (Gen-Enc: DCAE 32x)

What happens: A high-compression autoencoder encodes images (for editing/reference) or provides the latent space for decoding (for generation). EMMA freezes DCAE during most training stages to stabilize learning.
Why it exists: It dramatically reduces visual tokens for generation/editing compared to 8x AEs, cutting cost while preserving detail well enough.
Example: For a portrait edit, the reference image becomes a compact latent grid aligned to the same 32x scale as understanding tokens.

Channel-wise Concatenation

What happens: Because both branches use 32x, their tokens share the same spatial grid. EMMA stacks the Und-Enc features and Gen-Enc features along channels per token location, not by adding more tokens.
Why it exists: Token-wise concatenation (used in some prior work) multiplies the context length; channel-wise concatenation keeps it steady, speeding training and inference.
Example: For a 32×32 token grid (≈1024 tokens), EMMA keeps it at 1024 tokens but each token is richer inside (more channels).

Adapters and 2D→1D Positioning

What happens: Understanding and generation adapters project visual features into the unified model’s space. 2D positional encoding is first applied to inject spatial knowledge; then tokens enter the transformer with 1D RoPE so text and vision share a consistent sequence space.
Why it exists: Adapters align feature scales; 2D + 1D positions let the model respect both image layout and sequence order.
Example: The token at image row 5, col 12 knows both where it is in the image and where it sits in the joint text–vision sequence.

Shared-and-Decoupled Network

What happens: Shallow layers share parameters (e.g., QK in self-attention) so tasks learn from each other; selected task-specific modules (e.g., V projections) remain separate even in shallow layers; deeper layers are fully decoupled into understanding and generation branches.
Why it exists: Understanding needs strong semantics; generation needs semantics plus high-frequency details. Sharing basics while specializing late prevents conflicts.
Example: The model learns to follow instructions (shared), but one deep path becomes a great reader and reasoner; the other becomes a great painter and editor.

Attention Strategy (hybrid masking)

What happens: For understanding, both text and vision use causal masks (look left only) to match next-token training. For generation, text stays causal while visual tokens can attend within the same image (to coordinate pixels) and to previous tokens.
Why it exists: Different tasks benefit from different attention views; generation needs within-image context for coherent details.
Example: While generating a cat, the ear token can check nearby fur tokens for consistent color and texture.

Training Objectives

Understanding → Next-Token Prediction 🍞 Hook: When you read a sentence aloud, you can often guess the next word. 🥬 Next-Token Prediction: The model predicts the next token in the answer.
- How it works: (1) Feed prompt + image tokens; (2) Predict next token; (3) Repeat.
- Why it matters: It teaches precise, stepwise language reasoning grounded in visuals. 🍞 Anchor: “What color is the car?” → model predicts “… red.” one token at a time.
Generation → Flow Matching with Velocity Prediction 🍞 Hook: Think of guiding a toy boat smoothly down a river instead of jumping from rock to rock. 🥬 Flow Matching: The model learns a continuous path from noise/latent to the target image by predicting velocities.
- How it works: (1) Sample intermediate states between noise and image; (2) Predict velocity vectors that point toward the target; (3) Train to follow the smooth path.
- Why it matters: It stabilizes image generation and aligns well with transformer training. 🍞 Anchor: Turning a cloudy blur into a sharp “two dogs on a beach” picture by always nudging in the right direction.

Mixture of Experts (STEM expert + router)

What happens: A lightweight router decides if an input image is STEM-like (charts, documents, equations). If yes, route to a STEM expert (initialized from the general expert) that is tuned late on STEM data; otherwise, use the versatile expert.
Why it exists: STEM images need sharper text/structure perception; a tiny specialist boosts scores for a small parameter cost (~50M additional params).
Example: ChartQA and DocVQA images go to the STEM expert for improved OCR and layout understanding.

Data and Training Stages (end-to-end)

Stage 0 Alignment: Freeze encoders and the unified model; train the understanding adapter to align vision tokens at 512×512.
Stage 1 PT: Train everything except DCAE; balanced batches for understanding and generation at 512×512.
Stage 2 SFT: Use native resolution for understanding and ~1K bucketed resolution for generation; rebalance data (e.g., 1:1 portraits vs. general, 1:1 STEM vs. general) and then mix in editing.
Stage 3 QT: Quality-tune with curated data at a 1:1:1 ratio across tasks.
Stage 4 ET & RT: Train only the STEM expert and the router on focused sets.

Secret Sauce (why this recipe is clever):

Same 32x on both sides unlocks channel-wise fusion—huge token savings with better cross-task information sharing.
Shared-then-specialized layers reduce learning conflicts while enabling transfer.
A tiny MoE targets the hardest image types without inflating the whole model.
Hybrid objectives fit each task’s nature, yet remain trainable in one system.

04Experiments & Results

The Test: EMMA is evaluated across three families of tasks—multimodal understanding, text-to-image generation, and image editing—so we can see if one unified model truly does all three well while staying efficient. Benchmarks include MMVet, MMBench, MMMU, ChartQA, DocVQA, TextVQA (understanding); GenEval and DPG-Bench (generation); and GEdit-Bench-EN (editing).

The Competition: EMMA is compared to unified models (e.g., BAGEL-7B, BLIP3-o, UniWorld-V1, Janus Pro, OmniGen2), as well as specialist leaders for understanding (Qwen3-VL, InternVL3.5) and for generation (Qwen-Image, FLUX, SDXL, SANA). Importantly, several baselines use bigger LLMs (often ≥7B, sometimes 20B) or rely on tricks like prompt rewriting and reinforcement learning.

The Scoreboard (with context):

Multimodal Understanding: EMMA-4B scores 73.0 on MMVet versus BAGEL-7B at 67.2—a solid jump. Across 11 understanding datasets, adding the tiny MoE yields an average +0.4% gain, keeping EMMA competitive with state-of-the-art understanding specialists like Qwen3-VL and surpassing InternVL3.5 on average in the reported set.
Text-to-Image Generation: On GenEval, EMMA reaches 0.91 without prompt rewriting or RL (and 0.93 with rewriting), beating BAGEL-7B (0.82/0.88) and even rivaling strong dedicated generators. On DPG-Bench, EMMA achieves 85.63 overall, edging out BAGEL (85.07) and staying competitive with larger or more specialized models.
Image Editing: On GEdit-Bench-EN, EMMA’s overall score (6.53) is slightly above several unified baselines and near strong editors, despite using ~1/5 the visual tokens of certain rivals during reference handling. Notably, EMMA avoids training on GPT-Image-Edit-1.5M to preserve subject consistency, highlighting a different quality trade-off than some higher-scoring methods.

Efficiency Gains (the quiet superstar):

Visual tokens are dramatically reduced. In image editing, EMMA uses about 20% of the visual context tokens compared to some unified baselines (≈5× fewer), which cuts memory and latency while keeping or improving quality.
A 32x autoencoder on the generation side (and matched 32x on understanding via SigLIP2+shuffle) slashes token counts compared to typical 8x setups—yet EMMA still delivers crisp generations.

Surprising Findings (emergent skills):

Cross-lingual editing: Even without Chinese editing data, EMMA follows Chinese editing prompts. Likely because the understanding branch learned multilingual grounding from mixed understanding datasets—and generation benefits via the unified architecture.
Multi-step editing reasoning: Trained mostly on single-instruction edits, EMMA can still follow complex, chained instructions, probably because multimodal chain-of-thought data taught it to decompose and sequence actions.

Interpretation: Hitting 0.91–0.93 GenEval with a 4B unified model (no RL, no heavy prompt rewriting) is like getting an A+ while many bigger classmates get a B. Doing this while cutting tokens and preserving subject consistency shows the architecture’s balance between brains (semantics) and brush (detail).

05Discussion & Limitations

Limitations:

Editing Corner Cases: Ultra-tricky edits (fine-grained relighting, hair wisps, tiny text swaps) can still challenge a 4B unified model, especially without massive edit-specific data. Larger models or targeted fine-tuning may help.
Evaluation Gaps: GEdit relies on vision-language judgment that underweights subject consistency. EMMA deliberately avoids GPT-Image-Edit-1.5M to keep identities stable, but that choice can underreport its strengths under current metrics.
Compression Trade-offs: 32x compression is efficient, but for extreme photorealism or micro-details, some scenarios might benefit from adaptive or mixed-ratio compression.
Router Dependence: The STEM MoE relies on a small router to recognize STEM-like images. Misrouting can reduce the benefit; future work could refine routing or allow soft mixtures.

Required Resources:

Data: Hundreds of millions of samples across understanding, generation, and editing (with quality tuning subsets) are helpful for best results.
Compute: End-to-end unified training with mixed objectives and native-resolution handling needs strong multi-GPU/TPU resources, though the 32x compression and token savings help.

When NOT to Use:

Video/Audio-Visual Tasks: EMMA targets images and text; video or audio-visual tasks would need temporal/audio modules.
Ultra-High-Fidelity Restoration: If the goal is maximal pixel-perfect restoration, a specialized, lower-compression generator might be preferred.
Niche Domains Without Data: Highly specialized scientific imaging without training exposure may need domain-specific experts or fine-tuning.

Open Questions:

Adaptive Compression: Can the model learn where to compress 16x vs. 32x dynamically for better detail without token bloat?
Better Editing Metrics: How can we directly measure subject/identity consistency, layout faithfulness, and localized edit correctness?
More Experts, Less Overhead: Can we add more tiny experts (e.g., medical scans, maps) with smarter routing while keeping parameters light?
Unified Learning: Is there a single training objective that gracefully handles both understanding and high-fidelity generation without conflict?

06Conclusion & Future Work

Three-Sentence Summary: EMMA is a unified model that understands images, generates images from text, and edits images efficiently by matching a 32x compression on both the understanding and generation sides. This alignment enables channel-wise fusion (not token bloat), while a shared-and-decoupled network and a tiny STEM expert improve accuracy across tasks. As a result, EMMA beats larger unified models in both performance and efficiency and reaches top-tier text-to-image results without heavy tricks.

Main Achievement: Proving that equalized 32x compression plus channel-wise fusion, combined with shared-then-specialized layers and a lightweight MoE, forms a practical, high-performing recipe for truly unified multimodal modeling.

Future Directions: Add adaptive compression to preserve ultra-fine details when needed; broaden experts and routing to new domains; improve editing evaluation with identity/layout metrics; extend from images to video while keeping token efficiency; and explore training schedules that further harmonize understanding and generation learning.

Why Remember This: EMMA shows that smarter architecture—not just bigger models—can close the gap between reading and drawing, making AI that’s faster, cheaper, and better at real-world tasks like document understanding, chart Q&A, photorealistic generation, and careful, identity-safe editing.

Practical Applications

•Smart document assistant that reads receipts, forms, and charts, then explains the results in plain language.
•Classroom helper that solves visual math and science problems and generates labeled diagrams for lessons.
•Design co-pilot that drafts concept art from briefs and then performs precise, identity-safe edits on photos.
•E-commerce studio tool for virtual try-on, background swaps, and color changes while preserving product look.
•Accessibility service that creates accurate image descriptions (alt text) and clarifies diagrams for low-vision users.
•Photo app that performs complex edits (object add/remove/replace) reliably on mobile thanks to fewer tokens.
•Data labeling accelerator that pre-annotates images and charts for faster, cheaper dataset creation.
•Marketing content engine that generates on-brand images and then batch-edits them to new campaigns.
•Research aid for chart/figure Q&A and quick diagram generation from textual hypotheses.
•Multilingual editing assistant that follows instructions in different languages without retraining.

Version: 1