UM-Text: A Unified Multimodal Model for Image Understanding and Visual Text Editing

Lichen Ma; Xiaolong Fu; Gaojing Zhou; Zipeng Guo; Ting Zhu; Yichun Liu; Yu Shi; Jason Li; Junshi Huang

UM-Text: A Unified Multimodal Model for Image Understanding and Visual Text Editing

Intermediate

Lichen Ma, Xiaolong Fu, Gaojing Zhou et al.1/13/2026

arXiv PDF

Key Summary

•UM-Text is a single AI that understands both your words and your picture to add or change text in images so it looks like it truly belongs there.
•It uses a Visual Language Model (VLM) to read your instruction and study the reference image, then plans the text content, position, and style automatically.
•A new UM-Encoder blends several clues (language, tiny character pictures, and image context) so the diffusion model can draw crisp, correct letters.
•A special Regional Consistency Loss checks the exact areas with text—both in hidden features and in the final image—to keep strokes sharp and styles consistent.
•Training happens in three steps: teach the VLM to design, teach the diffusion model to render text, then align them so they cooperate smoothly.
•The team built UM-DATA-200K, a large dataset with clean images and text layouts, to teach the model real layout and design skills.
•On public benchmarks, UM-Text beats strong baselines like AnyText, AnyText-2, FLUX-Text, and DreamText in accuracy and realism, for both English and Chinese.
•UM-Text works for poster design, image text editing, and cross-language image translation using only natural language instructions.
•It reduces manual work (no more hand-picking fonts, sizes, and colors) by learning implicit style from the image context.
•The result is more readable text with better placement, cleaner shapes, and stronger harmony with the background.

Why This Research Matters

UM-Text makes image text editing as easy as giving a natural instruction, so non-designers can produce professional-looking posters and product images. It keeps letters readable and styles consistent, saving hours of manual layout, font picking, and color matching. By working across languages, it supports global teams doing translation and localization directly inside images. Marketing teams can refresh campaigns faster, while educators and small businesses can create polished visuals without special tools. The approach points to a future where AI understands the whole scene, not just separate parts, leading to cleaner, more harmonious results. It also opens doors to accessibility tools that clarify or translate on-image text for broader audiences.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re decorating a birthday poster. You want “Happy Birthday!” to sit just right on the cake, match the icing color, curve along the frosting, and use a fun font. Doing all that by hand takes time and skill.

🥬 Filling (The Actual Concept): Before this research, AI could make pretty images and even add text, but it struggled to really understand both your words and the picture together. That meant humans often had to manually choose the text, font, size, color, and where to place it. Many systems could create text in images, yet they fell apart on tricky letters, non‑English scripts, or when the text needed to blend naturally with the scene. How it worked before:

You type a prompt.
The AI generates an image or edits an image.
You manually fix the text: content, layout, font, and color. Why that was hard: The AI didn’t fully ‘see’ the image context or ‘understand’ the instruction well enough to pick styles and layouts that fit. Why it matters: Without better understanding, text looks pasted-on, inconsistent in style, or even unreadable.

🍞 Bottom Bread (Anchor): Think of a travel poster where “Visit Paris” should arc over the Eiffel Tower in elegant gold. Older methods often made blocky, off‑color words that hid the tower or clashed with the scene. You had to fix it yourself.

🍞 Top Bread (Hook): You know how a good party planner listens to what you want and studies the venue before deciding where the balloons, tables, and lights should go?

🥬 Filling (The Actual Concept): Visual Language Models (VLMs) are AIs that look at both images and text together, so they can connect what you ask with what’s in the picture. How it works:

Read your instruction (“Put ‘Fresh Deals’ above the orange juice bottle”).
Look at the image (find the bottle, colors, empty spaces).
Propose text content, layout, and style ideas that match the scene. Why it matters: Without a VLM, the model guesses blindly and often places or styles text poorly.

🍞 Bottom Bread (Anchor): Tell the system, “Add ‘Game Night’ where it balances with the board and dice.” The VLM notices the empty top-left corner and picks a matching color from the board edge.

🍞 Top Bread (Hook): Imagine having three friends helping with your poster: one who understands language, one who sees tiny letter shapes, and one who studies the photo’s vibe.

🥬 Filling (The Actual Concept): UM-Encoder is a module that mixes three kinds of information—language embeddings (what to write), character-level visual embeddings (how each letter looks), and multimodal context embeddings (what fits this image)—into one clear signal for image generation. How it works:

Get language clues from T5 (the words to render).
Get character-level visual clues from glyph images (crisp strokes for each character).
Get context from the VLM (layout, style hints from the image).
Align and combine them so the diffusion model can draw accurate, stylish text. Why it matters: Without UM-Encoder, the generator might spell wrong, draw messy strokes, or ignore the image’s style.

🍞 Bottom Bread (Anchor): For “EddieWorld” on a sign, UM-Encoder helps choose a playful sign-like font, fits the word on the signboard properly, and keeps every “e” crisp and consistent.

🍞 Top Bread (Hook): Think of baking cookies on just the parts of the tray that have batter—you don’t want to heat up the empty spaces.

🥬 Filling (The Actual Concept): Regional Consistency Loss focuses the model’s attention on the exact text regions—both in hidden features (latent space) and in the final image (RGB)—so strokes are clean and shapes are correct. How it works:

Use the VLM’s predicted layout to get a text region mask.
In latent space, guide learning more inside the mask so updates don’t get “diluted.”
In the final image, compare edges in the mask region to keep letters sharp. Why it matters: Without it, curves and complex characters get wobbly or blurry, especially in editing.

🍞 Bottom Bread (Anchor): When editing Chinese poetry on a poster, the loss sharpens each stroke so similar-looking characters don’t get confused.

🍞 Top Bread (Hook): If you’re building a treehouse, you first sketch a plan, then gather tools, then practice putting pieces together so they fit.

🥬 Filling (The Actual Concept): The three-stage training strategy teaches the system step by step: first design (VLM), then render (diffusion), then align them to work as one. How it works:

Pretrain UM-Designer (VLM) on UM-DATA-200K to plan layouts and text.
Pretrain the diffusion model to render clean text in images.
Align both so the designs become beautiful, accurate edits. Why it matters: Skipping steps makes the team uncoordinated—great plans but messy drawings, or sharp drawings with poor planning.

🍞 Bottom Bread (Anchor): After training, when you say “Translate and restyle the label to match the product,” the model plans placements and colors, then renders crisp letters that blend with the product photo.

02Core Idea

🍞 Top Bread (Hook): You know how a great poster artist doesn’t just write words anywhere—they read the brief, study the photo, choose a style, and place text so everything feels natural?

🥬 Filling (The Actual Concept): The key insight of UM-Text is to unify understanding (what to write and where/how to place it) with generation (drawing the text) in one model that uses a VLM plus a clever encoder and targeted losses. How it works (big picture):

UM-Designer (a VLM) reads your instruction and sees the image to propose text content, layout boxes, and style hints.
UM-Encoder blends language, per-character visual details, and image context into a single conditioning signal.
A diffusion transformer uses that signal to paint crisp, stylistically consistent text right onto the image. Why it matters: Without this unification, you must micromanage fonts, sizes, and layout, and the result often looks pasted-on.

🍞 Bottom Bread (Anchor): Say, “Replace ‘SALE’ with ‘LIMITED TIME’ in matching style, top-right.” The system plans the box, borrows colors, and renders letters that feel like they were always there.

Three friendly analogies:

Orchestra: The VLM is the conductor reading the score (your instruction and image), UM-Encoder is the arranger blending parts, and diffusion is the musicians performing a polished piece.
Kitchen: The VLM writes the recipe (content + layout), UM-Encoder preps the ingredients (language + glyphs + context), and diffusion cooks the dish (final text rendering) just right.
Sports: The VLM draws the play, UM-Encoder coordinates positions, and diffusion executes the move to score (clean, on-style text).

Before vs. After:

Before: Separate tools guessed layouts, needed manual font tweaks, and struggled with non-English or complex letters.
After: One model designs and renders, placing accurate, readable text that matches the image style and supports multiple languages.

Why it works (intuition):

Context first: Let a VLM truly ‘see’ the image and instruction to suggest layout and style—so the plan fits the scene.
Right ingredients: Mix language tokens, character-level visual features, and context embeddings so the renderer knows exactly what and how to draw.
Focus where it counts: Use regional consistency losses to sharpen letters exactly where they appear, avoiding blurry strokes.

Building blocks (with mini “sandwiches”):

🍞 Hook: Imagine a smart designer. 🥬 Concept: UM-Designer (VLM) plans content + layout. 🍞 Anchor: It picks a calm blue for “Sky Sale” to match the clouds.
🍞 Hook: Think of an adapter that merges signals. 🥬 Concept: UM-Encoder fuses language, glyph, and context embeddings. 🍞 Anchor: “EAGLE” gets sharp ‘A’ peaks and a bold outdoor style.
🍞 Hook: Picture an artist layering paint. 🥬 Concept: Diffusion transformer renders step-by-step from noisy hints to clear text. 🍞 Anchor: “TOGETHER” emerges cleanly over a picnic scene.
🍞 Hook: Only sand where the wood is rough. 🥬 Concept: Regional Consistency Loss polishes strokes exactly in text regions. 🍞 Anchor: Chinese characters keep precise edges, not washed-out lines.
🍞 Hook: Train like levels in a game. 🥬 Concept: Three-stage training builds design skill, drawing skill, then teamwork. 🍞 Anchor: After alignment, translations keep layout and texture harmony.

03Methodology

High-level pipeline: Input (instruction + image) → UM-Designer (plan content, layout, style hints) → UM-Encoder (fuse language + glyph + context) → Diffusion Transformer (render) → Output (edited image with crisp, on-style text).

Step A: UM-Designer (VLM) for planning

What happens: The model reads your instruction (e.g., “Change ‘OPEN’ to ‘WELCOME’ in the shop’s style”) and studies the image. It proposes what text to write, where to place it (bounding boxes), and implicit attributes (colors, size, style cues) that match the scene.
Why this step exists: Guessing layout without seeing the image leads to awkward overlap, wrong colors, or unreadable text.
Example: On a café photo with warm wood tones, UM-Designer proposes “WELCOME” in a box above the door and suggests a cream color sampled from the awning.

Step B: Render character-level glyphs and extract visual features

What happens: The chosen text is rendered into tiny per-character images (like mini stamps). An OCR-style visual encoder turns each tiny glyph into a detailed feature describing its strokes.
Why this step exists: Word-level features miss subtle curves/serifs; per-character features keep strokes sharp and accurate.
Example: The letter “R” keeps its leg distinct from the bowl so it doesn’t blur into “P.”

Step C: UM-Encoder fuses multimodal conditions

What happens: It aligns and blends three signals: T5 language embeddings (what to say), glyph visual embeddings (how each character looks), and VLM context embeddings (how to fit the scene). The result is the UM-Embedding that conditions the generator.
Why this step exists: Any single signal is incomplete. Language alone can’t shape strokes; glyphs alone ignore scene style; context alone can’t guarantee correct spelling.
Example: For “HAPPY,” the UM-Embedding keeps both Ps identical and cheerful, picks a matching yellow, and respects spacing above a balloon bouquet.

Step D: Masked conditioning for editing

What happens: The predicted layout becomes a mask that marks exactly where text goes. The model also gets a condition image (image × mask) and a mask latent to focus updates in the right areas.
Why this step exists: Editing only the needed region preserves the rest of the picture and avoids color bleeding.
Example: Replacing “SALE” on a price tag updates just the tag, keeping the product texture untouched.

Step E: Diffusion transformer renders the final text

What happens: Starting from a noisy latent, the diffusion transformer repeatedly refines the image latent while guided by the UM-Embedding, the mask latent, and the conditioned image latent, then decodes with a VAE to the final image.
Why this step exists: Gradual refinement produces cleaner, more realistic letters than a single-step paint.
Example: “CERTIFICATE” on a document becomes legible line by line, while paper texture stays intact.

Step F: Regional Consistency Loss (two spaces)

What happens: The loss checks text regions:
1. In latent space, it amplifies learning signals inside the masked area to avoid dilution by the background.
2. In the final image, it compares edges (via Canny) inside the mask to keep strokes crisp.
Why this step exists: Complex scripts (like Chinese) and fine serifs need extra attention to avoid wobble or blur.
Example: For “紫禁城风云,” each stroke junction stays sharp, not smudged.

Step G: Three-stage training to build skills progressively

Stage 1: Pretrain UM-Designer on UM-DATA-200K to learn detection, recognition, layout planning, and content generation from diverse posters and products. • Without it: Layout guesses are poor; style hints are weak. • Example: The VLM learns that big titles sit near poster tops with high contrast.
Stage 2: Pretrain diffusion (initialized from FLUX-Fill) on AnyWord-3M to master text rendering/editing broadly. • Without it: The generator can’t draw clean glyphs consistently. • Example: It learns to render both English and Chinese across many scenes.
Stage 3: Semantic alignment connects VLM plans to the diffusion renderer via the UM-Encoder’s connector. • Without it: Great plans, sloppy execution—or sharp rendering with mismatched layout. • Example: “Translate and restyle” produces aligned placement and matching textures.

The secret sauce:

Unified planning + rendering: The VLM proposes context-aware layout and style; the diffusion model realizes them precisely.
Character-level fidelity: Per-character visual embeddings preserve stroke-level correctness.
Region-focused sharpening: Dual-space Regional Consistency Loss keeps letters crisp exactly where they appear.
Data pipeline: UM-DATA-200K gives the model realistic layout/text pairs to learn genuine design habits.

Concrete walk-through (English):

Input: “Replace ‘OPEN’ with ‘WELCOME’ above the door, same style.”
Plan: Box above door; color sampled from trim; medium-bold font.
Fuse: Language (WELCOME) + glyph features (W/M wide angles) + context (door frame geometry, palette).
Render: Diffusion refines masked region to clean letters matching shadows and texture.
Check: RC Loss keeps edges sharp; background stays untouched.

Concrete walk-through (Chinese):

Input: “把海报上的‘听风阁’改为‘伴月亭’，风格不变。”
Plan: Same box; similar brush-like stroke width; matching ink color.
Fuse: Chinese glyph features per character + VLM style cues from background.
Render: Crisp strokes; paper grain preserved; spacing balanced.
Check: Edge consistency ensures no stroke merges or gaps.

04Experiments & Results

🍞 Top Bread (Hook): Imagine a spelling bee and an art contest combined—you must get every letter right and also make it look beautiful in the picture.

🥬 Filling (The Actual Concept): The team tested UM-Text on tasks that measure both correctness (did it write the right words?) and harmony (does the text look like it belongs?). They compared against strong systems like AnyText, AnyText-2, FLUX-Fill/FLUX-Text, UDiffText, and DreamText. How it works (tests):

AnyText-benchmark (English/Chinese): Checks sentence accuracy (Sen.ACC), edit distance (NED), and realism (FID, LPIPS).
UDiffText benchmark: Tests reconstruction (redraw the original text) and editing (change selected text) using sequence accuracy (SeqAcc), FID, LPIPS.
UMT-benchmark: End-to-end—use UM-Designer’s layout/text, then render with different editors (including UM-Text) to see who can deliver complete poster designs. Why it matters: Good models must spell correctly, keep shapes crisp, and blend styles with the background across languages.

🍞 Bottom Bread (Anchor): It’s like grading both the correct answer and the handwriting’s neatness on the same page.

The competition and scoreboard (with context):

AnyText-benchmark (Editing task): • English: UM-Text hits Sen.ACC ≈ 0.855 and NED ≈ 0.940 with FID ≈ 10.15 and LPIPS ≈ 0.066—like scoring an A+ for both correctness and looks when others are getting B’s. • Chinese: UM-Text reaches Sen.ACC ≈ 0.799 and NED ≈ 0.887 with FID ≈ 10.5 and LPIPS ≈ 0.048—showing strong multilingual ability where many models struggle.
UDiffText benchmark: • Reconstruction: UM-Text achieves near-top SeqAcc (≈ 0.99/0.98/0.97/0.96 across subsets) with much lower FID (≈ 6.57), meaning crisper, more realistic redraws. LPIPS is slightly higher than DreamText, likely because UM-Text better matches colors/textures, which can increase perceptual differences while still looking more natural. • Editing: UM-Text boosts SeqAcc to ≈ 0.93 across sets—like consistently getting the right words even after tricky changes.
UMT-benchmark (end-to-end design + render): UM-Text significantly outperforms previous approaches in Sen.ACC and NED for both English and Chinese when everyone uses the same UM-Designer layout/text. This shows UM-Text’s renderer makes the most of good plans.

Surprising findings:

Strong Chinese performance: Many models falter on complex Chinese characters; UM-Text keeps strokes distinct thanks to per-character embeddings and region-focused loss.
Style harmony improves realism metrics: By aligning colors and textures to the scene, UM-Text looks more ‘native,’ reducing FID even if LPIPS doesn’t always hit the absolute minimum.
Multi-turn robustness: Compared to a general multimodal assistant, UM-Text avoids unintended text edits and maintains consistent style over multiple instructions.

Takeaway: Across accuracy and realism, UM-Text is among the best—like a careful calligrapher who also happens to be a great graphic designer.

05Discussion & Limitations

🍞 Top Bread (Hook): Even the best chef has limits—some dishes need rare ingredients or special tools.

🥬 Filling (The Actual Concept): UM-Text is powerful but not magic. It still has boundaries and needs resources. Limitations:

Language range: The paper demonstrates strong English and Chinese support; other scripts (e.g., Arabic, Devanagari) may require more training data and tailored glyph handling.
Very dense layouts: Extremely cluttered images or micro-text can challenge layout prediction and rendering clarity.
Artistic extremes: Wild handwriting or rare display fonts might need extra style prompts or fine-tuning. Required resources:
Training uses significant compute (e.g., multiple A100 GPUs) and large datasets (AnyWord-3M plus UM-DATA-200K for design pretraining).
Inference is heavier than simple filters, though practical on modern GPUs. When not to use:
Pure vector design pipelines needing exact commercial fonts for print typesetting (UM-Text matches style implicitly, not by exact licensed font files).
Ultra-precise corporate brand locks where pixel-perfect font metrics are legally required. Open questions:
How to expand to more scripts and handwritten styles without massive extra data?
Can we let users ‘nudge’ layout and style interactively while keeping the unified pipeline simple?
How well does it handle extreme lighting/reflections where text must respect complex shadows and materials?
Could we distill the model to lighter versions for mobile or on-device editing while retaining accuracy?

🍞 Bottom Bread (Anchor): Think of UM-Text as a skilled poster artist: fantastic for most real-world jobs, but if you need the exact brand typeface measurements for a legal document, you’ll still pair it with a traditional typesetting tool.

06Conclusion & Future Work

Three-sentence summary: UM-Text unifies understanding (what to write, where to place it, and how it should look) with generation (drawing it) to edit or add text in images using just natural language. It blends language, character-level glyph details, and image context through UM-Encoder and sharpens letters with a Regional Consistency Loss, trained via a three-stage strategy. The result is multilingual, style-consistent, high-fidelity text editing that beats prior systems on accuracy and realism. Main achievement: Showing that a VLM-guided, multimodal-conditioned diffusion system can plan and render text end-to-end—so content, layout, and style are learned together rather than hand-specified. Future directions:

Broaden script coverage (handwriting, Arabic, Indic families) and expand cross-language translation/editing.
Add gentle user controls (e.g., ‘slightly bolder,’ ‘shift left a bit’) without breaking the unified pipeline.
Lighter, faster variants for interactive desktop and mobile use. Why remember this: UM-Text turns text-in-image editing from a fiddly manual chore into a smart, context-aware conversation—making posters, product images, and translations look like they were designed that way from the start.

Practical Applications

•Poster and flyer design from a single instruction (e.g., “Add ‘Summer Sale’ top-center in the photo’s warm style”).
•On-product label editing to update names, prices, or ingredients while matching existing textures and lighting.
•Cross-language image translation that preserves layout and style (e.g., Chinese to English on packaging).
•Social media graphics where slogans are placed and styled automatically to fit the photo.
•Retail signage updates (e.g., ‘OPEN’ to ‘WELCOME’) that blend perfectly with storefront photos.
•Educational visuals that add captions or annotations in the correct place and style on diagrams or maps.
•E-commerce product cards that auto-generate consistent badges like ‘New’ or ‘-20%’ in brand-matching colors.
•Event invitations that curve text along shapes and sample colors from the background automatically.
•Digital scrapbooking or journaling that stylizes dates and titles to match page themes.
•Localization of game UI screenshots by replacing on-image text while keeping the original art style.

Version: 1