LongCat-Image Technical Report
Key Summary
- •LongCat-Image is a small (6B) but mighty bilingual image generator that turns text into high-quality, realistic pictures and can also edit images very well.
- •It beats or matches much larger models on many tests while using less memory and running faster, which makes it cheaper to deploy.
- •The team cleaned their training data very carefully, removing AI-made images early on and filtering for high aesthetics to boost realism.
- •They invented a multi-level captioning system so the model learns from short tags up to rich photographic descriptions, improving understanding.
- •For Chinese text-in-image, they use character-level encoding inside quotes and lots of focused practice, reaching industry-leading accuracy and coverage.
- •Training happens in smart stages (pre-training → mid-training → SFT → RL) so the model learns structure first, then beauty, then preferences.
- •Reward models during RL check for OCR correctness, realism (via AIGC detection), distortion, and human preference, guiding the model toward better images.
- •On key benchmarks like GenEval, WISE, GlyphDraw2, CVTG-2K, and a full Chinese character test, LongCat-Image achieves state-of-the-art or highly competitive scores.
- •The image editing version keeps the original look while following edits precisely, thanks to data curation and a design that preserves visual consistency.
- •They open-sourced models, checkpoints, and the full training toolchain so developers can build and improve on top of it easily.
Why This Research Matters
LongCat-Image makes high-quality image generation and editing more accessible by being both powerful and efficient. It raises the bar for text rendering—especially Chinese—unlocking real uses like posters, menus, signs, and educational images. Its strong photorealism helps e-commerce, advertising, and content creators produce believable visuals quickly. Because it runs with lower VRAM and latency, smaller teams can deploy it without massive budgets. The fully open toolchain invites researchers and developers to build specialized versions for their needs. Finally, its bilingual strength supports global creators who work across English and Chinese, making visual communication more inclusive.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine you’re directing a super-fast art studio. You give a short note like “a panda riding a bike at sunset,” and a digital artist draws it perfectly. That’s what image AI aims to do—understand your words and paint them as you imagine.
🥬 The Concept (Diffusion Models): A diffusion model is an AI that learns to turn noisy “static” into clear images step by step. How it works (like a recipe): (1) Start with random noise; (2) Read the text to know what to draw; (3) Gently remove noise in many steps so shapes and textures appear; (4) Stop when the image looks right. Why it matters: Without this careful un-noising, images look blurry, warped, or fake.
🍞 Anchor: If you say “a red rocket on the moon,” the model starts from static and gradually reveals a crisp red rocket surrounded by grey craters.
—
The world before: A few years ago, text-to-image models got good at basic scenes and cool art styles. But people wanted more: photos that feel real, beautiful compositions, and spot-on text rendering (especially hard for non-Latin scripts like Chinese). Companies tried a simple trick: make the models bigger—8B, 20B, even 80B parameters—hoping size alone would fix everything. That raised costs, slowed inference, and often didn’t deliver the hoped-for realism or dependable text-on-image.
The problem: Four tough challenges blocked everyday use. (1) Photorealism: Images sometimes had a plasticky, AI-ish sheen or physics mistakes. (2) Text rendering: Even top models often misspelled words or broke strokes, with Chinese especially tricky because of thousands of complex characters. (3) Efficiency: Huge models need expensive hardware and run slowly. (4) Accessibility: Developers needed not just a model, but a full, open toolchain to build real products.
Failed attempts:
- Brute-force scaling: Bigger didn’t always mean better pictures; it did mean higher bills and latency.
- Messy training data: Mixing in AI-generated images early sometimes taught the model to imitate “AI textures.”
- One-size-fits-all captions: Long, samey captions wasted token space and missed precise details.
- Complicated spatial hacks: Fancy positional-embedding tricks tried to fix aspect-ratio problems, but added complexity without clear gains.
The gap: We needed a right-sized model with a right-shaped training plan. That meant (a) data that’s squeaky clean and rich in detail, (b) captions at multiple levels so the model learns both big ideas and fine-grained photography hints, (c) a bilingual text system that treats quoted text character-by-character for accurate rendering, and (d) reinforcement learning (RL) that rewards realistic texture, accurate text, and human-preferred looks.
Real stakes:
- Posters, signage, menus, and memes rely on perfect text rendering.
- E-commerce needs photoreal product shots, often with text overlays.
- Social creators and teachers want quick, beautiful visuals in English and Chinese.
- Startups and researchers need small, fast models they can actually run and tune.
🍞 Hook (Text-to-Image): You know how you can describe a scene and your friend sketches it? Text-to-Image (T2I) does that with computers.
🥬 The Concept: T2I turns a written prompt into a picture by guiding the diffusion process with the meaning of your words. How it works: (1) Read the prompt; (2) Convert words into numbers the model understands; (3) Use those numbers to steer the un-noising steps; (4) Decode the final clean image. Why it matters: Without good T2I, prompts get misunderstood and the picture doesn’t match what you asked.
🍞 Anchor: Prompt: “A blue whale underwater with sunbeams.” Output: A deep-blue scene with a whale silhouette and bright shafts of light from above.
🍞 Hook (Text Rendering): Think of writing with a pen on a bumpy wall—hard to keep letters clean and straight.
🥬 The Concept: Text rendering in images means drawing letters/shapes correctly inside a complex scene. How it works: (1) Detect the requested text; (2) Place it in the right spot; (3) Keep fonts, colors, and layout consistent; (4) Blend with lighting and textures. Why it matters: If the letters are wrong or float weirdly, posters and street scenes look fake.
🍞 Anchor: Ask for “On a chalkboard, write ‘有机化学’ in white Song font.” The model should write crisp, legible characters with chalk texture that matches the board.
🍞 Hook (Photorealism): You know how a real photo shows tiny fibers on fabric and natural shadows? That’s what makes it feel real.
🥬 The Concept: Photorealism is producing images that look like real photos, with correct lighting, materials, and physics. How it works: (1) Train only on real images at first; (2) Encourage good composition and genuine textures; (3) Penalize fake-looking patterns. Why it matters: Without photorealism, images feel plastic or uncanny.
🍞 Anchor: A latte photo shows foam bubbles, ceramic reflections, and the café’s soft window light—all aligning like in real life.
What LongCat-Image brings: It hits a sweet spot—6B parameters—trained with a careful, three-stage curriculum and a bilingual text brain. It filters data hard, captions images at multiple levels for richer learning, uses character-level encoding for quoted text to nail Chinese rendering, and finishes with RL that rewards realism, good text, and beauty. The result: state-of-the-art text rendering, top-tier realism, fast inference, and a fully open toolchain so others can build on it.
02Core Idea
🍞 Hook: Picture a student who doesn’t buy the biggest, heaviest textbook—but instead uses the clearest notes and practices exactly what the test cares about. They often do better with less.
🥬 The Concept: The key insight is “Smart data and training beat raw size.” LongCat-Image shows that a compact 6B diffusion model, fed carefully curated data and guided by targeted rewards, can match or beat much larger systems in realism, text rendering, and alignment. How it works: (1) Clean, diverse data with multi-level captions; (2) A bilingual text encoder with character-level handling for quoted text; (3) A simple, robust positional embedding; (4) A staged curriculum (pre → mid → SFT → RL) focused on realism and text accuracy; (5) Ensemble reward models for OCR, anti-AIGC texture, distortion, and human preference. Why it matters: Without this approach, teams overspend on huge models and still get worse text, slower speed, and lower realism.
🍞 Anchor: A 6B model that writes complex Chinese characters cleanly on posters, while keeping scenes photoreal—running fast on modest GPUs.
Multiple analogies:
- Chef’s menu: Rather than cooking every possible dish (giant model), pick the top recipes (curated data) and practice techniques (staged training). You’ll serve tastier meals faster.
- Orchestra: A smaller ensemble that rehearses the right passages (RL feedback) can sound better than a large, unfocused group.
- Camera focus: Turning the focus ring carefully (reward models) sharpens exactly what matters—text clarity, texture, and composition.
Before vs after:
- Before: Big models, mixed data (including AIGC), uneven text rendering (esp. Chinese), slow, costly, and closed tools.
- After: A lean model, strictly real images early on, focused text practice, strong bilingual rendering, faster speed, and a fully open ecosystem.
Why it works (intuition):
- Removing AIGC early prevents the model from learning “AI-ish” textures.
- Multi-Granularity Captioning packs more useful semantics per token, so the model understands both entities and photographic style.
- Character-level encoding in quotes makes the “draw this exact text” task simpler, especially for thousands of Chinese glyphs.
- Simple, 3D rotary positional embeddings handle different aspect ratios without brittle hacks.
- RL with the right rewards is like adding a coach who scores realism, text accuracy, and beauty—nudging the model’s habits in the right direction.
Building blocks (each with a Sandwich):
🍞 Hook (Multi-Granularity Captioning): You know how a good study guide has bullet points, summaries, and detailed explanations?
🥬 The Concept: Multi-Granularity Captioning teaches with four layers: entities, phrases, composition, and photographic details. How it works: (1) Extract key entities and attributes; (2) Add concise phrases; (3) Describe scene composition; (4) Provide compact, photography-aware descriptions. Why it matters: Without levels, captions get bloated or shallow, wasting tokens and missing crucial details.
🍞 Anchor: For a city nightscape, the model learns “skyscrapers,” “long exposure light trails,” and “warm/cool color contrast,” not just “a city at night.”
🍞 Hook (Positional Embedding – MRoPE): Imagine labeling seats in a theater by row and column so everyone knows where to sit—even if the theater shape changes.
🥬 The Concept: Multimodal Rotary Position Embedding (MRoPE) helps the model know “where” tokens belong across text and images. How it works: (1) Use a 3D tag to separate types of tokens (noise, text, reference image); (2) Encode 2D positions for image patches; (3) Keep text consistent. Why it matters: Without clear positions, spatial layouts and aspect ratios confuse the model.
🍞 Anchor: When generating a wide poster vs a tall banner, MRoPE keeps elements placed correctly.
🍞 Hook (Character-level Tokenizer for Quoted Text): Think of spelling a tricky word letter-by-letter to write it perfectly.
🥬 The Concept: For text inside quotes, the model parses characters individually (especially for Chinese), simplifying exact rendering. How it works: (1) Detect quoted text; (2) Split into characters; (3) Feed characters directly to the model. Why it matters: Without this, rare or complex characters often get mangled.
🍞 Anchor: “在黑板上写‘华’” yields the exact glyph strokes, not a look-alike.
🍞 Hook (RLHF + Reward Models): Like a coach who checks your form, speed, and accuracy all at once.
🥬 The Concept: Reinforcement Learning with Human Feedback uses reward models to push outputs toward what people prefer. How it works: (1) Generate candidates; (2) Score with OCR, AIGC detector, distortion, and aesthetic preference; (3) Update the model to favor higher scores. Why it matters: Without targeted rewards, the model may look pretty but miss text, or be accurate but look fake.
🍞 Anchor: If the poster writes “SALE 50%” clearly and looks like a real photo, it scores high and the model learns that style.
Put together, these parts let LongCat-Image do more with less—like an athlete who trains smart, not just hard.
03Methodology
At a high level: Prompt (text, optionally with quoted text to render) → Prompt refinement (built-in) → Text encoding (bilingual with character-level for quotes) → VAE noise latents + DiT denoising (MM-DiT → Single-DiT with 3D MRoPE) → VAE decode → Image. For editing: add source-image latents and edit instruction, then generate the edited image.
🍞 Hook (Data Curation Pipeline): You know how you first tidy your desk, then label folders, then write neat notes?
🥬 The Concept: A four-step data pipeline: filter, extract meta info, multi-level captions, then stratify for staged training. How it works: (1) Remove duplicates, watermarks, low-quality, and AIGC; (2) Extract categories, styles, named entities, OCR, and aesthetics; (3) Produce entity→phrase→composition→photographic captions; (4) Split data into pre/mid/SFT subsets. Why it matters: Without this, the model learns from mess and gets messy results.
🍞 Anchor: An image of a football match is cleaned, labeled (teams, numbers), captioned at multiple levels, and routed to the right training stage.
- Filtering: MD5 + SigLIP for dedup; keep reasonable resolutions; detect and remove watermarks; require strong aesthetic scores; and crucially, detect and remove AIGC early.
- Meta extraction: Category (portrait, indoor, poster…), style descriptors, named entities (teams, brands), OCR text, and a two-axis aesthetic score (Quality and Artistry).
- Multi-Granularity Captioning: Qwen2.5-VL + InternVL2.5 plus a fine-tuned Photographic Captioner for dense-yet-compact descriptions. Weighted sampling favors richer captions.
- Stratification: Pre-training = mostly real, photoreal data; Mid-training = high-quality sharpeners + a gradual dose of stylized art; SFT = human-curated real and hand-checked synthetic for aesthetics and fidelity.
🍞 Hook (Stratified Training – Pre/Mid/Post): Think of school: learn basics first, refine skills second, polish for the recital last.
🥬 The Concept: Train in stages—Pre-training (structure), Mid-training (quality/style), Post-training (SFT + RL for taste and precision). How it works: (1) Pre-train on progressive resolutions (256→512→512–1024) with real images; (2) Mid-train on stricter, high-quality data and a small, scheduled dose of art; (3) Post-train with SFT for aesthetics and RL for preferences and text accuracy. Why it matters: Without stages, the model either learns slowly or locks into bad habits.
🍞 Anchor: After pre-training knows “what’s where,” mid-training sharpens textures, and post-training chooses the most human-preferred look.
Model design:
- Diffusion backbone: Hybrid MM-DiT (double-stream early) then Single-DiT later (roughly 1:2 ratio), following FLUX.1-dev for efficient, powerful attention.
- VAE: 8× downsample, merge tokens 2×2 to a compact sequence; high-fidelity reconstruction is key for tiny text details.
- Text encoder: Qwen2.5-VL-7B for strong English and Chinese; remove unused adaLN text-in-timestep tricks; use character-level tokenization inside quotes for accurate rendering.
- Positional embeddings: 3D MRoPE—one dimension for modality (noise, text, reference), two for 2D spatial coordinates. Works across aspect ratios and resolutions.
- Prompt engineering: A built-in rewrite using the same encoder, no external API needed, to bridge dense training captions and short user prompts.
Training details:
- Pre-training: Progressive resolutions; real-time dashboards for loss, alignment, aesthetics, OCR accuracy. Dynamic sampling of Chinese text rendering via SynthDoG—boost error-prone characters more, fade out synthetic later to avoid overfitting to simple textures.
- Mid-training: Stricter curation, human-in-the-loop, balanced domains; produce a “Developer Version” checkpoint—flexible and not over-aligned, ideal for community fine-tuning.
- SFT: Mix hand-picked real and top-quality synthetic; train multiple specialty models (e.g., lighting, portrait, style), then parameter-average to combine strengths; switch to uniform timestep sampling to refine high-frequency details.
🍞 Hook (AIGC Detection as Guardrail): Like a food tester sniffing out fake ingredients before they hit the stew—and again later tasting the dish for freshness.
🥬 The Concept: Use an AIGC detector to remove AI-made images early and also to reward real-looking textures during RL. How it works: (1) Filter out synthetic at pre/mid stages; (2) In RL, give higher scores to images that “fool” the detector into thinking they’re real. Why it matters: Without this, the model picks up plasticky patterns and never reaches deep realism.
🍞 Anchor: Landscapes stop looking waxy and start showing natural grain, messy reflections, and real-life imperfections.
🍞 Hook (RL with DPO/GRPO/MPO): Think of three training drills: one from past examples (DPO), one with group comparisons (GRPO), and one streamlined solo practice (MPO).
🥬 The Concept: Reinforcement learning shapes the model toward OCR-correct, distortion-free, human-preferred, and realism-rich images. How it works: (1) DPO uses curated win/lose pairs to learn preferences offline; (2) GRPO samples groups and ranks within each group; (3) MPO samples one path per prompt with smart baselines and curriculum to improve efficiently. Why it matters: Without RL, the model may plateau on average quality and miss what humans actually like.
🍞 Anchor: If “SALE 50%” is crisp and the poster looks truly photographed, rewards go up and the model locks in that behavior.
Image editing:
- Architecture: Add a reference-image branch. Encode the source image with the VAE, mark its tokens via MRoPE’s modality slot, concatenate with noisy latents, and feed the joint sequence into the diffusion transformer. Encode both source and instruction in the text encoder with a distinct system prompt to signal “editing mode.”
🍞 Hook (Editing Consistency): Imagine painting over a photo: you want the new object, but the lighting, angles, and textures must still fit the original.
🥬 The Concept: Visual consistency is keeping the edited image faithful to the source scene. How it works: (1) Start from a mid-training T2I checkpoint (more flexible); (2) Jointly train T2I + editing tasks to prevent forgetting; (3) Curate and filter pairs with strict alignment; (4) Use SFT and DPO for stable, instruction-following edits. Why it matters: Without this, faces warp, lighting shifts, and edits look pasted-on.
🍞 Anchor: “Turn this bread into a cupcake with cream and a cherry” changes the pastry but keeps table, shadows, and camera viewpoint matching the original.
Secret sauce:
- Clean real data early; dynamic text rendering practice; bilingual character-level text encoding; simple but strong positional embedding; staged curriculum; and RL that cares about exactly what users care about—readable text, realism, and pleasing aesthetics.
04Experiments & Results
🍞 Hook (Benchmarks): You know how a report card shows scores in math, reading, and science? Benchmarks are a model’s report card across different skills.
🥬 The Concept: They test alignment (did you follow the prompt?), knowledge, text rendering, and editing. How it works: (1) Use public test sets with clear scoring; (2) Compare against known strong models; (3) Explain scores with context. Why it matters: Without fair tests, claims of “state-of-the-art” don’t mean much.
🍞 Anchor: Scoring 0.87 on GenEval is like getting an A when most classmates hover at B.
The tests:
- GenEval: Checks attribute binding, counting, colors, and spatial layout accuracy. LongCat-Image’s overall score ~0.87—tied with or surpassing other top open-source systems.
- DPG-Bench: Dense, complex prompts that stress semantic alignment; the model is competitive with leading systems, showing it handles verbose, specific instructions.
- WISE: World-knowledge and reasoning tasks; LongCat-Image scores ~0.65 among diffusion models, strong for a non-LLM generator.
- GlyphDraw2: Text-in-image for English and Chinese, including hard “Complex” character sets; LongCat-Image averages ~0.95 and shines on complex Chinese, reflecting robust glyph handling.
- CVTG-2K: Multi-region English text rendering in real scenes; word accuracy averages ~0.866, a state-of-the-art result among peers.
- ChineseWord: 8,105 prompts covering common to long-tail characters; LongCat-Image reaches ~98.7 on Level-1 common chars and ~90.7 overall—industry-leading coverage and accuracy.
- Internal Poster & Scene: Business-oriented tests (posters and natural scene text) where it reaches SOTA-level results, proving practical reliability.
The competition: Compared against strong open-source and commercial models (e.g., FLUX.1-dev, SD-3/3.5, Qwen-Image, Hunyuan 3.0, Seedream 4.0), LongCat-Image often matches or surpasses them despite being smaller.
The scoreboard with context:
- GenEval 0.87: Equivalent to an A when others score A− to B+.
- WISE 0.65: A top diffusion score, showing it reasons about world facts better than many peers.
- GlyphDraw2 ~0.95 avg: Like nailing spelling tests in both English and Chinese, including difficult words.
- CVTG-2K ~0.866 word accuracy: Strong multi-box text rendering under real-world layouts.
- ChineseWord ~90.7 overall across all levels: Outstanding long-tail Chinese coverage; rare-character mastery stands out.
Editing results: The editing model exhibits state-of-the-art consistency and instruction-following among open-source peers. Qualitative examples show precise object changes, style shifts, and text edits while preserving lighting and perspective.
Surprising findings:
- Early AIGC removal helps realism later, even if it slows early convergence a bit. That trade pays off.
- A “Developer Version” mid-training checkpoint is more adaptable for editing than an over-polished SFT/RL model.
- Multi-Granularity Captioning improves both understanding and output quality; denser, photography-aware captions matter.
- Character-level tokenization for quoted text significantly boosts non-Latin rendering without extra specialized encoders.
Efficiency wins: With only 6B parameters and a fast architecture, VRAM use and latency are notably lower than 20B+ models or MoE designs. That translates to lower costs and smoother deployments, without giving up quality.
05Discussion & Limitations
Limitations:
- Multi-character sequences (whole sentences) can still wobble compared to single characters, especially for rare combinations; more real text-in-scene data would help.
- Complex human edits that involve large pose or viewpoint shifts can still introduce artifacts, though video-frame pairs improve this.
- Aesthetic taste is subjective; while the model scores well, some commercial systems may edge it out on certain style preferences.
Required resources:
- Training needs large GPU clusters for multi-stage diffusion training and RL; however, inference is comparatively light for a 6B model.
- High-quality, human-curated data is critical (and expensive) for SFT and editing consistency.
When not to use:
- If you need pixel-perfect long paragraphs in a single image (e.g., book pages), specialized OCR-aware generators might be better.
- If your domain is ultra-niche (e.g., medical scans with strict regulations), consider domain-specific fine-tuning with expert-labeled data.
- If you must run on extremely tiny devices, even 6B may still be too large without further distillation.
Open questions:
- How far can character-level tokenization go for long texts without a dedicated glyph encoder?
- What’s the best mix of synthetic vs real text-in-scene data to maximize multi-line stability?
- Can we design universal reward models that correlate even better with human aesthetics across cultures?
- How well does the Developer Version accelerate community fine-tuning across many niches?
Overall, LongCat-Image honestly balances strengths—text rendering, realism, efficiency—with clear next-step goals: multi-line robustness, complex edits, and ever-better aesthetic alignment.
06Conclusion & Future Work
Three-sentence summary: LongCat-Image proves that a compact 6B diffusion model, trained with carefully cleaned data, multi-level captions, bilingual character-level text handling, and targeted RL, can deliver state-of-the-art realism and text rendering. It excels especially in Chinese character coverage and editing consistency, while staying fast and affordable to deploy. The team open-sourced not just models but the full training toolchain, empowering broader research and real-world use.
Main achievement: Showing that “smart data + smart training” can beat brute-force size, achieving industry-leading bilingual text rendering and strong photorealism with a small, efficient model.
Future directions:
- Strengthen multi-line and paragraph-level text stability with more real-world, text-rich datasets.
- Enhance complex editing (pose/view shifts) using richer video-derived training pairs and better spatial alignment losses.
- Refine aesthetic reward models that generalize across cultures and tasks.
- Explore lightweight distillation for even cheaper on-device deployment.
Why remember this: LongCat-Image changes the conversation from “bigger is better” to “better is better.” It demonstrates that with disciplined data, bilingual smarts, and the right rewards, a lean diffusion model can write clean text in images, look truly photographic, and run fast—making high-quality visual creation more accessible to everyone.
Practical Applications
- •Design bilingual posters and flyers with crisp, accurate text in both English and Chinese.
- •Generate photoreal product images for e-commerce with brand names and prices rendered cleanly.
- •Create classroom visuals (diagrams, labels, worksheets) that require precise on-image text.
- •Edit marketing photos (change objects, adjust lighting, add slogans) while preserving realism.
- •Prototype app UIs or dashboards with readable, layout-consistent text regions.
- •Localize ads and signage for different markets by swapping languages with accurate fonts.
- •Produce social media graphics and memes where text must blend naturally with backgrounds.
- •Assist publishers by generating book covers or magazine spreads with exact typography.
- •Augment datasets for OCR and scene-text models by generating varied, realistic text-in-scene images.
- •Rapidly iterate concept art that mixes realistic scenes with stylized, on-image typography.