Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Abdelrahman Shaker; Ahmed Heakl; Jaseel Muhammad; Ritesh Thawkar; Omkar Thawakar; Senmao Li; Hisham Cholakkal; Ian Reid; Eric P. Xing; Salman Khan; Fahad Shahbaz Khan

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Intermediate

Abdelrahman Shaker, Ahmed Heakl, Jaseel Muhammad et al.2/23/2026

arXiv

Key Summary

•Mobile-O is a small but smart AI that can both understand pictures and make new images, and it runs right on your phone.
•It uses a special bridge called the Mobile Conditioning Projector (MCP) to connect what it understands about language and images to how it draws pictures, without adding heavy parts.
•Instead of needing huge amounts of data, it learns well from just a few million pairs and then a tiny 105k new training samples shaped like quadruplets: (prompt, image, question, answer).
•On the GenEval test for image creation, Mobile-O scores 74%, beating Show-O by 5% and JanusFlow by 11%.
•For image understanding across seven challenges, it does better than Show-O by 15.3% and JanusFlow by 5.1% on average.
•It makes a 512×512 image on an iPhone in about 3 seconds and stays under ~2 GB of memory, so it’s practical for real use.
•The MCP fuses multiple layers of understanding, then compresses and refines them with depthwise-separable convolutions, making the bridge fast and accurate.
•A unified post-training step teaches understanding and generation together using the same sample, so each skill helps the other.
•It also supports simple image editing by conditioning on both an input image and an instruction, even with limited extra training.
•This design shows a path to private, offline, real-time multimodal AI on everyday devices.

Why This Research Matters

Mobile-O brings powerful image understanding and creation directly onto everyday devices, so you get speed and privacy without needing the cloud. This means you can analyze a photo, generate art, or edit an image even with poor internet—or no internet at all. For families and schools, this keeps kids’ photos and data safely on-device while enabling creative, hands-on learning. For professionals, it enables field work—like scanning charts, documenting scenes, or prototyping visuals—without latency or data costs. And for accessibility, it can describe surroundings in real time to help users better understand their environment. Overall, it shows how thoughtful design can make advanced AI practical, fast, and respectful of privacy.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how you like using phone apps that can describe photos, translate signs, or even make fun pictures—but sometimes they’re slow or need the internet? Wouldn’t it be nice if they worked fast, right on your phone, without sending anything to the cloud?

🥬 Filling (The Actual Concept): Unified Multimodal Model

What it is: A unified multimodal model is a single AI that can understand images and text and also generate images, all in one system.
How it works (step by step):
1. It looks at input (like a picture, a question, or a text prompt).
2. It turns them into tokens (tiny units of meaning).
3. It reasons about what the text and image mean together.
4. It either answers questions (understanding) or creates a new image (generation).
Why it matters: Without unifying these skills, we’d need separate apps and extra time to move information around, which is slow and memory-hungry—especially on phones.

🍞 Bottom Bread (Anchor) Imagine a single phone app that can read a menu photo and then instantly draw a fancy dish you describe—no switching apps, no lag.

The World Before Before this research, many AIs could do one job well: understand pictures (like answering questions about a photo) or generate pictures from text (like drawing a scene you describe). If a system did both, it usually glued together big, heavy parts: a large vision-language brain plus a big image-making engine. They worked but were bulky—like carrying a full toolbox for a tiny screw.

The Problem Two big roadblocks stopped these models from running nicely on your phone:

They were heavy: Big encoders and big diffusion models used lots of memory and computation. Phones don’t like that.
They were data-hungry: To make text and image parts talk well, people trained on 50 million to 1 billion samples. That’s slow, costly, and still not friendly for quick iteration.

Failed Attempts

One-transformer-for-everything designs were powerful but computationally expensive.
Query-token bridges (special tokens between understanding and generation) helped big models, but aligning them well needed tons of data—too much for a compact, on-device system.
Training styles:
- Joint training mixed separate datasets for understanding and generation, but balancing them was hard and often unaligned.
- Sequential training froze understanding while training generation, missing the chance for the skills to help each other.

The Gap What was missing was a way to:

Connect understanding and generation efficiently without needing a mountain of data.
Train both skills together using the same small example so they align naturally.
Keep everything small and fast enough for a phone.

🍞 Top Bread (Hook) Imagine whispering the right instructions straight into the artist’s ear instead of passing messages through three messengers who might garble it.

🥬 Filling (The Actual Concept): Diffusion Model (for generation)

What it is: A diffusion model is an AI painter that starts from noise and gradually turns it into a clear image by following guidance.
How it works:
1. Start with noisy, static-like pixels.
2. At each step, predict how to clean the noise a bit.
3. Use the text (and sometimes image) guidance to steer the cleaning toward the right picture.
4. Repeat until the image looks like the prompt.
Why it matters: Without diffusion, we’d struggle to make crisp, detailed, controllable images.

🍞 Bottom Bread (Anchor) Ask for “a red balloon floating over a blue lake,” and step by step the noise shapes into that exact scene.

Why This Paper Exists Mobile-O aims to make unified understanding-and-generation actually work on edge devices (like iPhones, Jetsons, and laptops) without cloud help. To do that, it introduces a smarter bridge between the language-and-vision brain and the diffusion painter, and a tiny training trick so both skills grow together instead of separately.

Real Stakes (Why You Should Care)

Speed: Make and understand images in seconds on your phone.
Privacy: Your photos stay on your device—safer and more comfortable.
Battery and Cost: No heavy cloud calls or data plans.
Reliability: Works offline—great for travel, classrooms, or remote areas.
Creativity and Learning: Kids can ask questions about pictures and also make art right away, in one place.

🍞 Top Bread (Hook) You know how team projects work best when teammates talk directly and share the same notes?

🥬 Filling (The Actual Concept): Cross-Modal Conditioning

What it is: Cross-modal conditioning is when information from one type (text or image) directly guides another type (like the image generator), so both stay in sync.
How it works:
1. Extract meaningful features from the text and/or image.
2. Feed those features into the generator’s attention.
3. The generator focuses on what matters (colors, objects, counts, positions).
Why it matters: Without it, the generator may ignore instructions or miscount objects.

🍞 Bottom Bread (Anchor) When you ask, “three green apples on a blue plate,” conditioning helps the painter put three (not two, not five) green apples on blue, not red.

02Core Idea

🍞 Top Bread (Hook) Imagine building a tiny, super-efficient bridge that lets your brain’s reading-and-seeing side talk directly to your drawing side, so you can think and create quickly without getting tired.

🥬 Filling (The Actual Concept): The Key Insight

What it is: Mobile-O’s “aha!” is a lightweight bridge—the Mobile Conditioning Projector (MCP)—that directly maps what the model understands (from the vision-language brain) into what the diffusion painter needs, plus a small unified post-training step that teaches both skills together using one compact sample format.
How it works:
1. Collect the last few layers of the model’s understanding features.
2. Fuse them with learned weights (layerwise alignment) so the best mix pops out.
3. Gently compress and refine them with depthwise-separable convolutions.
4. Project them into exactly the shape the diffusion model expects.
5. Post-train with quadruplets (prompt, image, question, answer) so understanding and generation help each other.
Why it matters: Without this, you’d need heavy query tokens and huge datasets to align skills. With MCP + quadruplets, you get accurate, fast, and on-device multimodal AI.

🍞 Bottom Bread (Anchor) Think of it like giving the painter a neat, well-organized checklist directly from the reader, so the painter can draw the right thing fast.

Multiple Analogies (Same Idea, 3 Ways)

Coach-and-Player: The vision-language model is a coach who sees the whole game; the diffusion model is the player who scores. MCP is the headset that sends just the right calls quickly, so the player acts fast and accurately.
Recipe-and-Chef: The language-and-vision brain writes a clear recipe. MCP translates that recipe into the chef’s language (exact temperatures and timings), so the chef (diffusion) cooks the dish perfectly.
Highway On-Ramp: Understanding is the main road; generation is another. MCP is a short, smart on-ramp that merges traffic smoothly without long detours (no extra query tokens), so cars (features) don’t slow down.

🍞 Top Bread (Hook) You know how measuring both reading and writing together gives a better idea of a student’s skill than testing them separately?

🥬 Filling (The Actual Concept): Quadruplet Data Representation

What it is: A training sample shaped like (generation prompt, image, question, answer) so one example teaches both image-making and image-understanding at the same time.
How it works:
1. Use the prompt to guide generation.
2. Use the image and a question to guide understanding.
3. Optimize both losses together.
4. Repeat so the two skills build on each other.
Why it matters: Without this, datasets are separate, and models don’t learn how making images and understanding images relate.

🍞 Bottom Bread (Anchor) Picture a homework sheet where you first write a story (generation) and then answer questions about it (understanding)—you learn faster because both tasks share the same context.

Before vs After

Before: Big models, extra query tokens, and massive data to align tasks; hard to run on phones.
After: A compact 1.6B-parameter system with a tiny, smart connector (MCP) and a small unified post-training set that aligns skills well enough for real-time on-device use.

🍞 Top Bread (Hook) Imagine picking the best layers of a cake—not just the top frosting—to get the perfect bite.

🥬 Filling (The Actual Concept): Layerwise Alignment

What it is: Learn to mix features from several recent layers of the understanding model so the generator gets a balanced view (details plus big picture).
How it works:
1. Take the last K layers.
2. Learn weights for each.
3. Softmax-normalize them so they form a clean blend.
4. Pass the blend to the projector.
Why it matters: Without it, you might overuse shallow or deep features and miss either details or overall structure.

🍞 Bottom Bread (Anchor) It’s like blending the right mix of close-up and wide-angle photos before giving them to the painter.

Why It Works (Intuition, Not Equations)

Direct, token-aligned conditioning avoids inventing new tokens that need tons of data to learn from scratch.
Compressing and refining with depthwise-separable convolutions gives you the essence without the weight.
Unified post-training with quadruplets ties both tasks to the same representation, so learning on one side helps the other.

🍞 Top Bread (Hook) Think of using a pencil with a thin tip instead of a chunky marker when you need fine details.

🥬 Filling (The Actual Concept): Depthwise-Separable Convolutions

What it is: A light way to process sequences that’s faster and smaller than full convolutions.
How it works:
1. Do depthwise filtering per channel (cheap).
2. Mix channels with a pointwise step (cheap again).
3. Add a tiny attention over channels to emphasize important parts.
Why it matters: Without it, the connector would be heavy and slow, draining memory and battery.

🍞 Bottom Bread (Anchor) It’s like sorting Legos by color (depthwise) and then snapping a few together (pointwise) to make exactly the piece you need.

Building Blocks

Vision-Language Model (VLM) for understanding, sized for mobile.
Diffusion Transformer (DiT) for flexible, high-quality generation.
Variational Autoencoder (VAE) to encode/decode images to smaller latents.
MCP bridge for fast, aligned conditioning.
Unified post-training with quadruplets so both skills improve together.

03Methodology

At a High Level: Input → Vision-Language Model (understanding) → Mobile Conditioning Projector (bridge) → Diffusion Transformer (generation) → VAE decode (final image) and/or LLM decode (final answer)

Step 1: Vision-Language Understanding (VLM) 🍞 Hook: You know how you first read a map before deciding how to drive? 🥬 The Concept: Vision-Language Model (VLM)

What it is: A compact model that looks at images and text together to build understanding.
How it works:
1. Encode the image into tokens (vision encoder).
2. Mix tokens with the question or prompt (language model).
3. Produce hidden states at each layer that capture meaning.
Why it matters: Without a good understanding backbone, generation won’t know what to draw, and Q&A won’t be accurate. 🍞 Anchor: Given a photo of a bridge and the question “How many arches?”, the VLM produces features that help count arches.

Step 2: Layerwise Alignment (Choosing the Right Mix) 🍞 Hook: Imagine choosing the best bites from several layers of a cake. 🥬 The Concept: Layerwise Alignment

What it is: Learning weights to blend the last K layers of the VLM.
How it works:
1. Take hidden states from layers L−K+1 to L.
2. Learn a weight for each layer.
3. Softmax the weights so they form a clean blend.
4. Get a fused representation that balances detail with context.
Why it matters: Relying on a single layer can miss either fine details or the big picture. 🍞 Anchor: For “a red mug on a wooden table by a window,” the blend catches red (detail), mug shape (object), table texture (context), and window lighting (scene).

Step 3: Mobile Conditioning Projector (MCP) 🍞 Hook: Think of a tiny translator that turns your ideas into the artist’s exact language. 🥬 The Concept: Mobile Conditioning Projector (MCP)

What it is: A light bridge that reshapes VLM features into the format the diffusion model expects.
How it works:
1. Compress the fused features to a smaller size (Linear + LayerNorm).
2. Refine with depthwise-separable Conv1D along the token sequence (cheap and token-aligned).
3. Apply a tiny channel attention to emphasize important channels.
4. Project to the diffusion model’s conditioning dimension.
5. Feed the sequence as keys/values to all cross-attention layers in the DiT.
Why it matters: Without MCP, you’d add heavy query tokens or big MLP stacks, needing huge data and more memory. 🍞 Anchor: The MCP turns “make two green apples on a blue plate by a sunny window” into a neat set of signals that the diffusion model can follow precisely.

Step 4: Diffusion Transformer (DiT) with Cross-Modal Conditioning 🍞 Hook: Imagine sculpting a statue by chipping away marble bit by bit—but with guidance. 🥬 The Concept: Cross-Modal Conditioning in a Diffusion Transformer

What it is: A generator that uses attention to listen to MCP features at every denoising step.
How it works:
1. Start from a noisy latent image (VAE space).
2. At each step, attend to MCP features that encode prompt/image meaning.
3. Predict how to move the latent toward the target image (velocity/flow matching).
4. After several steps, the latent forms the desired picture.
Why it matters: Without cross-modal conditioning, images drift off-prompt (wrong colors, counts, positions). 🍞 Anchor: Ask for “two yellow ducks on a red raft.” Attention ensures two ducks appear, not one or three, and the raft is red.

Step 5: VAE Encode/Decode 🍞 Hook: Think of zipping and unzipping a file to make it smaller to send and then restore it. 🥬 The Concept: Variational Autoencoder (VAE)

What it is: A compact image compressor/decompressor.
How it works:
1. Encoder turns an image into a small latent grid.
2. Generator edits/creates in this small space.
3. Decoder turns the latent back into a full image.
Why it matters: Without VAE, you’d generate pixels directly—much slower and heavier. 🍞 Anchor: It’s like working with a sketch first (latent) and then painting the final big canvas (decoded image).

Step 6: Training—Three Stages Like a Recipe Stage 1: Cross-Modal Alignment (Large but efficient pretrain)

Freeze the VLM and VAE; train DiT + MCP on ~9M image–text pairs.
Goal: Teach the painter to listen to the coach via MCP reliably.
Why: Without this, the generator won’t follow prompts consistently.
Example: Learn general objects, colors, positions from JourneyDB and BLIP3o short captions.

Stage 2: Supervised Fine-Tuning (Target weak spots)

Keep VLM/VAE frozen; continue training DiT + MCP on ~105k curated pairs.
Goal: Fix gaps (e.g., human gestures, common objects, landmarks).
Why: Without it, those cases remain shaky.
Example: Improve hands, faces, everyday scenes.

Stage 3: Unified Multimodal Post-Training (Both skills together) 🍞 Hook: Like practicing reading and writing using the same story. 🥬 The Concept: Multimodal Post-Training with Quadruplets

What it is: A small set of samples where each includes a generation prompt, the image, a question, and the answer.
How it works:
1. For each sample, compute a language loss (answer the question about the image).
2. Compute a diffusion loss (generate the image from the prompt) using flow matching.
3. Train both together so features align across tasks.
4. Optional: Light LoRA on VLM and vision encoder to adapt gently without overfitting.
Why it matters: Without joint post-training, skills improve separately and don’t reinforce each other. 🍞 Anchor: One example teaches the model to draw “a rabbit in a forest at golden hour” and to answer “What animal is in the picture?”—tightening the link between making and understanding.

Step 7: The Secret Sauce—Flow Matching (for stable, fast training) 🍞 Hook: Imagine following a river’s flow instead of guessing where it might go each time. 🥬 The Concept: Flow Matching Objective

What it is: A training target that teaches the model the best direction to move from noisy to clean images.
How it works:
1. Mix the clean latent with noise at strength σ.
2. Ask the model to predict the velocity that takes it toward the clean latent.
3. Weight errors by how hard the step is.
4. Repeat across many σ values.
Why it matters: Without flow matching, training can be slower and less stable. 🍞 Anchor: It’s like getting a compass that always points you in the best direction back to the picture you want.

Concrete Example (Putting It All Together)

Input: Prompt: “A red umbrella on a wooden bench in the rain.” Question: “What object is on the bench?” Image: a matching scene.
Understanding path: VLM reads the image and question, answers “umbrella.”
Generation path: MCP feeds fused features from text to DiT; DiT denoises latents guided by these features; VAE decodes the final image with the red umbrella.
Joint training: The model gets feedback on both the answer and how on-prompt the generated image is.

Secret Sauce Summary

A tiny, powerful bridge (MCP) that respects token alignment and keeps compute low.
Layerwise fusion with learnable weights for the right semantic mix.
Depthwise-separable Conv1D + channel attention for cheap, effective refinement.
Quadruplet post-training so making and understanding images improve together.

04Experiments & Results

The Test (What was measured and why)

Image Understanding: General reasoning (MMMU, MM-Vet, SEED), reading text in images (TextVQA), charts (ChartQA), hallucination resistance (POPE), and scene understanding (GQA). These tell us if the model really “gets” pictures.
Image Generation: GenEval, which checks alignment to prompts (objects, counts, colors, positions, attributes). This tells us if the images match instructions.
Edge Speed: Real device latency on MacBook M2 Pro, Jetson Orin Nano, and iPhone 17 Pro. This shows real-world practicality.

The Competition (Who it was compared against)

Unified peers under ~2B params: Janus, JanusFlow, Show-O.
Generation-only models small and large (e.g., SDXL, SANA-0.6B) for context.
Understanding-only baselines (e.g., FastVLM) to see if unified training helps understanding too.

The Scoreboard (With Context)

GenEval Overall: 0.74 for Mobile-O (like scoring an A when others got a B).
- Beats Show-O-Clip-ViT by 5 points (0.69 → 0.74) and JanusFlow by 11 points (0.63 → 0.74).
- Especially strong on position and color-attribute checks.
Understanding (average across seven benchmarks): Mobile-O outperforms Show-O by 15.3% and JanusFlow by 5.1% on average—while being smaller.
On-Device Latency:
- iPhone 17 Pro: ~3.0 s for a $512×512$ image; vision encoder ~102 ms; TTFT ~248 ms.
- MacBook M2 Pro: 2– $8× faster$ for understanding, and 11– $46× faster$ for generation vs. Janus/Show-O.
- Jetson Orin Nano: ~4 s per image vs. 22–52 s for others.
Memory: Under ~2 GB on iPhone using MLX/Core ML—practical for real use.

Surprising or Notable Findings

Unified Post-Training Helps Both Ways: Compared to the understanding-only FastVLM baseline, Mobile-O’s unified setup improves average understanding by +1.6% while also adding strong generation—a win-win.
Emergent Image Editing: Without changing the architecture, fine-tuning on 46k edit samples enables useful image edits (add/remove objects, color/style changes) with decent fidelity.
Right Number of Layers: Fusing the last 4 VLM layers yields the best generation alignment (too few misses semantics; too many adds noise/redundancy).
MCP vs. MLP Connectors: A simple MLP bridge needs more parameters and doesn’t generalize as well on small data; MCP gets better alignment with fewer params.

Making Numbers Meaningful

0.74 on GenEval: That’s like passing a detailed art exam where you must draw exactly what the teacher asked—most peers under 2B fell short.
~3 seconds on iPhone: That’s the difference between “wait, loading…” and “done already!”—smooth enough to feel instant.
Under ~2 GB memory: Think of fitting your whole art-and-reading toolkit in a small backpack instead of a moving truck.

Ablations (What parts matter most)

MCP design: Learnable layer fusion + Conv1D refinement + channel attention scored best with only ~2.4M trainable params in the connector.
Post-training format: Quadruplets beat separate pairs—tightening cross-task alignment.
Scaling up backbones: The same MCP + training recipe scales to larger components and improves both understanding and generation over their standalone versions.

05Discussion & Limitations

Limitations (What it can’t do yet)

Text Encoder Power: Mobile-O reuses a small LLM for both understanding and as the text encoder for generation. This saves memory but may miss the depth a big dedicated text model (like a 2B+ language encoder) can provide in rare, complex prompts.
Data Coverage: While data-efficient, the training corpus is still modest; very niche scenes or ultra-fine-grained prompts may need extra fine-tuning.
Maximum Fidelity vs. Heaviest Models: Ultra-high-resolution artistic control can still favor very large, cloud-scale systems.
Counting and Compositional Edge Cases: Greatly improved, but corner cases (e.g., many small objects, tricky occlusions) can still trip it up.

Required Resources

Training: An $8×A100$ (80GB) setup for a few days to run all stages comfortably; mixed precision and ZeRO optimization recommended.
Inference: Modern phone or edge device; Mobile-O-0.5B fits under ~2 GB using MLX/Core ML. GPU-accelerated decoding helps.
Data: A few million pretrain pairs plus ~105k unified post-training quadruplets.

When NOT to Use

If you must render ultra-high-res (e.g., posters) with extreme photo realism under tight, unusual constraints, and you don’t care about cloud cost or latency, a bigger cloud model may be better.
If your task is text-only and requires heavy long-context reasoning (e.g., 100-page documents), a specialized large LLM might outperform this compact unified model.
If you need highly specialized domain generation (e.g., medical imaging synthesis) and haven’t fine-tuned on that domain, use caution and domain data.

Open Questions (What we still don’t know)

Can we get the benefits of a stronger text encoder without blowing up memory—perhaps via smarter distillation or retrieval-augmented text features?
How far can compression go (quantization/pruning) before quality drops feel noticeable on phones?
Can the unified quadruplet idea be extended to video (prompt, video, question, answer) for on-device video understanding/generation?
What safety and fairness strategies work best on-device (e.g., content filters) without adding heavy compute?
How to personalize on-device without retraining everything (adapters, LoRA, or small memory modules per user)?

06Conclusion & Future Work

3-Sentence Summary Mobile-O is a compact, unified model that both understands images and generates them, designed to run fast on mobile devices. Its Mobile Conditioning Projector (MCP) efficiently maps understanding features into the diffusion generator, and a tiny unified post-training with quadruplets aligns both skills so they help each other. The result is state-of-the-art alignment and strong understanding in a 1.6B-parameter system that makes images in about 3 seconds on an iPhone while staying under ~2 GB of memory.

Main Achievement A practical recipe for on-device unified multimodal AI: a lightweight, layerwise-fused, depthwise-refined conditioning bridge (MCP) plus small, smart post-training that jointly boosts understanding and generation.

Future Directions

Stronger but still tiny text encoders (via distillation or retrieval) to improve subtle prompt following.
More efficient quantization and scheduling for even faster, lower-power phones.
Scaling the unified quadruplet idea to image editing and video, and exploring richer supervision (e.g., style tags, spatial hints).
Personalization on-device with small adapters for user styles and vocabularies.

Why Remember This Mobile-O shows that you don’t need giant models and billion-sample datasets to get high-quality multimodal understanding and generation—if you build the right bridge and train the two skills together. It sets a new bar for private, offline, real-time AI that fits in your pocket and still plays well with real-world tasks.

Practical Applications

•Private photo helper: Describe, summarize, and search your own photos entirely on-device.
•On-the-go art studio: Generate illustrations or concepts for stories, games, or class projects without internet.
•Educational companion: Answer questions about diagrams or charts and create matching visuals for study guides.
•Travel assistant: Read foreign signs and generate quick visual notes or maps offline.
•Prototype and product design: Quickly visualize design prompts and edit mockups in the field.
•Accessible descriptions: Provide fast image-to-text support for users with low vision in real environments.
•Marketing and social media: Create on-prompt images and captions in seconds on a phone.
•AR/creative filters: Edit or restyle photos (e.g., color changes, style transfer) with low latency.
•Retail and inventory: Read labels and generate visual tags or product shots on handheld devices.
•Emergency/off-grid use: Document scenes and generate annotated visuals when no network is available.

Version: 1