AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation

Dongjie Cheng; Ruifeng Yuan; Yongqi Li; Runyang You; Wenjie Wang; Liqiang Nie; Lei Zhang; Wenjie Li

AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation

Intermediate

Dongjie Cheng, Ruifeng Yuan, Yongqi Li et al.1/25/2026

arXiv PDF

Key Summary

•AR-Omni is a single autoregressive model that can take in and produce text, images, and speech without extra expert decoders.
•It turns everything (words, pictures, and sounds) into one shared alphabet of tokens and learns to predict the next token.
•A task-aware loss reweighting trick stops long audio or image sequences from overpowering shorter text tasks during training.
•A tiny token-level perceptual alignment loss helps the model keep images looking coherent even when some tokens are off.
•A finite-state decoding machine auto-picks safe greedy decoding for precise tasks (like ASR/TTS) and sampling for creative tasks (like image generation).
•AR-Omni speaks in real time, with 146 ms first-token latency and a 0.88 real-time factor for speech generation.
•It matches strong token-based baselines in speech (6.5 WER on VCTK TTS and 9.4 WER on LibriSpeech ASR) while staying unified and streaming.
•Image captioning is improved over its base (CIDEr 56.53), while text-to-image quality is below diffusion-based systems (CLIP score 0.24 vs higher for diffusion).
•The model uses one 7B-parameter Transformer plus lightweight tokenizers/detokenizers, keeping training and inference simple.
•This work shows that an "any-to-any" multimodal assistant can be built with one decoder and one next-token objective.

Why This Research Matters

AR-Omni shows that we can build a true any-to-any assistant that listens, looks, and talks using one simple engine, which lowers complexity and cost. Real-time speech means conversations feel natural, not laggy, making AI more usable for accessibility, teaching, and customer support. Keeping everything autoregressive simplifies training and maintenance, so improvements in one place can benefit all modalities. The approach reduces the need for heavy add-on decoders, which helps with deployment on limited hardware. Even though its diffusion-free images trail the best diffusion systems, it preserves enough quality for many practical uses while staying fast. This blueprint can inspire future systems that add video and richer styles without giving up the unified design.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a Swiss Army knife has many tools in one handle, so you don’t have to carry separate scissors, a knife, and a screwdriver? Wouldn’t it be great if AI could also be one tool that talks, sees, and listens—all in one body?

🥬 The Concept (Multimodal Large Language Models, MLLMs): What it is: MLLMs are AI systems that can understand and create more than just text—they can also handle images and speech. How it works: 1) They take in different kinds of inputs (like words, pictures, and sounds), 2) convert them into numbers the computer understands, 3) reason about them together, and 4) produce answers in one or more kinds (like a sentence, a picture, or spoken audio). Why it matters: Without multimodal ability, an AI that only reads and writes text can’t look at a photo you show it or reply in your voice. 🍞 Anchor: Asking, “What’s in this photo?” and hearing a clear spoken answer back is possible because the AI handles images and speech together.

🍞 Hook: Imagine you’re building with LEGO bricks. It’s easiest when every piece snaps into the same kind of studs. If every piece had a different connector, building anything would be a headache.

🥬 The Concept (Unified Autoregressive Modeling): What it is: A single next-token predictor that treats everything (text, image, speech) as a one-by-one sequence it keeps extending. How it works: 1) Turn each modality into discrete tokens, 2) mix them into one stream, 3) use one Transformer decoder to predict the next token, and 4) repeat until done. Why it matters: Without one simple next-token setup, you’d need separate expert decoders (like diffusion for images), making training and inference complicated and slow. 🍞 Anchor: Just like typing letters one after another to make a sentence, AR-Omni types tokens one by one to make a paragraph, a picture, or a spoken reply.

The World Before: LLMs became very good at language but mostly lived in a text-only world. Early MLLMs learned to “see” images and “hear” speech to answer questions, yet most could only reply with text. Newer “omni” models tried to reply in images and speech, but they often bolted on extra expert parts: diffusion models for pictures and special audio generators for speech. That made systems harder to train together and harder to run fast.

The Problem: Can we build a truly unified system that supports any input to any output (text, image, speech) using one decoder and one training objective—without relying on heavy expert decoders—and still be fast enough for real-time talking?

Failed Attempts: 1) Separate diffusion decoders for images gave high-quality pictures but added big, slow components. 2) Dual-codebook speech (semantic + acoustic) improved control but increased latency and complexity for streaming. 3) Naive next-token training across all modalities caused imbalance (long audio sequences dominated), unstable training, and lower visual fidelity.

The Gap: A single, simple, “pure AR” approach that runs everything through one next-token predictor, yet handles practical hurdles: balancing tasks, keeping images crisp, and switching between stable (deterministic) and creative (sampling) decoding.

🍞 Hook: Think of balancing a busy classroom: if one loud student (like long audio) talks all the time, others (short text tasks) won’t get heard.

🥬 The Concept (Modality Imbalance): What it is: When training on mixed tasks, some modalities with more tokens (like speech) can overpower learning. How it works: The model sees more loss signal from longer streams and pays less attention to shorter ones. Why it matters: Without balancing, the model can underperform on text responses or other tasks that matter. 🍞 Anchor: If the AI keeps getting long audio homework and short writing assignments, it might become great at listening but forget how to write well.

Real Stakes: A real “omni” assistant can watch a science demo video, explain it out loud, draw a quick sketch, and then continue chatting—smoothly and fast. That helps accessibility (spoken replies), creativity (image generation), and productivity (no switching tools). But to be useful every day, it must be simple, stable, and real-time.

02Core Idea

🍞 Hook: Imagine you have one magic pen that can write stories, draw pictures, and play music—without switching tools. You just move the same pen differently, and it does the right thing.

🥬 The Concept (The Aha!): What it is: AR-Omni turns text, images, and speech into one shared alphabet of tokens and uses one autoregressive decoder (one-next-token-at-a-time) to generate any output from any input. How it works: 1) Tokenize text, images, and speech into discrete symbols, 2) blend them into a joint vocabulary, 3) train with a weighted next-token loss so no task dominates, 4) add a tiny perceptual alignment loss to keep generated images coherent, and 5) use a finite-state decoding machine to pick stable greedy decoding for precise tasks and creative sampling for open-ended ones. Why it matters: Without this unification, you need separate heavy expert decoders, which complicates training and slows down real-time use. 🍞 Anchor: Like reading one line at a time and deciding the next letter to put down, AR-Omni writes the next token whether it’s part of a sentence, a picture, or a sound.

Multiple Analogies:

One conveyor belt: All ingredients (words, pixels, sounds) go onto the same belt as tokens. The chef (decoder) adds the next piece each time, following the recipe (context) until the dish (final output) is done.
One multilingual alphabet: Instead of switching languages with new letters each time, we merge alphabets into one super-alphabet. The writer picks the next character based on what came before.
One orchestra conductor: The same conductor leads different sections. For strict parts (ASR/TTS), everyone follows the sheet exactly (greedy). For creative solos (T2I), there’s room to improvise (sampling).

Before vs After:

Before: Omni models often needed extra diffusion decoders for images or special audio generators. Training was less unified; streaming was harder and slower.
After: AR-Omni keeps everything in an autoregressive pipeline: one token stream, one objective, one decoder—plus small fixes for balance and fidelity—so it can understand and generate across modalities and speak in real time.

Why It Works (intuition, no equations):

Discretization makes every modality behave like text, so “predict the next token” scales to images and speech.
Reweighting the loss lets short, important targets (like answer text) speak loudly enough during training.
Perceptual alignment nudges the model toward visually similar codes even when it can’t guess the exact one, keeping pictures coherent.
A decoding state machine matches the right decoding style to the right task, balancing stability and creativity.

Building Blocks (each with a mini-sandwich):

🍞 Hook: You know how we learn a shared sign language so everyone can communicate? 🥬 The Concept (Joint Vocabulary): What it is: One big set of tokens that covers text, image, and speech codes. How it works: Merge sub-vocabularies; mark modality boundaries with special tokens; process everything as one sequence. Why it matters: Without one joint set, the model can’t flow smoothly from one modality to another. 🍞 Anchor: The model can read “text → <boi> image tokens <eoi> → <boa> audio tokens <eoa>” all in one go.

🍞 Hook: Think of turning a song, a picture, and a poem into LEGO bricks so they all snap together. 🥬 The Concept (Discrete Tokenizers): What it is: Converters that turn continuous signals into discrete tokens. How it works: BPE for text; VQ for images; a single-codebook acoustic tokenizer for speech. Why it matters: Without discretizing, the next-token game can’t be played across modalities. 🍞 Anchor: A photo becomes a short list of codebook IDs the model can generate one by one.

🍞 Hook: When you draw, you sometimes miss a line but still keep the shape right. 🥬 The Concept (Token-level Perceptual Alignment Loss): What it is: A gentle nudge that aligns the model’s hidden states with embeddings of the correct image tokens. How it works: Compare the model’s internal representation to a frozen embedding for the target visual code; encourage closeness. Why it matters: Without it, missing the exact code can lead to blocky or incoherent images. 🍞 Anchor: Even if one “leaf” token is off, the tree still looks like a tree.

🍞 Hook: In class, if math is falling behind, the teacher gives a bit more math practice that week. 🥬 The Concept (Task-aware Loss Reweighting): What it is: Adjusting training weights so important response tokens (like text answers) don’t get drowned out by long sequences. How it works: Heavier weights on response tails for X→T tasks like ASR or captioning. Why it matters: Without it, speech tokens (many per second) could hog the learning. 🍞 Anchor: The model keeps its writing sharp while also learning to listen and draw.

🍞 Hook: Sometimes you want a precise answer; other times you want a creative sketch. 🥬 The Concept (Finite-State Decoding Machine): What it is: A small controller that switches decoding styles by task. How it works: Greedy for stable tasks (ASR/TTS); sampling for creative ones (T2I/open-ended text). Why it matters: Without it, one decoding style fits poorly across very different tasks. 🍞 Anchor: Read numbers exactly for a recipe (greedy), but brainstorm cake designs with flair (sampling).

03Methodology

High-Level Recipe: Input → Tokenize into one stream → Unified AR decoding with task-aware loss → Optional perceptual alignment on image tokens → Finite-state decoding to produce output → Detokenize to text/image/audio.

Step-by-Step (with “why” and examples):

Tokenize each modality

What happens: Text is split by a BPE tokenizer; images are turned into discrete visual codes by a VQ-style tokenizer; speech is compressed into a single-codebook acoustic token stream (low tokens per second) suitable for streaming.
Why this exists: The model needs everything in the same “token world” to do next-token prediction across modalities.
Example: “Describe this image” + a photo become: [USER text tokens, <boi>, image tokens, <eoi>, <eoh>].

Build a joint, interleaved stream with markers

What happens: Merge modality tokens into one sequence with boundary tokens: <boa>/<eoa> for audio, <boi>/<eoi> for image, <eoh> for end-of-user input, <eom>/<eos> for assistant turn/end. Text bridges meaning between parts.
Why this exists: The decoder must know where each modality starts/ends and which tokens should be produced next.
Example: For TTS: <bos> USER: Convert to speech: “Hello!” <eoh> ASSISTANT: <boa> audio tokens <eoa> <eos>.

🍞 Hook: Like using traffic lights to keep different vehicles organized on one road. 🥬 The Concept (Interleaved Modeling with Special Tokens): What it is: Mixing modalities in one timeline, marked by clear start/end tokens. How it works: Place boundary tokens around non-text segments and an end-of-human marker; decode causally. Why it matters: Without clear markers, the model could confuse when to read or speak or draw. 🍞 Anchor: It’s like “Text… now picture starts… picture ends… now audio starts… audio ends.”

Unified autoregressive decoding (one Transformer, one objective)

What happens: A 7B-parameter Transformer predicts the next token p(x_t | x_<t) over the joint vocabulary. The same decoder handles T→T, I→T, S→T, T→I, and T→S.
Why this exists: Simplicity, scalability, and shared learning across modalities.
Example: For ASR, after reading <boa> audio tokens, the model emits text tokens of the transcript.

Weighted next-token loss to fix imbalance

What happens: During training, apply higher weights to response text tokens for X→T tasks so shorter answers aren’t overshadowed by long inputs (like speech).
Why this exists: Prevents long modalities from dominating learning.
Example: In image captioning (I→T), the caption tokens get higher weight so the model learns to write better descriptions.

Token-level perceptual alignment for image fidelity

What happens: For steps predicting image tokens, add a small auxiliary loss that pulls the model’s hidden state toward a frozen embedding of the correct visual code.
Why this exists: Cross-entropy treats all wrong codes equally; this loss respects visual similarity, keeping shapes and structures coherent.
Example: If the exact brick code is missed, the model still picks a visually similar brick, so the house looks right.

Stable training with residual-post-norm (swin-norm)

What happens: Normalize the residual branches inside each Transformer block to keep training stable on long, interleaved sequences.
Why this exists: Mixed-modality, long-context training can become unstable or collapse; this normalization smooths optimization.
Example: Loss curves stay smooth instead of spiking late in training.

🍞 Hook: Like keeping your balance on a long hike by adjusting how you step. 🥬 The Concept (Residual-Post-Norm / Swin-Norm): What it is: A way of placing normalization after residual additions in Transformer blocks. How it works: Each block does attention, adds back a residual, then normalizes; then feed-forward, add residual, normalize. Why it matters: Without it, optimization can wobble or spike on long sequences. 🍞 Anchor: It’s like steadying each step so you don’t trip when the path (sequence) gets long.

Finite-state decoding machine for task-aware outputs

What happens: At inference, a tiny controller chooses decoding style: greedy for deterministic tasks (ASR/TTS), sampling for creative ones (image generation, open-ended text). It can switch states when the modality changes.
Why this exists: One decoding style is not best for all tasks; matching style to task improves quality and user experience.
Example: The model writes exact transcripts but samples diverse images for the same caption.

Streaming speech with low-latency tokens

What happens: Because the speech tokenizer uses a single acoustic codebook at a low token rate, the system can decode audio as soon as a small chunk of tokens is ready, enabling early playback.
Why this exists: Dual-codebook methods often need pairs of aligned tokens before decoding, which increases first-token delay.
Example: AR-Omni starts speaking after about 146 ms and keeps up with real time (RTF ≈ 0.88).

🍞 Hook: Like starting a song as soon as the first notes are ready, not waiting for the whole track to download. 🥬 The Concept (First-Token Latency and Real-Time Factor): What it is: FTL is time until the first playable audio token; RTF is total generation time divided by audio duration. How it works: Lower FTL feels snappy; RTF < 1 means faster-than-real-time. Why it matters: Without low FTL/RTF, voice assistants feel laggy. 🍞 Anchor: If your friend replies after a fraction of a second and then keeps pace, it feels like a real conversation.

Data and Training Flow:

Pretraining mix balances text-only : text-image : text-speech at 0.5 : 1 : 2 until any bucket empties, keeping a controlled composition.
Finetuning on interleaved instruction data (AnyInstruct-style), with noise-augmented speech and speech-centric dialogues to match real use.
Loss during finetuning applies only to assistant response tokens to sharpen instruction following.

Secret Sauce:

One clean AR pipeline + three light, practical fixes (loss reweighting, perceptual nudge, decoding FSM) = any-to-any generation that stays real-time, stable, and decoder-free.

04Experiments & Results

🍞 Hook: Imagine a school decathlon where one student must read aloud, paint, and sing on the same day—and finish on time.

🥬 The Concept (The Test Plan): What it is: AR-Omni is tested on image understanding (I→T), image generation (T→I), speech recognition (S→T), and text-to-speech (T→S), plus latency and throughput. How it works: Use standard datasets (MS-COCO, LibriSpeech, VCTK) and standard metrics (CIDEr, CLIP score, WER, FTL, RTF). Why it matters: Without broad tests, we can’t know if one model truly handles any-to-any well and fast. 🍞 Anchor: Like checking reading scores, art critique, and singing tempo to judge overall talent.

Competitors and Setup:

Diffusion-free AR models like Chameleon/Anole (text↔image) and token-based any-to-any models (AnyGPT, MiO) that often rely on diffusion for images or dual-codebook speech.
AR-Omni uses one 7B Transformer and a lightweight VQGAN detokenizer for images (∼41M params), no diffusion UNet (∼900M) needed.

Metrics (mini-sandwich each):

🍞 Hook: Like grading how well a caption matches a set of teacher-written captions. 🥬 The Concept (CIDEr for Captioning): What it is: A score that measures similarity between a generated caption and many references, focusing on important n-grams. How it works: TF-IDF weights plus cosine similarity over n-gram features. Why it matters: Without a focused comparison, generic captions could fool simpler metrics. 🍞 Anchor: AR-Omni’s 56.53 CIDEr shows solid alignment with the reference descriptions.

🍞 Hook: Like checking if a picture and its title fit well together. 🥬 The Concept (CLIP Score for T2I): What it is: Similarity between the CLIP embedding of a generated image and the caption text. How it works: Encode both with CLIP ViT-L; compute cosine similarity; average over samples. Why it matters: Without it, we can’t gauge if the image matches the prompt meaningfully. 🍞 Anchor: AR-Omni’s 0.24 CLIP score is decent for diffusion-free AR but below diffusion-based systems.

🍞 Hook: Like counting how many words you misheard when writing down a dictation. 🥬 The Concept (WER for ASR/TTS): What it is: Word Error Rate = (substitutions + deletions + insertions) / reference words. How it works: Compare generated transcript with ground truth using alignment. Why it matters: Without WER, we can’t objectively judge speech intelligibility. 🍞 Anchor: AR-Omni’s 9.4 WER (LibriSpeech ASR) and 6.5 WER (VCTK TTS) show good recognition and intelligible synthesis.

Scoreboard with Context:

Image Captioning (I→T, MS-COCO, zero-shot): AR-Omni CIDEr 56.53. Within diffusion-free AR setups, it improves over its base (Anole), indicating any-to-any training did not harm and may help captioning.
Text-to-Image (T→I, MS-COCO): CLIP score 0.24 using a 41M detokenizer (no diffusion). Diffusion-based systems score higher but require ~900M-parameter UNets. AR-Omni keeps a single, simple pipeline at some quality cost.
ASR (S→T, LibriSpeech test-clean): 9.4 WER with a single-codebook speech tokenizer that produces far fewer tokens/sec than dual-codebook baselines, helping efficiency. Accuracy is comparable among any-to-any models.
TTS (T→S, VCTK, zero-shot): 6.5 WER, 146 ms first-token latency, and 0.88 RTF—faster than real time. This is like getting your first musical note almost instantly and then playing at or above concert speed.

Surprising/Notable Findings:

Diffusion-free AR-Omni maintains strong captioning and competitive speech quality while enabling streaming—rare among any-to-any models.
Single-codebook speech tokens reduce latency and complexity versus dual-codebook methods yet remain intelligible.
Any-to-any training only slightly lowers T2I CLIP compared to its AR base, suggesting unified training preserves most image generation ability without diffusion.

Ablations (why the pieces matter):

Removing perceptual loss slightly hurts I2T/T2I and TTS while modestly helping ASR, implying it mainly benefits generation fidelity (vision and speech textures).
Removing swin-norm hurts TTS notably, showing its role in stable speech generation quality.
Training with a simple, unweighted next-token loss leads to higher ASR/TTS errors and drops T2I, plus unstable late-stage loss spikes—confirming the need for task-aware reweighting and stabilization.

05Discussion & Limitations

Limitations:

Diffusion-free AR image generation lags behind diffusion-based systems in fine detail and overall CLIP alignment. If photo-realism is the top priority, diffusion remains stronger.
Joint training across long, mixed sequences is sensitive; although stabilized here, scaling to even longer videos or higher-res images may need additional tricks.
Single-codebook speech is simple and streamable, but very fine prosody/style control might be less flexible than richer multi-codebook designs.

Required Resources:

A single 7B-parameter Transformer with modality tokenizers/detokenizers (e.g., VQGAN ∼41M for images), trained on multiple public datasets (LAION, GigaSpeech, etc.).
Modern GPUs (e.g., A100s) and standard training recipes (Adam, LR warmup, gradient clipping). Storage and bandwidth for large-scale mixed-modality corpora.

When NOT to Use:

If you need state-of-the-art text-to-image photorealism above all else, diffusion-based image generators may be better.
If your speech use case requires extremely precise speaker cloning or nuanced prosody control beyond intelligibility and speed, specialized TTS might be preferable.
If ultra-low-latency on tiny devices with no GPU is required, a lighter monomodal model could be more practical.

Open Questions:

Can diffusion-free AR image quality be boosted (e.g., better visual tokenizers, hierarchical decoding, hybrid losses) without sacrificing unification and speed?
How well does the unified AR approach extend to video tokens while staying real-time?
Can we add richer speaker/style control to single-codebook speech without increasing latency?
How far can task-aware reweighting be automated (e.g., learned schedules) to adapt across new domains without manual tuning?
What are the best safety and alignment practices when generating across modalities in one stream (e.g., watermarking audio/images, robust content filters)?

06Conclusion & Future Work

Three-Sentence Summary: AR-Omni shows that one autoregressive decoder with one next-token objective and one joint vocabulary can support any-to-any generation—text, images, and speech—without external expert decoders. It stays practical by rebalancing training with task-aware loss weights, sharpening visuals with a tiny perceptual alignment loss, and choosing the right decoding style per task via a finite-state machine, enabling real-time streaming speech. Across benchmarks, it delivers competitive multimodal quality while remaining simple and fast.

Main Achievement: Proving that a unified, diffusion-free, single-AR-decoder architecture can cover tri-modal input/output with real-time speech, narrowing the gap to complex systems that rely on multiple expert components.

Future Directions: Improve diffusion-free AR image quality through better visual tokenizers and decoding schedules; extend to video while keeping latency low; enrich controllability for speech style; and explore learned weighting schemes that self-balance tasks. Safety, watermarking, and reliability across modalities are also key for deployment.

Why Remember This: AR-Omni is a clear step toward a true “omni” assistant that listens, looks, and talks—using one simple engine. It trades extra parts for clever training and decoding choices, showing that simplicity can scale across modalities without giving up speed. That blueprint can guide future multimodal systems toward unified, real-time intelligence.

Practical Applications

•Voice-enabled tutors that can describe a diagram, answer questions out loud, and sketch a quick visual example.
•Customer support agents that listen to a caller, summarize an attached photo of a device, and reply with clear spoken steps.
•Accessibility aids that read text in images aloud and generate simple illustrative pictures for explanations.
•Creative tools that turn spoken prompts into draft images while also narrating design options in real time.
•Smart home assistants that understand voice, identify scenes from camera snapshots, and respond in natural speech.
•Language learning apps that show pictures for vocabulary, pronounce words instantly, and caption users’ photos.
•Meeting helpers that transcribe speech, generate visual summaries, and provide audio recaps right away.
•Robotics interfaces that process camera input, follow spoken commands, and provide verbal feedback during tasks.
•Kiosk or kiosk-like devices that can converse with users, interpret a displayed image, and guide them with voice.
•Telehealth screeners that listen to patient speech, review an uploaded image (like a rash photo), and respond with clear spoken guidance (non-diagnostic).

Version: 1