FlowAct-R1: Towards Interactive Humanoid Video Generation

Lizhen Wang; Yongming Zhu; Zhipeng Ge; Youwei Zheng; Longhao Zhang; Tianshu Hu; Shiyang Qin; Mingshuang Luo; Jiaxu Zhang; Xin Chen; Yulong Wang; Zerong Zheng; Jianwen Jiang; Chao Liang; Weifeng Chen; Xing Wang; Yuan Zhang; Mingyuan Gao

FlowAct-R1: Towards Interactive Humanoid Video Generation

Intermediate

Lizhen Wang, Yongming Zhu, Zhipeng Ge et al.1/15/2026

arXiv PDF

Key Summary

•FlowAct-R1 is a new system that makes lifelike human videos in real time, so the on-screen person can react quickly as you talk to them.
•It keeps the video smooth and consistent over long conversations by generating it in small chunks and using special memories to avoid drift.
•A Multimodal Diffusion Transformer (MMDiT) blends text, audio, and visuals so lips, expressions, and body gestures match what’s being said.
•A chunkwise diffusion forcing strategy plus a self-forcing variant trains the model to handle the exact errors it will see during streaming.
•The system is distilled down to only 3 denoising steps (3 NFEs), reaching 25 fps at 480p with about 1.5 seconds to the first frame.
•A structured memory bank (reference, long-term, short-term, denoising stream) keeps identity, motion, and transitions stable and natural.
•An MLLM planner proposes next actions so the avatar smoothly switches between speaking, listening, thinking, and idling.
•Compared to leading methods, FlowAct-R1 is both real time and vivid, avoiding repetitive motions while generalizing to many character styles.
•Operator-level speedups, quantization, and parallelism make it practical to deploy on modern GPUs without sacrificing quality.

Why This Research Matters

Interactive avatars that react instantly and move naturally can make online classes, telehealth, and customer support feel more human and effective. Real-time, full-body expressiveness helps sign-language guidance, emotion-sensitive tutoring, and public speaking practice. Because FlowAct-R1 keeps long videos stable, it can host live streams or webinars without breaking immersion. Low latency matters in conversations—fast reactions build trust and engagement. Efficient distillation and system optimizations mean this can be deployed practically, not just as a lab demo. With responsible safeguards, such avatars can improve accessibility, companionship, and global communication. The research also advances the broader field of streaming generative AI, showing how to bridge training and real-world use.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine FaceTime with a super actor who can smile, nod, and gesture naturally while talking to you—without awkward pauses or jerky motions.

🥬 Filling (The Actual Concept – Video Generation):

What it is: Video generation is teaching computers to create moving pictures that look real.
How it works: 1) Compress real video into smaller pieces called latents; 2) Learn patterns of motion and appearance; 3) Generate new frames step by step; 4) Decode back into full video.
Why it matters: Without strong video generation, avatars look stiff, off-sync, or break after a few seconds.

🍞 Bottom Bread (Anchor): Think of a digital newscaster who reads headlines while blinking, nodding, and moving hands naturally—video generation powers all of that.

The World Before: For years, AI was good at making short, pretty video clips or at syncing just the mouth to speech. Lip-sync systems nailed the “mouth” part but not the “whole person”—they missed eyebrows, head tilts, and body language. Diffusion models produced beautiful motion but were too slow to react live. Even when some systems streamed in real time, they often repeated motions or felt robotic, and many only handled cropped faces, not full bodies.

The Problem: Real-time interactive humanoid video is like juggling three balls at once: speed (low delay), quality (high fidelity), and endurance (no falling apart over long chats). When you talk with an avatar for minutes, tiny errors can add up. The person’s face might slowly morph, the body might drift, and motions can loop or repeat. Plus, the avatar needs to switch states—listening quietly, speaking with emphasis, thinking, or idling—without looking fake.

🍞 Top Bread (Hook): You know how a dancer looks smooth if every move connects, but clumsy if one step slips?

🥬 Filling (Temporal Consistency):

What it is: Temporal consistency means each frame flows naturally into the next over time.
How it works: 1) Keep identity stable (same face, hair, clothes); 2) Smooth motion (no sudden jumps); 3) Maintain scene layout; 4) Use memory to remind the model what just happened.
Why it matters: Without it, long videos drift—faces warp, hands teleport, or motions repeat.

🍞 Bottom Bread (Anchor): Like a cartoon that stays on-model from scene to scene, not a flipbook gone wrong.

Failed Attempts: Prior systems either generated short, high-quality clips (too slow for live use) or streamed quickly but with robotic motions and repetition. Some only focused on faces, losing the expressiveness of full-body gestures. Others didn’t bridge the gap between how they were trained (clean, fixed-length clips) and how they were used (messy, never-ending streams), so errors snowballed.

The Gap: What was missing was a way to: 1) Generate in small live chunks while remembering the past; 2) Train the model to expect the messy, imperfect history of a real stream; and 3) Make everything efficient enough to keep 25 frames every second with barely any wait for the first frame.

Real Stakes: This matters for live streaming, tutoring, customer support, video conferencing, and accessibility. A lifelike, responsive avatar can teach, translate, comfort, or perform—instantly. If it lags or drifts, people disengage. If it stays smooth and expressive, it feels human and helpful.

02Core Idea

🍞 Top Bread (Hook): Imagine building a long train one car at a time while it’s already moving, and making sure each new car matches the style and speed of the train so the ride stays smooth.

🥬 Filling (The Aha! Moment):

What it is: FlowAct-R1 streams lifelike humanoid video by generating it in small chunks, training on the same kinds of streaming errors it will see, and speeding the denoising to just three quick steps.
How it works: 1) Use a Multimodal Diffusion Transformer (MMDiT) to fuse audio, text, and visuals; 2) Generate video chunk by chunk (chunkwise diffusion forcing); 3) Train with self-forcing so the model practices with its own imperfect memories; 4) Keep a smart memory bank to stabilize identity and motion; 5) Distill the sampler to 3 steps for real-time speed; 6) Use an MLLM to plan natural next actions.
Why it matters: Without chunkwise generation plus self-forcing and memory, videos drift or repeat; without distillation, it’s too slow.

🍞 Bottom Bread (Anchor): Like a live TV puppeteer who listens to the script (text), hears the actor (audio), remembers the last pose (memory), and moves the puppet smoothly—fast enough for a broadcast.

Multiple Analogies:

Lego City: Build the city block by block (chunks), keep a map of what’s already built (memory), and check the storybook for the next scene (MLLM actions).
Orchestra: Conductor (MMDiT) mixes instruments (audio, text, visuals) so the music (video) stays in rhythm (temporal consistency) at concert tempo (real time).
Cooking Show: Prep in small batches (chunks), taste and adjust (self-forcing), follow a recipe card (action plan), and serve each plate hot (low latency).

🍞 Top Bread (Hook): You know how blueprints help builders coordinate electricians, plumbers, and carpenters?

🥬 Filling (MMDiT Architecture):

What it is: A Multimodal Diffusion Transformer that aligns text, audio, and video latents through attention.
How it works: 1) Encode text/audio/video into tokens; 2) Use cross-attention to share cues; 3) Denoise tokens into clean frames.
Why it matters: Without it, lips, expressions, and gestures wouldn’t match speech and meaning.

🍞 Bottom Bread (Anchor): When you say “I’m excited!”, MMDiT helps the avatar smile, widen eyes, and add upbeat gestures together.

🍞 Top Bread (Hook): Imagine writing a story page by page while keeping characters and plot consistent.

🥬 Filling (Chunkwise Diffusion Forcing):

What it is: Generate video in short chunks, each trained to continue smoothly from the last.
How it works: 1) Split video into small segments; 2) Condition each new chunk on recent memory; 3) Denoise quickly; 4) Append and repeat.
Why it matters: Without chunks, streaming would be slow or fall apart on long videos.

🍞 Bottom Bread (Anchor): Like releasing new comic pages weekly that fit perfectly with the previous issue.

🍞 Top Bread (Hook): Practicing with the same wind and bumps you’ll face on race day makes you better.

🥬 Filling (Self-Forcing):

What it is: Train using the model’s own generated memories so it learns to correct real streaming errors.
How it works: 1) Create “generated GT” latents by noising/denoising real frames; 2) Sample these in training; 3) Learn to recover when history is imperfect.
Why it matters: Without self-forcing, tiny errors pile up until motion drifts or identity warps.

🍞 Bottom Bread (Anchor): Like rehearsing a speech with background noise so you won’t freeze on stage.

🍞 Top Bread (Hook): You know how a diary helps you remember the big moments and what just happened yesterday?

🥬 Filling (Memory Bank – Reference/Long-term/Short-term/Denoising Stream):

What it is: A structured memory that anchors identity, keeps long-range actions, smooths recent motion, and holds the next chunk being generated.
How it works: 1) Reference latent locks identity; 2) Long-term queue stores fully denoised past chunks; 3) Short-term latent keeps immediate continuity; 4) Denoising stream updates the next frames.
Why it matters: Without these, faces drift, motions jerk, and transitions feel fake.

🍞 Bottom Bread (Anchor): It’s like a scrapbook (long-term), a sticky note (short-term), a portrait (reference), and your current draft (denoising stream).

Before vs After:

Before: Either pretty but short, or fast but repetitive/face-only.
After: FlowAct-R1 is fast, full-body, expressive, and steady for as long as you talk.

Why It Works (intuition): Keep chunks short to react quickly, keep memories to stay consistent, teach the model on the same bumpy data it will face live, and shrink the denoising to just a few smart steps so speed meets quality.

Building Blocks: MMDiT fusion; chunkwise diffusion forcing; self-forcing; structured memory bank; few-step distillation; system optimizations; MLLM action planning.

03Methodology

High-level Recipe: Input (audio + text + single reference image) → Encode into tokens → Fuse with MMDiT → Chunkwise denoise with memory bank → Decode frames → Stream video out continuously.

🍞 Top Bread (Hook): Picture packing a suitcase: you roll clothes (compress), keep essentials handy (memory), and add outfits day by day (chunks) without wrinkling the earlier ones.

🥬 Filling (Compression into Latents – VAE/tokens):

What it is: Turn big frames into smaller, learnable numbers called latents/tokens.
How it works: 1) A VAE compresses each frame; 2) Text becomes semantic tokens; 3) Audio becomes acoustic tokens via Whisper features; 4) All align at 25 tokens per second.
Why it matters: Without compression, real-time processing would be too slow.

🍞 Bottom Bread (Anchor): Like zipping a giant photo into a small file so you can send it quickly.

🍞 Top Bread (Hook): Imagine a team huddle where everyone shares quick updates to stay in sync.

🥬 Filling (Multimodal Fusion – cross-attention):

What it is: Mix audio, text, and visual tokens so lips, expressions, and gestures match.
How it works: 1) Cross-attention lets video tokens “listen” to audio/text; 2) Windowed spatial and shot-based temporal attention keep compute low; 3) Fake-causal masking stops future frames from leaking back.
Why it matters: Without good fusion and masking, motions would misalign and stability would drop.

🍞 Bottom Bread (Anchor): The avatar hears the word “surprised” and instantly adds raised brows and a small gasp.

Detailed Steps per 0.5s Chunk (25 fps at 480p):

Inputs update: Latest 0.5s of audio (from 16kHz to 25 features), current/next text prompt, and the fixed reference image.
Memory prep: Reference latent; long-term queue (up to 3 fully denoised past chunks); short-term latent (last finished chunk); denoising stream (3×3 latents, one micro-step each in parallel).
Denoising (3 NFEs): The MMDiT refines noisy latents for the next 0.5s until they’re clean.
Memory refinement cycles: Periodically re-noise and fix short-term memory using reference + long-term as anchors to remove artifacts.
Decode and stream: VAE decodes latents into frames; a pipeline overlaps denoising and decoding to avoid idle time.
Action planning: An MLLM reads recent audio/text and the reference to suggest next micro-actions (e.g., pause, nod, gesture right hand), guiding smoother transitions.

Why each step exists and what breaks without it:

Tokenization: Without it, compute explodes; latency spikes.
Cross-attention fusion: Without it, lips and gestures don’t match words or tone.
Chunkwise generation: Without it, you can’t stream in real time.
Structured memory: Without it, identity drifts and motion stutters.
Memory refinement: Without it, small artifacts accumulate into obvious glitches.
Distilled 3-step denoising: Without it, you can’t hit 25 fps.
Async pipeline + operator fusions + quantization: Without them, GPUs sit idle or bottlenecks stall video.

Concrete Mini-Example:

Input: Say the sentence, “Welcome back! Today we’ll unbox a robot.” The text prompt says “cheerful intro, small right-hand wave.”
Audio tokens peak on “Welcome back!”
Fusion: Cross-attention aligns peaks with a wide smile and eyebrow lift.
Chunk 1: Lip-sync on “Welcome back!”, shoulder squares, tiny wave starts.
Memory stores chunk 1’s clean latent.
Chunk 2: Continues “Today we’ll…”, smooths the wave to a stop, head nods.
Refinement: Short-term memory is cleaned to erase tiny jitter.
Video streams at 25 fps with ~1.5s initial wait.

🍞 Top Bread (Hook): Training for a marathon? You start with short runs, learn pacing, and practice on the same route you’ll race.

🥬 Filling (Training Curriculum + Self-Forcing):

What it is: Three stages—autoregressive adaptation, joint audio-motion training, and multi-stage distillation to 3 NFEs—plus self-forcing to match streaming errors.
How it works: 1) Intra-segment training for local smoothness; 2) Cross-segment training for transitions across changing prompts; 3) Keep a weighted image-to-video loss to start clean; 4) Fake-causal masking for stability; 5) Self-forcing by sampling generated-GT latents to simulate inference noise; 6) Distill steps and guidance into a fast student.
Why it matters: Without this, long videos repeat, drift, or run too slow.

🍞 Bottom Bread (Anchor): Like rehearsing scene changes on the actual stage and wearing your real costume so nothing surprises you on opening night.

🍞 Top Bread (Hook): Trimming a long essay into a sharp summary makes it faster to read without losing meaning.

🥬 Filling (Efficient Distillation – NFEs/CFG):

What it is: Compress many denoising steps into just 3, and fold classifier-free guidance into the model.
How it works: 1) Teach a student to mimic teacher outputs at different guidance scales with a single embedding; 2) Distill micro-steps into 3 big steps; 3) Use DMD-style few-step score distillation adjusted for chunks; 4) Initialize each distillation stage from the last to stay stable.
Why it matters: Without distillation, latency kills interactivity.

🍞 Bottom Bread (Anchor): Like learning the key moves of a dance so you can perform fluidly without practicing every tiny drill on stage.

Secret Sauce:

The pairing of chunkwise diffusion forcing with self-forcing and a smart memory bank keeps streams both quick and steady—a balance that prior systems struggled to get all at once.

04Experiments & Results

The Test: Researchers compared FlowAct-R1 with three strong baselines—KlingAvatar 2.0, LiveAvatar, and OmniHuman-1.5—focusing on real-time responsiveness, long-duration stability, and how natural the behavior looked and sounded.

What They Measured and Why:

Motion naturalness: Does the body gesture, nod, and pause like a person?
Lip-sync accuracy: Do mouth shapes match sounds precisely?
Frame structure stability: Does the face/body stay consistent without warping?
Motion richness: Are there diverse, non-repetitive movements?
Throughput/latency: Can it stream 25 fps at 480p with low time-to-first-frame (TTFF)?

The Competition:

OmniHuman-1.5: Similar model family but tops out around 30s, no streaming.
KlingAvatar 2.0: Great visuals, up to ~5 minutes, but no live streaming and shows motion repetition.
LiveAvatar: Real-time streaming but tends to repeat motions, harming naturalness.

The Scoreboard (with context):

FlowAct-R1 achieves 25 fps at 480p with ~1.5s TTFF—like a video call that starts almost immediately and stays smooth.
In a 20-person user study (GSB metric), FlowAct-R1 was preferred for behavioral naturalness, reduced repetition, and overall realism, while also being the only one to pair real-time streaming with vivid full-body control and long-duration stability.
Compared to LiveAvatar, FlowAct-R1’s chunkwise diffusion forcing plus MLLM action planning reduces repetitive loops, so it feels more like a real human driving the motion.
Compared to KlingAvatar 2.0, FlowAct-R1 answers in real time and holds up over longer interactions without losing quality, even though KlingAvatar has strong single-clip quality.
Compared to OmniHuman-1.5, FlowAct-R1 keeps the backbone strengths (quality, control) but extends to arbitrary lengths and streaming.

Surprising Findings:

Short-term memory had the biggest impact on cumulative errors; by periodically repairing it (memory refinement), long videos stayed smooth and on-model.
Removing classifier-free guidance via an embedding and distillation preserved quality while unlocking latency gains—an efficiency win without obvious quality loss.
Chunk-aware distillation (teaching the student on the same streaming rollout it will use) improved both stability and speed compared to vanilla step distillation.

Takeaway: FlowAct-R1 didn’t just edge out others on one metric; it combined speed, stability, and vivid full-body expressiveness—something prior systems typically traded off.

05Discussion & Limitations

Limitations:

Domain shifts: Extremely wild camera motion or heavy occlusions may still cause drift or artifacts despite memory refinement.
Hardware needs: Real-time 25 fps at 480p relies on strong GPUs (e.g., A100) plus careful kernel fusions and parallelism.
Content controls: While full-body control is strong, ultra-precise finger-level choreography or multi-person interactions may require extra conditioning or models.
Data dependence: High-quality, behavior-annotated training data is crucial; weak annotations could reduce naturalness.
Ethics and safety: The same realism that delights can deceive; guardrails, watermarks, and access control are necessary.

Required Resources:

A modern data center–class GPU or a well-optimized multi-GPU setup; FP8-capable inference stack helps.
Preprocessing pipeline (VAE, Whisper features, text annotations) and streaming-friendly loaders.
MLLM for action planning and short-interval prompt updates.

When NOT to Use:

Ultra-low-power edge devices without GPU acceleration.
Scenarios demanding 4K/60 fps today with the same latency budget.
Highly choreographed multi-person scenes without additional pose or scene constraints.

Open Questions:

How to scale to multi-person, object-rich scenes with equal stability and speed?
Can memory refinement become fully self-tuning and event-triggered to further cut artifacts?
What’s the best way to compress to mobile-class hardware without losing expressivity?
How can we add robust watermarking or provenance while keeping latency low?
Can the MLLM planner be unified end-to-end with the generator for even smoother intent-to-motion mapping?

06Conclusion & Future Work

3-Sentence Summary: FlowAct-R1 streams lifelike humanoid video by generating small chunks conditioned on structured memories, fusing audio/text/visuals with an MMDiT backbone. It practices on the same imperfect histories it will face live (self-forcing), then distills the sampler to 3 fast steps, reaching 25 fps at 480p with ~1.5s TTFF. The result is vivid, stable, full-body behavior over arbitrary durations.

Main Achievement: Unifying chunkwise diffusion forcing, self-forcing, a structured memory bank, and few-step distillation into a single system that is both real time and richly expressive—closing the usual gap between speed and naturalness.

Future Directions: Extend to multi-person interactions and explicit object/scene grounding; push to higher resolutions and frame rates; tighten end-to-end planning with the MLLM; advance safety with watermarking and robust identity controls; and compress further for edge deployment.

Why Remember This: FlowAct-R1 shows you don’t have to choose between fast or lifelike—you can have both by matching training to streaming reality, keeping smart memories, and shrinking denoising to just a few powerful steps.

Practical Applications

•Live customer support avatars that answer questions and gesture naturally in real time.
•Virtual teaching assistants that lip-sync to lessons, point, and encourage students on cue.
•Telehealth companions that maintain eye contact, nod, and mirror empathy during sessions.
•Real-time translation avatars that mouth and gesture the translated speech smoothly.
•Corporate training presenters for onboarding and safety demos with full-body cues.
•Interactive streamers and virtual influencers who respond instantly to chat and prompts.
•Accessible sign-language helpers with synchronized facial expressions and hand motions (with extra conditioning).
•Video meeting co-hosts that summarize, gesture to slides, and keep engagement high.
•Retail concierges in kiosks who greet, explain products, and react to voice queries.
•Role-play coaches for public speaking or interview prep, giving natural, timely feedback.

Version: 1