Plenoptic Video Generation

Xiao Fu; Shitao Tang; Min Shi; Xian Liu; Jinwei Gu; Ming-Yu Liu; Dahua Lin; Chen-Hsuan Lin

Plenoptic Video Generation

Intermediate

Xiao Fu, Shitao Tang, Min Shi et al.1/8/2026

arXiv PDF

Key Summary

•PlenopticDreamer is a new way to remake a video from different camera paths while keeping everything consistent across views and over time.
•It treats a scene like a plenoptic function (a big map of all the light in space and time) and learns to generate new views that stay in sync.
•Instead of making each new view separately, it works autoregressively: it looks at several past generated videos plus their cameras and then makes the next one.
•A 3D Field-of-View (FOV) retrieval step picks the most relevant earlier videos based on how much they see the same parts of the scene.
•Cameras are encoded with Plücker raymaps so the model knows exactly where each pixel ray travels in 3D.
•Progressive context-scaling trains the model to handle more and more context over time, which stabilizes learning.
•Self-conditioned training fine-tunes the model on its own outputs, making it robust to small mistakes that can grow over long sequences.
•A long-video conditioning trick lets it stitch long sequences by overlapping a few frames between chunks to keep continuity.
•On two benchmarks (Basic and Agibot), it outperforms prior systems in view synchronization, visual quality, and camera accuracy.
•This unlocks smoother camera-controlled storytelling, better robotic viewpoint switching (head to gripper), and more immersive content.

Why This Research Matters

Consistent multi-view videos are the bridge from pretty clips to believable worlds. With synchronized spatio-temporal memory, creators can freely direct camera moves without the scene falling apart, enhancing films, games, and AR/VR experiences. Robots benefit because switching from a head camera to a gripper camera no longer scrambles object identities or positions. Educators and scientists can re-explore recorded experiments from new viewpoints while trusting that measurements stay consistent. Even on consumer devices, camera-aware video editing becomes simpler and more reliable, reducing reshoots and saving time. In short, this work makes AI-generated videos feel like the same real place, not a patchwork of lucky guesses.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how when you walk around a room, everything still feels like the same room even though you see it from different angles? Your brain remembers what’s behind the couch, even when you can’t see it from where you’re standing.

🥬 Filling (The Actual Concept): Plenoptic Function

What it is: The plenoptic function is a fancy name for the complete “light map” of a scene—what light travels in every direction at every point and time.
How it works: 1) Imagine the scene as a big invisible bubble. 2) At every point, light shoots in many directions. 3) If you know all these light directions, you can recreate any camera view. 4) Over time, this also includes changes (like people moving).
Why it matters: Without this idea, a model might guess each new view separately and forget that they should all be different slices of the same world. 🍞 Bottom Bread (Anchor): Think of a museum diorama. If you could see the diorama’s light from any angle, you could place your camera anywhere and still get a correct view of the same mini-world.

🍞 Top Bread (Hook): Imagine you filmed your friend dancing from the front. Now you wish you had also filmed from the side and from above—without asking your friend to dance again.

🥬 Filling (The Actual Concept): Generative Video Re-rendering

What it is: It’s when an AI takes an existing video and regenerates it from new camera paths while keeping the scene and actions the same.
How it works: 1) Feed the original video into the model. 2) Tell it the new camera path (like “rotate left” or “move closer”). 3) The model imagines the unseen sides and renders new frames. 4) It repeats this for the whole video.
Why it matters: Without reliable re-rendering, creators must physically refilm from every angle; for robots, switching viewpoints would be clumsy. 🍞 Bottom Bread (Anchor): Like asking a skilled animator to redraw your soccer game from the goalie’s view, not just the main broadcast camera.

🍞 Top Bread (Hook): Suppose five friends each remember different parts of a story. If one friend retells the whole story alone, they might get parts wrong. If they compare notes first, the story stays consistent.

🥬 Filling (The Actual Concept): The Problem Before This Paper

What it is: Prior camera-controlled video models often made each new view independently, so the “guesses” (hallucinations) about unseen regions didn’t match across different views.
How it works (the failure mode): 1) Each view is generated separately. 2) Unseen areas are guessed randomly. 3) Different runs guess differently. 4) Results don’t align across cameras—signs shift, textures warp, people’s clothes change.
Why it matters: Without cross-view consistency, you can’t form a trustworthy 3D sense of the scene. It breaks immersion and confuses robots. 🍞 Bottom Bread (Anchor): Think of a door that appears red from the left view and blue from the right view—not believable, not usable.

🍞 Top Bread (Hook): Imagine a scrapbook where you save the best photos that overlap with what you’ll shoot next, so your next photo set matches the look and layout of the last ones.

🥬 Filling (The Actual Concept): Spatio-temporal Memory

What it is: A way for the model to remember what it previously generated across space (views) and time (frames) so future generations stay consistent.
How it works: 1) Store past generated videos and their camera info. 2) Select the past ones that see similar parts of the scene. 3) Condition the new generation on these. 4) Keep doing this step-by-step.
Why it matters: Without memory, each new view can drift, causing flicker, misalignment, and mismatched details. 🍞 Bottom Bread (Anchor): If a robot saw a cup on the table in a head view, it should also see the same cup from the gripper view in the same place and color.

🍞 Top Bread (Hook): Think of a movie director carefully planning camera moves so the audience sees the same scene from different angles without losing track.

🥬 Filling (The Actual Concept): Camera Control

What it is: Precisely telling the model how the camera moves (rotate, tilt, zoom, move) so it renders the right view.
How it works: 1) Provide camera pose and intrinsics (like focal length). 2) Convert them into a form the AI understands. 3) Guide generation to match these settings. 4) Check accuracy with pose estimation tools.
Why it matters: Without strict camera control, the output view can wobble or be wrong, breaking realism and usability. 🍞 Bottom Bread (Anchor): Saying “tilt up and zoom in from 50mm to 100mm” should make the video clearly tilt and change depth-of-field accordingly.

The World Before: Video generators got good at making short clips and even controlling a single camera path. But as soon as people tried multi-view re-rendering—like turning left, then doing an arc, then changing elevation—the same object could look different between views. This was mostly because diffusion models add randomness and forget long-range details. Failed attempts included generating each view separately or using shallow memory (like a few frames) or static 3D features that weren’t updated as new views were rendered. The gap: a way to synchronize the model’s imagination across time and viewpoints so “what’s behind the couch” stays the same everywhere. Real stakes: filmmakers, game creators, and robot systems all need consistent multi-view understanding; otherwise, the story breaks or the robot grabs the wrong thing.

02Core Idea

🍞 Top Bread (Hook): Imagine building a giant Lego city with friends. If each friend builds their block in isolation, streets won’t meet and colors won’t match. But if you all share the same city map and keep checking each other’s pieces, the city snaps together perfectly.

🥬 Filling (The Actual Concept): The Aha! Moment

What it is: Synchronize the model’s “hallucinations” by reusing the most relevant past generations (plus their cameras) every time you make a new view, so all views agree about the same world.
How it works: 1) Keep a memory bank of previous video-camera pairs. 2) Pick the ones that see similar parts of the scene using a 3D Field-of-View (FOV) similarity. 3) Feed those as context (multi-in) and generate only one new target view (single-out). 4) Repeat autoregressively for all required views and chunks.
Why it matters: Without synchronized memory, different views invent different details; with it, the scene stays coherent in space and time. 🍞 Bottom Bread (Anchor): If a robot’s right side panel is shiny silver in one view, it stays shiny silver in all other views later.

Three Analogies for the Same Idea:

Choir analogy: One lead singer starts a tune; new singers join by listening to the closest voices (FOV retrieval) so everyone harmonizes (consistent hallucinations).
Puzzle analogy: You place a piece by checking neighboring pieces that touch the same sky or grass (retrieved context); you don’t dump random pieces elsewhere.
GPS convoy analogy: Each car (a new view) follows the convoy’s shared route (memory bank) and checks the cars with the most overlapping path (FOV) to stay in formation.

Before vs After:

Before: Each view was generated in a separate run; unseen regions got guessed differently, leading to mismatched textures, shapes, and lighting.
After: Each new view is generated while looking back at the best-matching earlier views; the model carries over the same “world facts,” keeping geometry, materials, and motion in sync.

🍞 Top Bread (Hook): Imagine you don’t read an entire encyclopedia at once. You read a few relevant pages, write your summary, and then use that summary plus some new pages to write the next part.

🥬 Filling (The Actual Concept): Multi-in–Single-out Autoregressive Generation

What it is: A generation style where the model uses several context videos in and produces one target video out, then repeats step-by-step.
How it works: 1) Start with the source video. 2) Retrieve top-k past video-camera pairs using 3D FOV similarity. 3) Pack their tokens plus the target camera into the model. 4) Generate the next video. 5) Add this new pair to memory and continue.
Why it matters: Without autoregressive reuse of relevant context, long stories fall apart; with it, details carry forward naturally. 🍞 Bottom Bread (Anchor): Like making a comic book: you look at the last few panels (context) to draw the next panel so characters and backgrounds match.

🍞 Top Bread (Hook): Picking teammates matters—choose the kids who practiced the same drill when you’re about to perform that drill.

🥬 Filling (The Actual Concept): 3D Field-of-View (FOV) Retrieval

What it is: A way to select the most relevant earlier videos based on how much of the 3D space their cameras see in common with the next target camera.
How it works: 1) Build view frustums (3D cones) for each camera. 2) Monte Carlo–sample points per frame. 3) Count co-visible points between context and target cameras across frames. 4) Average to get video-level similarity; pick top-k.
Why it matters: Without smart retrieval, the model might condition on unrelated views and drift off course. 🍞 Bottom Bread (Anchor): If you’re going to look at the robot’s right side next, you retrieve past views that also looked at the right side—not the left.

🍞 Top Bread (Hook): When learning piano, you don’t jump straight to a symphony—you start with short pieces and grow longer.

🥬 Filling (The Actual Concept): Progressive Context-Scaling

What it is: Gradually train the model to handle more context videos over time (e.g., from 1 to k) for stable learning.
How it works: 1) Start with small context size. 2) Train till stable. 3) Increase context size. 4) Repeat until target size.
Why it matters: Without this, training with large context can be unstable and fail to converge. 🍞 Bottom Bread (Anchor): Like lifting light weights first, then heavier ones—your muscles (the model) learn safely.

🍞 Top Bread (Hook): After practicing a speech, you record yourself, watch it, and then practice again fixing your mistakes.

🥬 Filling (The Actual Concept): Self-Conditioning

What it is: Fine-tune the model on its own generated videos as conditioning so it learns to recover from its small errors over time.
How it works: 1) Train normally. 2) Use the model to generate videos for training scenes. 3) Replace clean contexts with these generated ones. 4) Fine-tune so the model gets robust to imperfections.
Why it matters: Without this, tiny errors can snowball in long sequences into glare, flicker, or distortions. 🍞 Bottom Bread (Anchor): Like practicing basketball with slightly slippery balls so regular balls feel easy and you don’t lose control.

🍞 Top Bread (Hook): For a long parade, organizers overlap groups slightly so the route feels continuous with no awkward gaps.

🥬 Filling (The Actual Concept): Long-Video Conditioning

What it is: Split long videos into chunks but keep some overlapping frames from the end of one chunk as clean inputs for the next.
How it works: 1) Chop the video into overlapping sub-clips. 2) Feed the overlap as context for the next chunk. 3) Generate the next chunk. 4) Repeat.
Why it matters: Without overlap, seams appear between chunks, breaking temporal smoothness. 🍞 Bottom Bread (Anchor): Like stitching quilts with shared edges so the big blanket feels seamless.

03Methodology

At a high level: Source video and target camera paths → Memory bank retrieval (3D FOV) → Pack context videos and camera rays → Diffusion Transformer denoising (flow-matching) → Target video → Add to memory and repeat; for long videos, use chunk overlaps.

🍞 Top Bread (Hook): Think of a well-organized binder: each finished page goes into a section so you can quickly find the pages that help with the next one.

🥬 Filling (The Actual Concept): Memory Bank

What it is: A growing library of past generated video–camera pairs you can look up later.
How it works: 1) Start with the source video and its camera pose. 2) After generating a target view, store it with its camera. 3) Keep doing this; the bank gets richer. 4) Retrieval picks the most useful entries later.
Why it matters: Without a memory bank, the model forgets its own decisions and re-invents details inconsistently. 🍞 Bottom Bread (Anchor): Your robot’s right side, once generated, is saved so later views can match its same screws and shine.

🍞 Top Bread (Hook): To aim a flashlight, you think about the ray of light leaving it and where it hits on the wall.

🥬 Filling (The Actual Concept): Plücker Raymaps (Camera Encoding)

What it is: A way to encode, for every pixel, the 3D line (ray) it travels along so the model knows exactly which part of space it’s looking at.
How it works: 1) Take camera intrinsics (like focal length) and extrinsics (position and rotation). 2) For each pixel, compute a 6D line representation (Plücker coordinates). 3) Pack these into a raymap aligned with video frames. 4) Add these ray tokens to video tokens before attention so the model bakes in camera geometry.
Why it matters: Without precise rays, the model can’t align views correctly, hurting camera accuracy and 3D consistency. 🍞 Bottom Bread (Anchor): A pixel on the left edge points to a different 3D line than a pixel at the center; raymaps tell the model that difference.

Step-by-step recipe:

Inputs: You have a source video (like a person walking), plus a list of target camera trajectories (e.g., rotate left, tilt up, zoom out).
Retrieval via 3D FOV: Build frustums for each candidate context camera and the target, sample 3D points, compute co-visibility scores over frames, and pick top-k videos that best see the same space as the target.
Pack tokens: Convert videos into latent tokens (“patchify”), temporally concatenate context tokens with the noisy target tokens, and add camera ray tokens channel-wise.
Denoise with a Video Diffusion Transformer under flow-matching: start from noise for the target video, then let a learned “velocity field” push the noisy frames toward clean frames step-by-step, guided by context and camera rays.
Output and loop: Decode the cleaned latent video, save it (with its camera) into the memory bank, move on to the next target camera. For long sequences, chunk the video and pass a slice of clean overlap into the next chunk.

🍞 Top Bread (Hook): Imagine a river current that gently pushes colored dye into a clear picture as it flows.

🥬 Filling (The Actual Concept): Flow-based Video Diffusion Transformer (Flow-Matching)

What it is: A diffusion-style model that learns a smooth “push” (velocity field) to transform noise into a clean video, conditioned on context.
How it works: 1) Mix the real video with noise at different levels. 2) Train the transformer to predict the direction that moves noisy frames toward clean frames. 3) At inference, start from noise and follow these predicted pushes over time until the picture becomes clear. 4) The model uses attention to blend information from context videos and camera rays.
Why it matters: Without a stable way to go from noise to video, results are blurry and inconsistent. 🍞 Bottom Bread (Anchor): Starting from TV-static, the model ‘flows’ the frames until your dancing robot appears sharply in the target view.

🍞 Top Bread (Hook): When writing an essay, you paste notes from the most related pages at the top of your draft so they guide your next paragraph.

🥬 Filling (The Actual Concept): Temporal Concatenation (In-Context Conditioning)

What it is: Putting the tokens of several context videos in front of the target’s tokens so the model can attend to them when generating.
How it works: 1) Patchify each context video into tokens. 2) Concatenate them along the time dimension with the target’s noisy tokens. 3) Apply self- and cross-attention so the target can ‘look’ at context. 4) Decode to frames.
Why it matters: Without seeing the context tokens, the model can’t copy consistent details like the same poster on a wall across views. 🍞 Bottom Bread (Anchor): The model sees the earlier “right side of the room” video tokens as it draws the next “arc-right” camera view.

🍞 Top Bread (Hook): If too many helpers talk at once, you form smaller groups, combine their answers, and move on.

🥬 Filling (The Actual Concept): Divide-and-Conquer Retrieval (When l > k)

What it is: If too many relevant context videos exist, merge subsets into a representative context to keep diversity without overload.
How it works: 1) Take the first m videos beyond capacity. 2) Merge their camera paths into a span that covers their FOVs. 3) Infer a merged context video. 4) Replace the m with this merged one and continue until l ≤ k.
Why it matters: Without this, you either ignore valuable views or blow past memory limits. 🍞 Bottom Bread (Anchor): Like summarizing three similar angles into one helpful summary view.

🍞 Top Bread (Hook): Don’t try to read a whole novel on day one—build up.

🥬 Filling (The Actual Concept): Progressive Context-Scaling in Training

What it is: Train with small context first, then slowly increase to the target k.
How it works: 1) Train k=1 until stable. 2) Train k=2, then k=3, etc. 3) Only some layers (like self-attention and camera encoder) are fine-tuned. 4) This stabilizes optimization.
Why it matters: Without it, large-context training can diverge and hurt camera accuracy. 🍞 Bottom Bread (Anchor): The paper shows removing this step increases translation error and reveals hidden artifacts.

🍞 Top Bread (Hook): Practice with your own recordings to toughen up.

🥬 Filling (The Actual Concept): Self-Conditioned Training

What it is: After initial training on ground-truth context, generate synthetic contexts and train on those so the model survives imperfections.
How it works: 1) Generate contexts for many scenes. 2) Replace clean contexts with these synthetic ones. 3) Fine-tune. 4) Improves robustness for long sequences.
Why it matters: Without it, long videos tend to over-expose or drift. 🍞 Bottom Bread (Anchor): In ablations, removing self-conditioning worsens FVD and image quality and adds artifacts.

Concrete example with data:

Input: A 93-frame video of a robot in a lab. Target cameras: rotation right, arc left, zoom out.
Step A (Retrieval): For rotation right, the system selects earlier views that also saw the robot’s right side (highest co-visibility).
Step B (Generation): It concatenates those tokens, adds Plücker raymaps for the new camera, and denoises from noise to produce the new 93-frame video.
Step C (Loop): That new video is stored and helps the next arc-left generation; long videos are chunked with overlapping frames to keep continuity.

Secret sauce:

Synchronizing hallucinations via 3D FOV–guided retrieval plus autoregressive, multi-in–single-out conditioning. This pairing turns scattered guesses into a single, shared world model across views and time.

04Experiments & Results

The Test: The paper checks three big things: visual quality (does it look good and stable?), camera accuracy (does the output match the commanded camera motion?), and view synchronization (do different views agree on the same details over time?). They use FVD for video fidelity, PSNR for pixel faithfulness in robotics, TransErr/RotErr for camera pose accuracy, and matched pixels (RoMa) for multi-view sync.

The Competition: PlenopticDreamer is compared against strong camera-controlled generators: ReCamMaster, TrajectoryCrafter, Trajectory-Attention, and a retrained ReCamMaster* that uses similar ingredients (Plücker rays and the same backbone) for fairness.

Benchmarks and Setups:

Basic benchmark: 100 in-the-wild videos with 12-step camera trajectories (rotation, arcs, azimuth/elevation changes, zooms). Training used MultiCamVideo (~136K episodes) + SynCamVideo (~34K episodes).
Agibot benchmark: Robotic manipulation with head-view and two gripper-views; 200 test videos. Training sampled ~145K episodes from Agibot.

The Scoreboard with Context:

Basic benchmark: PlenopticDreamer achieves an FVD around 425.8 (lower is better) versus 675.4 for ReCamMaster*—that’s like going from a shaky C+ to a crisp A- in overall video realism. Camera translation and rotation errors are competitive or better than baselines, especially under large view shifts. Most importantly, view synchronization (matched pixels) jumps: for the hardest 12-shot setting, it reaches about 41.2K matched pixels, beating the next best by notable margins. That’s like all cameras finally agreeing on the same poster, outlet, and roof edge at once.
Agibot benchmark: PSNR improves from 13.8 (ReCamMaster*) to 14.5, and matched pixels from 13.2K to 15.3K. In plain terms, the gripper-view videos line up better with the head-view story and show fewer distortions, which matters when a robot needs consistent, trustworthy visuals.

Qualitative Highlights:

Basic: In scenes with big viewpoint changes, baselines often warp objects (like traffic lights or wall paintings) or let details change across views; PlenopticDreamer keeps those details stable. Zoom and focal length control produce believable depth-of-field changes.
Agibot: Head-to-gripper and gripper-to-gripper transformations remain coherent across tasks (pouring, scooping, unscrewing), while the retrained baseline still slips on object shapes or textures.

Surprising/Notable Findings:

Retrieval matters a lot: Randomly picking context videos tanks multi-view synchronization and introduces mismatched hallucinations. FOV-based retrieval is a big driver of consistency.
More isn’t always better: Increasing context size beyond a point gives diminishing returns because errors and noise can accumulate; a sweet spot (e.g., k≈4–6, depending on budget) balances coverage and stability.
Training strategy pays off: Progressive context-scaling reduces camera errors and prevents unstable training; self-conditioning reduces long-range artifacts and improves FVD and human-perceived image quality.

Takeaway: The method doesn’t just make videos prettier—it makes multiple views tell the same story. That’s the leap from “cool demo” to “reliable multi-view world building.”

05Discussion & Limitations

Limitations:

Long-shot fragility: Despite self-conditioning, very long sequences can still show over-exposure, drifting textures, or mild geometric distortions.
Complex human motion: Dancing or fast, non-rigid actions can degrade due to pretraining biases and the difficulty of tracking fine deformations across many views.
Compute footprint: Training uses large backbones and multiple GPUs; context parallelism and long videos add memory pressure.
Retrieval assumptions: FOV-based retrieval assumes reasonable pose inputs; bad camera estimates can misguide selection.

Required Resources:

Hardware: Multi-GPU (e.g., H100-class) for fine-tuning and inference at high resolution/long duration; high-speed storage for memory banks.
Data: Multi-view video datasets with accurate camera poses; curated synthetic + real scenes for diversity; robotic datasets for head-to-gripper tasks.
Software: Video diffusion transformer stack with flow-matching; camera encoders for Plücker rays; retrieval and chunking pipelines.

When NOT to Use:

Extremely rapid non-rigid motion (crowded dance floors, waving fabrics) where pose estimates are poor and cross-view matching is hard.
Very sparse or unreliable camera metadata; without decent pose information, FOV retrieval and camera control underperform.
Ultra-low-latency, on-device scenarios without enough memory/compute to store and retrieve context or run a DiT-based generator.

Open Questions:

Better long-horizon stability: Can “Self-Forcing/Forcing-style” training and online adaptation keep quality high for multi-minute videos?
Richer geometry cues: Could hybrid 3D memory (surfels/point clouds) plus video memory improve both speed and consistency?
Learning to retrieve: Can the retrieval policy be learned end-to-end to select context more intelligently than hand-crafted FOV overlap?
Robustness to pose noise: How to remain stable with noisy or missing camera intrinsics/extrinsics?
Scaling laws: What are the best trade-offs among context size, chunk length, and compute for different content types?

06Conclusion & Future Work

Three-sentence summary: PlenopticDreamer re-renders videos along new camera paths by synchronizing its “imagined” details across time and viewpoints. It does this with an autoregressive, multi-in–single-out design, 3D FOV–based retrieval of past video contexts, precise camera ray encoding, and stabilizing training tactics (progressive context-scaling and self-conditioning). The result is state-of-the-art view synchronization with strong camera control and high visual fidelity on both everyday scenes and robotics.

Main achievement: Turning scattered, per-view guesses into one coherent, shared world by making every new generation consult the best-matching earlier views.

Future directions: Improve long-range stability with advanced self-forcing or online learning; combine video memory with explicit 3D world memory; learn retrieval policies; harden against noisy camera inputs; and scale to even longer, higher-resolution sequences.

Why remember this: It’s a step from “pretty clips” to “consistent worlds.” With synchronized spatio-temporal memory, creators can direct cameras freely, robots can switch viewpoints safely, and multi-view storytelling becomes both flexible and faithful to the same underlying scene.

Practical Applications

•Film previsualization: Re-render a scene from new camera paths to test shots before expensive reshoots.
•Game development: Generate consistent multi-view cutscenes and camera fly-throughs from a single base clip.
•AR/VR production: Create immersive, view-stable experiences where users can look around without artifacts.
•Robotics: Convert head-view demonstrations into gripper-view perspectives for better manipulation learning.
•Sports analysis: Re-render plays from different sidelines or elevated angles to study tactics without extra cameras.
•Education and labs: Revisit experiments from novel viewpoints while keeping measurements visually consistent.
•Real estate and tourism: Turn walkthroughs into smooth multi-angle tours without multiple physical recordings.
•Content creation: Apply precise camera moves (tilt, arc, zoom, focal length changes) for dynamic storytelling.
•Safety training: Re-render accident simulations from multiple viewpoints to improve situational understanding.
•Broadcast and news: Generate alternate camera angles for limited-footage events to enhance coverage.

Version: 1