360Anything: Geometry-Free Lifting of Images and Videos to 360°

Ziyi Wu; Daniel Watson; Andrea Tagliasacchi; David J. Fleet; Marcus A. Brubaker; Saurabh Saxena

360Anything: Geometry-Free Lifting of Images and Videos to 360°

Intermediate

Ziyi Wu, Daniel Watson, Andrea Tagliasacchi et al.1/22/2026

arXiv PDF

Key Summary

•This paper shows how to turn any normal photo or video into a seamless 360° panorama without needing the camera’s settings like field of view or tilt.
•The trick is to treat both the input picture and the target panorama as sequences of tokens and let a diffusion transformer learn how they relate—no geometry formulas required.
•They fix the common ‘seam line’ in panoramas by discovering it comes from zero-padding inside the VAE encoder and replacing it with Circular Latent Encoding.
•The method learns to place the input view correctly on a gravity‑aligned (upright) 360° canvas, even when the camera was tilted or zoomed.
•On standard image tests (Laval Indoor and SUN360), it beats prior methods on most quality metrics and ties or nearly ties on the rest.
•On videos, it strongly improves sharpness, smoothness, and overall realism, with much lower FVD and better PSNR/LPIPS than methods that even use ground-truth camera info.
•It can guess camera field‑of‑view and orientation from a single image competitively with specialized supervised methods—zero-shot.
•The panoramas are consistent enough to reconstruct 3D scenes using 3D Gaussian Splatting for free-view exploration.
•Because it avoids camera calibration, this approach works better ‘in the wild’ on casual photos and videos from phones, drones, or the internet.
•The main ideas—sequence concatenation, gravity‑aligned training, and circular latent encoding—are simple, scalable, and remove fragile dependencies.

Why This Research Matters

This work lets anyone lift ordinary photos and videos into full, VR‑ready panoramas without fiddling with camera calibration. That means creators, educators, and app developers can build immersive environments from everyday media, not just special 360° rigs. Robots and AR devices can imagine what’s outside their narrow view to plan or overlay information more safely. Filmmakers and game studios can quickly generate wraparound worlds from scouting shots or concept clips. Because seams are fixed at the source, outputs look professional without post‑processing hacks. Overall, it lowers the barrier to high‑quality 360° content and makes world‑building more robust and scalable.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re standing in the middle of a playground and want a picture that shows everything around you at once—the swings, the slide, your friends, the sky, and the ground. A normal photo can’t do that. A 360° panorama can.

🥬 The Concept (360° Panorama Generation): It is creating a special wide image that wraps all the way around you, like peeling a sticker from around a globe and laying it flat.

How it works (recipe):
1. Capture views in all directions. 2) Map the whole sphere onto a rectangle (called an ‘equirectangular projection’). 3) Blend everything so there are no visible edges.
Why it matters: Without it, we miss context. You can’t build immersive worlds or VR scenes from narrow peeks.

🍞 Anchor: Street View lets you look left, right, up, and down—like being inside the photo.

The World Before: AI got very good at making photos and even videos, but most were narrow ‘window’ views—like peeking through a keyhole. If you wanted a full wraparound scene for VR or for building 3D worlds, you needed a panorama. Turning a normal picture or video into a panorama sounds easy, but there was a catch: older methods demanded knowing exactly how the camera was set—its field of view (zoom) and its tilt/rotation. That information is often missing or wrong in real life.

🍞 Hook: You know how maps flatten a round Earth into a rectangle? That’s what panoramas do too—and stretching can cause weird edges.

🥬 The Concept (Equirectangular Projection, ERP): ERP is a way to lay the whole sphere (all directions) onto a flat rectangle.

How it works:
1. Imagine the world as a sphere. 2) Use latitude/longitude to place every direction on a grid. 3) Paint pixels according to those directions.
Why it matters: This format is standard for 360° images and videos, but the left and right edges should meet perfectly, or you’ll see a seam.

🍞 Anchor: Think of a world map poster—you can see everything but the left and right edges should line up.

The Problem: Previous systems said, “Project your small photo into ERP using the exact camera settings, then fill in the missing parts.” But this falls apart if your camera info is unknown or even a little noisy. In the wild—think phone videos, vlog clips, drone shots—metadata is missing or messy. Models would place the input in the wrong spot, distort people, or draw a visible seam down the panorama.

Failed Attempts: People tried stronger geometric tools (special spherical convolutions, cubemaps, and careful projections). They also used tricks during sampling (rotating or blending) to hide seams. This helped a bit, but added complexity and still depended on fragile camera estimates.

🍞 Hook: You know how you can learn to place a puzzle piece by just looking at its edges and picture, not by measuring angles with a protractor?

🥬 The Concept (In‑the‑wild Learning): Train on lots of real, messy data so the model learns to handle variety—without hand‑coded rules.

How it works: 1) Feed the model diverse examples. 2) Let attention learn where pieces fit. 3) Reward outputs that look correct everywhere.
Why it matters: You don’t crumble when metadata is missing or noisy.

🍞 Anchor: It’s like learning to ride a bike on bumpy streets, so flat sidewalks later feel easy.

The Gap: We needed a method that (1) didn’t require camera calibration, (2) still placed the input image/video correctly on the 360° canvas, and (3) produced truly seamless panoramas—no edge lines.

Real Stakes: If you post a phone clip, can AI expand it into a full VR scene? If a robot has one forward camera, can it imagine the rest of the room to plan better? If filmmakers have a normal shot, can they generate a wraparound set for previews? For gamers, can level designers quickly build explorable worlds from a few photos? These use cases demand robust, calibration‑free, seam‑free panoramas.

Another Hidden Problem: Modern diffusion models work in a VAE’s ‘latent’ space, which uses convolution layers that often do zero‑padding. For a panorama, zero‑padding at the left and right edges breaks the ‘wraparound’ nature—making a seam show up during training.

🍞 Hook: Imagine reading a sentence that wraps from the last word straight back to the first word with no pause. If someone inserts blank spaces at the edges, it ruins the flow.

🥬 The Concept (VAE and Zero‑Padding): A VAE compresses images into a smaller code (latent) and then expands them back. Convolutions often add zeros at the edges (zero‑padding) to keep sizes consistent.

How it works:
1. Encode image to latents. 2) Process with conv layers (with padding). 3) Decode back to pixels.
Why it matters: If a wraparound image gets zeros at its borders, the ‘circle’ is broken in the latent—training then learns a built‑in seam.

🍞 Anchor: Like cutting a hula hoop to store it flat—once cut, it’s no longer a perfect circle.

This paper fixes both the ‘need camera info’ problem and the ‘seam’ problem, enabling robust, geometry‑free 360° lifting for images and videos.

02Core Idea

The Aha! Moment in one sentence: Treat the input picture/video and the 360° panorama as two sequences of tokens, let a diffusion transformer learn their relationship by attention, and fix seams at the source with circular latent encoding—no camera calibration needed.

Multiple Analogies:

Puzzle Table: Pour both the input photo pieces and the big panorama pieces onto one table. A smart friend (the transformer) looks at all pieces at once and snaps them into place by matching colors and edges—no measuring tape required.
Orchestra Conductor: Give the conductor (transformer) both the soloist’s melody (input view) and the full symphony (panorama) on the same sheet. Attention lets the conductor keep instruments in sync—no extra footnotes about where to stand.
Magnetic Fridge Poetry: Mix words from two sets (input and panorama) on the same fridge. The transformer’s attention pulls the right words together to make a complete, smooth poem (seamless 360° scene).

🍞 Hook: You know how you can understand a story better if you read all the sentences together instead of one at a time?

🥬 The Concept (Diffusion Transformers): A diffusion transformer is a model that denoises noisy images/videos by looking at every part and how parts relate—using attention over sequences of tokens.

How it works:
1. Turn images into latent tokens. 2) Add noise. 3) Step by step, the model removes noise while paying attention to all tokens.
Why it matters: Attention lets the model learn placement and geometry from patterns in data, not from hand‑coded geometry.

🍞 Anchor: Like restoring a blurry picture by repeatedly sharpening it while checking how all regions fit together.

🍞 Hook: Imagine lining up two trains of toy cars—one is the small input, the other is the big panorama—and letting a smart robot look at both trains together to figure out where each car should go.

🥬 The Concept (Sequence Concatenation): Put the input view tokens and the target panorama tokens in one long line so attention can connect them.

How it works:
1. Encode input and panorama to latents. 2) Concatenate latent sequences. 3) Use attention to learn where the input belongs on the 360° canvas.
Why it matters: No need to know field‑of‑view or camera tilt. The model learns to ‘place’ the input correctly just from data.

🍞 Anchor: Like sliding your puzzle piece into the big picture by eye—without rulers or angles.

🍞 Hook: You know how if your photo is tilted, everything feels off? It’s easier if every panorama is always upright.

🥬 The Concept (Canonical, Gravity‑Aligned Panoramas): Always generate panoramas in a standard upright orientation, regardless of input camera tilt/roll.

How it works:
1. Pre-align training videos so gravity points down. 2) Train the model to always output upright panoramas. 3) The model learns to infer the input’s pose and place it in that upright frame.
Why it matters: Consistent outputs are easier to learn and look natural, reducing distortions.

🍞 Anchor: Think of a world where every map is drawn with north at the top. It’s easier to compare and navigate.

🍞 Hook: If the edges of a scarf need to meet perfectly when wrapped, don’t sew blank cloth at the borders.

🥬 The Concept (Circular Latent Encoding): Encode panoramas with circular padding so the latent representation wraps around cleanly—no hard edges in training.

How it works:
1. Before encoding, copy a small strip from the left edge to the right and from the right to the left (circular pad). 2) Encode with the VAE. 3) Drop the extra strips so sequence length stays the same.
Why it matters: This fixes the true cause of seams—edge discontinuities in the latent space—so the model learns seamless panoramas.

🍞 Anchor: Like moving the first and last words together before compressing a looping sentence so the loop stays smooth.

Before vs After:

Before: Methods projected the input into ERP using precise camera settings and then outpainted. They also used inference‑time tricks to hide seams.
After: 360Anything learns placement from token attention (no camera info), outputs upright by design, and removes seams at the source with circular latent encoding.

Why It Works (intuition): Attention compares every token with every other, so the model can match objects in the input view (e.g., the corner of a couch) with where those objects should appear on the 360° canvas—even when the camera’s zoom or tilt changes. A consistent upright target removes ambiguity. And fixing the latent seam prevents the model from inheriting a baked‑in edge.

Building Blocks:

Diffusion Transformer backbone for images/videos.
Sequence concatenation for geometry‑free conditioning.
Canonical (gravity‑aligned) training targets for consistency.
Circular Latent Encoding to eliminate seams during training.
Camera‑augmented crops during training so the model learns many FoVs and tilts, improving robustness.

🍞 Hook: How does it know how wide the camera sees and which way it’s tilted—without being told?

🥬 The Concept (Zero‑shot Camera FoV and Pose Estimation via the Panorama): After generating an upright panorama, you can search for the FoV and tilt that best re‑projects the panorama back to the original input.

How it works:
1. Generate panorama. 2) Try different FoV/tilt/roll to reproject back. 3) Pick the one that matches the input best.
Why it matters: The model implicitly learns geometry; you can read it out by matching.

🍞 Anchor: Like guessing a camera’s zoom by finding which setting reproduces the exact same picture from your full 360° scene.

03Methodology

High‑level overview: Perspective image or video + caption → VAE encode to latents → Concatenate input and (noisy) panorama tokens → Diffusion Transformer denoises with attention → Circular latent decoding → Seamless, upright 360° panorama.

Step‑by‑step:

Inputs and Canonicalization

What happens: For training, panorama videos from the wild are first ‘straightened.’ The method stabilizes frame‑to‑frame rotations and aligns gravity to the vertical, so all targets are upright.
Why this step exists: If targets tilt randomly, the model must learn many distortion patterns, making training harder and outputs wobblier.
Example: A handheld 360° street video is stabilized so buildings stand straight in every frame. The model then always learns to paint upright city panoramas.

🍞 Hook: Like making sure every notebook page sits straight before you start copying notes.

🥬 The Concept (Gravity‑Aligned Canonical Frame): Always use an upright target so the model learns one standard orientation.

How it works:
1. Estimate per‑frame pose. 2) Remove inter‑frame rotations (stabilize). 3) Align gravity to vertical.
Why it matters: Cleaner, more natural outputs and easier learning.

🍞 Anchor: Maps read better when north is always up. Same for panoramas.

Latent Encoding with Circular Continuity

What happens: Convert images/videos into compact latent tokens with a VAE. For panoramas, apply Circular Latent Encoding: pad the left edge on the right and the right edge on the left before encoding; then drop the pad after.
Why this step exists: Normal zero‑padding breaks the wraparound nature, baking a seam into the latent target. Fixing it here removes seams at the root.
Example: On a 2048‑pixel‑wide ERP image, copy a small 256‑pixel slice from each edge to the opposite side before encoding; after, trim them so the sequence length stays the same.

🍞 Hook: To keep a bracelet round, don’t add gaps at the clasp when you measure it.

🥬 The Concept (Circular Latent Encoding): Preserve wraparound continuity in the latent space.

How it works:
1. Circularly pad the panorama. 2) Encode with VAE. 3) Drop padded strips.
Why it matters: Seams disappear because the model trains on seam‑free latents.

🍞 Anchor: Like taping a loop of paper into a ring before flattening it, so the edges meet perfectly when restored.

Sequence Concatenation Conditioning

What happens: Encode the perspective input (image or video) into latents and simply place those tokens before the (noisy) panorama tokens. Feed the whole long sequence into the diffusion transformer.
Why this step exists: It lets attention naturally learn placement and geometry without needing explicit camera FoV or pose.
Example: Tokens from the input frame showing ‘a couch and a window’ attend to panorama tokens where those objects should be on the 360° canvas.

🍞 Hook: If all puzzle pieces are on one table, you can match them by sight without labels.

🥬 The Concept (Sequence Concatenation): Condition by lining up input and target tokens in a single sequence.

How it works:
1. Encode both. 2) Concatenate. 3) Attend globally to align and outpaint.
Why it matters: Works ‘in the wild’ without fragile projections.

🍞 Anchor: Like lining up two trains so couplers meet naturally.

Diffusion Transformer Denoising

What happens: The model starts with noisy panorama tokens and repeatedly denoises them, guided by attention to the input tokens and the caption.
Why this step exists: Diffusion provides high‑quality, stable generation at large resolutions.
Example: Over 50 steps, blurry walls become crisp, and the missing parts around the input view fill in realistically.

🍞 Hook: Restoring a foggy window by carefully wiping in small circles until you can see clearly.

🥬 The Concept (Diffusion Transformers): A transformer uses attention to remove noise step‑by‑step, guided by the input and text.

How it works:
1. Start noisy. 2) Attend across all tokens. 3) Predict and subtract noise repeatedly.
Why it matters: Produces detailed, coherent panoramas and videos.

🍞 Anchor: Like polishing a dull stone into a shiny gem by checking all its facets each time.

Circular Latent Decoding

What happens: Decode denoised panorama tokens back into an ERP image or video. Because training targets were seam‑free, outputs are seam‑free too—no need for blending tricks.
Why this step exists: Get final pixels for viewing and for projecting to perspective windows.
Example: The left and right edges align perfectly when wrapped, so VR viewers see no line.

Training Details that Boost Robustness:

Camera augmentations: Randomly crop input views with many FoVs (30°–120°) and tilts/rolls. This teaches the model to place any view onto the 360° canvas.
Text captions: Short prompts describe the scene; the model learns semantics that improve outpainting.
Video motion: Use both simulated and real camera trajectories to handle complex in‑the‑wild motion and maintain upright panoramas over time.

Secret Sauce (why it feels simple yet strong):

Replace geometry dependencies with attention over a joint token sequence.
Replace inference‑time seam band‑aids with a principled latent‑space fix.
Replace random‑orientation targets with a single upright frame that is easier to learn, scale, and use.

🍞 Hook: Can it build a 3D world afterwards?

🥬 The Concept (3D Gaussian Splatting from Panoramas): Use the generated 360° frames like many cameras around you to reconstruct a 3D scene for fly‑throughs.

How it works:
1. Estimate camera poses across frames. 2) Optimize many small 3D ‘gaussians’ to match the images. 3) Render from new views in real time.
Why it matters: Shows the panoramas are geometrically consistent, not just pretty pictures.

🍞 Anchor: Like placing tiny glowing marbles in space until they reproduce your photos—then you can walk around them.

04Experiments & Results

What they measured and why:

Image quality on standard panorama datasets (Laval Indoor, SUN360) to test how well outpainted panoramas match the real world and the prompt.
Video quality (sharpness, realism, smoothness) and faithfulness to the input view on the Argus benchmark split to see if motion stays consistent and upright.
Seam artifacts with a discontinuity score (DS) to check for edge lines.
Camera understanding (field‑of‑view and orientation) in zero‑shot to test implicit geometry learning.
3D scene reconstruction from generated panoramas to validate geometric consistency.

Competitors:

Images: OmniDreamer, PanoDiffusion, Diffusion360, CubeDiff (previous SOTA).
Videos: Imagine360, Argus, ViewPoint (all use geometry or assumptions; some even use ground‑truth camera info).

Scoreboard with context:

Images (Laval Indoor, SUN360): • FID ↓: 8.0 (Laval) and 22.4 (SUN360). This is like getting an A when others hover around B to C ranges; lower is better. • KID ↓: 0.22 (Laval) and 1.27 (SUN360); again, lower is better and clearly competitive or best. • CLIP‑FID ↓: 4.6 (Laval) and 7.3 (SUN360); near the top—slightly behind CubeDiff on Laval but ahead on SUN360. • FAED ↓ (full‑panorama quality): 9.8 (Laval) and 3.8 (SUN360), about a ~50% reduction vs prior SOTA, meaning much better whole‑panorama geometry and appearance. • CLIP‑Score ↑: Best or near‑best, showing strong text alignment.
Videos (Argus split; real and simulated camera trajectories): • PSNR ↑ and LPIPS ↓ in the input‑covered region: Higher PSNR and lower LPIPS than Imagine360, Argus, and ViewPoint—even though some baselines had ground‑truth camera info. That’s like drawing the input view more accurately than methods that got extra hints. • FVD ↓: 483 (real cams) and 433 (sim cams) vs 844–1532 baselines—this is a huge drop, like moving from a B‑ to A+ in overall video realism and consistency. • VBench Imaging/Aesthetic/Motion ↑: Across the board improvements, showing nicer visuals and smoother motion.
Seam Discontinuity (DS ↓): • Images: DS ~3.9 with Circular Latent Encoding vs ~9.9 vanilla and ~5.3 blended decoding. That’s a clean removal of the edge line without post‑hoc blurring. • Videos: DS ~13.3 with CLE vs ~35.5 vanilla and ~19.8 blended, again much cleaner.
Zero‑shot Camera Understanding: • Field of View (FoV) error (degrees): Mean ~3.9 (NYUv2), ~5.68 (ETH3D), ~5.21 (iBims‑1), averaging ~4.93°, competitive with DUSt3R and MoGe which are specialized. • Orientation (roll/pitch) error (degrees): 0.87/2.56 (MegaDepth) and 0.68/1.23 (LaMAR), just ~0.5° behind the current SOTA GeoCalib.
3D Reconstruction: • Using generated panoramas to train a 3D Gaussian Splatting model yields navigable 3D scenes, indicating strong geometry consistency beyond just pretty pixels.

Surprising findings:

Even without explicit camera metadata, the model preserves the input region better (higher PSNR/lower LPIPS) than methods that carefully unproject the input using ground‑truth cameras.
Training with random camera FoVs and tilts made the model stronger even on the standard test setting (fixed 90°). Variety teaches robust geometry.
Making targets upright (canonical) reduces distortions and boosts realism, even if it slightly lowers pixel‑match scores like PSNR in some ablations.
Fixing seam artifacts at the VAE latent stage beats inference‑time band‑aids and adds no extra compute at sampling.

05Discussion & Limitations

Limitations (be specific):

Long videos: Panorama frames are big (ERP packs a whole sphere). With current compute, the model handles about 81 frames; longer stories need either more memory or an autoregressive extension.
Complex physics/dynamics: Scenes with fast, intricate physics (e.g., splashing liquids, crowds) can still challenge fidelity and temporal stability.
Dataset biases: Training on internet 360° content can introduce habits (e.g., occasional black borders or tripod‑like objects at the bottom), which may appear in outputs.
Upsampling: Off‑the‑shelf video upscalers, built for perspective views, can re‑introduce ERP seams or distort spherical structure—panorama‑aware upscalers are needed.

Required Resources:

A pre‑trained image/video diffusion transformer (e.g., FLUX, Wan) and a VAE.
GPU memory to handle long token sequences at high ERP resolution (e.g., 1024×2048 images; 512×1024 videos with many frames).
Optional pre‑processing for training videos to canonicalize gravity and stabilize motion.

When NOT to Use:

If you need exact metric‑accurate geometry for measurement (e.g., surveying), a calibration‑free generative model is not a substitute for precise SfM/SLAM.
Real‑time on edge devices: Full diffusion transformers at panorama resolutions may be too heavy without distillation.
Highly dynamic, deforming scenes where outpainting beyond input evidence could hallucinate unacceptable content.

Open Questions:

Can we extend context length and temporal horizon efficiently (e.g., causal/AR diffusion transformers) for minute‑long 360° videos?
How to build panorama‑aware super‑resolution that preserves ERP continuity without seams or distortions?
Can we integrate lightweight, uncertainty‑aware camera hints when available without re‑introducing brittleness?
How to further reduce bias (e.g., tripod artifacts) via data curation or targeted fine‑tuning while keeping generality?
Can this geometry‑free conditioning help other tasks (e.g., multi‑view 3D training) without explicit calibration?

06Conclusion & Future Work

Three‑sentence summary: 360Anything turns ordinary photos and videos into seamless, gravity‑aligned 360° panoramas without needing camera calibration, by treating both input and output as token sequences that a diffusion transformer aligns with attention. It fixes the root cause of panorama seams by introducing Circular Latent Encoding so the VAE’s latent space wraps around cleanly. The result is state‑of‑the‑art image and video quality, competitive zero‑shot camera estimation, and panoramas consistent enough for 3D reconstruction.

Main achievement: Replacing fragile geometry dependencies with simple, scalable sequence concatenation plus a principled seam fix—enabling robust, calibration‑free 360° lifting in the wild.

Future directions:

Longer, higher‑resolution 360° videos via efficient or autoregressive diffusion transformers.
Panorama‑aware upsampling to keep ERP structure seamless at 4K/8K resolutions.
Hybrid approaches that can optionally accept weak camera hints without losing robustness when hints are missing.
Stronger temporal world models for complex dynamics and physics.

Why remember this: It shows that attention over joint token sequences can learn camera geometry implicitly, and that fixing seams at their true source (latent encoding) is cleaner and stronger than patching at inference—pointing toward simpler, more scalable, and more reliable generative world building.

Practical Applications

•Turn a phone photo into a VR‑ready 360° background for virtual tours or classrooms.
•Outpaint a forward‑facing drone video into a full wraparound scene for safer navigation planning.
•Create quick 360° location previews for film sets from a few scouting shots.
•Generate panoramic backdrops for game levels from concept art or reference images.
•Produce immersive real estate walkthroughs from simple room videos without special cameras.
•Expand POV sports clips into 360° highlights for replay and coaching analysis.
•Assist AR apps by imagining the unseen surroundings to place stable, context‑aware overlays.
•Build 3D scene reconstructions (via 3D Gaussian Splatting) from generated panoramas for free‑view exploration.
•Enhance travel and museum apps with instant 360° environments built from normal visitor photos.
•Automate panorama completion in photo editors without manual camera calibration.

Version: 1