360Anything: Geometry-Free Lifting of Images and Videos to 360°
Key Summary
- â˘This paper shows how to turn any normal photo or video into a seamless 360° panorama without needing the cameraâs settings like field of view or tilt.
- â˘The trick is to treat both the input picture and the target panorama as sequences of tokens and let a diffusion transformer learn how they relateâno geometry formulas required.
- â˘They fix the common âseam lineâ in panoramas by discovering it comes from zero-padding inside the VAE encoder and replacing it with Circular Latent Encoding.
- â˘The method learns to place the input view correctly on a gravityâaligned (upright) 360° canvas, even when the camera was tilted or zoomed.
- â˘On standard image tests (Laval Indoor and SUN360), it beats prior methods on most quality metrics and ties or nearly ties on the rest.
- â˘On videos, it strongly improves sharpness, smoothness, and overall realism, with much lower FVD and better PSNR/LPIPS than methods that even use ground-truth camera info.
- â˘It can guess camera fieldâofâview and orientation from a single image competitively with specialized supervised methodsâzero-shot.
- â˘The panoramas are consistent enough to reconstruct 3D scenes using 3D Gaussian Splatting for free-view exploration.
- â˘Because it avoids camera calibration, this approach works better âin the wildâ on casual photos and videos from phones, drones, or the internet.
- â˘The main ideasâsequence concatenation, gravityâaligned training, and circular latent encodingâare simple, scalable, and remove fragile dependencies.
Why This Research Matters
This work lets anyone lift ordinary photos and videos into full, VRâready panoramas without fiddling with camera calibration. That means creators, educators, and app developers can build immersive environments from everyday media, not just special 360° rigs. Robots and AR devices can imagine whatâs outside their narrow view to plan or overlay information more safely. Filmmakers and game studios can quickly generate wraparound worlds from scouting shots or concept clips. Because seams are fixed at the source, outputs look professional without postâprocessing hacks. Overall, it lowers the barrier to highâquality 360° content and makes worldâbuilding more robust and scalable.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: Imagine youâre standing in the middle of a playground and want a picture that shows everything around you at onceâthe swings, the slide, your friends, the sky, and the ground. A normal photo canât do that. A 360° panorama can.
𼏠The Concept (360° Panorama Generation): It is creating a special wide image that wraps all the way around you, like peeling a sticker from around a globe and laying it flat.
- How it works (recipe):
- Capture views in all directions. 2) Map the whole sphere onto a rectangle (called an âequirectangular projectionâ). 3) Blend everything so there are no visible edges.
- Why it matters: Without it, we miss context. You canât build immersive worlds or VR scenes from narrow peeks.
đ Anchor: Street View lets you look left, right, up, and downâlike being inside the photo.
The World Before: AI got very good at making photos and even videos, but most were narrow âwindowâ viewsâlike peeking through a keyhole. If you wanted a full wraparound scene for VR or for building 3D worlds, you needed a panorama. Turning a normal picture or video into a panorama sounds easy, but there was a catch: older methods demanded knowing exactly how the camera was setâits field of view (zoom) and its tilt/rotation. That information is often missing or wrong in real life.
đ Hook: You know how maps flatten a round Earth into a rectangle? Thatâs what panoramas do tooâand stretching can cause weird edges.
𼏠The Concept (Equirectangular Projection, ERP): ERP is a way to lay the whole sphere (all directions) onto a flat rectangle.
- How it works:
- Imagine the world as a sphere. 2) Use latitude/longitude to place every direction on a grid. 3) Paint pixels according to those directions.
- Why it matters: This format is standard for 360° images and videos, but the left and right edges should meet perfectly, or youâll see a seam.
đ Anchor: Think of a world map posterâyou can see everything but the left and right edges should line up.
The Problem: Previous systems said, âProject your small photo into ERP using the exact camera settings, then fill in the missing parts.â But this falls apart if your camera info is unknown or even a little noisy. In the wildâthink phone videos, vlog clips, drone shotsâmetadata is missing or messy. Models would place the input in the wrong spot, distort people, or draw a visible seam down the panorama.
Failed Attempts: People tried stronger geometric tools (special spherical convolutions, cubemaps, and careful projections). They also used tricks during sampling (rotating or blending) to hide seams. This helped a bit, but added complexity and still depended on fragile camera estimates.
đ Hook: You know how you can learn to place a puzzle piece by just looking at its edges and picture, not by measuring angles with a protractor?
𼏠The Concept (Inâtheâwild Learning): Train on lots of real, messy data so the model learns to handle varietyâwithout handâcoded rules.
- How it works: 1) Feed the model diverse examples. 2) Let attention learn where pieces fit. 3) Reward outputs that look correct everywhere.
- Why it matters: You donât crumble when metadata is missing or noisy.
đ Anchor: Itâs like learning to ride a bike on bumpy streets, so flat sidewalks later feel easy.
The Gap: We needed a method that (1) didnât require camera calibration, (2) still placed the input image/video correctly on the 360° canvas, and (3) produced truly seamless panoramasâno edge lines.
Real Stakes: If you post a phone clip, can AI expand it into a full VR scene? If a robot has one forward camera, can it imagine the rest of the room to plan better? If filmmakers have a normal shot, can they generate a wraparound set for previews? For gamers, can level designers quickly build explorable worlds from a few photos? These use cases demand robust, calibrationâfree, seamâfree panoramas.
Another Hidden Problem: Modern diffusion models work in a VAEâs âlatentâ space, which uses convolution layers that often do zeroâpadding. For a panorama, zeroâpadding at the left and right edges breaks the âwraparoundâ natureâmaking a seam show up during training.
đ Hook: Imagine reading a sentence that wraps from the last word straight back to the first word with no pause. If someone inserts blank spaces at the edges, it ruins the flow.
𼏠The Concept (VAE and ZeroâPadding): A VAE compresses images into a smaller code (latent) and then expands them back. Convolutions often add zeros at the edges (zeroâpadding) to keep sizes consistent.
- How it works:
- Encode image to latents. 2) Process with conv layers (with padding). 3) Decode back to pixels.
- Why it matters: If a wraparound image gets zeros at its borders, the âcircleâ is broken in the latentâtraining then learns a builtâin seam.
đ Anchor: Like cutting a hula hoop to store it flatâonce cut, itâs no longer a perfect circle.
This paper fixes both the âneed camera infoâ problem and the âseamâ problem, enabling robust, geometryâfree 360° lifting for images and videos.
02Core Idea
The Aha! Moment in one sentence: Treat the input picture/video and the 360° panorama as two sequences of tokens, let a diffusion transformer learn their relationship by attention, and fix seams at the source with circular latent encodingâno camera calibration needed.
Multiple Analogies:
- Puzzle Table: Pour both the input photo pieces and the big panorama pieces onto one table. A smart friend (the transformer) looks at all pieces at once and snaps them into place by matching colors and edgesâno measuring tape required.
- Orchestra Conductor: Give the conductor (transformer) both the soloistâs melody (input view) and the full symphony (panorama) on the same sheet. Attention lets the conductor keep instruments in syncâno extra footnotes about where to stand.
- Magnetic Fridge Poetry: Mix words from two sets (input and panorama) on the same fridge. The transformerâs attention pulls the right words together to make a complete, smooth poem (seamless 360° scene).
đ Hook: You know how you can understand a story better if you read all the sentences together instead of one at a time?
𼏠The Concept (Diffusion Transformers): A diffusion transformer is a model that denoises noisy images/videos by looking at every part and how parts relateâusing attention over sequences of tokens.
- How it works:
- Turn images into latent tokens. 2) Add noise. 3) Step by step, the model removes noise while paying attention to all tokens.
- Why it matters: Attention lets the model learn placement and geometry from patterns in data, not from handâcoded geometry.
đ Anchor: Like restoring a blurry picture by repeatedly sharpening it while checking how all regions fit together.
đ Hook: Imagine lining up two trains of toy carsâone is the small input, the other is the big panoramaâand letting a smart robot look at both trains together to figure out where each car should go.
𼏠The Concept (Sequence Concatenation): Put the input view tokens and the target panorama tokens in one long line so attention can connect them.
- How it works:
- Encode input and panorama to latents. 2) Concatenate latent sequences. 3) Use attention to learn where the input belongs on the 360° canvas.
- Why it matters: No need to know fieldâofâview or camera tilt. The model learns to âplaceâ the input correctly just from data.
đ Anchor: Like sliding your puzzle piece into the big picture by eyeâwithout rulers or angles.
đ Hook: You know how if your photo is tilted, everything feels off? Itâs easier if every panorama is always upright.
𼏠The Concept (Canonical, GravityâAligned Panoramas): Always generate panoramas in a standard upright orientation, regardless of input camera tilt/roll.
- How it works:
- Pre-align training videos so gravity points down. 2) Train the model to always output upright panoramas. 3) The model learns to infer the inputâs pose and place it in that upright frame.
- Why it matters: Consistent outputs are easier to learn and look natural, reducing distortions.
đ Anchor: Think of a world where every map is drawn with north at the top. Itâs easier to compare and navigate.
đ Hook: If the edges of a scarf need to meet perfectly when wrapped, donât sew blank cloth at the borders.
𼏠The Concept (Circular Latent Encoding): Encode panoramas with circular padding so the latent representation wraps around cleanlyâno hard edges in training.
- How it works:
- Before encoding, copy a small strip from the left edge to the right and from the right to the left (circular pad). 2) Encode with the VAE. 3) Drop the extra strips so sequence length stays the same.
- Why it matters: This fixes the true cause of seamsâedge discontinuities in the latent spaceâso the model learns seamless panoramas.
đ Anchor: Like moving the first and last words together before compressing a looping sentence so the loop stays smooth.
Before vs After:
- Before: Methods projected the input into ERP using precise camera settings and then outpainted. They also used inferenceâtime tricks to hide seams.
- After: 360Anything learns placement from token attention (no camera info), outputs upright by design, and removes seams at the source with circular latent encoding.
Why It Works (intuition): Attention compares every token with every other, so the model can match objects in the input view (e.g., the corner of a couch) with where those objects should appear on the 360° canvasâeven when the cameraâs zoom or tilt changes. A consistent upright target removes ambiguity. And fixing the latent seam prevents the model from inheriting a bakedâin edge.
Building Blocks:
- Diffusion Transformer backbone for images/videos.
- Sequence concatenation for geometryâfree conditioning.
- Canonical (gravityâaligned) training targets for consistency.
- Circular Latent Encoding to eliminate seams during training.
- Cameraâaugmented crops during training so the model learns many FoVs and tilts, improving robustness.
đ Hook: How does it know how wide the camera sees and which way itâs tiltedâwithout being told?
𼏠The Concept (Zeroâshot Camera FoV and Pose Estimation via the Panorama): After generating an upright panorama, you can search for the FoV and tilt that best reâprojects the panorama back to the original input.
- How it works:
- Generate panorama. 2) Try different FoV/tilt/roll to reproject back. 3) Pick the one that matches the input best.
- Why it matters: The model implicitly learns geometry; you can read it out by matching.
đ Anchor: Like guessing a cameraâs zoom by finding which setting reproduces the exact same picture from your full 360° scene.
03Methodology
Highâlevel overview: Perspective image or video + caption â VAE encode to latents â Concatenate input and (noisy) panorama tokens â Diffusion Transformer denoises with attention â Circular latent decoding â Seamless, upright 360° panorama.
Stepâbyâstep:
- Inputs and Canonicalization
- What happens: For training, panorama videos from the wild are first âstraightened.â The method stabilizes frameâtoâframe rotations and aligns gravity to the vertical, so all targets are upright.
- Why this step exists: If targets tilt randomly, the model must learn many distortion patterns, making training harder and outputs wobblier.
- Example: A handheld 360° street video is stabilized so buildings stand straight in every frame. The model then always learns to paint upright city panoramas.
đ Hook: Like making sure every notebook page sits straight before you start copying notes.
𼏠The Concept (GravityâAligned Canonical Frame): Always use an upright target so the model learns one standard orientation.
- How it works:
- Estimate perâframe pose. 2) Remove interâframe rotations (stabilize). 3) Align gravity to vertical.
- Why it matters: Cleaner, more natural outputs and easier learning.
đ Anchor: Maps read better when north is always up. Same for panoramas.
- Latent Encoding with Circular Continuity
- What happens: Convert images/videos into compact latent tokens with a VAE. For panoramas, apply Circular Latent Encoding: pad the left edge on the right and the right edge on the left before encoding; then drop the pad after.
- Why this step exists: Normal zeroâpadding breaks the wraparound nature, baking a seam into the latent target. Fixing it here removes seams at the root.
- Example: On a 2048âpixelâwide ERP image, copy a small 256âpixel slice from each edge to the opposite side before encoding; after, trim them so the sequence length stays the same.
đ Hook: To keep a bracelet round, donât add gaps at the clasp when you measure it.
𼏠The Concept (Circular Latent Encoding): Preserve wraparound continuity in the latent space.
- How it works:
- Circularly pad the panorama. 2) Encode with VAE. 3) Drop padded strips.
- Why it matters: Seams disappear because the model trains on seamâfree latents.
đ Anchor: Like taping a loop of paper into a ring before flattening it, so the edges meet perfectly when restored.
- Sequence Concatenation Conditioning
- What happens: Encode the perspective input (image or video) into latents and simply place those tokens before the (noisy) panorama tokens. Feed the whole long sequence into the diffusion transformer.
- Why this step exists: It lets attention naturally learn placement and geometry without needing explicit camera FoV or pose.
- Example: Tokens from the input frame showing âa couch and a windowâ attend to panorama tokens where those objects should be on the 360° canvas.
đ Hook: If all puzzle pieces are on one table, you can match them by sight without labels.
𼏠The Concept (Sequence Concatenation): Condition by lining up input and target tokens in a single sequence.
- How it works:
- Encode both. 2) Concatenate. 3) Attend globally to align and outpaint.
- Why it matters: Works âin the wildâ without fragile projections.
đ Anchor: Like lining up two trains so couplers meet naturally.
- Diffusion Transformer Denoising
- What happens: The model starts with noisy panorama tokens and repeatedly denoises them, guided by attention to the input tokens and the caption.
- Why this step exists: Diffusion provides highâquality, stable generation at large resolutions.
- Example: Over 50 steps, blurry walls become crisp, and the missing parts around the input view fill in realistically.
đ Hook: Restoring a foggy window by carefully wiping in small circles until you can see clearly.
𼏠The Concept (Diffusion Transformers): A transformer uses attention to remove noise stepâbyâstep, guided by the input and text.
- How it works:
- Start noisy. 2) Attend across all tokens. 3) Predict and subtract noise repeatedly.
- Why it matters: Produces detailed, coherent panoramas and videos.
đ Anchor: Like polishing a dull stone into a shiny gem by checking all its facets each time.
- Circular Latent Decoding
- What happens: Decode denoised panorama tokens back into an ERP image or video. Because training targets were seamâfree, outputs are seamâfree tooâno need for blending tricks.
- Why this step exists: Get final pixels for viewing and for projecting to perspective windows.
- Example: The left and right edges align perfectly when wrapped, so VR viewers see no line.
Training Details that Boost Robustness:
- Camera augmentations: Randomly crop input views with many FoVs (30°â120°) and tilts/rolls. This teaches the model to place any view onto the 360° canvas.
- Text captions: Short prompts describe the scene; the model learns semantics that improve outpainting.
- Video motion: Use both simulated and real camera trajectories to handle complex inâtheâwild motion and maintain upright panoramas over time.
Secret Sauce (why it feels simple yet strong):
- Replace geometry dependencies with attention over a joint token sequence.
- Replace inferenceâtime seam bandâaids with a principled latentâspace fix.
- Replace randomâorientation targets with a single upright frame that is easier to learn, scale, and use.
đ Hook: Can it build a 3D world afterwards?
𼏠The Concept (3D Gaussian Splatting from Panoramas): Use the generated 360° frames like many cameras around you to reconstruct a 3D scene for flyâthroughs.
- How it works:
- Estimate camera poses across frames. 2) Optimize many small 3D âgaussiansâ to match the images. 3) Render from new views in real time.
- Why it matters: Shows the panoramas are geometrically consistent, not just pretty pictures.
đ Anchor: Like placing tiny glowing marbles in space until they reproduce your photosâthen you can walk around them.
04Experiments & Results
What they measured and why:
- Image quality on standard panorama datasets (Laval Indoor, SUN360) to test how well outpainted panoramas match the real world and the prompt.
- Video quality (sharpness, realism, smoothness) and faithfulness to the input view on the Argus benchmark split to see if motion stays consistent and upright.
- Seam artifacts with a discontinuity score (DS) to check for edge lines.
- Camera understanding (fieldâofâview and orientation) in zeroâshot to test implicit geometry learning.
- 3D scene reconstruction from generated panoramas to validate geometric consistency.
Competitors:
- Images: OmniDreamer, PanoDiffusion, Diffusion360, CubeDiff (previous SOTA).
- Videos: Imagine360, Argus, ViewPoint (all use geometry or assumptions; some even use groundâtruth camera info).
Scoreboard with context:
-
Images (Laval Indoor, SUN360): ⢠FID â: 8.0 (Laval) and 22.4 (SUN360). This is like getting an A when others hover around B to C ranges; lower is better. ⢠KID â: 0.22 (Laval) and 1.27 (SUN360); again, lower is better and clearly competitive or best. ⢠CLIPâFID â: 4.6 (Laval) and 7.3 (SUN360); near the topâslightly behind CubeDiff on Laval but ahead on SUN360. ⢠FAED â (fullâpanorama quality): 9.8 (Laval) and 3.8 (SUN360), about a ~50% reduction vs prior SOTA, meaning much better wholeâpanorama geometry and appearance. ⢠CLIPâScore â: Best or nearâbest, showing strong text alignment.
-
Videos (Argus split; real and simulated camera trajectories): ⢠PSNR â and LPIPS â in the inputâcovered region: Higher PSNR and lower LPIPS than Imagine360, Argus, and ViewPointâeven though some baselines had groundâtruth camera info. Thatâs like drawing the input view more accurately than methods that got extra hints. ⢠FVD â: 483 (real cams) and 433 (sim cams) vs 844â1532 baselinesâthis is a huge drop, like moving from a Bâ to A+ in overall video realism and consistency. ⢠VBench Imaging/Aesthetic/Motion â: Across the board improvements, showing nicer visuals and smoother motion.
-
Seam Discontinuity (DS â): ⢠Images: DS ~3.9 with Circular Latent Encoding vs ~9.9 vanilla and ~5.3 blended decoding. Thatâs a clean removal of the edge line without postâhoc blurring. ⢠Videos: DS ~13.3 with CLE vs ~35.5 vanilla and ~19.8 blended, again much cleaner.
-
Zeroâshot Camera Understanding: ⢠Field of View (FoV) error (degrees): Mean ~3.9 (NYUv2), ~5.68 (ETH3D), ~5.21 (iBimsâ1), averaging ~4.93°, competitive with DUSt3R and MoGe which are specialized. ⢠Orientation (roll/pitch) error (degrees): 0.87/2.56 (MegaDepth) and 0.68/1.23 (LaMAR), just ~0.5° behind the current SOTA GeoCalib.
-
3D Reconstruction: ⢠Using generated panoramas to train a 3D Gaussian Splatting model yields navigable 3D scenes, indicating strong geometry consistency beyond just pretty pixels.
Surprising findings:
- Even without explicit camera metadata, the model preserves the input region better (higher PSNR/lower LPIPS) than methods that carefully unproject the input using groundâtruth cameras.
- Training with random camera FoVs and tilts made the model stronger even on the standard test setting (fixed 90°). Variety teaches robust geometry.
- Making targets upright (canonical) reduces distortions and boosts realism, even if it slightly lowers pixelâmatch scores like PSNR in some ablations.
- Fixing seam artifacts at the VAE latent stage beats inferenceâtime bandâaids and adds no extra compute at sampling.
05Discussion & Limitations
Limitations (be specific):
- Long videos: Panorama frames are big (ERP packs a whole sphere). With current compute, the model handles about 81 frames; longer stories need either more memory or an autoregressive extension.
- Complex physics/dynamics: Scenes with fast, intricate physics (e.g., splashing liquids, crowds) can still challenge fidelity and temporal stability.
- Dataset biases: Training on internet 360° content can introduce habits (e.g., occasional black borders or tripodâlike objects at the bottom), which may appear in outputs.
- Upsampling: Offâtheâshelf video upscalers, built for perspective views, can reâintroduce ERP seams or distort spherical structureâpanoramaâaware upscalers are needed.
Required Resources:
- A preâtrained image/video diffusion transformer (e.g., FLUX, Wan) and a VAE.
- GPU memory to handle long token sequences at high ERP resolution (e.g., 1024Ă2048 images; 512Ă1024 videos with many frames).
- Optional preâprocessing for training videos to canonicalize gravity and stabilize motion.
When NOT to Use:
- If you need exact metricâaccurate geometry for measurement (e.g., surveying), a calibrationâfree generative model is not a substitute for precise SfM/SLAM.
- Realâtime on edge devices: Full diffusion transformers at panorama resolutions may be too heavy without distillation.
- Highly dynamic, deforming scenes where outpainting beyond input evidence could hallucinate unacceptable content.
Open Questions:
- Can we extend context length and temporal horizon efficiently (e.g., causal/AR diffusion transformers) for minuteâlong 360° videos?
- How to build panoramaâaware superâresolution that preserves ERP continuity without seams or distortions?
- Can we integrate lightweight, uncertaintyâaware camera hints when available without reâintroducing brittleness?
- How to further reduce bias (e.g., tripod artifacts) via data curation or targeted fineâtuning while keeping generality?
- Can this geometryâfree conditioning help other tasks (e.g., multiâview 3D training) without explicit calibration?
06Conclusion & Future Work
Threeâsentence summary: 360Anything turns ordinary photos and videos into seamless, gravityâaligned 360° panoramas without needing camera calibration, by treating both input and output as token sequences that a diffusion transformer aligns with attention. It fixes the root cause of panorama seams by introducing Circular Latent Encoding so the VAEâs latent space wraps around cleanly. The result is stateâofâtheâart image and video quality, competitive zeroâshot camera estimation, and panoramas consistent enough for 3D reconstruction.
Main achievement: Replacing fragile geometry dependencies with simple, scalable sequence concatenation plus a principled seam fixâenabling robust, calibrationâfree 360° lifting in the wild.
Future directions:
- Longer, higherâresolution 360° videos via efficient or autoregressive diffusion transformers.
- Panoramaâaware upsampling to keep ERP structure seamless at 4K/8K resolutions.
- Hybrid approaches that can optionally accept weak camera hints without losing robustness when hints are missing.
- Stronger temporal world models for complex dynamics and physics.
Why remember this: It shows that attention over joint token sequences can learn camera geometry implicitly, and that fixing seams at their true source (latent encoding) is cleaner and stronger than patching at inferenceâpointing toward simpler, more scalable, and more reliable generative world building.
Practical Applications
- â˘Turn a phone photo into a VRâready 360° background for virtual tours or classrooms.
- â˘Outpaint a forwardâfacing drone video into a full wraparound scene for safer navigation planning.
- â˘Create quick 360° location previews for film sets from a few scouting shots.
- â˘Generate panoramic backdrops for game levels from concept art or reference images.
- â˘Produce immersive real estate walkthroughs from simple room videos without special cameras.
- â˘Expand POV sports clips into 360° highlights for replay and coaching analysis.
- â˘Assist AR apps by imagining the unseen surroundings to place stable, contextâaware overlays.
- â˘Build 3D scene reconstructions (via 3D Gaussian Splatting) from generated panoramas for freeâview exploration.
- â˘Enhance travel and museum apps with instant 360° environments built from normal visitor photos.
- â˘Automate panorama completion in photo editors without manual camera calibration.