PLANING: A Loosely Coupled Triangle-Gaussian Framework for Streaming 3D Reconstruction

Changjian Jiang; Kerui Ren; Xudong Li; Kaiwen Song; Linning Xu; Tao Lu; Junting Dong; Yu Zhang; Bo Dai; Mulin Yu

PLANING: A Loosely Coupled Triangle-Gaussian Framework for Streaming 3D Reconstruction

Intermediate

Changjian Jiang, Kerui Ren, Xudong Li et al.1/29/2026

arXiv PDF

Key Summary

•PLANING is a new way to build 3D worlds from a moving single camera by combining two kinds of pieces: sharp triangles for shape and soft Gaussians for looks.
•It separates geometry (where things are) from appearance (how things look), so each can be optimized well without fighting the other.
•In streaming mode, it adds, trains, and prunes pieces on-the-fly using smart filters, so the map stays compact and fast.
•Triangles act like sturdy walls and edges, while neural Gaussians paint textures and lighting anchored to those walls.
•Compared to strong baselines, PLANING reconstructs scenes over 5× faster than 2D Gaussian Splatting while matching or beating visual quality.
•It improves dense mesh accuracy (Chamfer-L2) by 18.52% over PGSR and boosts rendering quality by 1.31 dB PSNR over ARTDECO.
•A global map update keeps the model aligned as camera poses are refined, reducing drift.
•The triangle “soup” makes it easy to extract clean planes, which are great for fast robot simulation and training.
•It works well across indoor and outdoor datasets and is robust in hard areas like low light and textureless walls.
•Limitations include handling glass/transparent things and very distant skies, which can confuse appearance signals.

Why This Research Matters

PLANING makes live 3D mapping faster, cleaner, and more reliable by letting triangles handle shape and Gaussians handle looks. This balance means AR apps can anchor virtual furniture accurately while still looking realistic. Robots gain stable planes and edges to walk on and grasp against, which boosts safety and autonomy. The compact, clean planes export quickly into simulators, speeding up reinforcement learning for locomotion and manipulation. City-scale mapping becomes more feasible thanks to fewer primitives and dynamic loading. Overall, PLANING reduces the usual trade-off between beauty and accuracy, making streaming 3D reconstruction practical for everyday tools and large deployments.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re building a Lego city while you walk around it with a phone camera. You don’t finish the whole city at the end; you keep snapping pieces together as you move. That’s what fast 3D reconstruction wants to do—keep up with you in real time.

🥬 Filling (The Actual Concept):

What it is: Streaming 3D reconstruction is making a 3D model live, frame by frame, while a camera moves.
How it works:
1. The camera grabs a new image.
2. The system guesses where the camera is now.
3. It adds or improves parts of the 3D model using that image.
4. It repeats quickly so the 3D scene grows smoothly.
Why it matters: Without streaming, you must wait until after recording to see results. That’s too slow for robots, AR glasses, or live mapping in large spaces.

🍞 Bottom Bread (Anchor): Think of a phone app that lets you walk around your room and instantly see a 3D model filling in on-screen as you move. That’s streaming 3D reconstruction.

The world before: Earlier, NeRFs made scenes look photorealistic but were slow and hidden inside big neural nets, making edges fuzzy and geometry hard to edit. 3D Gaussian Splatting sped up rendering a lot, but its round, fuzzy blobs are great at color and light, not at drawing crisp walls, doors, and sharp corners. In live (streaming) settings, most methods either chose great appearance (pretty pictures) or great geometry (clean shapes), but not both at once.

The problem: When you try to get one system to do both jobs—surface shape and surface look—with the same kind of primitive (like only Gaussians), the two goals can argue. Fuzzy blobs love to match pixels (appearance), but they don’t naturally lock onto straight edges or flat planes. This causes geometry drift, redundant blobs, and unstable training, especially when views are sparse or poses keep changing.

Failed attempts:

Only Gaussians: Pretty views but messy, overfull geometry and weak edges.
Only triangles/rectangles: Clear edges but weaker textures and view-dependent effects.
Dual-branch with heavy SDFs + Gaussians: Better separation but slower and harder to optimize online.
Pure feed-forward pose-free recon: Fast and robust starts, but often less accurate and less consistent over long sequences.

🍞 Top Bread (Hook): You know how a house needs sturdy studs (structure) and then paint (appearance)? Mixing up the two doesn’t help—you wouldn’t paint thin air or use paint to hold up the roof.

🥬 Filling (The Actual Concept):

What it is: Triangle primitives are tiny flat pieces (like mini roof shingles) that build crisp surfaces.
How it works:
1. Each triangle has three learnable corners (vertices).
2. A special rasterizer computes where it should appear in the image, including sharp edges.
3. Triangles get adjusted using depth and normal hints so surfaces become clean and stable.
Why it matters: Without triangle anchors, edges get blurry, walls wobble, and planes don’t line up, especially in streaming.

🍞 Bottom Bread (Anchor): Imagine tiling a wall with small flat tiles. Because each tile is a straight-edged piece, the wall ends up flat and sharp.

The gap this paper fills: We need a representation that gives triangles the steering wheel for geometry, while letting Gaussians handle the paint job—loosely coupled so each can do its best, but still cooperate. We also need a streaming-friendly recipe that: (1) smartly decides where to add pieces, (2) separates geometry vs. appearance training, and (3) keeps the whole map aligned when camera poses get refined.

🍞 Top Bread (Hook): Picture a coloring book. First, you draw the outlines neatly. Then, you color them in. Mixing both at once with one fat marker makes a mess.

🥬 Filling (The Actual Concept):

What it is: Neural Gaussians are soft, learnable blobs that model color, lighting, and view-dependent effects.
How it works:
1. For each triangle, attach some Gaussians.
2. Decode their size, rotation, and color from triangle features plus their own features.
3. Render images fast and smooth, while triangles keep structure sharp.
Why it matters: Without Gaussians, textures look flat and lighting effects vanish; triangles alone can’t capture shiny, view-dependent details.

🍞 Bottom Bread (Anchor): The triangle is the page outline; neural Gaussians are the crayons that shade and highlight.

Real stakes:

For AR: A headset needs sharp walls and floors to stick virtual objects in place, but also nice textures so the world looks real.
For robots: They need trustworthy planes and edges for safe walking and grasping, updated live.
For mapping and simulation: Clean, compact planes mean faster, cheaper, and more reliable environments for training AI.

🍞 Top Bread (Hook): If your GPS learns you turned a corner, you’d want your map to rotate with you, not pretend you still face north.

🥬 Filling (The Actual Concept):

What it is: Global map adjustment is the step that re-aligns the 3D model whenever the camera pose estimates improve.
How it works:
1. Keep track of which frame created each piece.
2. When the backend refines that frame’s pose, compute the change.
3. Move/rotate the attached triangles and Gaussians by that change.
Why it matters: Without it, the model lags behind corrected poses, causing double walls and ghosting.

🍞 Bottom Bread (Anchor): It’s like straightening a painting on the wall after nudging the nail—it keeps the picture aligned with the frame.

02Core Idea

The “Aha!” in one sentence: Give geometry to sharp triangles and give appearance to soft neural Gaussians, and let them cooperate loosely so each can excel without tripping the other.

Three analogies:

Coloring book: Triangles are the ink outlines; Gaussians are the crayons. Clean lines plus rich color beats trying to do both with one smudgy marker.
House building: Triangles are the studs and drywall; Gaussians are the paint and lighting. Structure first, then appearance.
Orchestra: Triangles keep the beat (structure), Gaussians add melody and harmony (appearance). Separate parts, better music.

🍞 Top Bread (Hook): You know how mixing all chores together (laundry, cooking, homework) creates chaos, but separating them gets things done right?

🥬 Filling (The Actual Concept):

What it is: The Triangle–Gaussian Representation is a hybrid scene model that decouples geometry (triangles) from appearance (neural Gaussians) but keeps them linked.
How it works:
1. Learn triangles with a custom rasterizer to nail crisp surfaces, supervised by depth and normals.
2. Anchor flexible Gaussians to each triangle to render colors and view-dependent effects via small neural decoders.
3. Train with separate geometry and appearance losses so signals don’t fight.
4. Still pass some appearance gradients back to triangles for gentle geometry refinement.
Why it matters: Without decoupling, appearance gradients can warp geometry; without coupling, appearance ignores structure. The loose link keeps them coordinated.

🍞 Bottom Bread (Anchor): It’s like having a blueprint crew (triangles) and a painting crew (Gaussians) sharing a walkie-talkie. The painters ask for slight tweaks where needed; the builders keep walls straight.

Before vs. After:

Before: Single-primitive methods struggled to keep both sharp structure and high-fidelity rendering. Streaming versions often ballooned in primitive count, ran slower, and drifted in geometry.
After: PLANING achieves crisp planes and edges, high PSNR/SSIM, fewer primitives, and fast streaming training. It exports clean planar pieces for downstream tasks.

Why it works (intuition):

Triangles have edges and faces that align naturally with indoor layouts (walls, floors, tables). This gives the model a stable skeleton.
Gaussians, anchored to those triangles, only need to model appearance. Anchoring prevents them from drifting off-surface.
Separate losses avoid tug-of-war. Occasional gradients from appearance still help triangles snap tighter to cues in the image.
Smart initialization and pruning reduce redundancy so the model stays lean and fast.

Building blocks (small pieces that make it click):

Learnable triangles with a local frame and an edge-preserving contribution function for sharp-yet-differentiable rendering.
Neural Gaussians attached to triangles with tiny MLPs that decode scale/rotation/color from triangle and Gaussian features.
Streaming scaffold: A frontend tracks the camera and picks keyframes; a backend refines poses; a mapper adds/prunes/optimizes primitives.
Two filters decide where to add triangles: a photometric filter (finds places the render looks wrong, especially high-frequency spots) and a spatial filter (prevents crowding where coverage is already good).
Global map adjustment: When the backend tweaks poses, the entire triangle–Gaussian set is transformed consistently, preventing ghosting.

🍞 Top Bread (Hook): Imagine cleaning your room by first placing the big furniture (triangles) and then adding posters and lights (Gaussians). You work faster and keep things tidy.

🥬 Filling (The Actual Concept):

What it is: Streaming 3D reconstruction with PLANING is an on-the-fly pipeline that keeps geometry slim and visuals sharp.
How it works:
1. Frontend: track camera, choose keyframes, get depth/normal hints.
2. Mapper: at key places, add triangles where needed, attach Gaussians, train quickly, prune extras.
3. Backend: close loops and refine poses; then do a global adjustment so the map stays aligned.
Why it matters: Without this staged, decoupled flow, online systems get bloated, drift, or look bad.

🍞 Bottom Bread (Anchor): As you walk down a hallway, PLANING adds concise triangle patches on walls and floor only where missing, paints them with Gaussians, and straightens the whole model whenever the pose gets better.

03Methodology

At a high level: Unposed images → (Frontend tracking) → (Backend pose refinement) → (Mapper: initialize triangles + attach Gaussians + train + prune) → Output: clean triangles for geometry + Gaussians for rendering, plus optional planes and dense meshes.

Step-by-step, like a recipe:

Input and tracking

What happens: The system ingests a monocular video. The frontend estimates camera motion and selects keyframes. It also gets dense geometric hints (depth and normals) from a feed-forward prior.
Why this step exists: Streaming needs quick, robust pose and geometry hints to avoid drifting and to place new pieces in the right spots.
Example: You pan across a living room. The frontend marks a frame with the couch fully visible as a keyframe and estimates the pose and a rough depth/normal map for it.

Backend global pose optimization

What happens: It detects loop closures and refines camera poses over keyframes with bundle adjustment.
Why: Camera poses keep improving over time. If we don’t refine globally, small errors add up to drift.
Example: You circle the room and return to the doorway. The backend notices the loop and tightens all trajectories.

🍞 Top Bread (Hook): When you hang multiple picture frames, you step back to realign them so the whole wall looks straight.

🥬 Filling (The Actual Concept):

What it is: Global map adjustment moves all scene pieces to match the improved poses.
How it works: For each primitive, compute the relative transform from old to new keyframe pose and apply it to triangle vertices and Gaussian parameters.
Why it matters: Without this, even corrected poses don’t fix the already-built map, causing misalignments.

🍞 Bottom Bread (Anchor): After straightening the nail, you nudge each photo so the entire gallery wall is neat again.

Mapper: primitive initialization (deciding where to add triangles)

What happens: Two filters pick locations that truly need new geometry. a) Photometric filter: compares a sharpened view of the real image vs. the current render. Where they differ (especially in edges or textures), insertion probability rises. b) Spatial filter: checks for nearby triangles at the right scale (depth-adaptive), avoiding overcrowding.
Why: Without the photometric filter, we’d miss places where the render is wrong; without the spatial filter, we’d explode in primitive count.
Example with data: A pixel on a lamp edge has high LoG difference (render missed the edge). If no nearby triangle covers it (by depth-based vicinity), we seed a new triangle there.

🍞 Top Bread (Hook): Like a gardener pruning and planting only where the hedge is thin, not where it’s already bushy.

🥬 Filling (The Actual Concept):

What it is: Photometric and spatial filtering decide where new geometry is worth it.
How it works: Compute edge-like differences to spot bad reconstructions, then use a depth-aware neighborhood to avoid duplicates.
Why it matters: Without them, the model becomes redundant and slow; with them, it stays compact and fast.

🍞 Bottom Bread (Anchor): You patch holes in a wall where paint is missing but don’t keep repainting the same perfect spot.

Triangle creation and parameters

What happens: For a selected pixel, we back-project it to 3D, orient a small triangle using the local normal prior, and set its initial size (scaled with depth and camera focal length). Opacity is lowered where confidence is low. Each triangle gets a feature vector for appearance linkage.
Why: Good initialization makes training stable and keeps edges sharp from the start.
Example: On a white wall at 3 m depth, we place a coin-sized triangle aligned to the wall’s normal; on a table edge closer to the camera, the triangle is smaller and more precise.

Attaching neural Gaussians

What happens: Each triangle hosts several Gaussians. Their position is anchored at the triangle’s barycenter plus a small offset. Tiny MLPs decode each Gaussian’s scale/rotation based on triangle and Gaussian features. Colors come from spherical harmonics, seeded from pixel color.
Why: Anchoring keeps Gaussians glued to real surfaces. Decoding lets them adapt to local detail without exploding in number.
Example: A triangle on a bookshelf hosts more Gaussians if the texture is detailed (book spines), fewer if it’s smooth (a painted wall).

🍞 Top Bread (Hook): Think of magnets (Gaussians) snapping onto a steel sheet (triangle) so they don’t drift around.

🥬 Filling (The Actual Concept):

What it is: Neural Gaussians anchored to triangles for flexible, high-quality appearance.
How it works: The triangle’s feature and each Gaussian’s feature feed small networks that predict the Gaussian’s final size, orientation, and color.
Why it matters: Without anchoring, Gaussians wander; without networks, they can’t adapt detail efficiently.

🍞 Bottom Bread (Anchor): The magnets adjust their shapes a bit to fill gaps nicely, but they stay stuck to the sheet.

Training with decoupled losses

What happens: a) Geometry loss: triangles are supervised by multi-view depth and normal priors; an opacity entropy loss prunes weak triangles. b) Appearance loss: Gaussians are trained to match image colors and structures (L2 + SSIM + regularization), and gradients can gently adjust triangles too.
Why: Separate objectives prevent appearance from breaking geometry, while still letting appearance refine geometry where safe.
Example: If the rendered wall looks slightly bulged, appearance gradients nudge triangles flatter; if a triangle never contributes, opacity pruning removes it.

Global optimization and outputs

What happens: After the streaming pass, a brief global optimization polishes everything. Triangles can be fused into a dense mesh via TSDF. Planes are extracted from the triangle soup with a coarse-to-fine method.
Why: Final polish ensures consistency; plane extraction creates compact, simulation-ready assets.
Example: The living room becomes a clean set of big planes (floor, walls, ceiling) plus a tidy mesh of furniture.

The secret sauce:

Loosely coupled design: triangles give geometry backbone; Gaussians add appearance skin; small, controlled gradients flow between them.
Streaming-aware initialization + filtering: add only what’s needed, where it’s needed.
Global map adjustment: always keep the model in sync with pose updates, cutting drift and ghosting.
Edge-preserving triangle rasterization: differentiable yet crisp edges for strong geometry learning.

04Experiments & Results

The test: The authors measured two families of things—how accurate the 3D shape is and how good the images look from new views—plus how fast and compact the method is. Geometry used Chamfer distance (lower is better) and F-score (higher is better). Rendering used PSNR/SSIM/LPIPS (higher PSNR/SSIM and lower LPIPS mean better-looking images). They also tracked training time and how many primitives the model used.

The competition: They compared against strong per-scene methods (2DGS, PGSR, MeshSplatting), streaming methods (ARTDECO, OnTheFly-NVS, S3PO-GS, MonoGS), and planar methods (PlanarSplatting, AirPlanes). Everyone got the same geometric priors when applicable to keep it fair.

Scoreboard with context:

Speed and efficiency: On ScanNetV2, PLANING reconstructs scenes in under 100 seconds, over 5× faster than 2D Gaussian Splatting, while matching or beating quality. Think of finishing your homework in 20 minutes when others need over an hour, and still getting a top grade.
Geometry accuracy: Dense mesh Chamfer-L2 improves by 18.52% over PGSR. That’s like shrinking your measuring error from 10 cm to a bit over 8 cm—noticeably crisper shapes. F-scores also rise, showing more correct surface matches.
Rendering quality: PLANING surpasses ARTDECO by 1.31 dB PSNR on average. A 1+ dB PSNR gain is meaningful—like nudging a B+ toward an A in picture clarity—especially alongside better SSIM and LPIPS.
Compactness: Thanks to the filters and pruning, PLANING uses far fewer primitives than many baselines, which cuts memory and speeds up training and rendering.

Surprising or notable findings:

Tough regions (low light, blank walls): PLANING holds up better. Triangles keep the geometry steady even when textures are weak, and Gaussians fill in appearance without causing blobs to drift.
Plane extraction quality: Because triangles form clear surfaces, extracting big, clean planes is easy. This makes exporting to simulators (like Isaac) much faster than processing heavy meshes.
Global map update matters: Turning it off increases misalignments. With it, the model stays aligned to refined poses, improving both geometry and rendering.
Hybrid beats single-primitive: Ablations show that removing triangles hurts both geometry sharpness and rendering, while removing the hybrid anchoring bloats Gaussians and reduces quality.

Across datasets (ScanNet++, ScanNetV2, VR-NeRF, FAST-LIVO2, KITTI, Waymo), PLANING consistently ranked at or near the top in both geometry and appearance. It’s not just one scene type—it’s a robust pattern.

Key numbers to remember:

5× faster than 2DGS for streaming reconstruction with similar or better visual quality.
+1.31 dB PSNR over ARTDECO on average (better rendering).
−18.52% Chamfer-L2 vs. PGSR (better geometry).
Strong F-score and SSIM/LPIPS across varied indoor/outdoor scenes.

Takeaway: The hybrid triangle–Gaussian design provides a clean geometry backbone and high-fidelity appearance, all while staying fast and compact in streaming mode.

05Discussion & Limitations

Limitations (be specific):

Glass and transparency: Neural Gaussians depend on reliable appearance cues; see-through or reflective objects confuse gradients and can nudge geometry the wrong way.
Distant backgrounds/sky: The framework focuses on solid surfaces; very far, low-parallax regions aren’t modeled explicitly and can look inconsistent.
Dynamic scenes: The method assumes mostly static environments; lots of moving people or objects can disrupt consistency.
Prior dependence: Geometric priors (depth/normals) boost stability; if priors are poor or missing, training can degrade.

Required resources:

A modern GPU (e.g., RTX 4090 in the paper) for interactive training.
Feed-forward models (e.g., MASt3R) for geometric hints.
Enough CPU/GPU memory for primitives, with optional dynamic loading for large spaces.

When not to use:

Scenes dominated by transparent or highly specular geometry (glass art installations, aquariums).
Outdoor panoramas where sky and very distant scenery matter more than near surfaces.
Highly dynamic events (crowded concerts) where structure changes constantly.

Open questions and future work:

Better handling of transparency/specularity (e.g., separating reflective layers or learning robust appearance cues).
Unified far-field modeling (skyboxes or layered backgrounds) to improve outdoor realism.
Tighter end-to-end coupling with pose estimation, possibly reducing reliance on external priors.
Curved-surface anchoring (beyond planes/triangles) without losing simplicity.
Adaptive Gaussian budgeting that reallocates capacity on-the-fly based on uncertainty.
Even more robust large-scale streaming with smarter CPU–GPU swapping and multi-agent capture.

06Conclusion & Future Work

Three-sentence summary: PLANING introduces a hybrid, loosely coupled triangle–Gaussian representation that cleanly separates geometry from appearance for streaming 3D reconstruction. Triangles give crisp, stable structure while neural Gaussians deliver high-fidelity, view-dependent appearance, all kept in sync with global map updates. The result is state-of-the-art quality with far fewer primitives and much faster streaming than prior methods, plus clean planes that plug right into simulators.

Main achievement: Showing that a decoupled-yet-linked geometry–appearance design—with triangles as structural anchors and Gaussians as appearance painters—solves the long-standing trade-off between sharp geometry and photorealistic rendering in a streaming setting.

Future directions: Improve handling of transparent/reflective materials and distant backgrounds; explore end-to-end pose–map co-optimization; extend anchors to richer geometric primitives; and scale further with smarter resource management. The clean planes also invite new applications in robotics, AR, and large-scale mapping.

Why remember this: It’s the “coloring book” idea brought to 3D streaming—draw the outlines with triangles, color them with Gaussians, and keep the sketch aligned as you move. That simple separation makes reconstruction faster, cleaner, and more useful for real-world tasks.

Practical Applications

•AR interior design: place virtual furniture that stays put on crisp, well-detected floors and walls.
•Robot navigation: use extracted planes for reliable footstep planning and obstacle avoidance in real time.
•Rapid site capture: stream a clean 3D model of a room or building during walkthroughs for facility management.
•Simulation bootstrapping: export compact planes to Isaac or similar simulators for faster robot policy training.
•On-the-fly inspection: highlight geometry mismatches (e.g., missing edges) during construction quality checks.
•VR set reconstruction: quickly build sharp, lightweight environments for virtual production or training.
•Large-scale mapping: use dynamic loading to reconstruct long corridors and multi-room spaces on limited GPUs.
•Pose refinement: feed plane constraints back into the tracker to reduce drift in challenging sequences.
•Content creation: generate clean meshes and planes that are easy to edit for game and film assets.
•Disaster response: rapidly map interiors with minimal equipment to plan safe routes for first responders.

Version: 1