Papers1055

Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

Bowen Wen, Shaurya Dewan et al.Dec 11arXiv

Fast-FoundationStereo is a stereo vision system that sees depth from two cameras in real time while still working well on brand‑new scenes it was never trained on.

#stereo matching#zero‑shot generalization#knowledge distillation

Not triaged yet

StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

Intermediate

Tjark Behrens, Anton Obukhov et al.Dec 11arXiv

StereoSpace turns a single photo into a full 3D-style stereo pair without ever estimating a depth map.

#stereo generation#monocular-to-stereo#diffusion models

Not triaged yet

Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

Intermediate

Tsai-Shien Chen, Aliaksandr Siarohin et al.Dec 11arXiv

Omni-Attribute is a new image encoder that learns just the parts of a picture you ask for (like hairstyle or lighting) and ignores the rest.

#open-vocabulary attribute encoder#attribute disentanglement#visual concept personalization

Not triaged yet

Bidirectional Normalizing Flow: From Data to Noise and Back

Intermediate

Yiyang Lu, Qiao Sun et al.Dec 11arXiv

Normalizing Flows are models that learn how to turn real images into simple noise and then back again.

#Normalizing Flow#Bidirectional Normalizing Flow#Hidden Alignment

Not triaged yet

Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

Intermediate

Yiwen Tang, Zoey Guo et al.Dec 11arXiv

This paper asks whether reinforcement learning (RL) can improve making 3D models from text and shows that the answer is yes if we design the training and rewards carefully.

#Reinforcement Learning#Text-to-3D Generation#Hi-GRPO

Not triaged yet

Stronger Normalization-Free Transformers

Intermediate

Mingzhi Chen, Taiming Lu et al.Dec 11arXiv

This paper shows that we can remove normalization layers from Transformers and still train them well by using a simple point‑by‑point function called Derf.

#Normalization‑free Transformers#LayerNorm replacement#Point‑wise activation

Not triaged yet

FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos

Intermediate

Yulu Gan, Ligeng Zhu et al.Dec 11arXiv

FoundationMotion is a fully automatic pipeline that turns raw videos into detailed motion data, captions, and quizzes about how things move.

#motion understanding#spatio-temporal reasoning#video question answering

Not triaged yet

DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance

Intermediate

Peiying Zhang, Nanxuan Zhao et al.Dec 11arXiv

DuetSVG is a new AI that learns to make SVG graphics by generating an image and the matching SVG code together, like sketching first and then tracing neatly.

#DuetSVG#multimodal generation#SVG generation

Not triaged yet

MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

Intermediate

Kehong Gong, Zhengyu Wen et al.Dec 11arXiv

MoCapAnything is a system that turns a single regular video into a 3D animation that can drive any rigged character, not just humans or one animal type.

#motion capture#category-agnostic mocap#monocular video

Not triaged yet

From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

Intermediate

Zongzhao Li, Xiangzhe Kong et al.Dec 11arXiv

The paper defines Microscopic Spatial Intelligence (MiSI) as the skill AI needs to understand tiny 3D things like molecules from 2D pictures and text, just like scientists do.

#microscopic spatial intelligence#vision-language models#orthographic projection

Not triaged yet

MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

Intermediate

Jingli Lin, Runsen Xu et al.Dec 11arXiv

This paper introduces MMSI-Video-Bench, a big, carefully hand-made test to check how well AI understands space and motion in videos.

#video-based spatial intelligence#multimodal large language models#spatial construction

Not triaged yet

Scaling Behavior of Discrete Diffusion Language Models

Intermediate

Dimitri von Rütte, Janis Fluri et al.Dec 11arXiv

This paper studies how a newer kind of language model, called a discrete diffusion language model (DLM), gets better as we give it more data, bigger models, and more compute.

#discrete diffusion#language models#scaling laws

Not triaged yet

80 81 82 83 84