Papers943

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Chiao-An Yang, Ryo Hachiuma et al.Dec 18arXiv

This paper teaches a video-understanding AI to think in 3D plus time (4D) so it can answer questions about specific objects moving in videos.

#4D perception#multimodal large language models#perceptual distillation

Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

Intermediate

Junbo Li, Peng Zhou et al.Dec 18arXiv

Turn-PPO is a new way to train chatty AI agents that act over many steps, by judging each conversation turn as one whole action instead of judging every single token.

#Turn-PPO#multi-turn reinforcement learning#agentic LLMs

The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

Intermediate

Hanlin Wang, Hao Ouyang et al.Dec 18arXiv

WorldCanvas lets you make videos where things happen exactly how you ask by combining three inputs: text (what happens), drawn paths called trajectories (when and where it happens), and reference images (who it is).

#WorldCanvas#promptable world events#trajectory-controlled video generation

Next-Embedding Prediction Makes Strong Vision Learners

Beginner

Sihan Xu, Ziqiao Ma et al.Dec 18arXiv

This paper introduces NEPA, a very simple way to teach vision models by having them predict the next patch’s embedding in an image sequence, just like language models predict the next word.

#self-supervised learning#vision transformer#autoregression

EasyV2V: A High-quality Instruction-based Video Editing Framework

Intermediate

Jinjie Mai, Chaoyang Wang et al.Dec 18arXiv

EasyV2V is a simple but powerful system that edits videos by following plain-language instructions like “make the shirt blue starting at 2 seconds.”

#instruction-based video editing#spatiotemporal mask#text-to-video fine-tuning

Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

Intermediate

Qihao Liu, Chengzhi Mao et al.Dec 18arXiv

AuditDM is a friendly 'auditor' model that hunts for where vision-language models get things wrong and then creates the right practice to fix them.

#AuditDM#model auditing#cross-model divergence

AdaTooler-V: Adaptive Tool-Use for Images and Videos

Intermediate

Chaoyang Wang, Kaituo Feng et al.Dec 18arXiv

AdaTooler-V teaches an image-and-video AI to first ask, “Do I really need a tool?” before using one, which saves time and boosts accuracy.

#adaptive tool-use#multimodal chain-of-thought#visual tool interactions

StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

Intermediate

Guibao Shen, Yihua Du et al.Dec 18arXiv

StereoPilot is a new AI that turns regular 2D videos into 3D (stereo) videos quickly and with high quality.

#stereo video conversion#monocular-to-stereo#depth ambiguity

Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

Intermediate

Xin Lin, Meixi Song et al.Dec 18arXiv

This paper builds a foundation model called DAP that estimates real-world (metric) depth from any 360° panorama, indoors or outdoors.

#panoramic depth estimation#metric depth#360-degree vision

Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

Intermediate

Peter Chen, Xiaopeng Li et al.Dec 18arXiv

The paper studies why two opposite-sounding tricks in RL for reasoning—adding random (spurious) rewards and reducing randomness (entropy)—can both seem to help large language models think better.

#RLVR#Group Relative Policy Optimization#ratio clipping

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

Intermediate

Mohammed Irfan Kurpath, Jaseel Muhammad Kaithakkodan et al.Dec 18arXiv

This paper builds a new test, LongShOTBench, to check if AI can truly understand long videos by using sight, speech, and sounds together.

#long-form video understanding#multimodal reasoning#audio-visual-speech alignment

Animate Any Character in Any World

Intermediate

Yitong Wang, Fangyun Wei et al.Dec 18arXiv

AniX is a system that lets you place any character into any 3D world and control them with plain language, like “run forward” or “play a guitar.”

#AniX#3D Gaussian Splatting#world models

59 60 61 62 63