Papers1061

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

RoboMME is a new, big test playground that checks whether robot brains can remember important things over time, not just what they see right now.

#robot memory#long-horizon manipulation#vision-language-action (VLA)

Not triaged yet

Helios: Real Real-Time Long Video Generation Model

Intermediate

Shenghai Yuan, Yuanyang Yin et al.Mar 4arXiv

Helios is a 14-billion-parameter video model that can make minute-long videos in real time at about 19.5 frames per second on a single NVIDIA H100 GPU.

#real-time video generation#long video diffusion#autoregressive diffusion

Not triaged yet

$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners

Intermediate

Harman Singh, Xiuyu Li et al.Mar 4arXiv

The paper shows that when a model compares two of its own answers head-to-head, it picks the right one more often than when it judges each answer alone.

#pairwise self-verification#test-time scaling#parallel reasoning

Not triaged yet

CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

Intermediate

Lingen Li, Guangzhi Wang et al.Mar 4arXiv

CubeComposer is a new AI method that turns a normal forward-facing video into a full 360° VR video at true 4K quality without using super-resolution upscaling.

#360° video generation#cubemap#spatio-temporal autoregression

Not triaged yet

RIVER: A Real-Time Interaction Benchmark for Video LLMs

Intermediate

Yansong Shi, Qingsong Zhao et al.Mar 4arXiv

RIVER Bench is a new test that checks how well AI can watch a video stream and talk with you in real time.

#RIVER Bench#online video understanding#multimodal large language models

Not triaged yet

Phi-4-reasoning-vision-15B Technical Report

Intermediate

Jyoti Aneja, Michael Harrison et al.Mar 4arXiv

Phi-4-reasoning-vision-15B is a small, open-weight AI that understands pictures and text together and is especially good at math, science, and using computer screens.

#multimodal reasoning#vision-language model#mid-fusion

Not triaged yet

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

Intermediate

Jialong Chen, Xander Xu et al.Mar 4arXiv

SWE-CI is a new benchmark that tests how well AI coding agents can keep a codebase healthy over many changes, not just fix one bug.

#SWE-CI#continuous integration#code maintainability

Not triaged yet

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

Intermediate

Qinsi Wang, Hancheng Ye et al.Mar 4arXiv

This paper shows that teaching AI to first draw a simple map of a text (nodes and links) before answering questions makes it smarter and more reliable.

#Structure of Thought#Text-to-Structure#Intermediate Representation

Not triaged yet

MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier

Intermediate

Zonglin Yang, Lidong BingMar 4arXiv

Scientists want AI to propose brand‑new hypotheses directly from a research background, but training a model to do this end‑to‑end is mathematically intractable because the search space explodes combinatorially.

#scientific discovery#hypothesis generation#P(h|b)

Not triaged yet

InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions

Intermediate

Mohamed Elmoghany, Liangbing Zhao et al.Mar 4arXiv

InfinityStory is a new system that can make very long videos (even hours) where the world stays the same and characters transition smoothly between shots.

#long-form video generation#background consistency#multi-agent planning

Not triaged yet

Utonia: Toward One Encoder for All Point Clouds

Intermediate

Yujia Zhang, Xiaoyang Wu et al.Mar 3arXiv

Utonia is a single brain (encoder) that learns from many kinds of 3D point clouds, like indoor rooms, outdoor streets, tiny toys, and even city maps.

#Utonia#point cloud#self-supervised learning

Not triaged yet

MIBURI: Towards Expressive Interactive Gesture Synthesis

Intermediate

M. Hamza Mughal, Rishabh Dabral et al.Mar 3arXiv

MIBURI is a system that makes a talking digital character move its body and face expressively in real time while it speaks.

#co-speech gesture synthesis#embodied conversational agents#causal generation

Not triaged yet

1 2 3 4 5