Papers1262

Phi-4-reasoning-vision-15B Technical Report

Jyoti Aneja, Michael Harrison et al.Mar 4arXiv

Phi-4-reasoning-vision-15B is a small, open-weight AI that understands pictures and text together and is especially good at math, science, and using computer screens.

#multimodal reasoning#vision-language model#mid-fusion

Not triaged yet

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

Intermediate

Jialong Chen, Xander Xu et al.Mar 4arXiv

SWE-CI is a new benchmark that tests how well AI coding agents can keep a codebase healthy over many changes, not just fix one bug.

#SWE-CI#continuous integration#code maintainability

Not triaged yet

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

Intermediate

Qinsi Wang, Hancheng Ye et al.Mar 4arXiv

This paper shows that teaching AI to first draw a simple map of a text (nodes and links) before answering questions makes it smarter and more reliable.

#Structure of Thought#Text-to-Structure#Intermediate Representation

Not triaged yet

MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier

Intermediate

Zonglin Yang, Lidong BingMar 4arXiv

Scientists want AI to propose brand‑new hypotheses directly from a research background, but training a model to do this end‑to‑end is mathematically intractable because the search space explodes combinatorially.

#scientific discovery#hypothesis generation#P(h|b)

Not triaged yet

InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions

Intermediate

Mohamed Elmoghany, Liangbing Zhao et al.Mar 4arXiv

InfinityStory is a new system that can make very long videos (even hours) where the world stays the same and characters transition smoothly between shots.

#long-form video generation#background consistency#multi-agent planning

Not triaged yet

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

Beginner

Weicai Yan, Yuhong Dai et al.Mar 3arXiv

Proact-VL is a video-talking AI that knows not only what to say but also when to say it, like a great sports commentator.

#Proactive VideoLLM#real-time commentary#streaming video understanding

Not triaged yet

Utonia: Toward One Encoder for All Point Clouds

Intermediate

Yujia Zhang, Xiaoyang Wu et al.Mar 3arXiv

Utonia is a single brain (encoder) that learns from many kinds of 3D point clouds, like indoor rooms, outdoor streets, tiny toys, and even city maps.

#Utonia#point cloud#self-supervised learning

Not triaged yet

MIBURI: Towards Expressive Interactive Gesture Synthesis

Intermediate

M. Hamza Mughal, Rishabh Dabral et al.Mar 3arXiv

MIBURI is a system that makes a talking digital character move its body and face expressively in real time while it speaks.

#co-speech gesture synthesis#embodied conversational agents#causal generation

Not triaged yet

CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance

Intermediate

Hanyang Wang, Yiyang Liu et al.Mar 3arXiv

This paper turns a popular image-guidance trick (Classifier-Free Guidance) into a feedback-control problem, just like keeping a car steady in its lane.

#Classifier-Free Guidance#Sliding Mode Control#Diffusion Models

Not triaged yet

Beyond Language Modeling: An Exploration of Multimodal Pretraining

Intermediate

Shengbang Tong, David Fan et al.Mar 3arXiv

The paper trains one model from scratch to both read text and see images/videos, instead of starting from a language-only model.

#multimodal pretraining#representation autoencoder#RAE

Not triaged yet

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Intermediate

Zimo Wen, Boxiu Li et al.Mar 3arXiv

This paper builds UniG2U-Bench, a big test to find out when making pictures (generation) actually helps models understand pictures and text together.

#Unified multimodal models#Vision-language models#Generation-to-Understanding (G2U)

Not triaged yet

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Intermediate

Aradhye Agarwal, Gurdit Siyan et al.Mar 3arXiv

Agentic AIs don’t just chat; they plan, use tools, and take many steps, so one wrong click can cause real harm.

#MOSAIC#agentic safety#plan-check-act

Not triaged yet

1 2 3 4 5