Papers1055

OmniGAIA: Towards Native Omni-Modal AI Agents

Xiaoxi Li, Wenxiang Jiao et al.Feb 26arXiv

OmniGAIA is a new test that checks if AI can watch videos, look at images, listen to audio, and use web and code tools in several steps to find a verified answer.

#OmniGAIA#OmniAtlas#Tool-Integrated Reasoning

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Intermediate

Hongrui Jia, Chaoya Jiang et al.Feb 26arXiv

Large multimodal models (LMMs) can look at pictures and read text, but they still miss tricky cases, like tiny chart labels or multi-step math.

#Large Multimodal Models#Diagnostic-driven Progressive Evolution#Reinforcement Learning

Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction

Intermediate

Nils Schwager, Simon Münker et al.Feb 26arXiv

This paper tests whether AI can realistically guess what a specific social media user would comment when they see a new post.

#Conditioned Comment Prediction#LLM user simulation#implicit conditioning

Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Intermediate

Qianben Chen, Tianrui Qin et al.Feb 26arXiv

This paper shows that letting an AI search many places at the same time (in parallel) can beat making it think in long, slow chains.

#agentic search#parallel evidence acquisition#plan refinement

dLLM: Simple Diffusion Language Modeling

Intermediate

Zhanhui Zhou, Lingjie Chen et al.Feb 26arXiv

dLLM is a single, open-source toolbox that standardizes how diffusion language models are trained, run, and tested.

#diffusion language models#masked diffusion#block diffusion

Transformers converge to invariant algorithmic cores

Intermediate

Joshua S. SchiffmanFeb 26arXiv

Different transformers may have very different weights, but they often hide the same tiny "engine" inside that actually does the task.

#algorithmic cores#mechanistic interpretability#transformers

Causal Motion Diffusion Models for Autoregressive Motion Generation

Intermediate

Qing Yu, Akihisa Watanabe et al.Feb 26arXiv

The paper introduces CMDM, a new way to make computer-generated human motions that feel smooth over time and match the meaning of a text prompt.

#causal diffusion#autoregressive motion generation#text-to-motion

veScale-FSDP: Flexible and High-Performance FSDP at Scale

Intermediate

Zezhou Wang, Youjie Li et al.Feb 25arXiv

This paper makes training giant AI models faster and lighter on memory by inventing a new way to split tensors called RaggedShard.

#FSDP#ZeRO#RaggedShard

Solaris: Building a Multiplayer Video World Model in Minecraft

Intermediate

Georgy Savva, Oscar Michel et al.Feb 25arXiv

Solaris is a new AI that can imagine the future videos of two Minecraft players at the same time, keeping both cameras consistent with each other.

#multiplayer world model#video diffusion transformer#Minecraft dataset

Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

Intermediate

Hanna Yukhymenko, Anton Alexandrov et al.Feb 25arXiv

The paper builds an automated pipeline that translates AI benchmarks and datasets into many languages while keeping questions and answers correctly connected.

#machine translation#multilingual benchmarks#test-time compute scaling

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Intermediate

Rui Yang, Qianhui Wu et al.Feb 25arXiv

GUI-Libra is a training recipe that helps computer-using AI agents both think carefully and click precisely on screens.

#GUI agent#visual grounding#long-horizon navigation

World Guidance: World Modeling in Condition Space for Action Generation

Intermediate

Yue Su, Sijin Chen et al.Feb 25arXiv

WoG (World Guidance) teaches a robot to imagine just the right bits of the near future and use those bits to pick better actions.

#Vision-Language-Action#world modeling#condition space

6 7 8 9 10