Papers1055

Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

Romain Froger, Pierre Andrews et al.Feb 12arXiv

Gaia2 is a new test that measures how well AI agents handle real-life messiness like changing events, deadlines, and team coordination.

#Gaia2#ARE platform#asynchronous environments

MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

Intermediate

MiniCPM Team, Wenhao An et al.Feb 12arXiv

MiniCPM-SALA is a 9B-parameter language model that mixes two kinds of attention—sparse and linear—to read very long texts quickly and accurately.

#long-context modeling#sparse attention#linear attention

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

Intermediate

Futing Wang, Jianhao Yan et al.Feb 12arXiv

The paper teaches language models to explore more ideas while thinking, so they can solve harder problems.

#In-Context Exploration#Test-Time Scaling#Chain-of-Thought

Budget-Constrained Agentic Large Language Models: Intention-Based Planning for Costly Tool Use

Intermediate

Hanbing Liu, Chunhao Tian et al.Feb 12arXiv

This paper tackles a simple but serious question: can AI agents use paid tools to finish multi-step tasks without blowing the budget?

#budget-constrained tool use#agentic LLMs#inference-time planning

LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation

Intermediate

Ahmadreza Jeddi, Marco Ciccone et al.Feb 11arXiv

LoopFormer is a Transformer that thinks in loops and can flex its thinking time up or down based on the compute you give it.

#Looped Transformers#Elastic Depth#Shortcut Consistency

Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

Intermediate

Dawid J. Kopiczko, Sagar Vaze et al.Feb 11arXiv

The paper shows that, when teaching a reasoning AI with step-by-step examples, repeating a small set many times can beat using a huge set only once.

#Supervised Fine-Tuning#Chain-of-Thought#Data Repetition

GENIUS: Generative Fluid Intelligence Evaluation Suite

Intermediate

Ruichuan An, Sihan Yang et al.Feb 11arXiv

The paper introduces GENIUS, a new test that checks whether image-generating AIs can think on the fly, not just recall facts.

#Generative Fluid Intelligence#Unified Multimodal Models#Interleaved Multimodal Context

PhyCritic: Multimodal Critic Models for Physical AI

Intermediate

Tianyi Xiong, Shihao Wang et al.Feb 11arXiv

PhyCritic is a judge model that checks other AI models’ answers about the physical world, like cooking steps, robot actions, or driving choices.

#Physical AI#Multimodal critic#Self-referential training

GameDevBench: Evaluating Agentic Capabilities Through Game Development

Intermediate

Wayne Chi, Yixiong Fang et al.Feb 11arXiv

GameDevBench is a new test that checks if AI agents can actually make parts of video games, not just write code in one file.

#GameDevBench#Godot#multimodal agents

DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

Intermediate

Yicheng Chen, Zerun Ma et al.Feb 11arXiv

DataChef teaches a large language model to be a smart data chef: it plans and codes full data pipelines that turn messy datasets into great training meals for other models.

#data recipe#data processing pipeline#reinforcement learning

RISE: Self-Improving Robot Policy with Compositional World Model

Intermediate

Jiazhi Yang, Kunyang Lin et al.Feb 11arXiv

RISE lets a robot learn safely and cheaply by practicing in its imagination instead of always in the real world.

#Reinforcement Learning#World Models#Compositional World Model

ROCKET: Rapid Optimization via Calibration-guided Knapsack Enhanced Truncation for Efficient Model Compression

Intermediate

Ammar Ali, Baher Mohammad et al.Feb 11arXiv

ROCKET is a fast, training-free way to shrink big AI models while keeping most of their smarts.

#model compression#training-free compression#sparse factorization

15 16 17 18 19