How I Study AI - Learn AI Papers & Lectures the Easy Way

Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

Romain Froger, Pierre Andrews et al.Feb 12arXiv

Gaia2 is a new test that measures how well AI agents handle real-life messiness like changing events, deadlines, and team coordination.

#Gaia2#ARE platform#asynchronous environments

Not triaged yet

Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities

Intermediate

Shuangshuang Ying, Zheyu Wang et al.Jan 29arXiv

This paper builds a safe science “playground” called DeR that fairly tests how AI finds facts (retrieval) and how it thinks with those facts (reasoning) without mixing them up.

#retrieval-augmented generation#document-grounded reasoning#deep research benchmark

Not triaged yet

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Beginner

Sara Papi, Javier Garcia Gilabert et al.Dec 18arXiv

This paper builds a big, fair test called Hearing to Translate to check how well different speech translation systems work in the real world.

#speech translation#Speech-LLM#cascaded ASR-MT

Not triaged yet

Papers3

Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs