Papers5

#LLM-as-judge

Legal RAG Bench: an end-to-end benchmark for legal RAG

Abdur-Rahman Butler, Umar ButlerMar 2arXiv

Legal RAG Bench is a new, end-to-end test that checks how well legal AI systems find information and use it to answer tough, real-world legal questions.

#legal RAG#retrieval-augmented generation#embedding models

Not triaged yet

Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning

Beginner

Chris Samarinas, Haw-Shiuan Chang et al.Feb 26arXiv

SLATE is a new way to teach AI to think step by step while using a search engine, giving feedback at each step instead of only at the end.

#retrieval-augmented reasoning#reinforcement learning#GRPO

Not triaged yet

MediX-R1: Open Ended Medical Reinforcement Learning

Beginner

Sahal Shaji Mullappilly, Mohammed Irfan Kurpath et al.Feb 26arXiv

MediX-R1 teaches medical AI models to give clear, free-form answers (not just A, B, C, or D) and to explain their thinking.

#medical multimodal RL#open-ended reinforcement learning#composite reward

Not triaged yet

AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios

Beginner

Kaiyuan Chen, Qimin Wu et al.Jan 28arXiv

This paper builds a new test called AgentIF-OneDay that checks if AI helpers can follow everyday instructions the way people actually give them.

#AgentIF-OneDay#instruction following#AI agents

Not triaged yet

ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking

Beginner

Qiang Zhang, Boli Chen et al.Jan 10arXiv

ArenaRL teaches AI agents by comparing their answers against each other, like a sports tournament, instead of giving each answer a single noisy score.

#ArenaRL#reinforcement learning#relative ranking

Not triaged yet