Papers21

#LLM-as-a-judge

Specificity-aware reinforcement learning for fine-grained open-world classification

Samuele Angheben, Davide Berasi et al.Mar 3arXiv

This paper teaches AI to name things in pictures very specifically (like “golden retriever” instead of just “dog”) without making more mistakes.

#open-world classification#fine-grained recognition#large multimodal models

Not triaged yet

Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

Intermediate

Qiyuan Zhang, Yufei Wang et al.Mar 2arXiv

Longer explanations are not always better; the shape of thinking matters.

#Generative Reward Models#Chain-of-Thought#Breadth-CoT

Not triaged yet

RubricBench: Aligning Model-Generated Rubrics with Human Standards

Intermediate

Qiyuan Zhang, Junyi Zhou et al.Mar 2arXiv

RubricBench is a new benchmark that checks whether AI judges can use clear, checklist-style rules (rubrics) the way humans do.

#RubricBench#rubric-guided evaluation#reward models

Not triaged yet

Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Intermediate

Qianben Chen, Tianrui Qin et al.Feb 26arXiv

This paper shows that letting an AI search many places at the same time (in parallel) can beat making it think in long, slow chains.

#agentic search#parallel evidence acquisition#plan refinement

Not triaged yet

Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

Intermediate

Hanna Yukhymenko, Anton Alexandrov et al.Feb 25arXiv

The paper builds an automated pipeline that translates AI benchmarks and datasets into many languages while keeping questions and answers correctly connected.

#machine translation#multilingual benchmarks#test-time compute scaling

Not triaged yet

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Intermediate

Dongming Jiang, Yi Li et al.Feb 22arXiv

This paper explains how AI agents remember things across long conversations and why many current tests don’t truly measure that memory.

#agentic memory#memory-augmented generation#long-context LLMs

Not triaged yet

DREAM: Deep Research Evaluation with Agentic Metrics

Intermediate

Elad Ben Avraham, Changhao Li et al.Feb 21arXiv

Deep research agents write long reports, but old tests often judge only how smooth they sound and whether they add links, not whether the facts are true today or the logic really holds.

#deep research agents#agentic evaluation#capability parity

Not triaged yet

P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

Intermediate

Pinyi Zhang, Ting-En Lin et al.Feb 12arXiv

This paper introduces P-GenRM, a personalized generative reward model that judges AI answers using a custom scorecard built just for each user and situation.

#personalized reward modeling#generative reward model#evaluation chain

Not triaged yet

DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

Intermediate

Yicheng Chen, Zerun Ma et al.Feb 11arXiv

DataChef teaches a large language model to be a smart data chef: it plans and codes full data pipelines that turn messy datasets into great training meals for other models.

#data recipe#data processing pipeline#reinforcement learning

Not triaged yet

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Intermediate

Haozhen Zhang, Haodong Yue et al.Feb 5arXiv

BudgetMem is a way for AI helpers to build and use memory on the fly, picking how much thinking to spend so answers are both good and affordable.

#runtime memory extraction#budget-tier routing#reinforcement learning

Not triaged yet

SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Intermediate

Azmine Toushik Wasi, Wahid Faisal et al.Feb 3arXiv

SpatiaLab is a new test that checks if vision-language models (VLMs) can understand real-world spatial puzzles, like what’s in front, behind, bigger, or reachable.

#SpatiaLab#spatial reasoning#vision-language models

Not triaged yet

No Shortcuts to Culture: Indonesian Multi-hop Question Answering for Complex Cultural Understanding

Intermediate

Vynska Amalia Permadi, Xingwei Tan et al.Feb 3arXiv

This paper builds ID-MoCQA, a new two-step (multi-hop) quiz set about Indonesian culture that makes AI connect clues before answering.

#multi-hop question answering#cultural reasoning#Indonesian culture

Not triaged yet

1 2