🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers21

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#LLM-as-a-judge

Specificity-aware reinforcement learning for fine-grained open-world classification

Intermediate
Samuele Angheben, Davide Berasi et al.Mar 3arXiv

This paper teaches AI to name things in pictures very specifically (like “golden retriever” instead of just “dog”) without making more mistakes.

#open-world classification#fine-grained recognition#large multimodal models

Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

Intermediate
Qiyuan Zhang, Yufei Wang et al.Mar 2arXiv

Longer explanations are not always better; the shape of thinking matters.

#Generative Reward Models#Chain-of-Thought#Breadth-CoT

RubricBench: Aligning Model-Generated Rubrics with Human Standards

Intermediate
Qiyuan Zhang, Junyi Zhou et al.Mar 2arXiv

RubricBench is a new benchmark that checks whether AI judges can use clear, checklist-style rules (rubrics) the way humans do.

#RubricBench#rubric-guided evaluation#reward models

Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Intermediate
Qianben Chen, Tianrui Qin et al.Feb 26arXiv

This paper shows that letting an AI search many places at the same time (in parallel) can beat making it think in long, slow chains.

#agentic search#parallel evidence acquisition#plan refinement

Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

Intermediate
Hanna Yukhymenko, Anton Alexandrov et al.Feb 25arXiv

The paper builds an automated pipeline that translates AI benchmarks and datasets into many languages while keeping questions and answers correctly connected.

#machine translation#multilingual benchmarks#test-time compute scaling

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Intermediate
Dongming Jiang, Yi Li et al.Feb 22arXiv

This paper explains how AI agents remember things across long conversations and why many current tests don’t truly measure that memory.

#agentic memory#memory-augmented generation#long-context LLMs

DREAM: Deep Research Evaluation with Agentic Metrics

Intermediate
Elad Ben Avraham, Changhao Li et al.Feb 21arXiv

Deep research agents write long reports, but old tests often judge only how smooth they sound and whether they add links, not whether the facts are true today or the logic really holds.

#deep research agents#agentic evaluation#capability parity

P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

Intermediate
Pinyi Zhang, Ting-En Lin et al.Feb 12arXiv

This paper introduces P-GenRM, a personalized generative reward model that judges AI answers using a custom scorecard built just for each user and situation.

#personalized reward modeling#generative reward model#evaluation chain

DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

Intermediate
Yicheng Chen, Zerun Ma et al.Feb 11arXiv

DataChef teaches a large language model to be a smart data chef: it plans and codes full data pipelines that turn messy datasets into great training meals for other models.

#data recipe#data processing pipeline#reinforcement learning

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Intermediate
Haozhen Zhang, Haodong Yue et al.Feb 5arXiv

BudgetMem is a way for AI helpers to build and use memory on the fly, picking how much thinking to spend so answers are both good and affordable.

#runtime memory extraction#budget-tier routing#reinforcement learning

SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Intermediate
Azmine Toushik Wasi, Wahid Faisal et al.Feb 3arXiv

SpatiaLab is a new test that checks if vision-language models (VLMs) can understand real-world spatial puzzles, like what’s in front, behind, bigger, or reachable.

#SpatiaLab#spatial reasoning#vision-language models

No Shortcuts to Culture: Indonesian Multi-hop Question Answering for Complex Cultural Understanding

Intermediate
Vynska Amalia Permadi, Xingwei Tan et al.Feb 3arXiv

This paper builds ID-MoCQA, a new two-step (multi-hop) quiz set about Indonesian culture that makes AI connect clues before answering.

#multi-hop question answering#cultural reasoning#Indonesian culture
12