🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers19

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#LLM-as-judge

FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

Intermediate
Zhen Wang, Fan Bai et al.Feb 2arXiv

FIRE-Bench is a new test that checks whether AI agents can fully redo real scientific discoveries, step by step, not just guess answers.

#FIRE-Bench#scientific agents#rediscovery benchmark

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Intermediate
Yu Zeng, Wenxuan Huang et al.Feb 2arXiv

The paper introduces VDR-Bench, a new test with 2,000 carefully built questions that truly require both seeing (images) and reading (web text) to find answers.

#multimodal large language model#visual question answering#vision deep research

FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale

Intermediate
Ajay Patel, Colin Raffel et al.Jan 29arXiv

Large language models usually learn by guessing the next word, then get a tiny bit of instruction-following practice; this paper flips that by turning massive web documents into instruction-and-answer pairs at huge scale.

#FineInstructions#synthetic instruction–answer pairs#instruction-tuning pre-training

Self-Improving Pretraining: using post-trained models to pretrain better models

Intermediate
Ellen Xiaoqing Tan, Shehzaad Dhuliawala et al.Jan 29arXiv

This paper teaches language models to be safer, more factual, and higher quality during pretraining, not just after, by using reinforcement learning with a stronger model as a helper.

#self-improving pretraining#reinforcement learning#online DPO

AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios

Beginner
Kaiyuan Chen, Qimin Wu et al.Jan 28arXiv

This paper builds a new test called AgentIF-OneDay that checks if AI helpers can follow everyday instructions the way people actually give them.

#AgentIF-OneDay#instruction following#AI agents

Paper2Rebuttal: A Multi-Agent Framework for Transparent Author Response Assistance

Intermediate
Qianli Ma, Chang Guo et al.Jan 20arXiv

This paper turns rebuttal writing from ‘just write some text’ into ‘make a plan with proof, then write.’

#rebuttal generation#multi-agent systems#evidence-centric planning

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Intermediate
Keyu Li, Junhao Shi et al.Jan 16arXiv

AgencyBench is a giant test that checks how well AI agents can handle real, long, multi-step jobs, not just short puzzles.

#autonomous agents#long-horizon evaluation#agent benchmarking

Reasoning Models Generate Societies of Thought

Intermediate
Junsol Kim, Shiyang Lai et al.Jan 15arXiv

The paper shows that top reasoning AIs don’t just think longer—they act like a tiny team inside their heads, with different voices that ask, disagree, and then agree.

#society of thought#reasoning reinforcement learning#conversational behaviors

Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

Intermediate
Chengwen Liu, Xiaomin Yu et al.Jan 11arXiv

VideoDR is a new benchmark that tests if AI can watch a video, pull out key visual clues, search the open web, and chain the clues together to find one verifiable answer.

#video deep research#multimodal reasoning#open-domain question answering

Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity

Intermediate
Hongjun An, Yiliang Song et al.Jan 10arXiv

The paper shows that friendly, people-pleasing language can trick even advanced language models into agreeing with wrong answers.

#Preference-Undermining Attacks#PUA#sycophancy

ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking

Beginner
Qiang Zhang, Boli Chen et al.Jan 10arXiv

ArenaRL teaches AI agents by comparing their answers against each other, like a sports tournament, instead of giving each answer a single noisy score.

#ArenaRL#reinforcement learning#relative ranking

Over-Searching in Search-Augmented Large Language Models

Intermediate
Roy Xie, Deepak Gopinath et al.Jan 9arXiv

The paper shows that language models with a search tool often look up too much information, which wastes compute and can make answers worse on unanswerable questions.

#search-augmented LLMs#over-searching#abstention
12