๐ŸŽ“How I Study AIHISA
๐Ÿ“–Read
๐Ÿ“„Papers๐Ÿ“ฐBlogs๐ŸŽฌCourses
๐Ÿ’กLearn
๐Ÿ›ค๏ธPaths๐Ÿ“šTopics๐Ÿ’กConcepts๐ŸŽดShorts
๐ŸŽฏPractice
๐ŸงฉProblems๐ŸŽฏPrompts๐Ÿง Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers5

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#rubric-based evaluation

CL-bench: A Benchmark for Context Learning

Beginner
Shihan Dou, Ming Zhang et al.Feb 3arXiv

CL-bench is a new test that checks whether AI can truly learn new things from the information you give it right now, not just from what it memorized before.

#context learning#benchmark#rubric-based evaluation

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Intermediate
Keyu Li, Junhao Shi et al.Jan 16arXiv

AgencyBench is a giant test that checks how well AI agents can handle real, long, multi-step jobs, not just short puzzles.

#autonomous agents#long-horizon evaluation#agent benchmarking

Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL

Intermediate
Yifei Shen, Yilun Zhao et al.Jan 14arXiv

This paper introduces CLINSQL, a 633-task benchmark that turns real clinician-style questions into SQL challenges over the MIMIC-IV v3.1 hospital database.

#clinical text-to-SQL#EHR#MIMIC-IV

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

Intermediate
Hao Bai, Alexey Taymanov et al.Jan 5arXiv

WebGym is a giant practice world (almost 300,000 tasks) that lets AI web agents learn on real, ever-changing websites instead of tiny, fake ones.

#WebGym#visual web agents#vision-language models

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

Intermediate
Mohammed Irfan Kurpath, Jaseel Muhammad Kaithakkodan et al.Dec 18arXiv

This paper builds a new test, LongShOTBench, to check if AI can truly understand long videos by using sight, speech, and sounds together.

#long-form video understanding#multimodal reasoning#audio-visual-speech alignment