🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers9

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#benchmarking

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

Beginner
Zhiheng Song, Jingshuai Zhang et al.Feb 26arXiv

MobilityBench is a big, carefully built test that checks how well AI helpers can plan real-world routes using natural language and map tools.

#MobilityBench#route-planning agents#large language models

When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

Beginner
Jiacheng Hou, Yining Sun et al.Feb 10arXiv

Modern image editors can now follow visual prompts like arrows and scribbles, which opens a new way for attackers to hide harmful instructions inside images.

#vision-centric jailbreak#image editing safety#visual prompts

AgenticPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions

Beginner
Xianyang Liu, Shangding Gu et al.Feb 5arXiv

AgenticPay is a safe playground where AI agents practice buying and selling by talking, not just by typing numbers.

#multi-agent negotiation#language-mediated bargaining#LLM agents

PaperBanana: Automating Academic Illustration for AI Scientists

Beginner
Dawei Zhu, Rui Meng et al.Jan 30arXiv

PaperBanana is a team of AI helpers that turns a paper’s method text and caption into a clean, accurate, publication-ready figure.

#academic illustration#methodology diagrams#visual language models

DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents

Beginner
Nikita Gupta, Riju Chatterjee et al.Jan 28arXiv

DeepSearchQA is a new test with 900 real-world style questions that checks if AI agents can find complete lists of answers, not just one fact.

#DeepSearchQA#agentic information retrieval#systematic collation

MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models

Beginner
Zecheng Tang, Baibei Ji et al.Jan 17arXiv

This paper builds MemoryRewardBench, a big test that checks if reward models (AI judges) can fairly grade how other AIs manage long-term memory, not just whether their final answers are right.

#reward models#long-term memory#long-context reasoning

Video-Browser: Towards Agentic Open-web Video Browsing

Beginner
Zhengyang Liang, Yan Shu et al.Dec 28arXiv

The paper tackles how AI agents can truly research the open web when the answers are hidden inside long, messy videos, not just text.

#agentic video browsing#pyramidal perception#video understanding

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Beginner
Zhe Cao, Tao Wang et al.Dec 24arXiv

T2AV-Compass is a new, unified test to fairly grade AI systems that turn text into matching video and audio.

#Text-to-Audio-Video generation#multimodal evaluation#cross-modal alignment

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

Beginner
Aileen Cheng, Alon Jacovi et al.Dec 11arXiv

The FACTS Leaderboard is a four-part test that checks how truthful AI models are across images, memory, web search, and document grounding.

#LLM factuality#benchmarking#multimodal evaluation