Papers10

#Benchmarking

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

This paper builds a new test called Ref-Adv to check if AI can truly match tricky sentences to the right thing in a picture.

#Referring Expression Comprehension#Visual Grounding#Multimodal Large Language Models

Not triaged yet

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Intermediate

Xiangyi Li, Wenbo Chen et al.Feb 13arXiv

SkillsBench is a big test playground that measures whether giving AI agents step-by-step 'Skills' actually helps them finish real tasks.

#Agent Skills#LLM agents#Benchmarking

Not triaged yet

Adapting Vision-Language Models for E-commerce Understanding at Scale

Beginner

Matteo Nulli, Vladimir Orshulevich et al.Feb 12arXiv

This paper shows a simple, repeatable way to teach general Vision-Language Models (VLMs) to understand e-commerce items much better without forgetting their general skills.

#Vision-Language Models#E-commerce adaptation#Attribute extraction

Not triaged yet

GENIUS: Generative Fluid Intelligence Evaluation Suite

Intermediate

Ruichuan An, Sihan Yang et al.Feb 11arXiv

The paper introduces GENIUS, a new test that checks whether image-generating AIs can think on the fly, not just recall facts.

#Generative Fluid Intelligence#Unified Multimodal Models#Interleaved Multimodal Context

Not triaged yet

PhyCritic: Multimodal Critic Models for Physical AI

Intermediate

Tianyi Xiong, Shihao Wang et al.Feb 11arXiv

PhyCritic is a judge model that checks other AI models’ answers about the physical world, like cooking steps, robot actions, or driving choices.

#Physical AI#Multimodal critic#Self-referential training

Not triaged yet

ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios

Intermediate

António Loison, Quentin Macé et al.Jan 13arXiv

ViDoRe V3 is a big, carefully built test that checks how well AI systems find and use information from both text and pictures (like tables and charts) in real documents.

#Retrieval-Augmented Generation#Multimodal RAG#Visual Document Understanding

Not triaged yet

MDAgent2: Large Language Model for Code Generation and Knowledge Q&A in Molecular Dynamics

Intermediate

Zhuofan Shi, Hubao A et al.Jan 5arXiv

MDAgent2 is a special helper built from large language models (LLMs) that can both answer questions about molecular dynamics and write runnable LAMMPS simulation code.

#Molecular Dynamics#LAMMPS#Code Generation

Not triaged yet

ModelTables: A Corpus of Tables about Models

Intermediate

Zhengyuan Dong, Victor Zhong et al.Dec 18arXiv

ModelTables is a giant, organized collection of tables that describe AI models, gathered from Hugging Face model cards, GitHub READMEs, and research papers.

#Model Lake#Model Cards#Scientific Tables

Not triaged yet

HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering

Beginner

Dan Ben-Ami, Gabriele Serussi et al.Dec 16arXiv

HERBench is a new test that checks if video AI models can combine several clues spread across time, not just guess from one frame or language priors.

#Video Question Answering#Video-LLM#Multi-Evidence Integration

Not triaged yet

AI & Human Co-Improvement for Safer Co-Superintelligence

Beginner

Jason Weston, Jakob FoersterDec 5arXiv

This paper argues that the fastest and safest path to super-smart AI is for humans and AIs to improve together, not for AI to improve alone.

#Co-improvement#Human-AI collaboration#Co-superintelligence

Not triaged yet