Papers7

#Benchmarking

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

This paper builds a new test called Ref-Adv to check if AI can truly match tricky sentences to the right thing in a picture.

#Referring Expression Comprehension#Visual Grounding#Multimodal Large Language Models

Not triaged yet

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Intermediate

Xiangyi Li, Wenbo Chen et al.Feb 13arXiv

SkillsBench is a big test playground that measures whether giving AI agents step-by-step 'Skills' actually helps them finish real tasks.

#Agent Skills#LLM agents#Benchmarking

Not triaged yet

GENIUS: Generative Fluid Intelligence Evaluation Suite

Intermediate

Ruichuan An, Sihan Yang et al.Feb 11arXiv

The paper introduces GENIUS, a new test that checks whether image-generating AIs can think on the fly, not just recall facts.

#Generative Fluid Intelligence#Unified Multimodal Models#Interleaved Multimodal Context

Not triaged yet

PhyCritic: Multimodal Critic Models for Physical AI

Intermediate

Tianyi Xiong, Shihao Wang et al.Feb 11arXiv

PhyCritic is a judge model that checks other AI models’ answers about the physical world, like cooking steps, robot actions, or driving choices.

#Physical AI#Multimodal critic#Self-referential training

Not triaged yet

ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios

Intermediate

António Loison, Quentin Macé et al.Jan 13arXiv

ViDoRe V3 is a big, carefully built test that checks how well AI systems find and use information from both text and pictures (like tables and charts) in real documents.

#Retrieval-Augmented Generation#Multimodal RAG#Visual Document Understanding

Not triaged yet

MDAgent2: Large Language Model for Code Generation and Knowledge Q&A in Molecular Dynamics

Intermediate

Zhuofan Shi, Hubao A et al.Jan 5arXiv

MDAgent2 is a special helper built from large language models (LLMs) that can both answer questions about molecular dynamics and write runnable LAMMPS simulation code.

#Molecular Dynamics#LAMMPS#Code Generation

Not triaged yet

ModelTables: A Corpus of Tables about Models

Intermediate

Zhengyuan Dong, Victor Zhong et al.Dec 18arXiv

ModelTables is a giant, organized collection of tables that describe AI models, gathered from Hugging Face model cards, GitHub READMEs, and research papers.

#Model Lake#Model Cards#Scientific Tables

Not triaged yet