Papers32

#benchmarking

A Pragmatic VLA Foundation Model

LingBot-VLA is a robot brain that listens to language, looks at the world, and decides smooth actions to get tasks done.

#Vision‑Language‑Action#foundation model#Flow Matching

Not triaged yet

EEG Foundation Models: Progresses, Benchmarking, and Open Problems

Intermediate

Dingkun Liu, Yuheng Chen et al.Jan 25arXiv

This paper builds a fair, big playground (a benchmark) to test many EEG foundation models side-by-side on the same rules.

#EEG foundation models#brain-computer interface#self-supervised learning

Not triaged yet

iFSQ: Improving FSQ for Image Generation with 1 Line of Code

Intermediate

Bin Lin, Zongjian Li et al.Jan 23arXiv

This paper fixes a hidden flaw in a popular image tokenizer (FSQ) with a simple one-line change to its activation function.

#image generation#finite scalar quantization#iFSQ

Not triaged yet

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

Intermediate

Qian Chen, Jinlan Fu et al.Jan 20arXiv

FutureOmni is the first benchmark that tests if multimodal AI models can predict what happens next from both sound and video, not just explain what already happened.

#multimodal LLM#audio-visual reasoning#future forecasting

Not triaged yet

ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents

Intermediate

Dawei Li, Yuguang Yao et al.Jan 18arXiv

ToolPRMBench is a new benchmark that checks, step by step, whether an AI agent using tools picks the right next action.

#process reward model#tool-using agents#offline sampling

Not triaged yet

MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models

Beginner

Zecheng Tang, Baibei Ji et al.Jan 17arXiv

This paper builds MemoryRewardBench, a big test that checks if reward models (AI judges) can fairly grade how other AIs manage long-term memory, not just whether their final answers are right.

#reward models#long-term memory#long-context reasoning

Not triaged yet

Over-Searching in Search-Augmented Large Language Models

Intermediate

Roy Xie, Deepak Gopinath et al.Jan 9arXiv

The paper shows that language models with a search tool often look up too much information, which wastes compute and can make answers worse on unanswerable questions.

#search-augmented LLMs#over-searching#abstention

Not triaged yet

COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs

Intermediate

Dasol Choi, DongGeon Lee et al.Jan 5arXiv

COMPASS is a new framework that turns a company’s rules into thousands of smart test questions to check if chatbots follow those rules.

#policy alignment#allowlist denylist#enterprise AI safety

Not triaged yet

Video-Browser: Towards Agentic Open-web Video Browsing

Beginner

Zhengyang Liang, Yan Shu et al.Dec 28arXiv

The paper tackles how AI agents can truly research the open web when the answers are hidden inside long, messy videos, not just text.

#agentic video browsing#pyramidal perception#video understanding

Not triaged yet

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Beginner

Zhe Cao, Tao Wang et al.Dec 24arXiv

T2AV-Compass is a new, unified test to fairly grade AI systems that turn text into matching video and audio.

#Text-to-Audio-Video generation#multimodal evaluation#cross-modal alignment

Not triaged yet

InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

Intermediate

Kaican Li, Lewei Yao et al.Dec 21arXiv

This paper builds a tough new test called O3-BENCH to check if AI can truly think with images, not just spot objects.

#multimodal reasoning#generalized visual search#reinforcement learning

Not triaged yet

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Intermediate

Minh V. T. Thai, Tue Le et al.Dec 20arXiv

SWE-EVO is a new test (benchmark) that checks if AI coding agents can upgrade real software projects over many steps, not just fix one small bug.

#SWE-EVO#software evolution#coding agents

Not triaged yet

1 2 3