COMPASS is a new framework that turns a company’s rules into thousands of smart test questions to check if chatbots follow those rules.
The paper tackles how AI agents can truly research the open web when the answers are hidden inside long, messy videos, not just text.
T2AV-Compass is a new, unified test to fairly grade AI systems that turn text into matching video and audio.
This paper builds a tough new test called O3-BENCH to check if AI can truly think with images, not just spot objects.
SWE-EVO is a new test (benchmark) that checks if AI coding agents can upgrade real software projects over many steps, not just fix one small bug.
This paper teaches a video-understanding AI to think in 3D plus time (4D) so it can answer questions about specific objects moving in videos.
Skyra is a detective-style AI that spots tiny visual mistakes (artifacts) in videos to tell if they are real or AI-generated, and it explains its decision with times and places in the video.
This paper is about making the words you type into a generator turn into the right pictures and videos more reliably.
MMGR is a new benchmark that checks whether AI image and video generators follow real-world rules, not just whether their outputs look pretty.
ShowTable is a new way for AI to turn a data table into a beautiful, accurate infographic using a think–make–check–fix loop.
DentalGPT is a special AI that looks at dental images and text together and explains what it sees like a junior dentist.
The FACTS Leaderboard is a four-part test that checks how truthful AI models are across images, memory, web search, and document grounding.