Legal RAG Bench is a new, end-to-end test that checks how well legal AI systems find information and use it to answer tough, real-world legal questions.
PhotoBench is a new test built from real people’s photo albums to see if AI can find photos based on what you truly mean, not just what you see.
SciDER is a team of smart AI helpers that can run almost the whole research process: think of ideas, read raw data, write and run code, and improve itself with feedback.
This paper teaches a new way to make a language model pay extra attention to the exact words you highlight in a prompt.
CHIMERA is a small (about 9,000 examples) but very carefully built synthetic dataset that teaches AI to solve hard problems step by step.
Recurrent neural networks (RNNs) are fast but forgetful because they squeeze everything they’ve seen into a tiny, fixed memory.
Humans often make guesses about the world that are likely but not certain, and this paper studies how humans and AI compare at doing that.
This paper builds a giant, automatically made video library called SVG2 that tells who is in a video, what they look like, and how they interact over time.
The paper tackles a new integrity problem in science: large language models sometimes invent realistic-looking citations that do not exist.
SLATE is a new way to teach AI to think step by step while using a search engine, giving feedback at each step instead of only at the end.
MediX-R1 teaches medical AI models to give clear, free-form answers (not just A, B, C, or D) and to explain their thinking.
The paper asks a simple question: do the model’s invisible “imagination tokens” actually help it reason about images?