Most reinforcement learning agents only get a simple pass/fail reward, which hides how good or bad their attempts really were.
This paper shows how to make a whole picture in one go, directly in pixels, without using a hidden “latent” space or many tiny steps.
This paper upgrades a small but mighty vision-language model called PaddleOCR-VL-1.5 to read and understand real-world, messy documents better than any model before it.
ConceptMoE teaches a language model to group easy, similar tokens into bigger ideas called concepts, so it spends more brainpower on the hard parts.
This paper shows that many reasoning failures in AI are caused by just a few distracting words in the prompt, not because the problems are too hard.
This paper introduces Foundation-Sec-8B-Reasoning, a small (8 billion parameter) AI model that is trained to “think out loud” before answering cybersecurity questions.
DeepSearchQA is a new test with 900 real-world style questions that checks if AI agents can find complete lists of answers, not just one fact.
This paper builds a new test called AgentIF-OneDay that checks if AI helpers can follow everyday instructions the way people actually give them.
Text-to-image models draw pretty pictures, but often put things in the wrong places or miss how objects interact.
TSRBench is a giant test that checks if AI models can understand and reason about data that changes over time, like heartbeats, stock prices, and weather.
The paper finds almost 300 accepted NLP papers (mostly in 2025) that include at least one fake or non-existent reference, which the authors call a HalluCitation.
LLM agents are usually trained in a few worlds but asked to work in many different, unseen worlds, which often hurts their performance.