Long tasks trip up most AIs because they lose track of goals and make small mistakes that snowball over many steps.
WildGraphBench is a new test that checks how well GraphRAG systems find and combine facts from messy, real-world web pages.
The paper asks AI to hunt for insights in big databases without being told exact questions, like a curious scientist instead of a test-taker.
Shampoo is a smart optimizer that can train models better than AdamW, but it used to be slow because it must compute tricky inverse matrix roots.
Large Vision-Language Models (LVLMs) are great with one picture but get confused when you give them several, often mixing details from different images.
VIBE is a new test that checks how well image-editing AI models follow visual instructions like arrows, boxes, and sketches—not just text.
The paper makes long video generation much faster and lighter on memory by cutting out repeated work in attention.
The paper tests a simple but bold idea: show code to AI as pictures instead of plain text, then shrink those pictures to save tokens and time.
Mind-Brush turns image generation from a one-step 'read the prompt and draw' into a multi-step 'think, research, and create' process.
ObjEmbed teaches an AI to understand not just whole pictures, but each object inside them, and to link those objects to the right words.
TRIP-Bench is a new test that checks if AI travel agents can plan real trips over many chat turns while following strict rules and changing user requests.
CoDiQ is a recipe for making hard-but-solvable math and coding questions on purpose, and it controls how hard they get while you generate them.