SpatialTree is a new, four-level "ability tree" that tests how multimodal AI models (that see and read) handle space: from basic seeing to acting in the world.
The paper turns video avatars from passive puppets into active doers that can plan, act, check their own work, and fix mistakes over many steps.
The paper shows that big sequence models (like transformers) quietly learn longer goals inside their hidden activations, even though they are trained one step at a time.
Large language models often sound confident even when they are wrong, and existing ways to catch mistakes are slow or not very accurate.
The paper tackles a big blind spot in vision-language models: understanding how objects move and relate in 3D over time (dynamic spatial reasoning, or DSR).
Search is not the same as research; real research needs planning, checking many sources, fixing mistakes, and writing a clear report.
Big vision-language models are super smart but too large to fit on phones and small devices.
SlideTailor is an AI system that turns a scientific paper into personalized presentation slides that match what a specific user likes.
Large language models can say things that sound right but aren’t supported by the given document; this is called a faithfulness hallucination.
This paper builds DiRL, a fast and careful way to finish training diffusion language models so they reason better.
This paper adds a tiny but powerful step called Early Knowledge Alignment (EKA) to multi-step retrieval systems so the model takes a quick, smart look at relevant information before it starts planning.
Memory-T1 teaches chatty AI agents to keep track of when things happened across many conversations.