DeepSeek-OCR 2 teaches a computer to “read” pictures of documents in a smarter order, more like how people read.
LingBot-World is an open-source world model that turns video generation into an interactive, real-time simulator.
WorldVQA is a new test that checks if multimodal AI models can correctly name what they see in pictures without doing extra reasoning.
This paper finds that about 1 out of every 4 attention heads in autoregressive video diffusion models mostly looks only at the current frame and almost ignores the past, wasting memory and time.
OmegaUse is a new AI that can use phones and computers by looking at screenshots and deciding where to click, type, or scroll—much like a careful human user.
Text-to-image models draw pretty pictures, but often put things in the wrong places or miss how objects interact.
DenseGRPO teaches image models using lots of small, timely rewards instead of one final score at the end.
SPARK is a new way to train AI agents that saves compute by exploring more only at the most important moments.
VERGE is a teamwork system where an AI writer (an LLM) works with a strict math checker (an SMT solver) to make answers both smart and logically sound.
This paper shows a simple way for AI models to keep learning new things without forgetting what they already know.
Big AI models used to get better by getting wider or reading longer texts, but those tricks are slowing down.
The paper argues that making and using pictures inside an AI’s thinking can help it reason more like humans, especially for real-world, physical and spatial problems.