Utonia is a single brain (encoder) that learns from many kinds of 3D point clouds, like indoor rooms, outdoor streets, tiny toys, and even city maps.
Big picture: Vision-language models look at hundreds of image pieces (tokens), which makes them slow and sometimes chatty with mistakes called hallucinations.
The paper asks a simple question: what must a vision model’s internal pictures (embeddings) look like if it can recognize new mixes of things it already knows?
This paper teaches an AI to segment any object you name (open-vocabulary) much better by adding a few example pictures with pixel labels and smart retrieval.
The paper introduces CHAIN, a hands-on 3D playground that tests if AI can not only see objects but also plan and act under real physics.
Robots learn better when they get small hints at every step instead of only a final thumbs-up or thumbs-down.
The paper argues that the fairest way to check how generally smart an AI is, is to see how quickly and well it learns lots of different human-made games, just like a person with the same time and practice.
OCR is like reading a page exactly as it is, and that strictness makes it perfect for fast, parallel generation.
ExStrucTiny is a new test (benchmark) that checks if AI can pull many connected facts from all kinds of documents and neatly put them into JSON, even when the question style and schema change.
Most image search systems judge each photo by itself, which fails when clues are split across many photos taken over time.
NarraScore turns a video's changing story into a matching soundtrack by using emotion as the bridge.
The paper finds a hidden symmetry inside GRPO’s advantage calculation that accidentally stops models from exploring new good answers and from paying the right attention to easy versus hard problems at the right times.