This paper teaches a vision-language model to first find objects in real 3D space (not just 2D pictures) and then reason about where things are.
The paper turns one flat picture into a neat stack of see‑through layers, so you can edit one thing without messing up the rest.
Large language models usually line words up in fixed order slots, which can waste mental energy and make it harder to find the important parts of a long or noisy text.
LitePT is a new AI backbone for 3D point clouds that uses convolutions in early layers and attention in later layers to be both fast and accurate.
GRAPE is a new way to tell Transformers where each word is in a sentence by using neat math moves called group actions.