People often pick CLIP-like models for image labeling, but this paper shows that large multimodal models (LMMs) can be just as good—or even better—when you give them a few examples in the prompt (in-context learning).
SimToolReal teaches a robot hand to use many different tools by practicing in simulation and then working in the real world without extra training.
VLingNav is a robot navigation system that sees, reads instructions, and acts, while deciding when to think hard and when to just move.
NitroGen is a vision-to-action AI that learns to play many video games by watching 40,000 hours of gameplay videos from over 1,000 titles with on-screen controller overlays.
FINERWEB is a new, carefully built dataset pipeline that teaches computers to spot names of people, places, and more across 91 languages and 25 writing systems.