ObjEmbed teaches an AI to understand not just whole pictures, but each object inside them, and to link those objects to the right words.
This paper builds A4-Agent, a smart three-part helper that figures out where to touch or use an object just from a picture and a written instruction, without any extra training.
FoundationMotion is a fully automatic pipeline that turns raw videos into detailed motion data, captions, and quizzes about how things move.