This paper builds a medical image segmentation system that uses both pictures (like X-rays) and words (short clinical text) at the same time.
JavisGPT is a single AI that can both understand sounding videos (audio + video together) and also create new ones that stay in sync.