Large Multimodal Models (LMMs) are great at reading text and looking at pictures, but they usually do most of their thinking in words, which limits deep visual reasoning.
StoryMem is a new way to make minute‑long, multi‑shot videos that keep the same characters, places, and style across many clips.
ReCo is a new way to edit videos just by telling the computer what to change with words, no extra masks needed.
InsertAnywhere is a two-stage system that lets you add a new object into any video so it looks like it was always there.
AniX is a system that lets you place any character into any 3D world and control them with plain language, like “run forward” or “play a guitar.”
EgoX turns a regular third-person video into a first-person video that looks like it was filmed from the actor’s eyes.
Most image-similarity tools only notice how things look (color, shape, class) and miss deeper, human-like connections.