This paper introduces MOSS Transcribe Diarize, a single model that writes down what people say in a conversation, tells who said each part, and marks the exact times—all in one go.
Multimodal Large Language Models (MLLMs) often hallucinate on videos by trusting words and common sense more than what the frames really show.
JavisGPT is a single AI that can both understand sounding videos (audio + video together) and also create new ones that stay in sync.
Skyra is a detective-style AI that spots tiny visual mistakes (artifacts) in videos to tell if they are real or AI-generated, and it explains its decision with times and places in the video.
This paper is about making the words you type into a generator turn into the right pictures and videos more reliably.
ShowTable is a new way for AI to turn a data table into a beautiful, accurate infographic using a think–make–check–fix loop.
DentalGPT is a special AI that looks at dental images and text together and explains what it sees like a junior dentist.
EditThinker is a helper brain for any image editor that thinks, checks, and rewrites the instruction in multiple rounds until the picture looks right.
COOPER is a single AI model that both “looks better” (perceives depth and object boundaries) and “thinks smarter” (reasons step by step) to answer spatial questions about images.