The paper introduces VDR-Bench, a new test with 2,000 carefully built questions that truly require both seeing (images) and reading (web text) to find answers.
This paper teaches multimodal AI models to not just read pictures but to also imagine and think with pictures inside their heads.
VisionTrim makes picture-and-text AI models run much faster by keeping only the most useful visual pieces (tokens) and smartly merging the rest.
Innovator-VL is a new multimodal AI model that understands both pictures and text to help solve science problems without needing mountains of special data.
SimpleSeg teaches a multimodal language model to outline objects by writing down a list of points, like connecting the dots, instead of using a special segmentation decoder.
AR-Omni is a single autoregressive model that can take in and produce text, images, and speech without extra expert decoders.
SAMTok turns any object’s mask in an image into just two special “words” so language models can handle pixels like they handle text.
OpenVoxel is a training-free way to understand 3D scenes by grouping tiny 3D blocks (voxels) into objects and giving each object a clear caption.
Unified Thinker separates “thinking” (planning) from “drawing” (image generation) so complex instructions get turned into clear, doable steps before any pixels are painted.
This paper introduces MOSS Transcribe Diarize, a single model that writes down what people say in a conversation, tells who said each part, and marks the exact times—all in one go.
Multimodal Large Language Models (MLLMs) often hallucinate on videos by trusting words and common sense more than what the frames really show.
JavisGPT is a single AI that can both understand sounding videos (audio + video together) and also create new ones that stay in sync.