Papers18

#multimodal large language model

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

The paper introduces VDR-Bench, a new test with 2,000 carefully built questions that truly require both seeing (images) and reading (web text) to find answers.

#multimodal large language model#visual question answering#vision deep research

Toward Cognitive Supersensing in Multimodal Large Language Model

Intermediate

Boyi Li, Yifan Shen et al.Feb 2arXiv

This paper teaches multimodal AI models to not just read pictures but to also imagine and think with pictures inside their heads.

#multimodal large language model#visual cognition#latent visual imagery

VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration

Intermediate

Hanxun Yu, Wentong Li et al.Jan 30arXiv

VisionTrim makes picture-and-text AI models run much faster by keeping only the most useful visual pieces (tokens) and smartly merging the rest.

#vision token compression#training-free acceleration#multimodal large language model

Innovator-VL: A Multimodal Large Language Model for Scientific Discovery

Intermediate

Zichen Wen, Boxue Yang et al.Jan 27arXiv

Innovator-VL is a new multimodal AI model that understands both pictures and text to help solve science problems without needing mountains of special data.

#Innovator-VL#multimodal large language model#scientific reasoning

Towards Pixel-Level VLM Perception via Simple Points Prediction

Intermediate

Tianhui Song, Haoyu Lu et al.Jan 27arXiv

SimpleSeg teaches a multimodal language model to outline objects by writing down a list of points, like connecting the dots, instead of using a special segmentation decoder.

#SimpleSeg#multimodal large language model#decoder-free segmentation

AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation

Intermediate

Dongjie Cheng, Ruifeng Yuan et al.Jan 25arXiv

AR-Omni is a single autoregressive model that can take in and produce text, images, and speech without extra expert decoders.

#autoregressive modeling#multimodal large language model#any-to-any generation

SAMTok: Representing Any Mask with Two Words

Intermediate

Yikang Zhou, Tao Zhang et al.Jan 22arXiv

SAMTok turns any object’s mask in an image into just two special “words” so language models can handle pixels like they handle text.

#SAMTok#mask tokenizer#residual vector quantization

OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding

Intermediate

Sheng-Yu Huang, Jaesung Choe et al.Jan 14arXiv

OpenVoxel is a training-free way to understand 3D scenes by grouping tiny 3D blocks (voxels) into objects and giving each object a clear caption.

#OpenVoxel#Sparse Voxel Rasterization#training-free 3D understanding

Unified Thinker: A General Reasoning Modular Core for Image Generation

Intermediate

Sashuai Zhou, Qiang Zhou et al.Jan 6arXiv

Unified Thinker separates “thinking” (planning) from “drawing” (image generation) so complex instructions get turned into clear, doable steps before any pixels are painted.

#reasoning-aware image generation#structured planning#edit-only prompt

MOSS Transcribe Diarize Technical Report

Beginner

MOSI. AI, : et al.Jan 4arXiv

This paper introduces MOSS Transcribe Diarize, a single model that writes down what people say in a conversation, tells who said each part, and marks the exact times—all in one go.

#speaker diarization#speech recognition#end-to-end SATS

Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

Intermediate

Zhe Huang, Hao Wen et al.Dec 30arXiv

Multimodal Large Language Models (MLLMs) often hallucinate on videos by trusting words and common sense more than what the frames really show.

#multimodal large language model#video understanding#visual hallucination

JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

Intermediate

Kai Liu, Jungang Li et al.Dec 28arXiv

JavisGPT is a single AI that can both understand sounding videos (audio + video together) and also create new ones that stay in sync.

#multimodal large language model#audio-video synchronization#SyncFusion

1 2