Papers791

JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

JavisGPT is a single AI that can both understand sounding videos (audio + video together) and also create new ones that stay in sync.

#multimodal large language model#audio-video synchronization#SyncFusion

On the Role of Discreteness in Diffusion LLMs

Intermediate

Ziqi Jin, Bin Wang et al.Dec 27arXiv

The paper asks what a truly good diffusion-based language model should look like and lists five must-have properties.

#diffusion language models#smooth corruption#discrete tokens

Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

Intermediate

Jiacheng Ye, Shansan Gong et al.Dec 27arXiv

Dream-VL and Dream-VLA use a diffusion language model backbone to understand images, talk about them, and plan actions better than many regular (autoregressive) models.

#diffusion language model#vision-language model#vision-language-action

DreamOmni3: Scribble-based Editing and Generation

Intermediate

Bin Xia, Bohao Peng et al.Dec 27arXiv

DreamOmni3 lets people edit and create images by combining text, example images, and quick hand-drawn scribbles.

#scribble-based editing#scribble-based generation#joint input scheme

Monadic Context Engineering

Intermediate

Yifan Zhang, Yang Yuan et al.Dec 27arXiv

Monadic Context Engineering (MCE) is a way to build AI agents using math-inspired Lego blocks called Functors, Applicatives, and Monads so state, errors, and side effects are handled automatically.

#Monadic Context Engineering#AgentMonad#Functor

Self-Evaluation Unlocks Any-Step Text-to-Image Generation

Intermediate

Xin Yu, Xiaojuan Qi et al.Dec 26arXiv

This paper introduces Self-E, a text-to-image model that learns from scratch and can generate good pictures in any number of steps, from just a few to many.

#Self-Evaluating Model#Any-step inference#Text-to-image generation

VL-LN Bench: Towards Long-horizon Goal-oriented Navigation with Active Dialogs

Intermediate

Wensi Huang, Shaohao Zhu et al.Dec 26arXiv

Real life directions are often vague, so the paper creates a task where a robot can ask questions while it searches for a very specific object in a big house.

#embodied AI#interactive navigation#instance goal navigation

See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Intermediate

Shuoshuo Zhang, Yizhen Zhang et al.Dec 26arXiv

The paper teaches vision-language models (AIs that look and read) to pay attention to the right picture parts without needing extra tools during answering time.

#BiPS#perceptual shaping#vision-language models

ProEdit: Inversion-based Editing From Prompts Done Right

Intermediate

Zhi Ouyang, Dian Zheng et al.Dec 26arXiv

ProEdit is a training-free, plug-and-play method that fixes a common problem in image and video editing: the model clings too hard to the original picture and refuses to change what you asked for.

#ProEdit#inversion-based editing#KV-mix

Yume-1.5: A Text-Controlled Interactive World Generation Model

Intermediate

Xiaofeng Mao, Zhen Li et al.Dec 26arXiv

Yume1.5 is a model that turns text or a single image into a living, explorable video world you can move through with keyboard keys.

#interactive world generation#video diffusion#temporal-spatial-channel modeling

SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence

Intermediate

Yiheng Wang, Yixin Chen et al.Dec 26arXiv

SciEvalKit is a new open-source toolkit that tests AI on real scientific skills, not just trivia or simple Q&A.

#scientific intelligence evaluation#multimodal scientific reasoning#symbolic reasoning in AI

SpotEdit: Selective Region Editing in Diffusion Transformers

Intermediate

Zhibin Qin, Zhenxiong Tan et al.Dec 26arXiv

SpotEdit is a training‑free way to edit only the parts of an image that actually change, instead of re-generating the whole picture.

#Diffusion Transformer#Selective image editing#Region-aware editing

43 44 45 46 47