Papers32

#LoRA fine-tuning

Enhancing Spatial Understanding in Image Generation via Reward Modeling

Zhenyu Tang, Chaoran Feng et al.Feb 27arXiv

This paper teaches image generators to place objects in the right spots by building a special teacher called a reward model focused on spatial relationships.

#spatial reasoning#reward modeling#preference learning

Not triaged yet

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

Intermediate

Kai Liu, Yanhao Zheng et al.Feb 22arXiv

JavisDiT++ is a new AI that makes short videos and matching sounds from a text prompt, keeping sight and sound in sync.

#joint audio-video generation#multimodal diffusion transformer#modality-specific mixture-of-experts

Not triaged yet

Computer-Using World Model

Intermediate

Yiming Guan, Rui Yu et al.Feb 19arXiv

The paper builds a Computer-Using World Model (CUWM) that lets an AI “imagine” what a desktop app (like Word/Excel/PowerPoint) will look like after a click or keystroke—before doing it for real.

#world model#GUI agent#desktop automation

Not triaged yet

World Models for Policy Refinement in StarCraft II

Intermediate

Yixin Zhang, Ziyi Wang et al.Feb 16arXiv

The paper builds StarWM, a ‘world model’ that lets a StarCraft II agent imagine what will happen a few seconds after it takes an action.

#world model#action-conditioned dynamics#StarCraft II

Not triaged yet

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

Beginner

Dianyi Wang, Ruihang Li et al.Feb 12arXiv

DeepGen 1.0 is a small 5B-parameter model that can both make new images and smartly edit existing ones from text instructions.

#Unified multimodal model#Stacked Channel Bridging#Think tokens

Not triaged yet

VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval

Intermediate

Issar Tzachor, Dvir Samuel et al.Feb 8arXiv

VidVec shows that video-capable multimodal language models already hide strong matching signals between videos and sentences inside their middle layers.

#video–text retrieval#multimodal large language models#intermediate layer embeddings

Not triaged yet

HY3D-Bench: Generation of 3D Assets

Intermediate

Team Hunyuan3D, : et al.Feb 3arXiv

HY3D-Bench is a complete, open-source “starter kit” for making and studying high-quality 3D objects.

#HY3D-Bench#watertight meshes#part-level decomposition

Not triaged yet

DIFFA-2: A Practical Diffusion Large Language Model for General Audio Understanding

Intermediate

Jiaming Zhou, Xuxin Cheng et al.Jan 30arXiv

DIFFA-2 is a new audio AI that listens to speech, sounds, and music and answers questions about them using a diffusion-style language model instead of the usual step-by-step (autoregressive) method.

#Diffusion language models#Audio understanding#Large audio language model

Not triaged yet

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

Intermediate

Seanie Lee, Sangwoo Park et al.Jan 30arXiv

Large reasoning models got very good at thinking step-by-step, but that sometimes made them too eager to follow harmful instructions.

#THINKSAFE#self-generated safety alignment#refusal steering

Not triaged yet

DreamActor-M2: Universal Character Image Animation via Spatiotemporal In-Context Learning

Intermediate

Mingshuang Luo, Shuang Liang et al.Jan 29arXiv

DreamActor-M2 is a new way to make a still picture move by copying motion from a video while keeping the character’s look the same.

#character image animation#spatiotemporal in-context learning#video diffusion

Not triaged yet

Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning

Intermediate

Chengzu Li, Zanyi Wang et al.Jan 28arXiv

This paper shows that making short videos can help AI plan and reason in pictures better than writing out steps in text.

#video reasoning#visual planning#test-time scaling

Not triaged yet

SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer

Intermediate

Tongcheng Fang, Hanling Zhang et al.Jan 23arXiv

Videos are made of very long lists of tokens, and regular attention looks at every pair of tokens, which is slow and expensive.

#SALAD#sparse attention#linear attention

Not triaged yet

1 2 3