Papers943

Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection

Kaixin Ding, Yang Zhou et al.Dec 18arXiv

Alchemist is a smart data picker for training text-to-image models that learns which pictures and captions actually help the model improve.

#meta-gradient#data selection#text-to-image

FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction

Intermediate

Shuyuan Tu, Yueming Pan et al.Dec 18arXiv

FlashPortrait makes talking-portrait videos that keep a person’s identity steady for as long as you want—minutes or even hours.

#FlashPortrait#portrait animation#identity consistency

Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

Intermediate

Yushi Hu, Reyhane Askari-Hemmat et al.Dec 18arXiv

Reward models are like scorekeepers that tell AI which answers people like more, and this paper builds the first big test for scorekeepers that judge both pictures and words together.

#Multimodal reward model#Benchmarking omni models#Interleaved text-image evaluation

RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing

Intermediate

Tianyuan Qu, Lei Ke et al.Dec 18arXiv

RePlan is a plan-then-execute system that first figures out exactly where to edit in a picture and then makes clean changes there.

#instruction-based image editing#vision–language model (VLM)#diffusion model

Meta-RL Induces Exploration in Language Agents

Intermediate

Yulun Jiang, Liangze Jiang et al.Dec 18arXiv

This paper introduces LAMER, a Meta-RL training framework that teaches language agents to explore first and then use what they learned to solve tasks faster.

#Meta-Reinforcement Learning#Language Agents#Exploration vs Exploitation

PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

Intermediate

Xiaopeng Lin, Shijie Lian et al.Dec 18arXiv

Robots learn best from what they would actually see, which is a first-person (egocentric) view, but most AI models are trained on third-person videos and get confused.

#egocentric vision#first-person video#vision-language model

Kling-Omni Technical Report

Intermediate

Kling Team, Jialu Chen et al.Dec 18arXiv

Kling-Omni is a single, unified model that can understand text, images, and videos together and then make or edit high-quality videos from those mixed instructions.

#multimodal visual language#MVL#prompt enhancer

Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

Intermediate

Tianshuai Hu, Xiaolu Liu et al.Dec 18arXiv

Traditional self-driving used separate boxes for seeing, thinking, and acting, but tiny mistakes in early boxes could snowball into big problems later.

#Vision-Language-Action#End-to-End Autonomous Driving#Dual-System VLA

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

Intermediate

Hao Liang, Xiaochen Ma et al.Dec 18arXiv

DataFlow is a building-block system that helps large language models get better data by unifying how we create, clean, check, and organize that data.

#DataFlow#LLM data preparation#operator pipeline

JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

Intermediate

Bingxiang He, Zekai Qu et al.Dec 18arXiv

JustRL shows that a tiny, steady recipe for reinforcement learning (RL) can make a 1.5B-parameter language model much better at math without fancy tricks.

#Reinforcement Learning#GRPO#Policy Entropy

REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion

Intermediate

Giorgos Petsangourakis, Christos Sgouropoulos et al.Dec 18arXiv

Latent diffusion models are great at making images but learn the meaning of scenes slowly because their training goal mostly teaches them to clean up noise, not to understand objects and layouts.

#latent diffusion#REGLUE#representation entanglement

DeContext as Defense: Safe Image Editing in Diffusion Transformers

Intermediate

Linghui Shen, Mingyue Cui et al.Dec 18arXiv

This paper protects your photos from being misused by new AI image editors that can copy your face or style from just one picture.

#Diffusion Transformer#cross-attention#in-context image editing

60 61 62 63 64