Papers807

Towards Scalable Pre-training of Visual Tokenizers for Generation

Jingfeng Yao, Yuda Song et al.Dec 15arXiv

The paper tackles a paradox: visual tokenizers that get great pixel reconstructions often make worse images when used for generation.

#visual tokenizer#latent space#Vision Transformer

Feedforward 3D Editing via Text-Steerable Image-to-3D

Intermediate

Ziqi Ma, Hongqiao Chen et al.Dec 15arXiv

Steer3D lets you change a 3D object just by typing what you want, like “add a roof rack,” and it does it in one quick pass.

#3D editing#image-to-3D#ControlNet

Towards Interactive Intelligence for Digital Humans

Intermediate

Yiyi Cai, Xuangeng Chu et al.Dec 15arXiv

Digital humans used to just copy motions; this paper makes them think, speak, and move in sync like real people.

#interactive intelligence#digital human#multimodal avatar

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

Intermediate

Enshen Zhou, Cheng Chi et al.Dec 15arXiv

RoboTracer is a vision-language model that turns tricky, word-only instructions into safe, step-by-step 3D paths (spatial traces) robots can follow.

#RoboTracer#spatial trace#3D spatial referring

Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Intermediate

Boxin Wang, Chankyu Lee et al.Dec 15arXiv

The paper introduces Nemotron-Cascade, a step-by-step (cascaded) reinforcement learning recipe that trains an AI across domains like alignment, instructions, math, coding, and software engineering—one at a time.

#Cascaded Reinforcement Learning#RLHF#Instruction-Following RL

LongVie 2: Multimodal Controllable Ultra-Long Video World Model

Intermediate

Jianxiong Gao, Zhaoxi Chen et al.Dec 15arXiv

LongVie 2 is a video world model that can generate controllable videos for 3–5 minutes while keeping the look and motion steady over time.

#long video generation#world model#multimodal control

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

Intermediate

Jia-Nan Li, Jian Guan et al.Dec 15arXiv

ReFusion is a new way for AI to write text faster by planning in chunks (called slots) and then filling each chunk carefully.

#ReFusion#masked diffusion model#parallel decoding

Memory in the Age of AI Agents

Intermediate

Yuyang Hu, Shichun Liu et al.Dec 15arXiv

This survey explains how AI agents remember things and organizes the whole topic into three clear parts: forms, functions, and dynamics.

#Agent memory#LLM memory#Retrieval-augmented generation

Janus: Disaggregating Attention and Experts for Scalable MoE Inference

Intermediate

Zhexiang Zhang, Ye Wang et al.Dec 15arXiv

Janus splits a Mixture-of-Experts (MoE) model into two parts—attention and experts—so each can use just the right amount of GPUs.

#Mixture-of-Experts inference#disaggregated serving#activation load balancing

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Intermediate

Team Seedance, Heyi Chen et al.Dec 15arXiv

Seedance 1.5 pro is a single model that makes video and sound together at the same time, so lips, music, and actions match naturally.

#audio-visual generation#diffusion transformer#cross-modal synchronization

Scaling Laws for Code: Every Programming Language Matters

Intermediate

Jian Yang, Shawn Guo et al.Dec 15arXiv

Different programming languages scale differently when training code AI models, so treating them all the same wastes compute and lowers performance.

#multilingual code pre-training#scaling laws#language-specific scaling

RecTok: Reconstruction Distillation along Rectified Flow

Intermediate

Qingyu Shi, Size Wu et al.Dec 15arXiv

RecTok is a new visual tokenizer that teaches the whole training path of a diffusion model (the forward flow) to be smart about image meaning, not just the starting latent features.

#Rectified Flow#Flow Matching#Visual Tokenizer

56 57 58 59 60