Papers784

End-to-End Joint ASR and Speaker Role Diarization with Child-Adult Interactions

Anfeng Xu, Tiantian Feng et al.Jan 25arXiv

This paper builds one smart system that listens to child–adult conversations and writes what was said, who said it, and exactly when each person spoke.

#end-to-end ASR#speaker diarization#child speech

SkyReels-V3 Technique Report

Intermediate

Debang Li, Zhengcong Fei et al.Jan 24arXiv

SkyReels-V3 is a single AI model that can make videos in three ways: from reference images, by extending an existing video, and by creating talking avatars from audio.

#video generation#diffusion transformer#multimodal in-context learning

PingPong: A Natural Benchmark for Multi-Turn Code-Switching Dialogues

Intermediate

Mohammad Rifqi Farhansyah, Hanif Muhammad Zhafran et al.Jan 24arXiv

Most people on Earth speak more than one language and often switch languages in the same chat, but AI tools aren’t tested well on this real behavior.

#code-switching#multilingual NLP#trilingual dialogue

C-RADIOv4 (Tech Report)

Intermediate

Mike Ranzinger, Greg Heinrich et al.Jan 24arXiv

C-RADIOv4 is a single vision model that learns from several expert models at once and keeps their best skills while staying fast.

#C-RADIOv4#agglomerative vision models#multi-teacher distillation

iFSQ: Improving FSQ for Image Generation with 1 Line of Code

Intermediate

Bin Lin, Zongjian Li et al.Jan 23arXiv

This paper fixes a hidden flaw in a popular image tokenizer (FSQ) with a simple one-line change to its activation function.

#image generation#finite scalar quantization#iFSQ

VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

Intermediate

Zirui Wang, Junyi Zhang et al.Jan 23arXiv

VisGym is a playground of 17 very different visual tasks that test and train AI models that see and talk (Vision–Language Models) to act over many steps.

#VisGym#Vision–Language Models#Multimodal Agents

Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts

Intermediate

Xuan-Phi Nguyen, Shrey Pandit et al.Jan 23arXiv

Mixture-of-Experts (MoE) models often send far more tokens to a few “favorite” experts, which overloads some GPUs while others sit idle.

#Mixture-of-Experts#Expert Parallelism#Least-Loaded Expert Parallelism

LoL: Longer than Longer, Scaling Video Generation to Hour

Intermediate

Justin Cui, Jie Wu et al.Jan 23arXiv

This paper fixes a big problem in long video-making AIs where the video keeps snapping back to the beginning, like a movie stuck on rewind.

#sink-collapse#Rotary Position Embedding#RoPE jitter

SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents

Intermediate

Yuhang Wang, Yuling Shi et al.Jan 23arXiv

Coding agents waste most of their tokens just reading giant files, which makes them slow and expensive.

#SWE-Pruner#context pruning#coding agents

SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer

Intermediate

Tongcheng Fang, Hanling Zhang et al.Jan 23arXiv

Videos are made of very long lists of tokens, and regular attention looks at every pair of tokens, which is slow and expensive.

#SALAD#sparse attention#linear attention

Endless Terminals: Scaling RL Environments for Terminal Agents

Intermediate

Kanishk Gandhi, Shivam Garg et al.Jan 23arXiv

Endless Terminals is an automatic factory that builds thousands of realistic, checkable computer-terminal tasks so AI agents can practice and improve with reinforcement learning.

#reinforcement learning#PPO#terminal agents

Memory-V2V: Augmenting Video-to-Video Diffusion Models with Memory

Intermediate

Dohun Lee, Chun-Hao Paul Huang et al.Jan 22arXiv

Memory-V2V teaches video editing AIs to remember what they already changed so new edits stay consistent with old ones.

#multi-turn video editing#video-to-video diffusion#explicit memory

18 19 20 21 22