Papers4

#ASR

DIFFA-2: A Practical Diffusion Large Language Model for General Audio Understanding

Jiaming Zhou, Xuxin Cheng et al.Jan 30arXiv

DIFFA-2 is a new audio AI that listens to speech, sounds, and music and answers questions about them using a diffusion-style language model instead of the usual step-by-step (autoregressive) method.

#Diffusion language models#Audio understanding#Large audio language model

Not triaged yet

Qwen3-ASR Technical Report

Intermediate

Xian Shi, Xiong Wang et al.Jan 29arXiv

Qwen3‑ASR is a family of speech models that hear, understand, and write down speech in 52 languages and dialects, plus they can tell you when each word was spoken.

#ASR#forced alignment#timestamps

Not triaged yet

AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation

Intermediate

Dongjie Cheng, Ruifeng Yuan et al.Jan 25arXiv

AR-Omni is a single autoregressive model that can take in and produce text, images, and speech without extra expert decoders.

#autoregressive modeling#multimodal large language model#any-to-any generation

Not triaged yet

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Intermediate

Xiaojun Jia, Jie Liao et al.Dec 6arXiv

OmniSafeBench-MM is a one-stop, open-source test bench that fairly compares how multimodal AI models get tricked (jailbroken) and how well different defenses stop that.

#multimodal large language models#jailbreak attacks#safety alignment

Not triaged yet