Qwen3-TTS is a family of text-to-speech models that can talk in 10+ languages, clone a new voice from just 3 seconds, and follow detailed style instructions in real time.
The paper teaches a video model to squeeze long video history into a tiny memory while still keeping sharp details in single frames.
RecTok is a new visual tokenizer that teaches the whole training path of a diffusion model (the forward flow) to be smart about image meaning, not just the starting latent features.
This paper shows that we can remove normalization layers from Transformers and still train them well by using a simple point‑by‑point function called Derf.