AR-Omni is a single autoregressive model that can take in and produce text, images, and speech without extra expert decoders.
Chroma 1.0 is a real-time, end-to-end speech-to-speech system that can talk back in your own cloned voice with sub-second delay.