DREAM is one model that both understands images (like CLIP) and makes images from text (like top text-to-image models).
This paper asks a simple question: does reinforcement learning (RL) truly make medical vision-language models (VLMs) smarter, or just help them pick better from answers they already know?
MAEB is a giant, fair report card for audio AI that tests 50+ models on 30 tasks across speech, music, environmental sounds, and audio–text tasks in 100+ languages.
Decoder-only language models can be great at making user profiles (embeddings), but how we let them look at the sequence—called attention masking—changes how smart those profiles are.
This paper builds a fair, big playground (a benchmark) to test many EEG foundation models side-by-side on the same rules.
This paper shows that the best VAEs for image generation are the ones whose latents neatly separate object attributes, a property called semantic disentanglement.
This paper asks whether generation training benefits more from an encoder’s big-picture meaning (global semantics) or from how features are arranged across space (spatial structure).