Phi-4-reasoning-vision-15B is a small, open-weight AI that understands pictures and text together and is especially good at math, science, and using computer screens.
The paper builds v-Sonar, a bridge that maps images and videos into the same meaning-space as text called Sonar, so all modalities “speak” the same language.
dLLM is a single, open-source toolbox that standardizes how diffusion language models are trained, run, and tested.
The paper treats the last layer of a Large Language Model (the softmax over tokens) as an Energy-Based Model, which lets us measure a new signal called spilled energy.
FutureOmni is the first benchmark that tests if multimodal AI models can predict what happens next from both sound and video, not just explain what already happened.
TranslateGemma is a family of open machine translation models fine-tuned from Gemma 3 to translate many languages more accurately.
JavisGPT is a single AI that can both understand sounding videos (audio + video together) and also create new ones that stay in sync.
Streamo is a real-time video assistant that knows when to stay quiet, when to wait, and when to speak—while a video is still playing.
OpenDataArena (ODA) is a fair, open platform that measures how valuable different post‑training datasets are for large language models by holding everything else constant.
DentalGPT is a special AI that looks at dental images and text together and explains what it sees like a junior dentist.
Time-series data are numbers tracked over time, like temperature each hour or traffic each day, and turning them into clear words usually needs experts.
Before this work, most big language models talked one word at a time (autoregressive), which made them slow and hard to parallelize.