Phi-4-reasoning-vision-15B is a small, open-weight AI that understands pictures and text together and is especially good at math, science, and using computer screens.
Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.