This paper asks a simple question: does reinforcement learning (RL) truly make medical vision-language models (VLMs) smarter, or just help them pick better from answers they already know?
Big picture: Vision-language models look at hundreds of image pieces (tokens), which makes them slow and sometimes chatty with mistakes called hallucinations.
ReGFT is a simple pre-RL step that shows the model partial human hints, then makes it solve problems in its own words, creating correct, model-style solutions for hard questions.
ArtLLM is a 3D large language model that turns a rough 3D shape (from an image, text, or mesh) into a complete, movable 3D object with parts and joints.
The paper builds v-Sonar, a bridge that maps images and videos into the same meaning-space as text called Sonar, so all modalities “speak” the same language.
LLaDA-o is a new AI that understands pictures and text and can also make images, all in one model.
The paper introduces VGGT-Det, a new way to detect 3D objects indoors from many photos without needing sensor-provided camera poses or depth maps.
Qwen3-Coder-Next is an open-weight coding model that uses only 3B of its 80B total parameters at a time, so it runs fast while still being smart.
STMI is a new way to recognize the same object across different kinds of cameras (color, night-vision, and thermal) without throwing away useful details.
Many real-life requests to AI helpers are vague, so agents must ask good questions before acting.
Modern music AIs can follow text, lyrics, and even example audio, but judges that score these songs have not kept up.
Big AI models keep getting wider (more neurons per layer) and deeper (more layers), which often makes training unstable and hyperparameters hard to reuse.