DentalGPT is a special AI that looks at dental images and text together and explains what it sees like a junior dentist.
The paper asks how to best use expert step-by-step solutions (expert trajectories) when teaching big AI models to reason after pretraining.
MetaCanvas lets a multimodal language model (MLLM) sketch a plan inside the generator’s hidden canvas so diffusion models can follow it patch by patch.
Vision-Language-Action (VLA) models are robots’ “see–think–do” brains that connect cameras (vision), words (language), and motors (action).
Time-series data are numbers tracked over time, like temperature each hour or traffic each day, and turning them into clear words usually needs experts.
LLM judges are cheap but biased; without calibration they can completely flip which model looks best.
Fast-FoundationStereo is a stereo vision system that sees depth from two cameras in real time while still working well on brand‑new scenes it was never trained on.
StereoSpace turns a single photo into a full 3D-style stereo pair without ever estimating a depth map.
Omni-Attribute is a new image encoder that learns just the parts of a picture you ask for (like hairstyle or lighting) and ignores the rest.
Normalizing Flows are models that learn how to turn real images into simple noise and then back again.
This paper asks whether reinforcement learning (RL) can improve making 3D models from text and shows that the answer is yes if we design the training and rewards carefully.
This paper shows that we can remove normalization layers from Transformers and still train them well by using a simple point‑by‑point function called Derf.