Latent diffusion models are great at making images but learn the meaning of scenes slowly because their training goal mostly teaches them to clean up noise, not to understand objects and layouts.
This paper protects your photos from being misused by new AI image editors that can copy your face or style from just one picture.
This paper teaches a vision-language model to first find objects in real 3D space (not just 2D pictures) and then reason about where things are.
StageVAR makes image-generating AI much faster by recognizing that early steps set the meaning and structure, while later steps just polish details.
The paper defines Scientific General Intelligence (SGI) as an AI that can do science like a human scientist across the full loop: study, imagine, test, and understand.
This paper organizes how AI agents learn and improve into one simple map with four roads: A1, A2, T1, and T2.
INTELLECT-3 is a 106B-parameter Mixture-of-Experts model (about 12B active per token) trained with large-scale reinforcement learning and it beats many bigger models on math, coding, science, and reasoning tests.
ModelTables is a giant, organized collection of tables that describe AI models, gathered from Hugging Face model cards, GitHub READMEs, and research papers.
TurboDiffusion speeds up video diffusion models by about 100–200 times while keeping video quality comparable.
This paper asks whether we are judging AI answers the right way and introduces Sage, a new way to test AI judges without using human-graded answers.
Spatia is a video generator that keeps a live 3D map of the scene (a point cloud) as its memory while making videos.
Pixels are the raw stuff of images, and this paper shows you can learn great vision skills by predicting pixels directly, not by comparing fancy hidden features.