Large language models usually get judged one message at a time, but many real tasks need smart planning across a whole conversation.
This paper says modern video generators are starting to act like tiny "world simulators," not just pretty video painters.
Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.
This paper shows how to turn any normal photo or video into a seamless 360° panorama without needing the camera’s settings like field of view or tilt.
This paper shows how to keep training a language model while it is solving one hard, real problem, so it can discover a single, truly great answer instead of many average ones.
Cosmos Policy teaches robots to act by fine-tuning a powerful video model in just one training stage, without changing the model’s architecture.
ActionMesh is a fast, feed-forward AI that turns videos, images + text, text alone, or a given 3D model into an animated 3D mesh.
This paper introduces EDIR, a new and much more detailed test for Composed Image Retrieval (CIR), where you search for a target image using a starting image plus a short text change.
SAMTok turns any object’s mask in an image into just two special “words” so language models can handle pixels like they handle text.
The paper builds special Turkish legal AI models called Mecellem by teaching them from the ground up and then giving them more law-focused lessons.
Diffusion models make pictures from noise but often miss what people actually want in the prompt or what looks good to humans.
Stable-DiffCoder is a code-focused diffusion language model that learns to write and edit programs by filling in masked pieces, not just predicting the next token.