JavisGPT is a single AI that can both understand sounding videos (audio + video together) and also create new ones that stay in sync.
The paper asks what a truly good diffusion-based language model should look like and lists five must-have properties.
Dream-VL and Dream-VLA use a diffusion language model backbone to understand images, talk about them, and plan actions better than many regular (autoregressive) models.
DreamOmni3 lets people edit and create images by combining text, example images, and quick hand-drawn scribbles.
Monadic Context Engineering (MCE) is a way to build AI agents using math-inspired Lego blocks called Functors, Applicatives, and Monads so state, errors, and side effects are handled automatically.
This paper introduces Self-E, a text-to-image model that learns from scratch and can generate good pictures in any number of steps, from just a few to many.
Real life directions are often vague, so the paper creates a task where a robot can ask questions while it searches for a very specific object in a big house.
The paper teaches vision-language models (AIs that look and read) to pay attention to the right picture parts without needing extra tools during answering time.
ProEdit is a training-free, plug-and-play method that fixes a common problem in image and video editing: the model clings too hard to the original picture and refuses to change what you asked for.
Yume1.5 is a model that turns text or a single image into a living, explorable video world you can move through with keyboard keys.
SciEvalKit is a new open-source toolkit that tests AI on real scientific skills, not just trivia or simple Q&A.
SpotEdit is a training‑free way to edit only the parts of an image that actually change, instead of re-generating the whole picture.