ObjEmbed teaches an AI to understand not just whole pictures, but each object inside them, and to link those objects to the right words.
The paper asks a simple question: if a language model becomes better at step-by-step reasoning (using RLVR), do its text embeddings also get better? The short answer is no.
This paper introduces CGPT, a way to help computers find the right tables by building smarter mini-versions of tables and training with tough practice questions.
Agentic-R is a new way to teach a search retriever to find not just similar text, but the text that truly helps an AI get the final answer right.
Action100M is a gigantic video dataset with about 100 million labeled action moments built automatically from 1.2 million instructional videos.
HeartMuLa is a family of open-source music AI models that can understand and generate full songs with clear lyrics and strong musical structure.
This paper builds two teamwork models, Qwen3-VL-Embedding and Qwen3-VL-Reranker, that understand text, images, visual documents, and videos in one shared space so search works across all of them.
This paper is about making the words you type into a generator turn into the right pictures and videos more reliably.
The paper tackles a paradox: visual tokenizers that get great pixel reconstructions often make worse images when used for generation.
Omni-Attribute is a new image encoder that learns just the parts of a picture you ask for (like hairstyle or lighting) and ignores the rest.
Most image-similarity tools only notice how things look (color, shape, class) and miss deeper, human-like connections.
The paper introduces M3DR, a way for computers to find the right document image no matter which of 22 languages the query or the document uses.