The paper introduces M3DR, a way for computers to find the right document image no matter which of 22 languages the query or the document uses.
SPARK teaches AI to grade its own steps without needing the right answers written down anywhere.
The paper shows how a vision-language model (VLM) can train itself to be a fair judge of answers about images without using any human preference labels.
Fairy2i turns any pre-trained real-valued Transformer layer into an exactly equivalent complex form, so nothing changes before quantization.
ReVSeg teaches an AI to segment objects in videos by thinking step-by-step instead of guessing everything at once.
This paper teaches image models to keep things consistent across multiple pictures—like the same character, art style, and story logic—using reinforcement learning (RL).
This paper teaches AI models to reason better by first copying only good examples and later learning from mistakes too.
This paper introduces AV-SpeakerBench, a new test that checks if AI can truly see, hear, and understand who is speaking, what they say, and when they say it in real videos.
Clinical conversations are special because they mix caring feelings with precise medical facts, and old AI systems struggled to do both at once.
RealGen is a new way to make computer-made pictures look so real that they can fool expert detectors and even careful judges.
Before this work, big vision-language models (VLMs) were great at understanding pictures and words together but not at making new pictures.
VQRAE is a new kind of image tokenizer that lets one model both understand images (continuous features) and generate/reconstruct them (discrete tokens).