RealGen is a new way to make computer-made pictures look so real that they can fool expert detectors and even careful judges.
Before this work, big vision-language models (VLMs) were great at understanding pictures and words together but not at making new pictures.
VQRAE is a new kind of image tokenizer that lets one model both understand images (continuous features) and generate/reconstruct them (discrete tokens).
ThreadWeaver teaches a language model to split big problems into smaller parts it can solve at the same time, like teammates working in parallel.
Recurrent Neural Networks (RNNs) are special neural networks that learn from sequences, like sentences or time series, by remembering what came before.
The paper introduces the Transformer, a model that understands and generates sequences (like sentences) using only attention, without RNNs or CNNs.
This paper teaches computers to understand words by also looking at the smaller pieces inside words, like 'un-', 'play', and '-ing'.