DeepSeek-OCR 2 teaches a computer to βreadβ pictures of documents in a smarter order, more like how people read.
Putting the reading passage before the question and answer choices (CQO) makes language models much more accurate than putting it after (QOC), by about 15 percentage points on average.
Dream-VL and Dream-VLA use a diffusion language model backbone to understand images, talk about them, and plan actions better than many regular (autoregressive) models.