Decoder-only language models can be great at making user profiles (embeddings), but how we let them look at the sequence—called attention masking—changes how smart those profiles are.
Putting the reading passage before the question and answer choices (CQO) makes language models much more accurate than putting it after (QOC), by about 15 percentage points on average.