Fast weight models remember context with a tiny, fixed memory, but standard next-token training teaches them to think only one word ahead.
Standard attention is slow for long texts because it compares every word with every other word, which takes quadratic time.