Recursive transformers save memory by reusing the same layer over and over, but that makes them less expressive and hurts accuracy.
Big language models use RoPE to remember word order, but it throws away the imaginary half of a complex number during attention.