Speculative decoding speeds up big language models by letting a small helper model guess several next words and having the big model check them all at once.
RelayGen is a training-free way to switch between a big model and a small model while one answer is being generated.
DFlash is a new way to make big language models answer much faster without changing the final answers.
DEER is a new way to speed up big language models by letting a diffusion model draft many tokens at once and an autoregressive model double-check them.
ReFusion is a new way for AI to write text faster by planning in chunks (called slots) and then filling each chunk carefully.
ARBITRAGE makes AI solve step-by-step problems faster by only using the big, slow model when it is predicted to truly help.