Short videos are easy for AI to make sharp and lively, but long videos need stories and memory, and there isnβt much training data for that.
This paper fixes a hidden flaw in a popular image tokenizer (FSQ) with a simple one-line change to its activation function.
Big video makers (diffusion models) create great videos but are too slow because they use hundreds of tiny clean-up steps.