SageBwd is a way to make the Transformer's attention both fast and trainable by doing most big multiplications in 8-bit instead of full precision.
Video generators are slow because attention looks at everything, which takes a lot of time.
SLA2 is a new way for AI to pay attention faster by smartly splitting work between two helpers: a precise one (sparse attention) and a speedy one (linear attention).
HySparse is a new way for AI models to pay attention that mixes a few full attention layers with many fast, memory‑saving sparse layers.
This paper speeds up how AI models read very long texts by carefully choosing which words (tokens) to focus on at each step.
The paper makes long video generation much faster and lighter on memory by cutting out repeated work in attention.
Multi-agent LLM systems often use LoRA adapters so each agent has a special role, but they all rebuild almost the same KV cache, wasting memory and time.
AI programs called LLMs can now help write the tiny, super-fast pieces of code (kernels) that make GPUs run AI models efficiently.
HeartMuLa is a family of open-source music AI models that can understand and generate full songs with clear lyrics and strong musical structure.
Transformers are powerful but slow because regular self-attention compares every token with every other token, which grows too fast for long sequences.
FOCUSUI makes computer-using AI faster and still accurate by looking only at the important parts of a screen.
InfiniteVGGT is a streaming 3D vision system that can keep working forever on live video without running out of memory.