Multi-agent LLM systems often use LoRA adapters so each agent has a special role, but they all rebuild almost the same KV cache, wasting memory and time.
This paper shows a simple, one-model way to dub videos that makes the new voice and the lips move together naturally.
Diffusion models make pictures from noise but often miss what people actually want in the prompt or what looks good to humans.
Traditional supervised fine-tuning (SFT) makes a model copy one answer too exactly, which can cause overfitting to the exact wording instead of the real idea.
The paper asks which small, add-on training tricks (PEFT) work best when we teach language models with yes/no rewards we can check (RLVR).
DreamOmni3 lets people edit and create images by combining text, example images, and quick hand-drawn scribbles.
C2LLM is a new family of code embedding models that helps computers find the right code faster and more accurately.
IC-Effect is a new way to add special effects to existing videos by following a text instruction while keeping everything else unchanged.
Recursive transformers save memory by reusing the same layer over and over, but that makes them less expressive and hurts accuracy.
MetaCanvas lets a multimodal language model (MLLM) sketch a plan inside the generator’s hidden canvas so diffusion models can follow it patch by patch.
Omni-Attribute is a new image encoder that learns just the parts of a picture you ask for (like hairstyle or lighting) and ignores the rest.
InfiniteVL is a vision-language model that mixes two ideas: local focus with Sliding Window Attention and long-term memory with a linear module called Gated DeltaNet.