Papers4

All Beginner Intermediate Advanced

All Sources arXiv

#Throughput

LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents

Intermediate

Hyesung Jeon, Hyeongju Ha et al.Feb 1arXiv

Multi-agent LLM systems often use LoRA adapters so each agent has a special role, but they all rebuild almost the same KV cache, wasting memory and time.

#LoRA#Multi-LoRA#KV cache

Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts

Intermediate

Xuan-Phi Nguyen, Shrey Pandit et al.Jan 23arXiv

Mixture-of-Experts (MoE) models often send far more tokens to a few “favorite” experts, which overloads some GPUs while others sit idle.

#Mixture-of-Experts#Expert Parallelism#Least-Loaded Expert Parallelism

NVIDIA Nemotron 3: Efficient and Open Intelligence

Intermediate

NVIDIA, : et al.Dec 24arXiv

Nemotron 3 is a new family of open AI models (Nano, Super, Ultra) built to think better while running faster and cheaper.

#Nemotron 3#Mixture-of-Experts#LatentMoE

Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Intermediate

NVIDIA, : et al.Dec 23arXiv

Nemotron 3 Nano is a new open-source language model that mixes two brain styles (Mamba and Transformer) and adds a team of special experts (MoE) so it thinks better while running much faster.

#Mixture-of-Experts#Mamba-2#Transformer