Papers2

#Load Balancing

Revisiting Parameter Server in LLM Post-Training

Large language model (LLM) post-training has uneven work per GPU because some text sequences are much longer than others.

#On-Demand Communication#Fully Sharded Data Parallel#Parameter Server

Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Intermediate

NVIDIA, : et al.Dec 23arXiv

Nemotron 3 Nano is a new open-source language model that mixes two brain styles (Mamba and Transformer) and adds a team of special experts (MoE) so it thinks better while running much faster.

#Mixture-of-Experts#Mamba-2#Transformer