šŸŽ“How I Study AIHISA
šŸ“–Read
šŸ“„PapersšŸ“°BlogsšŸŽ¬Courses
šŸ’”Learn
šŸ›¤ļøPathsšŸ“šTopicsšŸ’”ConceptsšŸŽ“Shorts
šŸŽÆPractice
ā±ļøCoach🧩Problems🧠ThinkingšŸŽÆPrompts🧠Review
SearchSettings
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 8: Parallelism 2 | How I Study AI
šŸ“š Stanford CS336: Language Modeling from Scratch8 / 17
PrevNext
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 8: Parallelism 2
Watch on YouTube

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 8: Parallelism 2

Intermediate
Stanford Online
Deep LearningYouTube

Key Summary

  • •This session explains how to speed up and scale training when one GPU or a simple setup is not enough. It reviews data parallelism (split data across devices) and pipeline parallelism (split model across devices), then dives into practical fixes for their main bottlenecks. The key tools are gradient accumulation, virtual batch size, and interleaved pipeline stages. You’ll learn the trade‑offs between memory use, communication overhead, and idle time.
  • •Data parallelism lets each GPU process a different slice of the batch, compute gradients, and then combine them with an all‑reduce operation. The upside is easy scaling when you have lots of data; the downside is the cost and delay of synchronizing gradients across devices. This communication can become the training bottleneck as GPU counts grow. Managing when and how often you synchronize is crucial.
  • •Pipeline parallelism splits the model’s layers across devices so a single example flows through the devices like water through a pipe. The drawback is pipeline bubbles: some devices sit idle while the pipe fills and empties. You get high utilization only in the middle of a step. Scheduling tricks and microbatching are used to reduce idle time.
  • •Virtual batch size is the ā€˜effective’ batch size you train with, even if you can’t fit it in memory all at once. Gradient accumulation makes this possible by processing several mini‑batches in sequence, summing their gradients, and stepping the optimizer once at the end. You get the benefits of larger batches without needing more GPU memory per step. The trade‑off is extra time per update because you perform more forward/backward passes before stepping.
  • •Gradient accumulation pairs naturally with data parallelism. Each device accumulates gradients locally across several mini‑batches, and only then runs the costly all‑reduce to aggregate gradients across devices. This reduces the number of synchronization events and can increase throughput. All devices should use the same per‑device batch for balanced work.

Why This Lecture Matters

Training modern language models pushes the limits of memory, compute, and network bandwidth. Engineers, researchers, and ML practitioners need reliable ways to scale beyond a single device without wasting resources or hurting convergence. The techniques here—gradient accumulation, virtual batch size, interleaving, and their combinations with data and pipeline parallelism—directly target the most common bottlenecks: memory limits, idle compute, and communication overhead. This knowledge lets you fit larger models, sustain higher throughput, and reduce training cost per token. In real projects, you’ll often find that simply adding GPUs doesn’t speed things up because synchronization becomes dominant. Gradient accumulation lets you cut the number of all-reduces; interleaving and microbatching keep pipelines full; and careful batch/accumulation choices maintain stable optimization. These strategies translate to faster experiments, better hardware utilization, and the ability to train models that would otherwise not fit. They also reduce operational risks like out-of-memory errors and straggler-induced slowdowns. From a career perspective, being able to diagnose and fix distributed training bottlenecks is highly valued. Teams building large models must balance compute and communication, and those who can orchestrate data parallelism, pipeline scheduling, and accumulation stand out. In an industry where training budgets and timelines are tight, mastering these techniques can directly improve product delivery speed and research velocity. As models and datasets continue to grow, the importance of efficient parallel training only increases.

Lecture Summary

Tap terms for definitions

01Overview

This lecture focuses on practical strategies to scale and speed up training of large language models when a single GPU or naĆÆve setup becomes a bottleneck. It builds on two foundational ideas: data parallelism (splitting the data across devices) and pipeline parallelism (splitting the model across devices). While both approaches increase throughput and enable training larger models, each introduces its own bottleneck: data parallelism requires frequent network-wide gradient aggregation (an all-reduce), and pipeline parallelism suffers from pipeline bubbles, where some devices wait idle while others compute. The goal here is to introduce techniques that directly address these pain points without changing model quality.

The first big tool is gradient accumulation and the closely related idea of virtual batch size. Virtual batch size is the effective batch you want the optimizer to ā€œsee,ā€ even if your GPU cannot hold that many samples at once. Gradient accumulation achieves this by processing several smaller mini-batches sequentially, summing their gradients, and then applying a single optimizer update. This lets you enjoy many benefits of large batches—more stable gradients, potentially faster convergence or better final quality—without needing the memory for all samples simultaneously. The price you pay is extra compute time per update, because you must do multiple forward/backward passes before stepping the weights.

Next, the lecture revisits data parallelism in light of gradient accumulation: if all-reduce synchronization is slowing you down, you can accumulate gradients locally for a few mini-batches on each device and then synchronize less often. This approach reduces the number of all-reduces and can significantly speed training on clusters where network bandwidth is limited. Importantly, per-device batch sizes and accumulation steps should be the same across devices to keep the workload balanced and avoid stragglers.

The lecture then returns to pipeline parallelism and its central challenge: pipeline bubbles that happen while a stage waits to receive activations or gradients. One technique to cut bubble time is interleaved pipeline stages. Instead of giving each device one contiguous block of layers (e.g., layers 1–3 on device 1, 4–6 on device 2), you interleave (e.g., device 1 runs layers 1, 5, 9; device 2 runs 2, 6, 10; etc.). This alternation reduces how long devices wait between tasks, improving average utilization. However, it increases communication because activations hop more frequently between devices.

Finally, gradient accumulation is combined with pipeline parallelism. By splitting a large batch into multiple microbatches and streaming them through the pipe, you keep all stages busy while deferring the weight update until after several microbatches. This approach reduces idle time across the pipe but requires memory to store and combine gradients from multiple microbatches. As with all distributed strategies, the best setup depends on your model size, dataset, number and type of devices, network speed, and memory constraints.

This lecture is aimed at learners who already understand basic deep learning, training loops (forward pass, loss, backward pass, optimizer step), and the concepts of gradients and batches. It is appropriate for intermediate students who have seen data and pipeline parallelism at a high level and want to master the practical knobs to squeeze more efficiency out of their hardware. You should be comfortable with the idea that distributed training has both compute and communication components, and that performance is often limited by whichever is slower.

By the end, you will be able to: define and use virtual batch size; implement gradient accumulation to mimic large-batch training within small memory; combine accumulation with data parallelism to reduce all-reduce frequency; explain pipeline bubbles and apply interleaving to reduce idle time; and orchestrate gradient accumulation with pipeline parallelism to keep stages busy while controlling memory. You’ll also be able to choose among these tools based on constraints, and articulate the relationship between model parallelism and pipeline parallelism (pipeline is a specific kind of model parallelism).

The lecture is structured as follows. It starts with a quick review of data parallelism and pipeline parallelism with their bottlenecks: all-reduce overhead and pipeline bubbles, respectively. It then introduces virtual batch size and gradient accumulation, including the key identity B_v = B Ɨ A (effective batch equals actual per-device batch times accumulation steps). After that, it shows how accumulation reduces synchronization cost in data parallelism and how interleaving reduces idle time in pipeline parallelism. It closes by combining accumulation with pipeline parallelism to keep the pipe fuller, and ends with practical guidance on choosing the right mix of techniques for your setup and a clarification that pipeline parallelism is a form of model parallelism.

Key Takeaways

  • āœ“Profile before tuning: Measure forward time, backward time, and synchronization time to identify the real bottleneck. If sync dominates, accumulation can help; if idle time in pipelines dominates, increase microbatches or interleave. Avoid changing many knobs at once so you can see the impact of each fix. Keep a baseline run for comparison.
  • āœ“Use virtual batch size to plan stability: Decide the effective batch your optimizer needs, then reach it via B_v = B Ɨ A. Keep per-microbatch B small to fit memory and raise A to hit the target. Normalize loss by A or adjust learning rate to maintain consistent update magnitude. Validate that learning curves match a true large-batch run.
  • āœ“Reduce all-reduce frequency with GA in DP: Wrap the first Aāˆ’1 microbatches in a no-sync context to avoid premature gradient sharing. Synchronize only on the A-th microbatch, then step. This boosts compute-to-communication ratio and improves scaling on bandwidth-limited clusters. Ensure equal B and A on all devices to prevent stragglers.
  • āœ“Balance pipeline stages: Partition layers so each stage has similar compute time to minimize bubbles. If one stage is heavy, consider moving layers or interleaving. Re-measure after changes to confirm utilization gains. Uneven stages cause persistent idle time.
  • āœ“Use microbatching to fill the pipeline: Split batches into multiple microbatches so early stages can start new work while later stages finish old work. Accumulate gradients across microbatches and update once. Find the microbatch count that fits memory while keeping utilization high. Too few microbatches leave stages idle; too many can cause OOM.
  • āœ“Try interleaving only on fast interconnects: Interleaving reduces bubble time but increases activation transfers. On NVLink/InfiniBand, it often helps; on PCIe-only systems, it may hurt. Benchmark both contiguous and interleaved assignments. Choose the one with better tokens/sec and stable training.

Glossary

Data Parallelism (DP)

A way to speed up training by making copies of the model on many devices and giving each device a different chunk of data. Each device computes gradients on its chunk. Then all devices combine their gradients so they update the model the same way. It’s useful when you have lots of data to process. It’s limited by how fast devices can share gradients.

Pipeline Parallelism (PP)

A way to split a big model across devices by putting different layers on different devices. An input flows through the layers like an item on an assembly line. Each device works on its layers, then passes results forward. This allows training models that don’t fit on one GPU. It can suffer from idle time while the pipeline starts and stops.

Model Parallelism

Any method that splits parts of a model across multiple devices. Pipeline parallelism is one specific type where layers are arranged in sequence. Other styles can split within layers too. It’s used when a single device can’t hold the whole model. It helps with memory limits but adds communication needs.

All-Reduce

A network operation where all devices share and combine data (like summing gradients) and each device gets the final result. It’s how models in data parallelism keep weights in sync. The speed depends on network bandwidth and latency. If it’s slow, training slows down even if GPUs are fast. Reducing how often it happens can help.

#data parallelism#pipeline parallelism#model parallelism#gradient accumulation#virtual batch size#all-reduce#microbatching#pipeline bubbles#interleaved stages#distributed training#throughput#communication overhead#GPU memory#synchronization#optimizer step#loss normalization#learning rate scaling#utilization#bandwidth#latency
Version: 1
  • •Interleaved pipeline stages reorder layers so each device holds non‑contiguous layers (e.g., device 1 gets layers 1, 5, 9). This reduces the time to fill and drain the pipe since work alternates more frequently across devices. You typically see fewer idle gaps and better utilization. The downside is more frequent activation transfers, increasing communication overhead.
  • •Gradient accumulation also helps pipeline parallelism by splitting a big batch into multiple microbatches that stream through the pipe. While one microbatch is waiting on a later stage, another microbatch can be entering the early stage, keeping devices busy. You delay the weight update until after several microbatches and then step once. This keeps the pipeline fuller, but you must store and combine gradients, which uses more memory.
  • •Choosing among techniques depends on your constraints: model size, dataset size, number of GPUs, network bandwidth, and memory limits. Very large models often require pipeline parallelism to fit at all. Very large datasets push you toward data parallelism to process more examples per unit time. Tight memory budgets motivate gradient accumulation.
  • •A simple mental model helps: data parallelism fights ā€˜not enough data throughput,’ pipeline parallelism fights ā€˜model too big to fit,’ and gradient accumulation fights ā€˜batch too big to fit.’ Interleaving fights pipeline idle time but adds communication. You mix and match to target the bottleneck that matters most in your setup. Measure, then tune.
  • •Virtual batch size B_v is the product of actual per‑device batch B and the number of accumulation steps A: B_v = B Ɨ A. This equation guides you to hit the target effective batch your optimizer needs for stable training. You can keep B small to fit memory and raise A to reach the same B_v. Just remember you’ll do more compute per optimizer step.
  • •Communication cost versus computation cost is the central trade‑off in distributed training. Reducing synchronization events (fewer all‑reduces) saves time when network bandwidth is limited. Cutting pipeline bubbles saves time when device compute would otherwise sit idle. Your best configuration balances these costs for your hardware.
  • •Pipeline parallelism is a specific type of model parallelism where you split layers into ordered stages. Model parallelism is the broader term for placing different parts of a model on different devices in any arrangement. Recognizing this relationship helps you reason about other model parallel strategies. Pipeline interleaving is just one scheduling variant within this family.
  • 02Key Concepts

    • 01

      šŸŽÆ Data Parallelism: Splitting the same model across devices so each one trains on a different slice of the data. šŸ  It's like having multiple chefs cook the same recipe on different portions of ingredients and comparing notes at the end. šŸ”§ Each GPU runs forward/backward on its mini-batch, computes gradients, then all gradients are aggregated (usually with an all-reduce) so weights update the same everywhere. šŸ’” Without aggregation, models would drift apart and training would diverge; with too-frequent aggregation, network time can dominate. šŸ“ In practice, 8 GPUs each process a mini-batch of size 128, then synchronize gradients once so the model updates as if trained on a batch of 1024.

    • 02

      šŸŽÆ All-Reduce Bottleneck: The network-wide operation that combines gradients across devices. šŸ  Think of it like everyone on a group call trying to share their notes at the same time over a slow connection. šŸ”§ All-reduce sums or averages gradient tensors across GPUs; latency and bandwidth determine how long it takes. šŸ’” If this step is slow, adding more GPUs won't speed training because they spend time waiting to sync. šŸ“ On a cluster with limited bandwidth, gradient syncs after every mini-batch can stall overall throughput.

    • 03

      šŸŽÆ Pipeline Parallelism: Splitting the model's layers across devices in ordered stages so one example flows through them. šŸ  Like a factory assembly line where each station does a different part of the job. šŸ”§ Device 1 runs early layers, passes activations to device 2 for later layers, and so on; backward gradients flow in reverse. šŸ’” This lets very large models train when they can't fit on a single GPU. šŸ“ A 12-layer transformer might put layers 1–3 on GPU 1, 4–6 on GPU 2, 7–9 on GPU 3, and 10–12 on GPU 4.

    • 04

      šŸŽÆ Pipeline Bubbles: Idle time in pipeline parallelism while stages wait for activations or gradients. šŸ  Imagine a water pipe that must fill before water flows steadily and then must drain at the end. šŸ”§ At the start (fill) and end (drain) of a step, some devices have nothing to compute, reducing utilization. šŸ’” Without addressing bubbles, expensive GPUs sit idle and you lose the benefits of parallelism. šŸ“ With two stages and one big batch, the second device initially waits for the first to finish early layers, and later the first waits for gradients to return.

    • 05

      šŸŽÆ Virtual Batch Size (B_v): The effective batch size the optimizer sees, even if you don't load it at once. šŸ  It's like reading a 100-page book in four 25-page sittings but still saying you finished the whole book today. šŸ”§ If B is the per-device batch and A is the number of accumulation steps, then B_v = B Ɨ A. šŸ’” This lets you get large-batch behavior (stable gradients) on small-memory GPUs. šŸ“ Using B=256 and A=4 gives B_v=1024 without ever storing 1024 examples at once.

    • 06

      šŸŽÆ Gradient Accumulation: Summing gradients from multiple mini-batches before taking one optimizer step. šŸ  Like saving coins from several allowances and making one big purchase later. šŸ”§ You run forward/backward on small mini-batches, add gradients to a running total, and call optimizer.step() after A such mini-batches. šŸ’” Without this, you can't enjoy large-batch stability if memory is tight. šŸ“ Process four mini-batches of size 256, accumulate gradients, then update once to emulate batch size 1024.

    • 07

      šŸŽÆ Trade-offs of Gradient Accumulation: Memory-friendly large-batch training at the cost of time per update. šŸ  It's like taking more trips with a smaller backpack instead of one trip with a suitcase. šŸ”§ You do A times more forward/backward passes before each optimizer step, increasing wall-clock time per step. šŸ’” This helps when memory is the bottleneck but can slow learning dynamics if steps are too infrequent. šŸ“ Accumulating 8 times can halve steps per hour compared to no accumulation, even though total examples/hour may be similar.

    • 08

      šŸŽÆ Accumulation with Data Parallelism: Accumulate locally to reduce how often you all-reduce. šŸ  It's like everyone writes notes for several pages before the group shares summaries. šŸ”§ Each GPU processes several mini-batches, sums gradients, then all-reduces once per accumulation cycle so synchronization events drop. šŸ’” Fewer all-reduces mean less time spent waiting on the network. šŸ“ With four GPUs and target B_v=4000, each can use B=250 and A=4, syncing once instead of after every 250.

    • 09

      šŸŽÆ Balanced Batch Across Devices: Keep per-device work equal. šŸ  Like teammates splitting chores evenly to finish at the same time. šŸ”§ If one device has a larger per-device batch or different A, it becomes a straggler that delays synchronization and pipeline turns. šŸ’” Imbalanced loads waste time and reduce throughput. šŸ“ Ensure each of 8 GPUs uses the same batch size and accumulation steps so all-reduce starts promptly.

    • 10

      šŸŽÆ Interleaved Pipeline Stages: Place non-contiguous layers on each device to reduce bubble time. šŸ  It's like alternating tasks between two workers so neither sits idle for long. šŸ”§ Device 1 might own layers 1, 5, 9; device 2 owns 2, 6, 10; device 3 owns 3, 7, 11; device 4 owns 4, 8, 12; activations hop more frequently. šŸ’” This can improve utilization by shortening the fill and drain phases. šŸ“ With two devices and four layers, assigning (1,3) to device 1 and (2,4) to device 2 keeps both busier.

    • 11

      šŸŽÆ Communication Overhead of Interleaving: More hops mean more transfers. šŸ  Like passing a ball back and forth more often—everyone stays engaged, but more throws take time. šŸ”§ Interleaving increases the number of activation/gradient messages across devices, stressing bandwidth and latency. šŸ’” If your network is slow, the extra chatter can erase utilization gains. šŸ“ On PCIe-only systems, interleaving may hurt; on NVLink/InfiniBand, it can help.

    • 12

      šŸŽÆ Accumulation with Pipeline Parallelism: Stream microbatches through the pipe and step once after several. šŸ  Like a bakery sending a steady stream of smaller trays into the oven to keep it hot. šŸ”§ Break a large batch into microbatches, feed them through stages, accumulate gradients for each microbatch, then update once at the end. šŸ’” This fills the pipeline better and reduces idle time. šŸ“ Split a batch of 4 into two microbatches of 2 so while one microbatch is at later layers, the next starts at earlier layers.

    • 13

      šŸŽÆ Memory Impact of Accumulating in Pipelines: Stored gradients add up. šŸ  It's like stacking several trays of cookies on the counter until you package them all at once. šŸ”§ Accumulating across multiple microbatches means holding gradient or optimizer states until you step, which uses memory. šŸ’” If memory is already tight, too many microbatches can cause out-of-memory errors. šŸ“ Accumulating across 8 microbatches might require reducing per-microbatch size to fit.

    • 14

      šŸŽÆ Practical Selection Heuristics: Choose the technique that targets your bottleneck. šŸ  If the car can’t hold the luggage, split the load; if traffic is slow, avoid extra trips. šŸ”§ Use pipeline parallelism when model size doesn’t fit one GPU; use data parallelism for large datasets; add accumulation to reach a larger effective batch without extra memory; try interleaving to shrink bubbles. šŸ’” Measure hardware utilization, step time, and network usage to guide decisions. šŸ“ Start with data parallelism, add accumulation to reduce syncs, then consider pipeline or interleaving if the model still doesn’t fit or devices are idle.

    • 15

      šŸŽÆ Model vs. Pipeline Parallelism: Pipeline is a specific kind of model parallelism. šŸ  Model parallelism is like splitting a big puzzle among tables; pipeline is arranging tables in a line so pieces pass from one table to the next. šŸ”§ Model parallelism broadly means distributing different parts of a model across devices; pipeline specifically organizes them as sequential stages for streaming. šŸ’” Understanding this hierarchy helps you reason about alternatives and combinations. šŸ“ You might use tensor parallelism (another model-parallel style) alongside pipeline and data parallel strategies.

    • 16

      šŸŽÆ Activations: The outputs of each layer that must be sent forward (and kept for backward). šŸ  Like intermediate steps in a recipe you need to remember to finish the dish. šŸ”§ In pipelines, activations are transmitted between devices and cached for gradient computation. šŸ’” They drive both memory use and communication volume. šŸ“ Large sequence lengths and hidden sizes inflate activation sizes, increasing transfer time.

    • 17

      šŸŽÆ Gradient Aggregation Mechanics: How gradients from different mini-batches/devices get combined. šŸ  Like adding everyone’s scores to get a class total and then averaging. šŸ”§ Within a device, accumulation sums gradients across microbatches; across devices, all-reduce averages or sums them so all replicas agree. šŸ’” Correct aggregation ensures consistent updates and stable training. šŸ“ Normalizing by total examples keeps learning rate behavior consistent when changing B or A.

    • 18

      šŸŽÆ Microbatches vs. Minibatches: The smaller slices used inside a step vs. the per-device batch. šŸ  Cutting a sandwich into bites (microbatches) inside a meal (minibatch). šŸ”§ A minibatch is what each device processes at a time; microbatches are how you subdivide work for pipeline streaming or accumulation. šŸ’” Microbatching helps keep devices busy and fit memory budgets. šŸ“ A per-device minibatch of 256 can be processed as 8 microbatches of 32 each with accumulation.

    • 19

      šŸŽÆ Communication vs. Computation Balance: Which one limits you determines the best fix. šŸ  If the kitchen is fast but deliveries are slow, reduce trips; if deliveries are fast but the kitchen is slow, add cooks. šŸ”§ When communication is the bottleneck, reduce sync frequency (accumulation) or message size; when computation is the bottleneck, add parallel stages or improve utilization (interleave). šŸ’” Matching the remedy to the bottleneck yields the biggest gains. šŸ“ Profiling shows whether GPUs are compute-bound or waiting on network.

    • 20

      šŸŽÆ Batch Size and Convergence: Larger batches change gradient noise and optimization dynamics. šŸ  Stirring a big pot gives a smoother mix than a tiny cup with bumps. šŸ”§ Big batches provide more accurate gradient estimates but may require learning-rate adjustments and can reduce iteration-level feedback. šŸ’” Virtual large batches via accumulation give stability without fitting all samples at once. šŸ“ If you increase B_v, consider scaling the learning rate and monitoring loss smoothness.

    03Technical Details

    Overall Architecture/Structure

    1. Data Parallelism (DP)
    • Role of each component: Every device (GPU) holds a full copy of the model. Each device receives a different subset of the batch, computes forward pass, computes loss and backward pass to obtain gradients. Then gradients are combined across devices (usually via an all-reduce) so all model replicas apply the same weight update.
    • Data flow: Input batch is split among devices. Each device computes activations and gradients locally. A synchronization step aggregates gradients across devices. The optimizer step updates identical weights on all devices.
    • Bottleneck: The all-reduce operation can dominate time if bandwidth/latency is limited or the number of devices is large. More devices mean larger total gradient volume and more frequent synchronization events unless mitigated.
    1. Pipeline Parallelism (PP)
    • Role of each component: The model is split across devices into ordered stages (e.g., Stage 1 = layers 1–3, Stage 2 = layers 4–6, etc.). A single example (or microbatch) flows forward through Stage 1, then Stage 2, etc. Backward gradients flow in reverse order (Stage N to Stage N-1).
    • Data flow: Stage 1 computes activations and sends them to Stage 2. During backprop, later stages compute gradients and send them upstream. At the beginning and end of a step, devices are underutilized—the classic ā€˜pipeline bubble.’
    • Bottleneck: Pipeline bubbles (idle periods) reduce effective throughput. Some communication is required to transmit activations and gradients between stages.
    1. Gradient Accumulation (GA) and Virtual Batch Size (B_v)
    • Role of each component: GA increases the effective batch size without increasing per-device memory by accumulating gradients across several microbatches before stepping the optimizer. Virtual batch size (B_v) is defined as the batch size the optimizer effectively experiences.
    • Data flow: For A accumulation steps, each mini/microbatch performs forward+backward, gradients are added to a running buffer, and weights are not updated until the A-th pass. Then a single optimizer step applies the accumulated gradients.
    • Bottleneck: GA increases compute time per update (more forward/backward cycles before a step) and may require careful memory management of gradient buffers.

    Formal Relationship and Scheduling

    • The key identity: B_v = B Ɨ A, where B is the actual per-device batch processed at a time, and A is the number of accumulation steps. If you have D devices in data parallel, the global effective batch is D Ɨ B Ɨ A (assuming identical B and A across devices). In pure single-device accumulation, D=1 and B_v=BƗA.
    • Practical normalization: Many implementations divide the loss by A (or by total batch size) so that the magnitude of gradients remains comparable whether or not accumulation is used. Alternatively, you can sum raw gradients and let the optimizer implicitly scale updates based on the final total.

    Implementation Details (General/PyTorch-like)

    • Basic training loop with GA:
      1. zero_grad() at the start of an accumulation cycle.
      2. For step in 1..A:
        • Forward pass on microbatch.
        • Compute loss; optionally divide by A to normalize.
        • Backward pass: loss.backward() accumulates gradients in .grad tensors (they add up by default).
      3. After A microbatches: optimizer.step() to update weights, then optimizer.zero_grad() to reset.
    • With Distributed Data Parallel (DDP):
      • DDP typically all-reduces gradients on every backward unless you wrap accumulation steps with a no_sync() context to avoid premature synchronization. This way, gradients accumulate locally for A-1 steps and sync only on the last.
      • Ensure each device uses the same B and A to keep computation balanced; otherwise, some devices will reach synchronization earlier and idle.
    • Loss scaling and normalization:
      • If you divide loss by A at each microbatch, the final accumulated gradient matches the non-accumulated case. If you do not normalize, gradients are A times larger, which you can compensate by lowering the learning rate by the same factor.
    • Gradient clipping:
      • Clip after accumulation, right before optimizer.step(), to clip the combined gradient as if it came from the full batch.
    • Mixed precision (FP16/bfloat16):
      • When using automatic mixed precision (AMP), keep accumulation buffers in stable precision (often FP32 master weights/gradients). Apply loss scaling as usual and unscale before clipping.

    Combining GA with Data Parallelism

    • Goal: Reduce the number of expensive all-reduces.
    • Mechanism: Each device accumulates gradients across A microbatches with DDP sync disabled (e.g., no_sync). On the A-th microbatch, allow DDP to synchronize via all-reduce so all replicas get the same combined gradient. Then step the optimizer.
    • Benefits: Fewer network synchronizations per unit of data processed; increased compute-to-communication ratio; better scaling when network is the bottleneck.
    • Considerations: Keep B and A identical across devices; monitor gradient memory usage; choose A to balance fewer syncs against longer time to each optimization step.

    Pipeline Parallelism Scheduling and Bubbles

    • Fill/drain effect: With S stages and M microbatches in flight, the pipeline reaches steady utilization only after initial fill (about S-1 intervals) and then loses utilization during final drain.
    • Microbatching: Splitting a large batch into M microbatches lets you start processing subsequent microbatches on early stages while later stages are still busy with earlier microbatches. This increases overlap and reduces idle time.
    • Gradient timing: Backprop must respect dependency order (later layers depend on earlier ones). Schedules carefully interleave forward and backward computations to minimize bubbles.

    Interleaved Pipeline Stages

    • Mapping: Instead of contiguous blocks of layers per device, assign layers in a round-robin fashion (e.g., device 1: layers 1, 5, 9; device 2: 2, 6, 10; etc.). This raises the frequency of hand-offs, so each device alternates between work on successive parts of a sample more quickly.
    • Why it reduces bubbles: Shorter time gaps between when a device finishes one layer and receives the next task means less idle waiting during both fill and drain phases. The pipeline becomes more interleaved in time.
    • Communication cost: More hand-offs mean more activation and gradient transfers. If interconnect bandwidth is limited (e.g., PCIe), this overhead can offset reduced bubble time. If bandwidth is strong (e.g., NVLink/InfiniBand), interleaving often wins.

    Gradient Accumulation with Pipeline Parallelism

    • Mechanism: Choose a microbatch count M per step. For each microbatch, push forward through the pipeline; as earlier microbatches progress to later stages, start the next microbatch in earlier stages. Accumulate gradients across microbatches and step once after M are processed.
    • Benefit: Keeps all stages busier by overlapping work across microbatches, smoothing the fill/drain effect and improving throughput.
    • Memory consideration: Accumulating across M microbatches can increase memory because gradients and some activation states must persist until the final update. If memory gets tight, reduce per-microbatch size or M.

    Tools/Libraries (Conceptual)

    • Any deep learning framework (e.g., PyTorch) supports accumulation by controlling when optimizer.step() happens and whether to normalize the loss.
    • For DDP-style training, use a mechanism to avoid sync during accumulation steps (e.g., no_sync()).
    • For pipeline parallelism, use a pipelining library or framework that can partition layers and schedule microbatches (even if done manually, ensure correct order of forward/backward and careful tensor transfers between devices).

    Step-by-Step Implementation Guide

    1. Decide your target effective batch (B_v) based on optimizer stability and convergence goals.
    2. Choose per-device batch size (B) that fits memory comfortably with your model and sequence length.
    3. Compute accumulation steps (A) so that B_v = B Ɨ A (single device) or global batch = D Ɨ B Ɨ A (data parallel with D devices). Adjust A to balance memory and update frequency.
    4. If using data parallelism:
      • Wrap the model with DDP (or similar).
      • Use no_sync (or equivalent) for the first A-1 microbatches each step to avoid all-reduce; allow sync on the A-th microbatch.
      • Normalize loss by A (or adjust learning rate accordingly).
      • Ensure all devices have identical B and A.
    5. If using pipeline parallelism (contiguous stages):
      • Partition layers across devices so each device’s memory and compute load are balanced.
      • Split your per-step batch into M microbatches; schedule them to enter the pipeline sequentially.
      • Accumulate gradients over M microbatches and step once per full batch.
    6. If using interleaved stages:
      • Reassign layers in a round-robin pattern across devices.
      • Measure activation transfer times; ensure interconnect can handle more frequent messages.
      • Keep microbatching to exploit overlap and reduce bubbles.
    7. Monitor and tune:
      • Profile GPU utilization, step time breakdown (forward, backward, sync), and memory usage.
      • If communication dominates, increase A (accumulation) to sync less often; if idle time dominates in pipeline, increase M (microbatches) or enable interleaving.
      • Adjust learning rate when B_v changes; keep gradient clipping after accumulation.

    Tips and Warnings

    • Correctness of scaling: If you change B_v, revisit learning-rate schedules. Many recipes scale LR roughly linearly with batch size up to a point, but always validate.
    • Gradient zeroing: Only zero gradients at the start of an accumulation cycle, not after every microbatch; otherwise, you’ll lose accumulated information.
    • Numerical stability: With mixed precision, unscale gradients before clipping; keep master weights in FP32.
    • Memory budgeting: Accumulation increases lifetime of gradient and optimizer states; pipeline increases activation traffic and storage. Reduce sequence length or per-microbatch size if near limits.
    • Load balance: In pipelines, ensure each stage’s compute time is roughly equal; uneven partitions create their own bubbles.
    • Network-aware choices: Interleaving helps mostly when your interconnect is fast. On slow links, fewer, larger messages (contiguous stages) may be better.
    • Debugging distributed runs: Start small (few devices, small model), add logging around synchronization points, verify identical loss curves across configurations when only B/A/M change but B_v stays constant.

    Putting It Together

    • If your model fits one GPU and network is slow: Use single-device GA to hit desired B_v.
    • If your model fits but your dataset is huge and you have many GPUs: Use DP, and add GA to reduce all-reduce frequency.
    • If your model doesn’t fit one GPU: Use PP; add microbatches and GA to keep the pipe full; consider interleaving if you have fast interconnect.
    • Always measure: Effective throughput (tokens/sec), GPU utilization, memory headroom, and validation metrics. Tune B, A, and microbatch counts for your constraints.

    04Examples

    • šŸ’”

      Virtual Batch Size Calculation: Suppose per-device batch B=256 and you want an effective batch B_v=1024. Choose accumulation steps A=4 so B_v=BƗA=1024 without increasing memory per microbatch. You run four forward/backward passes, add gradients each time, then do one optimizer step. The key point is achieving large-batch stability with small memory.

    • šŸ’”

      Reducing All-Reduce Frequency with GA: You have 4 GPUs and target B_v=4000. Without GA, each GPU uses B=1000 and you all-reduce once per mini-batch. With GA, set B=250 and A=4; each GPU processes four microbatches, then performs one all-reduce and a single update. The key point is cutting synchronization events by 4x while keeping the same global batch.

    • šŸ’”

      Basic Pipeline Bubble: With two devices and four layers, assign layers 1–2 to device 1 and layers 3–4 to device 2. For the first microbatch, device 2 idles until device 1 finishes layers 1–2 and sends activations; later, device 1 idles waiting for gradients from device 2. This fill and drain produce bubbles where devices are idle. The example highlights why pipeline parallelism can underutilize hardware.

    • šŸ’”

      Interleaved Two-Device Pipeline: With the same four-layer model, assign layers (1,3) to device 1 and (2,4) to device 2. Device 1 computes layer 1 and passes activations, device 2 computes layer 2 and passes back, device 1 computes layer 3, and device 2 finishes with layer 4. This pattern keeps both devices active more consistently, shrinking idle windows. The key point is reduced bubble time at the cost of more activation transfers.

    • šŸ’”

      GA with Pipeline Microbatching: Split a batch of 4 into two microbatches of 2. Start microbatch 1 at stage 1, then when microbatch 1 moves to stage 2, start microbatch 2 at stage 1. Accumulate gradients for both microbatches and update once at the end. The example shows how microbatching keeps stages busier.

    • šŸ’”

      Memory Trade-off in GA for Pipelines: You choose 8 microbatches per step to boost utilization. Gradients must be stored or accumulated until the step is taken, increasing memory footprint. If you hit out-of-memory, you reduce microbatch size or microbatch count. The example emphasizes balancing utilization with memory limits.

    • šŸ’”

      Balanced Per-Device Work in DP: In a 4-GPU DP setup, if GPU 0 uses B=256 and others use B=128, GPU 0 will take longer per iteration. The other GPUs will reach the all-reduce point earlier and wait idle. Matching batch sizes (and A) across devices prevents stragglers. The key point is synchronized progress to avoid idle time.

    • šŸ’”

      Loss Normalization with Accumulation: Without dividing loss by A, accumulated gradients are A times larger. If you keep the same learning rate, the effective step is too big and training can destabilize. Normalize per-microbatch loss by A (or scale LR accordingly) to maintain consistent update magnitude. The example shows how scaling keeps convergence behavior steady.

    • šŸ’”

      When Not to Interleave: On a system with only PCIe interconnect, frequent activation hops are slow. Interleaving increases the number of transfers and may exceed PCIe bandwidth, wiping out benefits. Keeping contiguous stages may perform better by reducing messages. The key point is matching interleaving to your interconnect speed.

    • šŸ’”

      Choosing A for DP: Profiling shows 40% of your step time is gradient synchronization. Increasing A from 1 to 4 reduces all-reduce frequency by 4Ɨ, improving overall throughput. If compute becomes the new bottleneck, consider increasing per-device batch B or adding GPUs. The example illustrates tuning A based on profiling.

    • šŸ’”

      Convergence Considerations for Larger B_v: You raise B_v from 512 to 4096 via GA to stabilize training. You adjust the learning rate upward moderately and monitor validation loss to ensure no degradation. Results show smoother loss but slower per-step updates, which is acceptable due to higher throughput per step. The key point is carefully pairing B_v changes with LR tuning.

    • šŸ’”

      Pipeline Stage Balancing: In a 12-layer model, layers 1–3 take 20 ms, 4–6 take 30 ms, 7–9 take 25 ms, 10–12 take 15 ms per microbatch. If you assign them as contiguous stages, the 30 ms stage becomes a bottleneck and creates bubbles. You repartition or interleave to even out per-stage time. The key point is balancing compute across stages to reduce idle time.

    • šŸ’”

      Microbatch Count vs. Memory: You test microbatch count M={2,4,8} in a pipeline. M=8 maximizes utilization but causes OOM; M=4 fits memory with good utilization; M=2 underutilizes the pipeline. You choose M=4 as the best compromise. The key point is empirically finding the sweet spot.

    • šŸ’”

      DP + GA Correctness Check: You run the same number of total examples per step with (B=256, A=4) and with (B=1024, A=1). After applying loss normalization, both configurations produce nearly identical gradients and similar loss curves. This confirms your GA implementation is correct. The key point is validating equivalence to large-batch training.

    • šŸ’”

      Profiling to Identify Bottlenecks: You instrument your training to log forward time, backward time, and sync time. Logs show sync time dominates, so you increase A and see sync time per step drop sharply. If forward/backward time now dominates, you consider kernel optimizations or stage rebalancing. The key point is data-driven tuning.

    05Conclusion

    This lecture presented a practical toolkit for scaling training when models and datasets exceed the comfort zone of a single GPU. Data parallelism accelerates throughput by splitting data across identical model replicas, but the required all-reduce synchronization can bottleneck performance. Pipeline parallelism allows very large models to train by splitting layers across devices, yet pipeline bubbles reduce utilization as stages wait for inputs or gradients. Gradient accumulation and virtual batch size tie these strategies together by letting you emulate large-batch training within fixed memory, then synchronize less often to save network time. Interleaved pipeline stages further reduce idle time by alternating layers across devices, trading fewer bubbles for more frequent activation transfers.

    The most important relationships to remember are: B_v = B Ɨ A (effective batch equals per-device batch times accumulation steps), and pipeline parallelism is a specific type of model parallelism. Gradient accumulation can dramatically cut synchronization frequency in data parallelism and fill the pipe better in pipeline parallelism. Interleaving can improve utilization when your interconnect is fast enough to handle extra communication. Choosing the right combination depends on your hardware, network, memory, and convergence needs—so profile first, then tune.

    To practice, implement gradient accumulation on a single GPU and verify that scaling loss by 1/A yields the same updates as a larger true batch. Then add data parallelism and wrap accumulation with no_sync to reduce all-reduces. Next, try a toy pipeline split across two devices, measure bubbles, and add microbatches; finally, experiment with interleaving if your hardware supports fast links. Useful projects include building a small benchmark harness to sweep B, A, and microbatch counts and report utilization and step-time breakdowns.

    For next steps, explore more advanced scheduling strategies for pipelines, tensor parallelism (another model-parallel technique), and memory-saving methods like activation checkpointing and quantization (the next lecture’s topic). Also study learning-rate scaling rules for large batches and techniques such as gradient clipping and mixed precision to maintain stability. The core message is simple: distributed training performance is about balancing compute and communication while respecting memory constraints. By applying gradient accumulation, thoughtful synchronization, and careful pipeline scheduling, you can unlock substantial speedups without sacrificing correctness.

  • āœ“Scale learning rate with batch changes: When B_v increases, consider raising the learning rate moderately, but monitor loss for stability. Keep gradient clipping in place after accumulation to avoid spikes. Adjust warmup schedules if needed. Always validate on a held-out set.
  • āœ“Zero gradients at the right time: Call zero_grad() only at the start of an accumulation cycle, not after each microbatch. Otherwise, you’ll erase accumulated information and lose the benefits of GA. Double-check that gradients truly accumulate across microbatches. Confirm by inspecting .grad values.
  • āœ“Mind memory when accumulating in pipelines: Accumulating across many microbatches increases gradient/activation lifetimes. If nearing limits, reduce per-microbatch size or the number of microbatches. Consider shorter sequence lengths or smaller activations. Watch memory headroom during runs.
  • āœ“Keep per-device work equal in DP: Use the same per-device batch and accumulation steps everywhere. Any mismatch creates stragglers and wastes time. Recheck data loaders for even splits. If one device is slower hardware, adjust assignments or exclude it.
  • āœ“Normalize losses consistently: Either divide loss by A each microbatch or adjust the learning rate equivalently. Inconsistent scaling leads to different effective step sizes and unstable convergence. Document your chosen convention. Test equivalence with and without normalization to ensure correctness.
  • āœ“Clip after accumulation: Apply gradient clipping to the accumulated gradient right before optimizer.step(). This reflects the full-batch behavior you intend. Clipping per microbatch can distort the final gradient direction. Keep clipping thresholds consistent across runs.
  • āœ“Tune accumulation and microbatch counts empirically: Start with small A and microbatch counts, then increase until utilization stops improving or memory runs out. Record tokens/sec and step time components for each setting. Pick the best trade-off for your hardware. Avoid assuming linear gains.
  • āœ“Use simple schedules first: Begin with contiguous pipeline stages and moderate microbatching. Only adopt interleaving if profiling shows bubbles remain and your links are fast. Simple setups are easier to debug and often perform well. Add complexity only when it clearly helps.
  • āœ“Validate equivalence to large-batch training: Compare runs with true large batches (if they fit) to GA-based large batches. Learning curves and final metrics should align when scaling is correct. This builds confidence in your configuration. It also helps when documenting results.
  • Gradient Accumulation (GA)

    A trick to imitate large-batch training without loading all samples at once. You process several mini or microbatches, add their gradients together, and only then update the model. This saves memory because you never hold the entire big batch in memory. It costs extra time because you do more passes before each update. It’s great when memory is your limit.

    Virtual Batch Size (B_v)

    The effective batch size your optimizer experiences after accumulation. It tells you how big your gradients are as if you processed that many samples at once. You compute it by multiplying per-device batch size by accumulation steps (and by number of devices if doing data parallel). It helps you plan learning rates and stability. You don’t need the full B_v in memory at once.

    Batch Size (B)

    How many training examples a device processes at one time before computing a gradient. Bigger batches can give smoother gradients but need more memory. Smaller batches fit in memory but can be noisier. You can combine small B with accumulation to mimic a large batch. Choosing B is a trade-off between memory and stability.

    Accumulation Steps (A)

    How many mini or microbatches you process and add up before taking one optimizer step. More steps mean larger virtual batch size without extra memory per microbatch. But more steps also mean fewer updates per hour. You must scale loss or learning rate accordingly. It’s a knob to balance memory and speed.

    +24 more (click terms in content)