📚 Stanford CS336: Language Modeling from Scratch8 / 17

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 8: Parallelism 2

Intermediate

Stanford Online

Deep LearningYouTube

Key Summary

•This session explains how to speed up and scale training when one GPU or a simple setup is not enough. It reviews data parallelism (split data across devices) and pipeline parallelism (split model across devices), then dives into practical fixes for their main bottlenecks. The key tools are gradient accumulation, virtual batch size, and interleaved pipeline stages. You’ll learn the trade‑offs between memory use, communication overhead, and idle time.
•Data parallelism lets each GPU process a different slice of the batch, compute gradients, and then combine them with an all‑reduce operation. The upside is easy scaling when you have lots of data; the downside is the cost and delay of synchronizing gradients across devices. This communication can become the training bottleneck as GPU counts grow. Managing when and how often you synchronize is crucial.
•Pipeline parallelism splits the model’s layers across devices so a single example flows through the devices like water through a pipe. The drawback is pipeline bubbles: some devices sit idle while the pipe fills and empties. You get high utilization only in the middle of a step. Scheduling tricks and microbatching are used to reduce idle time.
•Virtual batch size is the ‘effective’ batch size you train with, even if you can’t fit it in memory all at once. Gradient accumulation makes this possible by processing several mini‑batches in sequence, summing their gradients, and stepping the optimizer once at the end. You get the benefits of larger batches without needing more GPU memory per step. The trade‑off is extra time per update because you perform more forward/backward passes before stepping.
•Gradient accumulation pairs naturally with data parallelism. Each device accumulates gradients locally across several mini‑batches, and only then runs the costly all‑reduce to aggregate gradients across devices. This reduces the number of synchronization events and can increase throughput. All devices should use the same per‑device batch for balanced work.

Why This Lecture Matters

Training modern language models pushes the limits of memory, compute, and network bandwidth. Engineers, researchers, and ML practitioners need reliable ways to scale beyond a single device without wasting resources or hurting convergence. The techniques here—gradient accumulation, virtual batch size, interleaving, and their combinations with data and pipeline parallelism—directly target the most common bottlenecks: memory limits, idle compute, and communication overhead. This knowledge lets you fit larger models, sustain higher throughput, and reduce training cost per token. In real projects, you’ll often find that simply adding GPUs doesn’t speed things up because synchronization becomes dominant. Gradient accumulation lets you cut the number of all-reduces; interleaving and microbatching keep pipelines full; and careful batch/accumulation choices maintain stable optimization. These strategies translate to faster experiments, better hardware utilization, and the ability to train models that would otherwise not fit. They also reduce operational risks like out-of-memory errors and straggler-induced slowdowns. From a career perspective, being able to diagnose and fix distributed training bottlenecks is highly valued. Teams building large models must balance compute and communication, and those who can orchestrate data parallelism, pipeline scheduling, and accumulation stand out. In an industry where training budgets and timelines are tight, mastering these techniques can directly improve product delivery speed and research velocity. As models and datasets continue to grow, the importance of efficient parallel training only increases.

Lecture Summary

Tap terms for definitions

01Overview

This lecture focuses on practical strategies to scale and speed up training of large language models when a single GPU or naïve setup becomes a bottleneck. It builds on two foundational ideas: data parallelism (splitting the data across devices) and pipeline parallelism (splitting the model across devices). While both approaches increase throughput and enable training larger models, each introduces its own bottleneck: data parallelism requires frequent network-wide gradient aggregation (an all-reduce), and pipeline parallelism suffers from pipeline bubbles, where some devices wait idle while others compute. The goal here is to introduce techniques that directly address these pain points without changing model quality.

The first big tool is gradient accumulation and the closely related idea of virtual batch size. Virtual batch size is the effective batch you want the optimizer to “see,” even if your GPU cannot hold that many samples at once. Gradient accumulation achieves this by processing several smaller mini-batches sequentially, summing their gradients, and then applying a single optimizer update. This lets you enjoy many benefits of large batches—more stable gradients, potentially faster convergence or better final quality—without needing the memory for all samples simultaneously. The price you pay is extra compute time per update, because you must do multiple forward/backward passes before stepping the weights.

Next, the lecture revisits data parallelism in light of gradient accumulation: if all-reduce synchronization is slowing you down, you can accumulate gradients locally for a few mini-batches on each device and then synchronize less often. This approach reduces the number of all-reduces and can significantly speed training on clusters where network bandwidth is limited. Importantly, per-device batch sizes and accumulation steps should be the same across devices to keep the workload balanced and avoid stragglers.

The lecture then returns to pipeline parallelism and its central challenge: pipeline bubbles that happen while a stage waits to receive activations or gradients. One technique to cut bubble time is interleaved pipeline stages. Instead of giving each device one contiguous block of layers (e.g., layers 1–3 on device 1, 4–6 on device 2), you interleave (e.g., device 1 runs layers 1, 5, 9; device 2 runs 2, 6, 10; etc.). This alternation reduces how long devices wait between tasks, improving average utilization. However, it increases communication because activations hop more frequently between devices.

Finally, gradient accumulation is combined with pipeline parallelism. By splitting a large batch into multiple microbatches and streaming them through the pipe, you keep all stages busy while deferring the weight update until after several microbatches. This approach reduces idle time across the pipe but requires memory to store and combine gradients from multiple microbatches. As with all distributed strategies, the best setup depends on your model size, dataset, number and type of devices, network speed, and memory constraints.

This lecture is aimed at learners who already understand basic deep learning, training loops (forward pass, loss, backward pass, optimizer step), and the concepts of gradients and batches. It is appropriate for intermediate students who have seen data and pipeline parallelism at a high level and want to master the practical knobs to squeeze more efficiency out of their hardware. You should be comfortable with the idea that distributed training has both compute and communication components, and that performance is often limited by whichever is slower.

By the end, you will be able to: define and use virtual batch size; implement gradient accumulation to mimic large-batch training within small memory; combine accumulation with data parallelism to reduce all-reduce frequency; explain pipeline bubbles and apply interleaving to reduce idle time; and orchestrate gradient accumulation with pipeline parallelism to keep stages busy while controlling memory. You’ll also be able to choose among these tools based on constraints, and articulate the relationship between model parallelism and pipeline parallelism (pipeline is a specific kind of model parallelism).

The lecture is structured as follows. It starts with a quick review of data parallelism and pipeline parallelism with their bottlenecks: all-reduce overhead and pipeline bubbles, respectively. It then introduces virtual batch size and gradient accumulation, including the key identity $B_v$ = $B × A$ (effective batch equals actual per-device batch times accumulation steps). After that, it shows how accumulation reduces synchronization cost in data parallelism and how interleaving reduces idle time in pipeline parallelism. It closes by combining accumulation with pipeline parallelism to keep the pipe fuller, and ends with practical guidance on choosing the right mix of techniques for your setup and a clarification that pipeline parallelism is a form of model parallelism.

Key Takeaways

✓Profile before tuning: Measure forward time, backward time, and synchronization time to identify the real bottleneck. If sync dominates, accumulation can help; if idle time in pipelines dominates, increase microbatches or interleave. Avoid changing many knobs at once so you can see the impact of each fix. Keep a baseline run for comparison.
✓Use virtual batch size to plan stability: Decide the effective batch your optimizer needs, then reach it via B_v = B × A. Keep per-microbatch B small to fit memory and raise A to hit the target. Normalize loss by A or adjust learning rate to maintain consistent update magnitude. Validate that learning curves match a true large-batch run.
✓Reduce all-reduce frequency with GA in DP: Wrap the first A−1 microbatches in a no-sync context to avoid premature gradient sharing. Synchronize only on the A-th microbatch, then step. This boosts compute-to-communication ratio and improves scaling on bandwidth-limited clusters. Ensure equal B and A on all devices to prevent stragglers.
✓Balance pipeline stages: Partition layers so each stage has similar compute time to minimize bubbles. If one stage is heavy, consider moving layers or interleaving. Re-measure after changes to confirm utilization gains. Uneven stages cause persistent idle time.
✓Use microbatching to fill the pipeline: Split batches into multiple microbatches so early stages can start new work while later stages finish old work. Accumulate gradients across microbatches and update once. Find the microbatch count that fits memory while keeping utilization high. Too few microbatches leave stages idle; too many can cause OOM.
✓Try interleaving only on fast interconnects: Interleaving reduces bubble time but increases activation transfers. On NVLink/InfiniBand, it often helps; on PCIe-only systems, it may hurt. Benchmark both contiguous and interleaved assignments. Choose the one with better tokens/sec and stable training.

Glossary

Data Parallelism (DP)

A way to speed up training by making copies of the model on many devices and giving each device a different chunk of data. Each device computes gradients on its chunk. Then all devices combine their gradients so they update the model the same way. It’s useful when you have lots of data to process. It’s limited by how fast devices can share gradients.

Pipeline Parallelism (PP)

A way to split a big model across devices by putting different layers on different devices. An input flows through the layers like an item on an assembly line. Each device works on its layers, then passes results forward. This allows training models that don’t fit on one GPU. It can suffer from idle time while the pipeline starts and stops.

Model Parallelism

Any method that splits parts of a model across multiple devices. Pipeline parallelism is one specific type where layers are arranged in sequence. Other styles can split within layers too. It’s used when a single device can’t hold the whole model. It helps with memory limits but adds communication needs.

All-Reduce

A network operation where all devices share and combine data (like summing gradients) and each device gets the final result. It’s how models in data parallelism keep weights in sync. The speed depends on network bandwidth and latency. If it’s slow, training slows down even if GPUs are fast. Reducing how often it happens can help.

Version: 1

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 8: Parallelism 2

Key Summary

Why This Lecture Matters

Lecture Summary

01Overview

Key Takeaways

Glossary

Data Parallelism (DP)

Pipeline Parallelism (PP)

Model Parallelism

All-Reduce

02Key Concepts

03Technical Details

04Examples

05Conclusion

Gradient Accumulation (GA)

Virtual Batch Size (B_v)

Batch Size (B)

Accumulation Steps (A)