📚 Stanford CS336: Language Modeling from Scratch7 / 17

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 7: Parallelism 1

Intermediate

Stanford Online

Deep LearningYouTube

Key Summary

•This lesson teaches two big ways to train neural networks on many GPUs: data parallelism and model parallelism. Data parallelism copies the whole model to every GPU and splits the dataset into equal shards, then averages gradients to take one update step. Model parallelism splits the model itself across GPUs and passes activations forward and gradients backward between them.
•In data parallelism, each GPU computes gradients on its own data shard, then gradients are combined by either a parameter server or an all-reduce operation. A parameter server is a central hub that gathers gradients, updates the weights, and broadcasts new weights back. All-reduce has GPUs talk directly to each other so that every GPU ends up with the same summed (or averaged) gradient.
•Parameter servers are simple and flexible but can become a bottleneck because all traffic flows through them. You can add multiple parameter servers and shard parameters to ease the load, but that makes orchestration more complex. All-reduce avoids a single bottleneck and scales better with more GPUs.
•All-reduce can be implemented in different topologies, mainly tree-based or ring-based. Tree all-reduce reduces values up a tree and broadcasts down, often fast for a small number of GPUs. Ring all-reduce passes chunks around a loop, often better for very large GPU counts and robust to link failures.
•Choosing between tree and ring all-reduce depends on GPU count and network layout (topology). With few GPUs, tree often wins; with many GPUs, ring often wins. Good libraries can auto-select the best algorithm for your hardware.
•Model parallelism is used when a single model doesn’t fit on one GPU. The model is split into parts and placed on different GPUs, with activations streaming forward across GPUs and gradients streaming backward. Two main flavors are pipeline parallelism and tensor parallelism.

Why This Lecture Matters

Large language models need huge amounts of compute and memory, and single-GPU training is often too slow or outright impossible. This lecture equips practitioners—ML engineers, researchers, and infrastructure teams—with the essential strategies to scale training across many GPUs efficiently. Data parallelism speeds up training when models fit on a GPU, while model parallelism makes training feasible when models do not fit. Knowing when to apply parameter servers versus all-reduce, and how to choose between tree-based and ring-based collectives, directly affects time-to-train and cloud costs. Understanding pipeline and tensor parallelism unlocks the ability to train state-of-the-art models by spreading both layers and layer computations across devices. In real projects, these methods solve problems like central bottlenecks, underutilized GPUs, and memory limits. They help teams reach target throughput, handle hardware failures more gracefully, and leverage existing network topologies for optimal performance. Mastery of these techniques translates into building reliable training pipelines, reducing iteration cycles, and making better use of expensive GPU clusters. For careers, this knowledge is core to modern LLM development and highly valued in roles focused on model scaling and distributed systems. As models and datasets continue to grow, the importance of sound parallelism strategies only increases, making these concepts foundational in today’s AI industry.

Lecture Summary

Tap terms for definitions

01Overview

This lecture focuses on how to train large language models efficiently across multiple GPUs using parallelism. It explains the two main families of approaches you need to know: data parallelism and model parallelism. Data parallelism is about splitting the dataset into equal pieces (shards) and replicating the same model on every GPU, then combining gradients to perform one update step. Model parallelism is about splitting the model itself across GPUs so that different parts of the model live on different devices, allowing you to train models that don't fit on a single GPU. Within these families, the lecture dives into practical patterns and communication strategies that determine whether training scales smoothly or stalls.

The intended audience is learners who already understand basic supervised learning and gradient-based training, including concepts like datasets, loss functions, forward/backward passes, and stochastic gradient descent. You should be comfortable with the idea of gradients and model parameters and have some familiarity with GPU-based training. You do not need advanced systems knowledge to follow along; the lecture teaches core communication patterns (like all-reduce) from first principles.

By the end, you will be able to describe and compare parameter servers and all-reduce for gradient aggregation; understand how tree-based and ring-based all-reduce work; and when to choose one over the other based on the number of GPUs and network topology. You will also be able to explain model parallelism and its two main strategies—pipeline parallelism and tensor parallelism—and apply row-parallel and column-parallel splits within tensor parallelism. You will recognize the tradeoffs in communication vs. computation for each approach and understand the practical challenges such as pipeline bubbles, bottlenecks, and partitioning the model across GPUs.

The lecture is structured in two major parts. First, it covers data parallelism: how the dataset is divided, how identical model replicas compute local gradients, and how gradients are combined with either a parameter server or an all-reduce. It then compares two all-reduce algorithms—tree and ring—and discusses when each is preferable. Second, it moves to model parallelism: how and why to split models, the flow of activations forward and gradients backward across devices, and the two key strategies—pipeline parallelism and tensor parallelism. For tensor parallelism, it distinguishes row parallelism (split weight matrix rows) and column parallelism (split columns), describing the communication needs in forward and backward passes. Finally, it highlights that practical systems often combine these methods (e.g., data + tensor + pipeline) and that finding the best partition is a hard graph-partitioning problem often handled by specialized libraries.

Key Takeaways

✓Start with data parallelism if your model fits on one GPU. Split each batch evenly across GPUs and aggregate gradients using an efficient all-reduce. Ensure each replica applies exactly the same update to stay synchronized. Watch for imbalanced shards that create stragglers.
✓Use all-reduce instead of a parameter server when scaling to many GPUs. All-reduce removes the central bottleneck and often improves throughput and scaling. Let your communication library pick ring or tree automatically based on topology. Validate by profiling step times before and after the change.
✓If a parameter server is required, shard parameters across multiple servers. Routing gradient slices to the right server lowers bottlenecks. Monitor server CPU/GPU and network usage to detect saturation. Consider redundancy to avoid single points of failure.
✓Choose tree all-reduce for small GPU counts and ring all-reduce for large counts, as a rule of thumb. Tree reduces latency with fewer rounds; ring leverages bandwidth well at scale. Always confirm with benchmarks on your hardware. Topology-aware libraries can outperform hand-picked settings.
✓Adopt model parallelism when the model no longer fits in GPU memory. Split the model across devices so activations flow forward and gradients flow backward. Combine it with data parallelism for both memory fit and throughput. Profile activation sizes to plan inter-GPU transfers.
✓Apply pipeline parallelism to group layers into balanced stages. Use microbatches to fill the pipeline and shrink bubbles. Rebalance stages if one becomes a bottleneck. Aim for steady-state overlap of forward and backward to maximize utilization.

Glossary

Data Parallelism

A way to train by copying the whole model to many GPUs and splitting the data across them. Each GPU computes gradients on its own data slice. The gradients are then combined (summed or averaged) so every model copy updates the same way. It’s like a team reading different pages of a book and agreeing on one summary. This makes training faster when the model fits on one GPU.

Model Parallelism

A way to train models that don’t fit on a single GPU by splitting the model across multiple GPUs. Different parts of the model live on different devices. Activations move forward from one GPU to the next; gradients move backward. It’s like an assembly line of stations. This lets you train much bigger models than one GPU can hold.

Parameter Server

A central server (or servers) that stores and updates model parameters during training. Workers send gradients to it; it updates the weights and broadcasts the new version. It’s simple and can continue even if some workers fail. But it can become a bottleneck because all traffic flows through it.

All-Reduce

A communication operation where every GPU contributes a value and every GPU receives the combined result, like a sum. It avoids a central server by having GPUs talk directly. It keeps parameters synchronized by ensuring everyone sees the same summed gradients. It scales well as you add more GPUs.

Version: 1

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 7: Parallelism 1

Key Summary

Why This Lecture Matters

Lecture Summary

01Overview

Key Takeaways

Glossary

Data Parallelism

Model Parallelism

Parameter Server

All-Reduce

02Key Concepts

03Technical Details

04Examples

05Conclusion

Tree All-Reduce

Ring All-Reduce

Gradient Aggregation

Shard