â˘This lesson teaches two big ways to train neural networks on many GPUs: data parallelism and model parallelism. Data parallelism copies the whole model to every GPU and splits the dataset into equal shards, then averages gradients to take one update step. Model parallelism splits the model itself across GPUs and passes activations forward and gradients backward between them.
â˘In data parallelism, each GPU computes gradients on its own data shard, then gradients are combined by either a parameter server or an all-reduce operation. A parameter server is a central hub that gathers gradients, updates the weights, and broadcasts new weights back. All-reduce has GPUs talk directly to each other so that every GPU ends up with the same summed (or averaged) gradient.
â˘Parameter servers are simple and flexible but can become a bottleneck because all traffic flows through them. You can add multiple parameter servers and shard parameters to ease the load, but that makes orchestration more complex. All-reduce avoids a single bottleneck and scales better with more GPUs.
â˘All-reduce can be implemented in different topologies, mainly tree-based or ring-based. Tree all-reduce reduces values up a tree and broadcasts down, often fast for a small number of GPUs. Ring all-reduce passes chunks around a loop, often better for very large GPU counts and robust to link failures.
â˘Choosing between tree and ring all-reduce depends on GPU count and network layout (topology). With few GPUs, tree often wins; with many GPUs, ring often wins. Good libraries can auto-select the best algorithm for your hardware.
â˘Model parallelism is used when a single model doesnât fit on one GPU. The model is split into parts and placed on different GPUs, with activations streaming forward across GPUs and gradients streaming backward. Two main flavors are pipeline parallelism and tensor parallelism.
Why This Lecture Matters
Large language models need huge amounts of compute and memory, and single-GPU training is often too slow or outright impossible. This lecture equips practitionersâML engineers, researchers, and infrastructure teamsâwith the essential strategies to scale training across many GPUs efficiently. Data parallelism speeds up training when models fit on a GPU, while model parallelism makes training feasible when models do not fit. Knowing when to apply parameter servers versus all-reduce, and how to choose between tree-based and ring-based collectives, directly affects time-to-train and cloud costs. Understanding pipeline and tensor parallelism unlocks the ability to train state-of-the-art models by spreading both layers and layer computations across devices.
In real projects, these methods solve problems like central bottlenecks, underutilized GPUs, and memory limits. They help teams reach target throughput, handle hardware failures more gracefully, and leverage existing network topologies for optimal performance. Mastery of these techniques translates into building reliable training pipelines, reducing iteration cycles, and making better use of expensive GPU clusters. For careers, this knowledge is core to modern LLM development and highly valued in roles focused on model scaling and distributed systems. As models and datasets continue to grow, the importance of sound parallelism strategies only increases, making these concepts foundational in todayâs AI industry.
Lecture Summary
Tap terms for definitions
01Overview
This lecture focuses on how to train large language models efficiently across multiple GPUs using parallelism. It explains the two main families of approaches you need to know: data parallelism and model parallelism. Data parallelism is about splitting the dataset into equal pieces (shards) and replicating the same model on every GPU, then combining gradients to perform one update step. Model parallelism is about splitting the model itself across GPUs so that different parts of the model live on different devices, allowing you to train models that don't fit on a single GPU. Within these families, the lecture dives into practical patterns and communication strategies that determine whether training scales smoothly or stalls.
The intended audience is learners who already understand basic supervised learning and gradient-based training, including concepts like datasets, loss functions, forward/backward passes, and stochastic gradient descent. You should be comfortable with the idea of gradients and model parameters and have some familiarity with GPU-based training. You do not need advanced systems knowledge to follow along; the lecture teaches core communication patterns (like all-reduce) from first principles.
By the end, you will be able to describe and compare parameter servers and all-reduce for gradient aggregation; understand how tree-based and ring-based all-reduce work; and when to choose one over the other based on the number of GPUs and network topology. You will also be able to explain model parallelism and its two main strategiesâpipeline parallelism and tensor parallelismâand apply row-parallel and column-parallel splits within tensor parallelism. You will recognize the tradeoffs in communication vs. computation for each approach and understand the practical challenges such as pipeline bubbles, bottlenecks, and partitioning the model across GPUs.
The lecture is structured in two major parts. First, it covers data parallelism: how the dataset is divided, how identical model replicas compute local gradients, and how gradients are combined with either a parameter server or an all-reduce. It then compares two all-reduce algorithmsâtree and ringâand discusses when each is preferable. Second, it moves to model parallelism: how and why to split models, the flow of activations forward and gradients backward across devices, and the two key strategiesâpipeline parallelism and tensor parallelism. For tensor parallelism, it distinguishes row parallelism (split weight matrix rows) and column parallelism (split columns), describing the communication needs in forward and backward passes. Finally, it highlights that practical systems often combine these methods (e.g., data + tensor + pipeline) and that finding the best partition is a hard graph-partitioning problem often handled by specialized libraries.
Key Takeaways
âStart with data parallelism if your model fits on one GPU. Split each batch evenly across GPUs and aggregate gradients using an efficient all-reduce. Ensure each replica applies exactly the same update to stay synchronized. Watch for imbalanced shards that create stragglers.
âUse all-reduce instead of a parameter server when scaling to many GPUs. All-reduce removes the central bottleneck and often improves throughput and scaling. Let your communication library pick ring or tree automatically based on topology. Validate by profiling step times before and after the change.
âIf a parameter server is required, shard parameters across multiple servers. Routing gradient slices to the right server lowers bottlenecks. Monitor server CPU/GPU and network usage to detect saturation. Consider redundancy to avoid single points of failure.
âChoose tree all-reduce for small GPU counts and ring all-reduce for large counts, as a rule of thumb. Tree reduces latency with fewer rounds; ring leverages bandwidth well at scale. Always confirm with benchmarks on your hardware. Topology-aware libraries can outperform hand-picked settings.
âAdopt model parallelism when the model no longer fits in GPU memory. Split the model across devices so activations flow forward and gradients flow backward. Combine it with data parallelism for both memory fit and throughput. Profile activation sizes to plan inter-GPU transfers.
âApply pipeline parallelism to group layers into balanced stages. Use microbatches to fill the pipeline and shrink bubbles. Rebalance stages if one becomes a bottleneck. Aim for steady-state overlap of forward and backward to maximize utilization.
Glossary
Data Parallelism
A way to train by copying the whole model to many GPUs and splitting the data across them. Each GPU computes gradients on its own data slice. The gradients are then combined (summed or averaged) so every model copy updates the same way. Itâs like a team reading different pages of a book and agreeing on one summary. This makes training faster when the model fits on one GPU.
Model Parallelism
A way to train models that donât fit on a single GPU by splitting the model across multiple GPUs. Different parts of the model live on different devices. Activations move forward from one GPU to the next; gradients move backward. Itâs like an assembly line of stations. This lets you train much bigger models than one GPU can hold.
Parameter Server
A central server (or servers) that stores and updates model parameters during training. Workers send gradients to it; it updates the weights and broadcasts the new version. Itâs simple and can continue even if some workers fail. But it can become a bottleneck because all traffic flows through it.
All-Reduce
A communication operation where every GPU contributes a value and every GPU receives the combined result, like a sum. It avoids a central server by having GPUs talk directly. It keeps parameters synchronized by ensuring everyone sees the same summed gradients. It scales well as you add more GPUs.
Version: 1
â˘Pipeline parallelism groups consecutive layers into stages on different GPUs and runs microbatches through the stages like cars on an assembly line. The challenge is keeping all stages busy to maximize throughput and avoid âbubblesâ where a stage waits idle. When tuned well, it enables training models much larger than a single GPU can hold.
â˘Tensor parallelism splits heavy layers across GPUs so each GPU computes a slice of the same layer. Two core patterns are row parallelism (split rows of the weight matrix) and column parallelism (split columns). Row parallelism needs no forward-pass communication but needs backward communication, while column parallelism flips that tradeoff.
â˘In row parallelism, each GPU stores different rows of a weight matrix and multiplies with the full input, then concatenates outputs. This often requires replicating the input on all GPUs. The backward pass needs a communication step to combine gradients for the input.
â˘In column parallelism, each GPU stores different columns of the weight matrix and multiplies with a slice of the input, then sums partial results. This reduces input replication but needs a forward-pass sum across GPUs. The backward pass can be computed locally.
â˘Putting it all together, real systems combine data parallelism with pipeline and tensor parallelism to reach high efficiency. The best mix balances memory, compute, and network costs. Libraries and planners can automatically partition layers to minimize communication.
â˘Partitioning a model across GPUs is a graph partitioning problem: split nodes (layers) and edges (data flows) to lower cross-GPU traffic and balance load. This is difficult and often handled by tooling. In practice, automated strategies pick good splits given layer sizes and interconnect speeds.
â˘Overall, data parallelism improves throughput when memory per GPU is sufficient for the model, while model parallelism is necessary when the model itself is too large to fit. All-reduce is the standard gradient-combining method for data parallel, with strong scaling properties. Pipeline and tensor parallelism make giant models trainable by spreading layers and layer math across devices.
â˘Robustness and fault tolerance differ by strategy: parameter servers can continue if a few workers fail, but can be a single point of failure; all-reduce removes that single point yet may stall on worker failure unless handled. Ring all-reduce can route around a broken link with performance loss. Careful monitoring and retry logic help keep training stable.
â˘Efficiency requires keeping GPUs fed with data (overlapping compute and communication) and avoiding long idle gaps. Choosing shard sizes, microbatch counts, and all-reduce topology affects utilization. Good defaults plus profiling lead to strong speedups.
â˘The math stays simple: the true gradient is the sum (or mean) of per-shard gradients, and all methods work to compute that same result efficiently. Averaging is just summing then dividing by the number of GPUs. Synchronization ensures every replica applies the same update so parameters stay identical.
02Key Concepts
01
Data Parallelism (Definition): Data parallelism is a training strategy that copies the entire model onto many GPUs and splits the dataset into equal parts. Each GPU computes gradients on its data shard, and then all gradients are combined into one update. It works like a team reading a big book by splitting the pages among them, then meeting to agree on the summary. Technically, if each GPU computes âL(Di)/âθ, those are summed (and usually averaged) to estimate the full gradient âL(D)/âθ. Without it, training on huge datasets would be slow and limited to one GPU. For example, with 8 GPUs, each processes one-eighth of the batch, then gradients are averaged to update all replicas identically.
02
Gradient Aggregation (Definition): Gradient aggregation is the step where the per-GPU gradients are combined into one global gradient. Think of it as collecting everyoneâs notes and merging them into one final document. Technically, gradients from shards D1...DN are summed and optionally divided by N to average. This ensures every replica applies the exact same update and stays in sync. Without it, replicas would drift apart and no single, consistent model would exist. For instance, 4 GPUs each compute g1..g4, an all-reduce computes g = (g1+g2+g3+g4)/4.
03
Parameter Server (Definition): A parameter server is a central node (or nodes) that stores model parameters, receives gradients from workers, updates, and then broadcasts new parameters. Itâs like a librarian who collects all edits to a shared book and then publishes the new edition. Technically, workers send gradients to the server, the server computes θ â θ â Ρ¡(sum of gradients), then broadcasts θ. It matters because itâs simple and robust to worker failures, but the server can become a bottleneck. For example, 16 workers send their gradients to one server, which updates θ and returns it to all.
04
Parameter ServerBottleneck (Definition): The bottleneck is when the central server canât keep up with incoming gradients and outgoing parameter updates. Itâs like a single checkout line in a busy store causing a long queue. Technically, all network traffic converges on the server; its bandwidth and CPU/GPU capacity limit throughput. This matters because it caps scaling and wastes GPU time. For example, adding more workers beyond 8 yields little speedup because the server saturates on communication.
05
Sharded Parameter Servers (Definition): Sharding parameters means spreading different slices of the modelâs weights across multiple parameter servers. Itâs like splitting the job of a librarian among several librarians, each handling different chapters. Technically, gradients for weight slice A go to server A, for slice B go to server B, etc.; each server updates its slice and returns it. This reduces bottlenecks but adds complexity in routing and bookkeeping. For instance, an embedding table might be split so each server stores a different range of rows.
06
All-Reduce (Definition): All-reduce is a collective operation where every GPU contributes a value (like gradients) and every GPU receives the global reduction (e.g., the sum). Itâs like a roundtable where everyone shares numbers, and everyone leaves with the same total. Technically, it computes y[i] = sum over all ranksâ x[i], distributing the result to all ranks. This matters because it removes the single central bottleneck and scales well with more GPUs. For example, each GPU holds a gradient vector; after all-reduce, all GPUs hold the summed vector and can average locally.
07
Tree All-Reduce (Definition): Tree all-reduce arranges GPUs in a tree: children send values to parents, parents sum and pass up, then the root broadcasts down. Itâs like passing buckets of water up a human ladder and then pouring the full bucket back down. Technically, reduction takes O(log N) steps up the tree and O(log N) back down. It matters for small to medium GPU counts where latency is key. For example, with 8 GPUs, values reduce in 3 uplink steps and broadcast back down in 3 steps.
08
Ring All-Reduce (Definition): Ring all-reduce arranges GPUs in a loop, passing chunks around so each GPU contributes and receives parts until everyone has the sum. Itâs like passing a message around a circle, adding your piece before passing it on. Technically, it takes Nâ1 stages to circulate contributions, and can be bandwidth-efficient at large scale. It matters because it scales well with many GPUs and tolerates link issues (with degraded speed). For instance, with 32 GPUs, the ring completes reduction with a steady stream of chunk transfers.
09
Choosing All-Reduce Algorithm (Definition): The algorithm choice depends on GPU count and network topology (how devices are wired). Itâs like choosing roads: city streets (tree) are fast for a few stops; highways (ring) are efficient for long trips. Technically, tree often wins at low N due to fewer steps; ring often wins at high N due to sustained bandwidth use. It matters for practical speedups and cost. For example, libraries may auto-pick tree for 4 GPUs and ring for 64 GPUs.
10
Model Parallelism (Definition): Model parallelism splits the model itself across GPUs because itâs too big for one device. Itâs like dividing a long assembly line into stations, each doing part of the job. Technically, different layers (or layer pieces) live on different GPUs; forward activations flow across GPUs, and backward gradients flow in reverse. It matters because without it, giant models canât be trained at all. For example, layers 1â12 on GPU0, 13â24 on GPU1, 25â36 on GPU2.
11
Pipeline Parallelism (Definition): Pipeline parallelism groups consecutive layers into stages placed on different GPUs and streams microbatches through them. Itâs like an assembly line where while one car is at station 2, the next car starts at station 1. Technically, microbatches move forward stage by stage; backward moves in reverse order. It matters because it can keep all GPUs busy and enable very large models. For example, with 3 stages and 8 microbatches, all stages work in parallel after the pipeline fills.
12
Pipeline Bubbles (Definition): Bubbles are idle periods in the pipeline where some stages arenât doing work. Itâs like empty spots between cars on an assembly line. Technically, they occur during the warm-up and cool-down phases or when one stage is slower than others. They reduce utilization and overall throughput. For example, if stage 2 is twice as slow, stages 1 and 3 spend time waiting.
13
Tensor Parallelism (Definition): Tensor parallelism splits individual layers (like big matrix multiplications) across GPUs so each computes a slice. Itâs like several people dividing up a giant spreadsheet calculation. Technically, weights and inputs are partitioned so outputs can be concatenated or summed to form the full result. It matters for layers too big or heavy for one GPU. For example, split a 16384Ă16384 weight into halves so each GPU multiplies part of it.
14
Row Parallelism (Definition): Row parallelism splits the rows of a weight matrix across GPUs so each GPU computes its portion of the output rows. Itâs like dividing a tall stack of rows among teammates, each computing their rows. Technically, each GPU multiplies its row-slice Wi by the full input X, then outputs are concatenated. It matters because it avoids forward-pass communication but requires backward communication for input gradients. For example, two GPUs compute [W1; W2]X and concat the results.
15
Column Parallelism (Definition): Column parallelism splits the columns of a weight matrix across GPUs and typically splits the input to match. Itâs like dividing a wide spreadsheet by columns among teammates. Technically, each GPU multiplies its column-slice Wi with its input slice Xi; partial outputs are summed. It matters because it avoids backward-pass communication but requires forward summation. For example, two GPUs compute W1X1 and W2X2 and sum to get WX.
16
Forward and Backward Flows (Definition): Forward flow means activations move from earlier layers to later layers; backward flow means gradients move in reverse during backpropagation. Itâs like sending a package forward across stations and returning a signed receipt backward. Technically, model parallelism routes tensors between GPUs in both directions. It matters because these transfers define communication cost and scheduling. For example, GPU0 sends its activation to GPU1 in forward, and GPU1 sends gradients back to GPU0 in backward.
17
Network Topology (Definition): Network topology describes how GPUs and servers are physically and logically connected. Itâs like the road map between cities: some routes are direct, others are indirect. Technically, bandwidth and latency between pairs of GPUs can differ, influencing all-reduce performance. It matters because the best algorithm depends on this map. For example, on a single node with NVLink, ring might be great; across nodes with a fat-tree fabric, a tree-based algorithm may win.
18
Scalability and Fault Tolerance (Definition): Scalability is how well performance improves as you add more GPUs; fault tolerance is how well the system continues if something breaks. Itâs like adding more workers to a job and still finishing even if one leaves. Technically, parameter servers can continue with fewer workers but may be a single point of failure; all-reduce has no central point but can stall if a rank fails. It matters for long, expensive training runs. For example, a failed network link in a ring can be routed around but runs slower.
19
Graph Partitioning for Model Placement (Definition): Assigning layers to GPUs is a graph partitioning problem: split the computation graph to minimize cross-GPU communication and balance load. Itâs like cutting a network of roads into regions so most driving stays within each region. Technically, edges represent tensor flows; cuts represent communications. It matters because poor splits cause bottlenecks. For example, putting two heavy-attention blocks on the same GPU and lighter parts elsewhere can balance compute and reduce transfers.
20
Combining Parallelisms (Definition): Real systems mix data parallelism, pipeline parallelism, and tensor parallelism to reach better speed and fit. Itâs like using multiple teamwork strategies at once: divide the book pages, divide the chapters, and divide the calculations. Technically, data-parallel replicas can each be internally model-parallel. It matters because no single method dominates all constraints. For example, each pipeline stage uses tensor parallelism inside, and many such pipelines run in data parallel.
03Technical Details
Overall Architecture and Data Flow
Baseline supervised training
We begin with a dataset D made of pairs (X, Y). A model with parameters θ maps X to š and uses a loss L(š, Y). Stochastic gradient descent (SGD) or a variant computes gradients âL/âθ, then updates θ.
On one GPU, a training step is: forward pass on a batch â compute loss â backward pass for gradients â update parameters.
Data parallelism: replicating models, sharding data
Concept: Make N identical copies of the model on N GPUs. Split each training batch into N shards (equal size ideally) and send one shard to each GPU. Every replica computes gradients on its shard. Then combine all gradients to get a global gradient (sum or average), and update parameters identically on every GPU.
Why it works: Gradients are additive. If L(D) = ÎŁi L(Di), then âL/âθ = ÎŁi âL(Di)/âθ. Averaging is just dividing by N after summation. Thus, combining per-shardgradients exactly reproduces the gradient of the full batch, assuming identical model states.
Gradient aggregation options
Parameter server: One (or several) central nodes store and update θ. Workers (GPUs) send gradients, the server updates θ, and broadcasts the new θ. Advantage: simplicity, robustness to some worker failures, flexible infrastructure. Disadvantage: central bottleneck; risk of single point of failure unless sharded or replicated.
All-reduce: No central node. Each GPU contributes its gradient vector and receives the global sum (or average) through a collective operation. Advantage: no single bottleneck; scales well with many GPUs; often higher throughput. Disadvantage: requires all participants to coordinate; a failing participant can stall the collective without special handling.
All-reduce algorithms
Tree-based all-reduce: GPUs logically form a tree. In the reduce phase, leaves send values up to parents, parents add and pass up until the root gets the total. Then a broadcast phase sends the total back down so everyone has the result. Time complexity involves O(log N) steps for up and O(log N) for down. Communication fan-in/fan-out can align well with certain fabrics (e.g., fat-tree networks), and itâs often faster at small N.
Ring-based all-reduce: GPUs form a logical ring. Each GPU splits its gradient into N chunks. In the reduce-scatter phase, chunks circulate around the ring for Nâ1 steps; each step, a GPU adds incoming chunks to its local partial sums and forwards them. In the all-gather phase, the reduced chunks circulate again so all GPUs receive every chunk. With proper chunking and overlap, ring uses link bandwidth efficiently and often excels at large N.
Choice: For few GPUs or high-latency links, tree can win due to fewer synchronization stages. For many GPUs with good bandwidth, ring often wins by fully utilizing links. Many libraries analyze topology and pick automatically.
Multi-parameter servers (sharding)
When a parameter server becomes a bottleneck, split weights across multiple servers. Workers must route gradient segments to the correct server responsible for that parameter slice. Each server updates its slice and returns it. This introduces routing and consistency management overhead but allows scaling beyond a single serverâs capacity.
Model Parallelism: Splitting the Model
Why model parallelism?
When a modelâs parameters, activations, and optimizer states exceed a single GPUâs memory, replication is impossible. Model parallelism divides the model across multiple GPUs. The forward pass sends activations from one partition to the next; backpropagation sends gradients in reverse.
Pipeline parallelism
Structure: Divide the network into S sequential stages, each a block of consecutive layers. Place each stage on a different GPU. Instead of pushing one big batch through stage-by-stage, split it into M microbatches. Start microbatch 1 at stage 1; when it moves to stage 2, stage 1 starts microbatch 2, and so on. After warm-up, all stages work concurrently.
Forward/backward scheduling: A common schedule is 1F1B (one forward, one backward per microbatch in steady state) to overlap directions and reduce memory. During backward, gradients flow from the last stage back to the first stage across the same inter-stage links.
Bubbles and balance: If one stage has more compute than others, it becomes the bottleneck. Balancing layers across stages reduces idle time. Increasing microbatch count M helps fill the pipeline and shrink bubbles, but too many microbatches may add overhead or memory pressure.
Tensor parallelism
Goal: Split a heavy layerâs compute across GPUs so no single GPU does the entire matrix multiplication or attention block. Two main patterns are row parallel and column parallel for linear layers.
Row parallel: Split the weight by rows: W = [W1; W2; âŚ]. Each GPU stores Wi and multiplies it by the full input X to produce a slice of the output Oi = WiX. Outputs are concatenated to form O = concat(Oi). Communication: forward concatenation is inexpensive (metadata or layout), but backward requires combining gradients to compute dX = ÎŁi Wiáľ dOi across GPUs (a communication step). Input X must be accessible to each GPU (replicated or broadcast).
Column parallel: Split the weight by columns: W = [W1 W2 âŚ]. Split inputs accordingly: X = [X1; X2 âŚ]. Each GPU computes Pi = Wi Xi. Outputs are summed across GPUs: O = ÎŁi Pi. Communication: forward requires a sum/all-reduce-like operation to combine partial outputs; backward allows each GPU to compute its local gradients without extra global steps for dX if X was split to match W.
Tradeoffs: Row parallel avoids forward communication but needs backward communication and input replication; column parallel needs forward summation but can localize backward computation and avoid replicating the whole input.
Combining model and data parallelism
Practical large-scale training stacks combine them. For example, each data-parallel replica contains a pipeline of stages, and each stage uses tensor parallelism to split heavy layers. Gradient aggregation within each data-parallel group uses all-reduce; activations/gradients move between pipeline stages; tensor-parallel ranks exchange partial layer results.
Communication patterns and synchronization
Synchronization points occur at: (a) gradientall-reduce in data parallel, (b) inter-stage sends/receives in pipelines, and (c) partial result exchanges in tensor parallel layers. Overlapping compute with communication (e.g., starting an all-reduce for earlier layers while computing later layersâ backward) improves utilization. Communication libraries exploit multiple streams and priorities to hide latency behind computation.
Practical scheduling considerations
Microbatching: Choose M (microbatches) large enough to keep the pipeline full but small enough to fit memory and not fragment performance. On the data-parallel side, choose per-GPU batch size to keep high GPU utilization while avoiding out-of-memory.
Load balancing: Adjust stage boundaries so each stage has similar compute time. For transformer models, attention and feed-forward blocks can be profiled to split stages evenly.
Failure modes and resilience
Parameter server: If a worker fails, the server can continue with fewer gradients per step (depending on configuration) or wait until the worker restarts. The server is a potential single point of failure unless replicated. Multi-server sharding reduces single-point risk.
All-reduce: If one rank fails, the collective can stall. Some systems attempt reconfiguration or elastic training to continue, but the basic primitive assumes all ranks participate. Rings can sometimes route around link failures; hardware/libraries determine behavior.
Topology-aware choices
Single node (NVLink/NVSwitch): High intra-node bandwidth favors ring or tree depending on GPU count and switch fabric. Multi-node clusters: latency across nodes increases; hierarchical collectives (intra-node then inter-node) often perform best. Libraries can automatically pick and sometimes combine algorithms (e.g., tree across nodes, ring within nodes).
Training loop sketches
Data parallel with all-reduce: (1) Broadcast initial θ to all GPUs. (2) For each step: split batch into N shards; each GPU runs forward+backward to compute local gradient g_i; perform all-reduce to compute g = ÎŁ g_i; each GPU updates θ â θ â Ρ¡g/N. (3) Repeat; parameters remain identical across GPUs.
Data parallel with parameter server: (1) Server hosts θ; workers pull θ. (2) Workers compute local gradients and push to server. (3) Server updates θ and pushes new θ to workers. (4) Workers proceed with next batch using updated θ.
Pipeline parallel: (1) Partition layers into S stages on S GPUs. (2) Split batch into M microbatches. (3) Start microbatch 1 at stage 1; once sent to stage 2, stage 1 immediately starts microbatch 2; continue until steady state. (4) Backward flows in reverse order; careful scheduling overlaps forward and backward where possible.
Tensor parallel row split: (1) Each GPU holds Wi rows. (2) Replicate or broadcast input X to all GPUs. (3) Compute Oi = WiX locally; concatenate O. (4) Backward requires a reduction for dX = ÎŁ Wiáľ dOi.
Tensor parallel column split: (1) Each GPU holds Wi columns and corresponding Xi. (2) Compute Pi = WiXi locally. (3) Sum O = ÎŁ Pi across GPUs (forward communication). (4) Backward can compute local dWi and dXi with minimal extra global communication.
Memory and bandwidth considerations
Data parallel: Memory scales well because each replica stores full model and optimizer state; this is fine only if the model fits per GPU. Communication scales with model size per step for gradient aggregation.
Model parallel: Reduces per-GPU model memory but increases activation and gradient traffic between GPUs. Pipeline parallel stores only a segment of layers per GPU but needs activation transfers; tensor parallel reduces per-layer memory, but each layer incurs communication for partial results.
Tuning tips
Start with data parallel when the model fits on a GPU; use all-reduce with a well-supported library that chooses algorithms based on topology. If the model is too big, add pipeline parallelism to split layers, then add tensor parallelism to split heavy layers. Balance stages and pick microbatch counts that minimize bubbles. Profile to confirm that communication overlaps with compute and to find bottlenecks.
Tooling (general guidance)
While specific tools arenât required to understand the concepts, typical implementations rely on collective communication libraries that provide all-reduce, and higher-level frameworks that orchestrate data, pipeline, and tensor parallel patterns. These tools usually auto-detect topology and choose ring/tree variants, group ranks into data-parallel and model-parallel groups, and schedule microbatches.
Step-by-Step Implementation Guide (Conceptual)
A) Implementing data parallel with all-reduce (conceptually)
Initialize N processes (one per GPU) and form a data-parallel group of all ranks.
Broadcast initial parameters θ to all ranks to ensure identical starting states.
For each training step:
Split batch into N equal shards (or use distributed samplers so each rank reads a disjoint shard).
Each rank runs forward pass on its shard, computes loss, runs backward to produce a local gradient vector g_i matching θ.
Invoke all-reduce(sum) over g_i so every rank receives g = ÎŁ g_i.
Optionally divide by N to get the average gradient, then apply the optimizer update θ â θ â Ρ¡g/N locally on each rank.
Repeat steps until convergence, ensuring that every rank updates identically.
B) Implementing a simple parameter server setup (conceptually)
Start a server process that stores θ.
Worker ranks request θ before each step or subscribe to updates.
Each worker runs forward+backward on its shard and sends gradients to the server.
The server aggregates gradients, updates θ, and transmits the updated θ to workers.
Workers proceed to the next batch using the new θ.
C) Implementing pipeline parallelism (conceptually)
Partition the model into S contiguous stages based on layer compute cost so each stage is balanced.
Place each stage on a different GPU.
Choose M microbatches per global batch to keep stages busy.
Execute forward passes in a pipeline: when stage k finishes microbatch m, it sends activations to stage k+1 and starts microbatch m+1.
After forward warm-up, start backward passes in reverse order. Use schedules that overlap forward and backward to improve utilization.
Adjust M or stage boundaries if profiling shows long bubbles or imbalances.
D) Implementing tensor parallelism for a linear layer (conceptually)
Row parallel:
Split the weight matrix by rows across R GPUs: W = [W1; âŚ; WR].
Ensure input X is available on all R GPUs (replication/broadcast).
Each GPU computes Oi = WiX. Concatenate O = concat(Oi).
For backward, compute local dWi and participate in a communication step to form dX = ÎŁ Wiáľ dOi.
Column parallel:
Split the weight by columns across C GPUs: W = [W1 ⌠WC]. Split input X accordingly: X = [X1; âŚ; XC].
Each GPU computes Pi = WiXi locally.
Sum outputs O = ÎŁ Pi across GPUs (a forward communication step).
Backward for dWi and dXi can be local; global communication needs are reduced compared to row-split.
Tips and Warnings
Batch and shard balance: Keep shards equal in size to ensure each GPU has similar work; imbalance creates stragglers.
Communication overlap: Start all-reduce as soon as parts of the gradient are ready to hide latency behind remaining computation.
Bottlenecks: Watch for parameter server saturation or a slow pipeline stage; fix by sharding servers or rebalancing stages.
Memory pressure: Pipeline increases activation memory if microbatches are large; tensor parallel reduces per-layer memory but adds communication overheadâfind the right tradeoff.
Fault handling: Plan for worker restart or elastic training if possible; otherwise, a failed rank can stall collectives.
Topology awareness: Group ranks so that heavy communication occurs on the fastest links (e.g., keep tensor-parallel ranks on the same node when possible).
04Examples
đĄ
Four-GPU Data Parallel Step: Suppose you have 4 GPUs and a batch of 4096 examples. You split the batch into four shards of 1024 each. Each GPU computes its gradient on its shard. An all-reduce operation sums the four gradients, every GPU divides by 4, and all update their weights identically.
đĄ
Parameter Server with 8 Workers: You run 1 parameter server holding θ and 8 GPU workers. Each worker computes its local gradient and sends it to the server. The server sums all gradients, applies the update, and broadcasts the new θ. Training continues even if one worker is temporarily down, but the server can become the traffic bottleneck.
đĄ
Sharded Parameter Servers for a Large Embedding: A huge embedding table is split across 2 servers, each storing half the rows. Workers route gradients for rows 0â49M to server A and rows 50Mâ99M to server B. Each server updates its shard and returns the updated weights. This reduces pressure on any single server but requires careful routing logic.
đĄ
Tree All-Reduce with 8 GPUs: GPUs form a binary tree. Leaves send values up to parents; parents sum and forward upward until the root holds the total. Then the root broadcasts the total back down so everyone receives it. The process completes in logarithmic steps up and down the tree.
đĄ
Ring All-Reduce with 16 GPUs: GPUs form a ring, and gradients are split into 16 chunks. In reduce-scatter, chunks circulate for 15 steps with each GPU adding incoming chunks to local sums. In all-gather, reduced chunks circulate so every GPU ends up with all chunks. The steady stream keeps links busy and often scales well.
đĄ
Choosing All-Reduce by Scale: With 4 GPUs on a single node, a tree-based algorithm often finishes quickly due to low step count. With 64 GPUs across multiple nodes, ring-based reduction may outperform because it uses bandwidth efficiently over many hops. Libraries typically auto-select using topology hints. This choice impacts your step time directly.
đĄ
Three-Stage Pipeline Parallelism: A transformer with 36 layers is split into 3 stages of 12 layers each across 3 GPUs. A batch is split into 8 microbatches that move through the pipeline. After initial warm-up, all 3 GPUs work simultaneously on different microbatches. At the end, backward flows in reverse, also overlapped to keep stages busy.
đĄ
Pipeline Bubble Example: Stage 2 contains more compute (e.g., larger attention heads) than stages 1 and 3. During steady state, stages 1 and 3 sometimes wait for stage 2 to finish, creating bubbles. Repartitioning layers or increasing microbatches reduces the idle time. The result is higher overall throughput.
đĄ
Row-Parallel Linear Layer: A 4096Ă4096 weight matrix is split into two 2048-row slices across 2 GPUs. Each GPU multiplies its slice with the full input X and produces a 2048Ăbatch output. The two outputs are concatenated to form the full 4096Ăbatch result. Backprop then requires a reduction to compute gradients for X.
đĄ
Column-Parallel Linear Layer: The same 4096Ă4096 weight matrix is split into two 2048-column slices, and the input is split to match. Each GPU multiplies its weight slice by its input slice to produce a partial 4096Ăbatch output. The partial outputs are summed across GPUs for the final result. Backward can be handled mostly locally.
đĄ
Combining Data + Pipeline + Tensor: Each data-parallel replica is a 4-stage pipeline across 4 GPUs; inside each stage, heavy linear layers are split across 2 tensor-parallel GPUs. During a step, gradients are all-reduced inside each data-parallel group, activations move between stages, and tensor-parallel ranks exchange partial results. This composition trains a model far too large for one GPU while maintaining good throughput.
đĄ
Fault Tolerance Scenarios: In a parameter-server setup, if a worker fails, the server can still update using remaining gradients or wait until the worker rejoins. In all-reduce, a failed rank can stall the operation unless the framework supports elastic reconfiguration. With a ring topology, a broken link can sometimes be bypassed by rerouting, though performance declines. Planning for failures avoids long downtime.
đĄ
Topology-Aware Grouping: On a machine with 8 GPUs connected via a fast switch, you place tensor-parallel groups within the node for best bandwidth. Across nodes, you use hierarchical collectives: first reduce within a node, then reduce across nodes. This reduces cross-node traffic and speeds up gradient aggregation. The outcome is faster, more stable scaling.
đĄ
Batch and Microbatch Tuning: You experiment with per-GPU batch size to avoid out-of-memory while keeping high utilization. For pipelines, you vary microbatch count to find the sweet spot with minimal bubbles. Profiling shows that 16 microbatches per batch balances memory and utilization well. Overall step time decreases and throughput increases.
05Conclusion
Training large language models efficiently requires splitting work across many GPUs. This lecture presented two main ways to do that: data parallelism and model parallelism. Data parallelism keeps a full copy of the model on each GPU and shards the data, then combines gradientsâeither with a parameter server or an all-reduce. All-reduce avoids a central bottleneck and typically scales better; tree and ring topologies each shine under different GPU counts and network layouts. When the model itself is too big to fit on one GPU, model parallelism is essential. Pipeline parallelism splits layers into stages that run like an assembly line, while tensor parallelism splits heavy layers themselves by rows or columns. Row parallelism avoids forward communication but needs backward communication; column parallelism flips that tradeoff. In practice, you often combine data, pipeline, and tensor parallelism to meet memory limits and achieve strong throughput.
To put this into practice, start by checking whether your model fits in a single GPU. If it does, use data parallelism with an efficient all-reduce and confirm parameters stay synchronized. If your model doesnât fit, add pipeline parallelism to partition layers and use microbatches to fill the pipeline, then add tensor parallelism to split the heaviest layers. Balance pipeline stages to reduce bubbles, and choose microbatch counts that fit memory. Profile your runs to find communication bottlenecks and to verify that compute overlaps with communication. For resilience, plan for failuresâparameter servers can keep going with fewer workers, while all-reduce may need elastic strategies.
Next steps include learning more about topology-aware collectives, hierarchical reductions across nodes, and automated partitioners that solve the graph partitioning problem for you. Explore how scheduling policies (like 1F1B) affect memory and throughput, and how different parallelisms interact with optimizer states and checkpointing. As you progress, youâll gain the ability to design hybrid parallel strategies tailored to your model size, hardware layout, and training goals.
The core message is simple: the math of gradient addition is straightforward, but performance depends on smart communication and partitioning. Data parallelism speeds up training when the model fits; model parallelism makes massive models trainable. Choosing and combining these methods thoughtfully is the key to scaling language model training efficiently and reliably.
â
Use tensor parallelism to split heavy layers along rows or columns. Row splits avoid forward communication but require backward reductions and input replication. Column splits need forward summation but keep backward more local and avoid full input replication. Pick based on where your communication budget fits best.
âOverlap communication with computation wherever possible. Launch gradient all-reduces as soon as partial gradients are ready. Pipeline activations while earlier layers finish computing. This keeps GPUs busy and reduces perceived latency.
âTune batch size and microbatch count for your memory and throughput goals. Larger per-GPU batches may raise utilization but risk out-of-memory. More microbatches reduce pipeline bubbles but increase overhead. Find the balance using short, controlled experiments.
âBe topology-aware in grouping GPUs for communication. Keep the heaviest communication (like tensor-parallel exchanges) on the fastest links. Use hierarchical collectives across nodes. This minimizes slow cross-node traffic.
âMonitor for bottlenecks and fix them systematically. If a parameter server saturates, shard; if a pipeline stage is slow, rebalance or adjust microbatches. If all-reduce dominates time, test alternative algorithms or hierarchies. Continuous profiling guides effective changes.
âPlan for failures and long runs. Understand how your chosen method reacts to node or link loss. Use retry and restart strategies, and consider elastic training features if available. Avoid single points of failure when possible.
âCombine data, pipeline, and tensor parallelism for best results on giant models. Data parallelism increases throughput, pipeline parallelism fits depth, and tensor parallelism fits width. The right mix depends on model shape and hardware. Iteratively refine the design as model size changes.
âKeep replicas in sync by averaging gradients and applying identical optimizer steps. Even tiny drifts can cause divergence. Verify synchronization points are correct and that random seeds or dropout states are handled consistently. Consistency ensures stable, reproducible training.
âUse balanced shards and stages to avoid idle time. Uneven data shards or lopsided stage assignments cause waiting and waste. Rebalance when GPU utilization varies widely. Better balance yields immediate throughput gains.
Tree All-Reduce
An all-reduce algorithm that arranges GPUs in a tree. Children send values to parents, and the root gets the total; then the total is broadcast back down. It can be fast with small numbers of GPUs. The number of steps grows like the height of the tree.
Ring All-Reduce
An all-reduce algorithm that arranges GPUs in a ring. Partial results are passed around the ring and combined at each hop. It often performs well with many GPUs by keeping links busy. It can keep going even if a link is rerouted, though slower.
Gradient Aggregation
The process of combining gradients computed on different GPUs into one global gradient. Usually done by summing and then averaging. It ensures all model replicas perform the same parameter update. Without it, each replica would drift to a different model.
Shard
A piece of a larger whole split up for easier handling. In training, you can shard data (split a dataset) or shard a model (split parameters). Shards let many GPUs work in parallel and share results later. Balanced shards prevent idle time.