•GPUs (Graphics Processing Units) are critical for deep learning because they run thousands of simple math operations at the same time. Language models like Transformers rely on huge numbers of matrix multiplications, which are perfect for parallel processing. CPUs have a few strong cores for complex, step-by-step tasks, while GPUs have many simpler cores for doing lots of math in parallel. Using GPUs correctly can make training and inference dramatically faster.
•Historically, GPUs were built to draw graphics by processing many pixels, textures, and vertices in parallel. Those same traits—parallel math and high throughput—also fit deep learning workloads. This shift from graphics-only to general-purpose computing is why CUDA exists. CUDA lets programs run non-graphics math on NVIDIA GPUs efficiently.
•CPUs focus on low latency and complex control, with features like large caches and branch prediction. GPUs focus on high throughput, sacrificing some single-task flexibility for massive parallel execution. A helpful analogy: CPUs are Formula One race cars—fast and precise on tricky tracks; GPUs are fleets of trucks—slower individually but able to move huge loads together. This difference explains why deep learning thrives on GPUs.
•GPUs commonly follow a SIMD (Single Instruction, Multiple Data) style: the same instruction runs across many data elements at once. In practice, tasks are split into threads, grouped into blocks, and arranged into grids to cover the full workload. Each thread does a tiny piece of the job, enabling huge parallelism. This structure is what makes array and matrix operations lightning fast on GPUs.
•Deep learning frameworks like PyTorch and TensorFlow wrap GPU details so you don’t write CUDA directly. In PyTorch, you check for a GPU and move models and tensors to it with .to(device). Example: device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu'); model.to(device); data.to(device). Once data and models are on the GPU, operations run there automatically.
Why This Lecture Matters
This content matters to anyone training or deploying modern deep learning models: ML engineers, data scientists, researchers, and students building LLMs or other neural networks. Without GPUs, training times are too long and costs too high, making many real projects impractical. Knowing how to place models and data on the GPU, manage batch sizes, and use mixed precision translates directly into faster experiments and lower cloud bills. For teams scaling training, understanding data parallelism and DDP avoids common pitfalls and unlocks near-linear speedups across multiple GPUs. Monitoring with nvidia-smi and fixing bottlenecks ensures hardware dollars become real performance. This knowledge helps you be effective in industry roles where training efficiency is a core metric, from startups to big labs. In today’s AI landscape, the ability to use GPUs well is a baseline skill, and mastering these patterns is a career accelerator. It lets you take advantage of the hardware that underpins state-of-the-art models and deliver results faster and more reliably.
Lecture Summary
Tap terms for definitions
01Overview
This lecture teaches how and why GPUs power modern deep learning, especially large language models (LLMs) built on Transformers. It begins with a brief recap of language modeling fundamentals—perplexity as an evaluation metric, classic models like n-grams and RNNs, and the modern Transformer architecture (self-attention, multi-head attention, feed-forward layers, residual connections, normalization, and positional encodings). With that foundation, the lecture explains the hardware side: why the math inside Transformers (heavy matrix multiplications and additions) maps perfectly to GPU parallelism. You learn the history of GPUs, originally made for graphics rendering, and how their architecture—many simpler cores optimized for throughput—differs from CPUs, which have fewer, stronger cores focused on low latency and complex control flow.
The target audience is students with basic deep learning understanding who are ready to run models efficiently. You should know core training loops (forward, backward, optimization), tensors, and the basics of PyTorch, along with an idea of what matrix multiplications and attention do in a Transformer. No prior CUDA programming is required; you’ll rely on high-level frameworks. Beginners can follow the analogies and code snippets, while practitioners will benefit from best practices and performance tips.
After this lecture, you will be able to: decide when and why to use GPUs, understand the high-level GPU execution model (SIMD, threads, blocks, grids), place PyTorch models and tensors on the GPU correctly, minimize slow CPU↔GPU transfers, scale from one to many GPUs using data parallelism, and unlock performance with batch sizing, mixed precision (torch.cuda.amp), and gradient accumulation. You will also learn to monitor your GPU with nvidia-smi and identify common bottlenecks like small batch sizes or excessive data movement.
The lecture is structured to connect concepts end-to-end. First, it links the math of deep learning to GPU strengths by contrasting CPUs and GPUs using a clear analogy: CPUs are like Formula One race cars (few, very fast, great at complex maneuvers) while GPUs are like fleets of trucks (many, slower individually, but can move a lot at once). Next, it demystifies GPU execution using SIMD and the hierarchy of threads, blocks, and grids. Then, it becomes practical: using CUDA-enabled GPUs in PyTorch with simple device selection and .to(device) placement. The lecture emphasizes best practices: use the largest batch that fits, reduce data transfers, train with mixed precision for speed and memory savings, and use gradient accumulation for effective large batches without VRAM overflow. Finally, it covers multi-GPU strategies—data parallelism (DataParallel vs DistributedDataParallel) and model parallelism for models too large for a single device—and closes with monitoring tools like nvidia-smi and TensorBoard. The overarching message is that GPUs are essential, but thoughtful usage and monitoring are what deliver real-world speedups.
Key Takeaways
✓Always verify your setup before coding optimizations. Run torch.cuda.is_available() and nvidia-smi to confirm the GPU is visible and drivers are correct. If these checks fail, fix the environment first instead of debugging training loops. A solid foundation saves hours later.
✓Place model and tensors on the same device. Call model.to(device) and inputs.to(device) at the start of each step, and avoid mixing CPU and GPU tensors in the same operation. Device mismatches cause errors or slow implicit transfers. Keep the whole forward/backward/step on the GPU.
✓Feed the GPU with the largest batch size that fits. Start big and shrink if you hit out-of-memory; watch nvidia-smi for memory usage. Larger batches improve throughput and utilization. If VRAM is tight, use gradient accumulation to hit your target effective batch size.
✓Enable mixed precision early. Use torch.cuda.amp.autocast and GradScaler to speed up training and cut memory use with minimal code changes. Monitor loss curves to ensure stability. On modern NVIDIA GPUs, this is usually a free win.
✓Minimize CPU↔GPU transfers. Keep tensors on the GPU throughout the training step, and avoid frequent .cpu() calls in the hot path. Use pin_memory=True and non_blocking=True to speed copies when needed. Fewer transfers mean higher GPU utilization.
✓Optimize the data pipeline. Increase DataLoader num_workers, prefetch batches, and move heavy preprocessing out of the training step. An underfed GPU is wasted compute. Measure utilization while tuning these knobs.
✓Prefer DDP over DataParallel for multi-GPU. Launch one process per GPU with torchrun and use DistributedSampler to shard the dataset. DDP scales better and overlaps communication with computation. Use DataParallel only for quick tests.
Glossary
GPU (Graphics Processing Unit)
A special computer chip designed to handle many simple math tasks in parallel. It acts like a big team of workers each doing small pieces at the same time. GPUs are great for deep learning because models need lots of repeated math. They trade single-task flexibility for massive speed with many tasks.
CPU (Central Processing Unit)
The main processor in a computer that handles general tasks and complex decisions. It is very good at doing things in order and changing plans quickly. CPUs have a few very strong cores and large caches to keep data close. They are great at controlling programs and preparing work for GPUs.
Core
A single processing unit inside a CPU or GPU. CPUs have a few strong cores; GPUs have many simpler cores. More cores mean more tasks can run at once. GPU cores do simple math in parallel over large datasets.
Throughput
How much work is finished in a given time. GPUs are designed for high throughput by doing many tasks at once. Higher throughput means more samples or more math per second. It’s different from latency, which is about how fast a single task finishes.
Latency
Version: 1
•Transferring data between CPU and GPU is relatively slow, so minimizing transfers is key. Load and preprocess data efficiently, and keep tensors on the GPU as long as possible. Frequent host-device copies can bottleneck otherwise fast GPU compute. Aim to stage the entire training step on the GPU end-to-end when feasible.
•Multiple GPUs don’t speed up your code automatically—you must program for it. Data parallelism splits a batch across GPUs so each one processes a shard and gradients are synchronized. In PyTorch, DataParallel is simple but slower; DistributedDataParallel (DDP) is the recommended, scalable approach. Model parallelism splits the model across GPUs and is used when the model cannot fit on a single GPU.
•Batch size strongly affects GPU utilization and memory use. Larger batches can increase throughput by feeding the GPU more work per step, but they consume more VRAM. The practical rule is to use the largest batch size that fits in memory without OOM (out-of-memory) errors. If VRAM is tight, use gradient accumulation to simulate larger batches.
•Mixed precision training uses 16-bit (FP16) and 32-bit (FP32) numbers together. FP16 is faster and uses less memory, but it can lose numerical accuracy on some operations. PyTorch’s torch.cuda.amp automates when to use FP16 vs FP32 to balance speed and stability. This often gives large speedups on modern GPUs with Tensor Cores.
•Gradient accumulation lets you simulate a larger batch size by summing gradients over multiple mini-batches before updating weights. This reduces VRAM pressure while preserving the signal of a big batch. It’s a simple loop change: call loss.backward() several times and then optimizer.step() once. Combine it with mixed precision for even better efficiency.
•Monitoring GPU usage is essential to know if you’re efficient. Use nvidia-smi to view GPU utilization, memory usage, and temperature in real time. Low utilization suggests bottlenecks like tiny batch sizes, excessive data transfer, or slow data loading. Tools like TensorBoard can also track device stats during training.
•Recap of prerequisites: understanding Transformers, attention, and training loops makes the GPU content meaningful. The lecture connects algorithmic needs (matrix multiplications, attention) to hardware realities (parallel compute, memory bandwidth). Mastering device placement, parallelism, and precision unlocks the performance of modern LLM training. The final message: GPUs are necessary, but using them well is what truly accelerates your work.
02Key Concepts
01
What makes GPUs essential: GPUs are specialized processors that run many simple operations at once, perfect for deep learning’s heavy math. It’s like hiring a whole team of helpers to carry many boxes together instead of one strong person doing trips alone. Technically, GPUs maximize throughput by parallelizing operations like matrix multiplications across thousands of cores. Without them, training would be painfully slow and often impractical for large models. For example, training a Transformer on a CPU could take weeks, whereas the same job on a GPU may complete in days or hours.
02
CPU vs GPU mental model: A CPU has a few powerful cores optimized for complex, step-by-step tasks with low latency. Think of a Formula One car—extremely fast and precise on complicated tracks, but it carries very little cargo. GPUs have many simpler cores focused on throughput, like a fleet of trucks moving huge loads in parallel. Technically, CPUs prioritize branch prediction and large caches, while GPUs simplify control logic to allocate silicon to more cores. In practice, Transformers prefer trucks over race cars because the workload is massive and parallelizable.
03
SIMD (Single Instruction, Multiple Data): SIMD means executing the same instruction over many data items simultaneously. Imagine telling 1,000 students to each add two numbers at the same time, instead of one student doing them one by one. The GPU broadcasts one instruction to many threads that each handle different elements (like array positions). Without SIMD-style execution, array and matrix operations would be limited by sequential loops. For example, adding two arrays element-wise is naturally a SIMD operation.
04
Threads, blocks, and grids: A thread is the smallest unit of GPU work, threads form blocks, and blocks form a grid for the full task. Picture a city (grid) divided into neighborhoods (blocks) filled with workers (threads) each doing a tiny task. Technically, the kernel (the function you run on the GPU) is launched over a grid, and the runtime schedules threads on GPU cores. This hierarchy organizes parallelism and allows scaling to large datasets. For instance, summing a 1M-element array might use thousands of threads grouped into many blocks.
05
Why deep learning maps well to GPUs: DL layers like matmuls and convolutions boil down to many independent multiply-adds. It’s like tiling a floor with millions of identical tiles—perfect for a big crew of identical workers. Technically, these operations are batched, vectorized, and expressed as kernels that saturate GPU cores and memory bandwidth. Without GPUs, the compute and memory demands of Transformers would be prohibitive. A single attention head performing QK^T, softmax, and weighted sums is dominated by matmuls ideal for GPUs.
06
CUDA and OpenCL at a glance: CUDA is NVIDIA’s toolkit for programming GPUs; OpenCL is an open standard across vendors. Think of CUDA like a well-paved highway specific to NVIDIA vehicles, while OpenCL is a general road network for many makes. In practice, deep learning frameworks often run fastest on CUDA with NVIDIA cards. Without these toolkits, developers would struggle to access GPU compute from high-level languages. PyTorch and TensorFlow hide most CUDA details but rely on it under the hood.
07
PyTorch device placement: In PyTorch, you choose a device and move models and tensors there with .to(device). It’s like moving your tools and materials into a workshop so all the work happens in one place. Technically, once tensors are on 'cuda:0', operations are enqueued on that GPU and run asynchronously. Forgetting to move data to the same device as the model causes errors or slow host-device copies. For example, device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu'); model.to(device); inputs.to(device).
08
Minimizing CPU↔GPU transfers: Transfers across PCIe are slow compared to on-device compute. It’s like sending items by a narrow bridge—crossings are costly, so you want to carry big loads infrequently. Technically, host-to-device and device-to-host copies block or slow the pipeline and reduce GPU utilization. Too many transfers can make a powerful GPU sit idle. For instance, keeping preprocessing and batching on the CPU while repeatedly sending small tensors can bottleneck training.
09
Batch size and utilization: Batch size is how many samples you process per step; bigger batches feed more parallel work to the GPU. It’s like giving a factory a full truckload to process rather than a handful at a time. Technically, larger batches improve arithmetic intensity and kernel efficiency but consume more VRAM. Choose the largest batch that fits without out-of-memory errors to keep utilization high. If memory is tight, use gradient accumulation to simulate larger effective batches.
10
Mixed precision training (torch.cuda.amp): Mixed precision uses FP16 where safe and FP32 where needed. Imagine using a smaller, lighter measuring cup when exactness isn’t critical, switching to a larger precise one for delicate measurements. Technically, autocast picks lower precision for many ops, while a GradScaler prevents underflow in gradients. This saves memory and speeds up compute on GPUs with specialized hardware. For example, you can wrap forward and loss in autocast() and use GradScaler for backward and step.
11
Gradient accumulation: This means summing gradients over several mini-batches before updating weights. It’s like saving coins over several days before making one purchase, rather than paying every day. Technically, you call loss.backward() multiple times and delay optimizer.step() and zero_grad() until after N mini-batches. This allows large effective batch sizes with limited VRAM. For example, accumulating over 4 steps turns a batch of 32 into an effective batch of 128.
12
Multi-GPU data parallelism: Data parallelism splits each batch across multiple GPUs and averages gradients. Think of four kitchens cooking the same recipe on different portions of the meal, then combining the dishes. Technically, each replica processes a shard and gradients are synchronized (all-reduced) before the optimizer step. PyTorchDataParallel is simple but has overhead; DistributedDataParallel (DDP) is faster and the recommended approach. Without data parallelism, you can’t scale throughput across GPUs easily.
13
Model parallelism for oversized models: Model parallelism spreads different parts of a single model across GPUs. It’s like building a long assembly line where each station handles a different stage. Technically, layers are placed on different devices and tensors are passed between them. This is necessary when a single GPU’s memory can’t hold the whole model. For example, you might put the encoder on one GPU and the decoder on another for a very large network.
14
Monitoring with nvidia-smi and TensorBoard: nvidia-smi shows utilization, memory, and temperature so you can spot idle GPUs. It’s like a car dashboard for your GPU’s speed and fuel usage. Technically, you want high utilization (close to 100%) and stable memory levels during training. Low utilization reveals bottlenecks like I/O, small batches, or poor parallel scaling. TensorBoard can log and visualize device metrics over time to guide tuning.
15
Why CPUs still matter: CPUs handle control-heavy logic, data loading, and orchestration. They’re the race cars that make split-second decisions and set up work for the trucks. Technically, CPUs run the Python runtime, data pipelines, and launch GPU kernels. If the CPU or data loader is slow, the GPU may starve. For instance, a non-optimized DataLoader can cause the GPU to wait between steps.
16
PyTorch code pattern for single GPU: The standard flow is select device, move model and data, run forward, compute loss, backward, and step. It’s like setting up your station, doing the work, and cleaning up—every time. Technically, tensors and model parameters must share the same device. Mixing CPU and GPU tensors in ops throws errors or triggers slow copies. A minimal template keeps everything on 'cuda:0' during the step.
17
Choosing DataParallel vs DistributedDataParallel: DataParallel uses one process with threads and scatters batches internally; it’s easier but less efficient. DDP uses one process per GPU, reducing GIL and scheduling overhead. Think of DDP as each cook in their own kitchen versus many cooks cramped in one kitchen. Technically, DDP overlaps communication with computation and scales better across nodes. Use DDP for serious multi-GPU training.
18
Practical signs of underutilization: If nvidia-smi shows low utilization and memory swings, you’re likely bottlenecked elsewhere. It’s like workers waiting around for materials. Technically, small batches, too many host-device copies, slow data loading, or sync points can stall the GPU. Fixes include increasing batch size, pinning memory, prefetching, and reducing CPU-GPU chatter. Profiling helps confirm which step is slow.
19
Historical context of GPU for DL: GPUs were built for graphics—transforming vertices, texturing, and shading millions of pixels. That workload mirrors deep learning’s parallel math needs. When people realized this, they adapted GPUs for general compute via CUDA. Now, modern DL frameworks wrap these capabilities with high-level APIs. This history explains why NVIDIA tooling dominates DL performance.
03Technical Details
Overall Architecture/Structure
Where GPUs fit in the deep learning stack:
At the top level, you write training code in Python using a high-level framework like PyTorch. Your model uses layers (Linear, Attention, LayerNorm) that internally call efficient kernels (compiled routines) for operations such as matrix multiplications and convolutions.
These kernels are implemented in libraries like cuBLAS (for BLAS-level matrix ops) and cuDNN (for deep neural network primitives) built on CUDA. CUDA is the driver/toolkit that allows code to run on NVIDIA GPUs.
The GPU executes kernels across many cores in parallel. Data sits in GPU memory (VRAM). Transfers between CPU memory (RAM) and GPU memory occur over a bus such as PCIe; these transfers are relatively slow compared to on-GPU compute.
Efficient training means: (a) keep data and model on the GPU to avoid transfers, (b) batch work to exploit parallelism, (c) use mixed precision to reduce memory and speed up math, and (d) scale across multiple GPUs when needed.
CPU vs GPU roles in a training loop:
CPU: loads data (often via DataLoader), performs light preprocessing, launches GPU kernels, handles logging and checkpointing.
GPU: performs the heavy math in the forward and backward passes for tensors placed on it.
If the CPU or data pipeline cannot keep up, the GPU will be underutilized; the goal is to pipeline work so the GPU is busy most of the time.
SIMD (Single Instruction, Multiple Data) describes the basic idea: apply the same instruction to many data elements simultaneously. In CUDA, this is realized by launching many threads that execute the same kernel (function) with different indices referencing different portions of data.
Threads are grouped into blocks, and blocks are organized into a grid. You specify the grid and block sizes when launching a kernel (frameworks do this internally). The runtime schedules blocks on the GPU’s streaming multiprocessors (SMs) so many threads run concurrently.
In deep learning, you seldom write custom kernels; you rely on PyTorch and vendor libraries that pick appropriate launch parameters. But conceptually, each element or tile of a matrix multiplication can be processed by different threads, enabling massive parallelism.
Data flow for a typical training step (single GPU):
Step A: Dataloader on CPU reads a batch and optionally applies augmentations. Ideally, it uses multiple workers, pinned memory, and prefetching to overlap with GPU compute.
Step B: Batch tensor is transferred once to the GPU (inputs.to('cuda:0')). The model (model.to('cuda:0')) already resides on the GPU.
Step C: Forward pass runs: matmuls, attention, activations—all as GPU kernels. Intermediate tensors remain on the GPU.
Step D: Loss is computed on the GPU; backward() computes gradients via automatic differentiation, launching more kernels.
Step E: Optimizer.step() updates parameters (often on GPU-resident tensors). Optionally, mixed precision scaling adjusts gradients before the update.
Step F: Only necessary outputs (like scalar losses for logging) are brought back to the CPU; avoid copying large tensors back.
Code/Implementation Details (PyTorch)
Device selection and placement:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
inputs = inputs.to(device)
target = target.to(device)
All tensors used together must be on the same device; mixing devices triggers errors or implicit (slow) copies. Check torch.cuda.is_available() to guard against missing GPU.
Single-GPU training loop (with best practices):
Create model and move to GPU.
Create optimizer (e.g., AdamW) after moving model so optimizer references GPU parameters.
In each iteration:
Load batch from DataLoader (CPU); ensure num_workers > 0 for speed, and pin_memory=True so host-to-device copies are faster.
inputs = inputs.to(device, non_blocking=True) if using pinned memory to overlap copy with compute.
With torch.cuda.amp.autocast(enabled=True): run forward, compute loss.
scaler.scale(loss).backward() to handle FP16 gradients safely.
The scaler (GradScaler) prevents underflow by dynamically adjusting scaling of the loss/gradients in mixed precision.
Mixed precision snippet:
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for inputs, targets in loader:
inputs = inputs.to(device, non_blocking=True)
targets = targets.to(device, non_blocking=True)
with autocast():
outputs = model(inputs)
loss = criterion(outputs, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad(set_to_none=True)
This pattern uses FP16 where safe and FP32 when necessary, often delivering speedups and lower memory use on modern NVIDIA GPUs.
Gradient accumulation pattern:
Suppose you want an effective batch size of B_eff but can only fit B = B_eff / N per mini-batch.
For step in steps:
outputs = model(inputs)
loss = criterion(outputs, targets) / N # scale loss so total grad matches big batch
loss.backward()
If (step + 1) % N == 0:
optimizer.step(); optimizer.zero_grad()
With AMP: wrap forward in autocast() and use scaler for backward/step; call scaler.step()/update() every N mini-batches.
This approach accumulates gradients across N mini-batches to replicate the effect of a larger batch.
Multi-GPU with DataParallel vs DistributedDataParallel:
torch.nn.DataParallel(model) replicates the model on multiple GPUs in one process, splits input across devices, gathers outputs back. It’s easy to use but can have overhead on the main device and is generally slower.
torch.nn.parallel.DistributedDataParallel (DDP) launches one process per GPU (commonly via torchrun or torch.distributed.launch). Each process holds a model replica bound to a single GPU, processes a shard of the batch, and gradients are all-reduced across processes. DDP overlaps computation and communication and scales much better.
Sketch with DDP (single node, multiple GPUs):
torchrun --nproc_per_node=NUM_GPUS train.py
Inside train.py: initialize process group, set local_rank GPU via environment variable, move model to that device, wrap with DDP(model, device_ids=[local_rank]). Use DistributedSampler on the dataset so each process sees unique data.
Key detail: set find_unused_parameters carefully if your graph has conditional branches; unnecessary settings can slow training.
Model parallelism basics:
When the model doesn’t fit on one GPU, split layers across devices: e.g., encoder on cuda:0, decoder on cuda:1. The forward pass moves tensors between devices at boundary layers. PyTorch lets you assign submodules to devices and manage the transfers.
Model parallelism is more complex: you must manage inter-device communication and ensure that sequential dependencies are respected. Pipeline parallelism (staging micro-batches through model partitions) can increase utilization by keeping all devices busy, but requires orchestration.
Data transfer costs and memory considerations:
Host (CPU RAM) ↔ Device (GPU VRAM) transfers traverse a bus like PCIe and are much slower than on-device memory operations. Minimize transfers by keeping data and intermediate tensors on the GPU during the forward/backward/step.
Use pin_memory=True in DataLoader and non_blocking=True in .to() calls to speed up transfers and overlap copy with compute.
Monitor VRAM with nvidia-smi; if you’re close to the limit, consider smaller batch sizes, mixed precision, gradient checkpointing (if available), or gradient accumulation.
Monitoring and diagnosing performance:
nvidia-smi shows GPU utilization (%), memory (MiB), temperature, and running processes. High utilization (near 100%) indicates the GPU is busy; low utilization suggests bottlenecks (data loading, small batches, frequent sync points).
Watch memory usage: if it’s low but utilization is low too, you might be bottlenecked by input pipeline or too much CPU↔GPU transfer. If memory is near full and you see OOM errors, reduce batch size or enable mixed precision.
TensorBoard can log device metrics over time to see trends and confirm improvements when you change settings.
Putting it together for Transformers:
Transformers rely on matrix multiplies (QK^T, attention value weighting, and feed-forward layers), softmax, and elementwise ops—excellent GPU workloads.
For training: maximize batch size within VRAM, enable AMP for speed, use gradient accumulation if needed, and ensure data loading keeps pace. If you have multiple GPUs, prefer DDP for scaling. Monitor with nvidia-smi and iterate on bottlenecks.
Tools/Libraries Used
CUDA: NVIDIA’s platform for general-purpose GPU computing. Deep learning frameworks use CUDA to drive kernels on NVIDIA GPUs.
PyTorch: A deep learning framework that abstracts CUDA kernels and provides high-level APIs for tensors, autograd, and distributed training.
TensorBoard: Visualization tool to track metrics (loss, accuracy) and device stats over time.
nvidia-smi: Command-line tool to inspect GPU utilization, memory usage, and processes. Useful for quick checks during training.
Step-by-Step Implementation Guide
Step 1: Verify GPU availability
Install NVIDIA drivers and CUDA toolkit compatible with your PyTorch version (pip/conda installs often bundle CUDA runtime).
In Python: import torch; print(torch.cuda.is_available()) should be True on a correctly configured system.
Check nvidia-smi in a terminal to see your GPU(s) listed and their current utilization/memory.
Step 2: Set device and move model/tensors
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
Use DataLoader(dataset, batch_size=B, shuffle=True, num_workers=N, pin_memory=True)
num_workers > 0 to parallelize data loading; pin_memory=True to accelerate transfers to GPU.
Step 4: Mixed precision training (recommended)
from torch.cuda.amp import autocast, GradScaler; scaler = GradScaler()
Training loop: move inputs/targets with non_blocking=True, wrap forward and loss with autocast(), apply scaler.scale(loss).backward(), then scaler.step(optimizer); scaler.update()
Step 5: Gradient accumulation (if VRAM-limited)
Decide accumulation_steps = N; divide loss by N before backward.
Call backward N times; call step/update and zero_grad only every N steps.
Step 6: Scale to multiple GPUs with DDP (when available)
Launch with torchrun --nproc_per_node=NUM_GPUS train.py
In code: initialize process group, set local_rank device, wrap model with DDP. Use DistributedSampler so each process sees a unique data shard.
Step 7: Monitor and iterate
Use nvidia-smi to watch utilization and memory; aim for high utilization and stable memory.
If utilization is low: increase batch size, speed up data loader, reduce CPU↔GPU transfers, enable AMP, consider gradient accumulation.
Tips and Warnings
Device mismatches cause errors: ensure both model and all tensors are on the same device before operations.
Beware frequent .cpu() calls: moving tensors back to CPU for logging or metrics can stall the pipeline; move only scalars or summaries.
Start with AMP: on modern NVIDIA GPUs, torch.cuda.amp usually yields immediate speedups with minimal code changes.
Use the largest batch size that fits: then back off if OOM happens; combine with gradient accumulation to reach target effective batch size.
Prefer DDP over DataParallel: for multi-GPU training, DDP scales better and reduces bottlenecks.
Keep data pipeline fast: increase num_workers, use pin_memory, and prefetch batches so the GPU is not waiting.
Checkpoints and logging: write less frequently or asynchronously to avoid blocking the training loop.
Beware of CPU bottlenecks: heavy Python-side work (complex transforms inside the loop) can slow down the whole system.
Validate mixed precision stability: while AMP is robust, watch for loss spikes; if needed, exclude numerically sensitive ops from autocast.
Keep transfers minimal: stage as much of the step as possible on the GPU and avoid per-sample transfers; prefer batched moves.
In summary, the technical backbone of efficient deep learning is thoughtful GPU usage: correct device placement, large and well-fed batches, mixed precision, gradient accumulation when memory-constrained, scalable multi-GPU strategies, and continuous monitoring to eliminate bottlenecks. With these patterns, the math-heavy core of Transformers maps cleanly to the GPU’s massively parallel engine.
04Examples
💡
Array addition on GPU: Input two arrays A and B of length 1,000,000, and compute C = A + B. Processing: the GPU launches many threads so each thread adds one pair A[i] + B[i]; the same instruction runs across all indices (SIMD-style). Output: C contains the element-wise sums computed in parallel. Key point: simple, independent per-element operations scale perfectly on GPUs.
💡
PyTorch device placement basics: Input a small CNN model and a batch of images on the CPU. Processing: select device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu'), then call model.to(device) and inputs.to(device) before forward(). Output: forward and backward execute on the GPU with significant speedup compared to CPU. Key point: moving both model and tensors to the same device is mandatory for GPU acceleration.
💡
Checking GPU availability: Input a Python REPL with PyTorch installed. Processing: run torch.cuda.is_available() and then nvidia-smi in the terminal to verify a visible device and driver. Output: True from PyTorch and a nvidia-smi table listing utilization and memory. Key point: always verify hardware and drivers before debugging higher-level code.
💡
Large batch vs small batch: Input the same dataset and model trained with batch_size=16 versus batch_size=128. Processing: measure nvidia-smi utilization during training; with larger batches, GPU utilization rises and steps/time improve until VRAM is filled. Output: faster throughput with batch_size=128 but possibly OOM if memory is insufficient. Key point: choose the largest batch that fits to improve utilization.
💡
Mixed precision with AMP: Input a Transformer model and data loader. Processing: wrap forward and loss in autocast(), use GradScaler to scale loss, call scaler.step(optimizer). Output: training time per epoch decreases and memory usage drops with similar accuracy after tuning. Key point: AMP automates precision selection to gain speed and save memory.
💡
Gradient accumulation to simulate big batches: Input a GPU that fits only batch_size=32 but you want effective 128. Processing: divide loss by 4 and call backward over 4 mini-batches before optimizer.step(); zero gradients after the step. Output: model updates as if trained with batch_size=128 but within memory limits. Key point: accumulate gradients when VRAM is the bottleneck.
💡
Data parallelism with DDP: Input a server with 4 GPUs and a training script. Processing: launch with torchrun --nproc_per_node=4 train.py; inside, initialize process groups, set device per process, wrap model with DistributedDataParallel, and use DistributedSampler. Output: each GPU processes a quarter of each batch and gradients are synchronized, reducing total training time. Key point: DDP is the scalable, recommended way to use multiple GPUs.
💡
Model parallelism for oversized models: Input a model too large for one GPU. Processing: place early layers on cuda:0 and later layers on cuda:1; forward pass moves tensors between devices at the boundary. Output: the full model runs without OOM, though with some communication overhead. Key point: split the model across devices when it doesn’t fit in a single GPU’s memory.
💡
Minimizing data transfers: Input a training loop that logs full prediction tensors to CPU each step. Processing: change code to keep tensors on GPU and only move scalar loss.item() and occasional small summaries to CPU. Output: higher GPU utilization and faster epochs due to fewer host-device transfers. Key point: avoid unnecessary .cpu() calls during the hot path.
💡
Monitoring with nvidia-smi: Input an active training job. Processing: run watch -n 1 nvidia-smi to refresh every second, and observe utilization, memory, and process IDs. Output: you see utilization climb near 100% during forward/backward and dip during data loading if there’s a bottleneck. Key point: use nvidia-smi to catch idle periods and guide performance tuning.
💡
Diagnosing low utilization from data loader: Input a model with utilization hovering at 40%. Processing: increase DataLoader num_workers, enable pin_memory, and prefetch batches; ensure preprocessing isn’t inside the training step. Output: utilization rises to 90–100% as the GPU is kept fed. Key point: the input pipeline can starve the GPU if not configured.
💡
Avoiding device mismatch errors: Input code with model on GPU but target tensor on CPU. Processing: running loss = criterion(output, target) throws a device mismatch error or silently triggers slow transfers. Output: after moving target.to(device), the error disappears and performance improves. Key point: keep all tensors involved in an operation on the same device.
💡
Comparing DataParallel vs DDP: Input the same script run with DataParallel and then with DDP on 4 GPUs. Processing: measure epoch time and GPU utilization; DDP shows better scaling and less main-device overhead. Output: DDP achieves higher throughput and more balanced utilization across GPUs. Key point: prefer DDP for multi-GPU jobs to minimize overhead.
05Conclusion
This lecture connects the mathematical demands of modern language models with the hardware designed to meet them: GPUs. Transformers rely on massive numbers of parallelizable operations—especially matrix multiplications—making them a natural fit for GPUs’ many-core, high-throughput design. CPUs act like expert coordinators handling complex control flow, while GPUs are the muscle that performs the heavy lifting in parallel. The key ideas—SIMD-style execution, threads/blocks/grids, and the need to minimize CPU↔GPU transfers—explain why practical speedups depend on both hardware and how you use it.
On the implementation side, PyTorch simplifies GPU usage with device placement (model.to(device), tensor.to(device)), and it integrates CUDA under the hood so you rarely need to write low-level code. Performance best practices include choosing the largest batch size that fits in memory, enabling mixed precision with torch.cuda.amp for speed and memory savings, and using gradient accumulation to simulate big batches when VRAM is limited. When scaling to multiple GPUs, DistributedDataParallel is the preferred approach over DataParallel because it scales more efficiently. For models that exceed a single GPU’s memory, model parallelism spreads layers across devices at the cost of added communication.
Monitoring is non-negotiable: nvidia-smi and TensorBoard help confirm whether your GPU is well-utilized or starved by data loading, small batches, or frequent transfers. The cycle is measure, change, and re-measure until utilization is high and bottlenecks are removed. As a next step, practice by converting a CPU-only training loop to GPU, enabling AMP, and experimenting with batch sizes and gradient accumulation. If you have access to multiple GPUs, implement DDP and compare performance to a single GPU run.
The core message is simple but powerful: GPUs make deep learning possible at modern scales, but merely having a GPU is not enough—using it well is what delivers real speed. Learn to keep data on the device, feed the GPU large, efficient batches, choose the right precision, and scale in a principled way. Mastery of these patterns will pay off across model types and tasks, from small experiments to large-scale LLM training.
✓Use gradient accumulation to simulate large batches. Divide loss by the number of accumulation steps and only call optimizer.step() after several mini-batches. This approach reaches a large effective batch size without extra VRAM. Combine with AMP for best results.
✓Monitor constantly with nvidia-smi. Aim for high utilization (near 100%) and stable memory usage during steady-state training. Low utilization indicates bottlenecks like small batches or slow data loading. Fix the bottleneck, then re-measure.
✓Keep logging light in the training loop. Converting large tensors to CPU for logs every step can stall training. Log scalars frequently but defer heavy logs or do them less often. This keeps the GPU pipeline flowing.
✓Structure your code for clarity in device placement. Centralize device selection and moves to avoid scattered .to() calls that are easy to miss. Clear patterns reduce the chance of silent performance bugs. Consistency makes debugging faster.
✓Validate improvements with timing and metrics. Compare epoch times, steps per second, and utilization before and after each change. Evidence-based tuning ensures you don’t regress performance accidentally. Keep notes for reproducibility.
✓Scale responsibly to multiple GPUs. Start with one GPU, optimize it, then scale with DDP. Poor single-GPU performance will only get worse on many GPUs. Strong single-GPU baselines make multi-GPU wins easier.
✓Be mindful of numerical stability in mixed precision. While AMP is robust, watch for rare overflows or loss spikes. Exclude sensitive ops from autocast if necessary. GradScaler usually prevents issues by adjusting scale dynamically.
✓Control randomness and reproducibility when benchmarking. Fix seeds and keep settings constant when comparing CPU vs GPU or different batch sizes. This isolates performance changes from training noise. Fair comparisons lead to trustworthy conclusions.
The time it takes to finish a single task or respond once. CPUs focus on low latency so each job finishes quickly. GPUs accept higher latency for one job but finish many jobs together faster overall. Balancing latency and throughput depends on your goal.
SIMD (Single Instruction, Multiple Data)
A way of computing where one instruction is applied to many pieces of data at the same time. This is perfect for operations like adding two arrays. GPUs use this idea to run thousands of threads with the same steps on different elements. It boosts speed for repetitive math.
Thread (GPU)
The smallest unit of work that runs on the GPU. Each thread handles a tiny part of the data. Many threads run together to finish a big job. Threads are grouped for scheduling and memory sharing.
Block (GPU)
A group of threads that run together on the GPU. Threads in a block can share fast local memory. Blocks make it easier to divide the work into chunks. Many blocks form a grid for the full workload.