📚 Stanford CS336: Language Modeling from Scratch5 / 17

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 5: GPUs

Beginner

Stanford Online

Deep LearningYouTube

Key Summary

•GPUs (Graphics Processing Units) are critical for deep learning because they run thousands of simple math operations at the same time. Language models like Transformers rely on huge numbers of matrix multiplications, which are perfect for parallel processing. CPUs have a few strong cores for complex, step-by-step tasks, while GPUs have many simpler cores for doing lots of math in parallel. Using GPUs correctly can make training and inference dramatically faster.
•Historically, GPUs were built to draw graphics by processing many pixels, textures, and vertices in parallel. Those same traits—parallel math and high throughput—also fit deep learning workloads. This shift from graphics-only to general-purpose computing is why CUDA exists. CUDA lets programs run non-graphics math on NVIDIA GPUs efficiently.
•CPUs focus on low latency and complex control, with features like large caches and branch prediction. GPUs focus on high throughput, sacrificing some single-task flexibility for massive parallel execution. A helpful analogy: CPUs are Formula One race cars—fast and precise on tricky tracks; GPUs are fleets of trucks—slower individually but able to move huge loads together. This difference explains why deep learning thrives on GPUs.
•GPUs commonly follow a SIMD (Single Instruction, Multiple Data) style: the same instruction runs across many data elements at once. In practice, tasks are split into threads, grouped into blocks, and arranged into grids to cover the full workload. Each thread does a tiny piece of the job, enabling huge parallelism. This structure is what makes array and matrix operations lightning fast on GPUs.
•Deep learning frameworks like PyTorch and TensorFlow wrap GPU details so you don’t write CUDA directly. In PyTorch, you check for a GPU and move models and tensors to it with .to(device). Example: device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu'); model.to(device); data.to(device). Once data and models are on the GPU, operations run there automatically.

Why This Lecture Matters

This content matters to anyone training or deploying modern deep learning models: ML engineers, data scientists, researchers, and students building LLMs or other neural networks. Without GPUs, training times are too long and costs too high, making many real projects impractical. Knowing how to place models and data on the GPU, manage batch sizes, and use mixed precision translates directly into faster experiments and lower cloud bills. For teams scaling training, understanding data parallelism and DDP avoids common pitfalls and unlocks near-linear speedups across multiple GPUs. Monitoring with nvidia-smi and fixing bottlenecks ensures hardware dollars become real performance. This knowledge helps you be effective in industry roles where training efficiency is a core metric, from startups to big labs. In today’s AI landscape, the ability to use GPUs well is a baseline skill, and mastering these patterns is a career accelerator. It lets you take advantage of the hardware that underpins state-of-the-art models and deliver results faster and more reliably.

Lecture Summary

Tap terms for definitions

01Overview

This lecture teaches how and why GPUs power modern deep learning, especially large language models (LLMs) built on Transformers. It begins with a brief recap of language modeling fundamentals—perplexity as an evaluation metric, classic models like n-grams and RNNs, and the modern Transformer architecture (self-attention, multi-head attention, feed-forward layers, residual connections, normalization, and positional encodings). With that foundation, the lecture explains the hardware side: why the math inside Transformers (heavy matrix multiplications and additions) maps perfectly to GPU parallelism. You learn the history of GPUs, originally made for graphics rendering, and how their architecture—many simpler cores optimized for throughput—differs from CPUs, which have fewer, stronger cores focused on low latency and complex control flow.

The target audience is students with basic deep learning understanding who are ready to run models efficiently. You should know core training loops (forward, backward, optimization), tensors, and the basics of PyTorch, along with an idea of what matrix multiplications and attention do in a Transformer. No prior CUDA programming is required; you’ll rely on high-level frameworks. Beginners can follow the analogies and code snippets, while practitioners will benefit from best practices and performance tips.

After this lecture, you will be able to: decide when and why to use GPUs, understand the high-level GPU execution model (SIMD, threads, blocks, grids), place PyTorch models and tensors on the GPU correctly, minimize slow CPU↔GPU transfers, scale from one to many GPUs using data parallelism, and unlock performance with batch sizing, mixed precision (torch.cuda.amp), and gradient accumulation. You will also learn to monitor your GPU with nvidia-smi and identify common bottlenecks like small batch sizes or excessive data movement.

The lecture is structured to connect concepts end-to-end. First, it links the math of deep learning to GPU strengths by contrasting CPUs and GPUs using a clear analogy: CPUs are like Formula One race cars (few, very fast, great at complex maneuvers) while GPUs are like fleets of trucks (many, slower individually, but can move a lot at once). Next, it demystifies GPU execution using SIMD and the hierarchy of threads, blocks, and grids. Then, it becomes practical: using CUDA-enabled GPUs in PyTorch with simple device selection and .to(device) placement. The lecture emphasizes best practices: use the largest batch that fits, reduce data transfers, train with mixed precision for speed and memory savings, and use gradient accumulation for effective large batches without VRAM overflow. Finally, it covers multi-GPU strategies—data parallelism (DataParallel vs DistributedDataParallel) and model parallelism for models too large for a single device—and closes with monitoring tools like nvidia-smi and TensorBoard. The overarching message is that GPUs are essential, but thoughtful usage and monitoring are what deliver real-world speedups.

Key Takeaways

✓Always verify your setup before coding optimizations. Run torch.cuda.is_available() and nvidia-smi to confirm the GPU is visible and drivers are correct. If these checks fail, fix the environment first instead of debugging training loops. A solid foundation saves hours later.
✓Place model and tensors on the same device. Call model.to(device) and inputs.to(device) at the start of each step, and avoid mixing CPU and GPU tensors in the same operation. Device mismatches cause errors or slow implicit transfers. Keep the whole forward/backward/step on the GPU.
✓Feed the GPU with the largest batch size that fits. Start big and shrink if you hit out-of-memory; watch nvidia-smi for memory usage. Larger batches improve throughput and utilization. If VRAM is tight, use gradient accumulation to hit your target effective batch size.
✓Enable mixed precision early. Use torch.cuda.amp.autocast and GradScaler to speed up training and cut memory use with minimal code changes. Monitor loss curves to ensure stability. On modern NVIDIA GPUs, this is usually a free win.
✓Minimize CPU↔GPU transfers. Keep tensors on the GPU throughout the training step, and avoid frequent .cpu() calls in the hot path. Use pin_memory=True and non_blocking=True to speed copies when needed. Fewer transfers mean higher GPU utilization.
✓Optimize the data pipeline. Increase DataLoader num_workers, prefetch batches, and move heavy preprocessing out of the training step. An underfed GPU is wasted compute. Measure utilization while tuning these knobs.
✓Prefer DDP over DataParallel for multi-GPU. Launch one process per GPU with torchrun and use DistributedSampler to shard the dataset. DDP scales better and overlaps communication with computation. Use DataParallel only for quick tests.

Glossary

GPU (Graphics Processing Unit)

A special computer chip designed to handle many simple math tasks in parallel. It acts like a big team of workers each doing small pieces at the same time. GPUs are great for deep learning because models need lots of repeated math. They trade single-task flexibility for massive speed with many tasks.

CPU (Central Processing Unit)

The main processor in a computer that handles general tasks and complex decisions. It is very good at doing things in order and changing plans quickly. CPUs have a few very strong cores and large caches to keep data close. They are great at controlling programs and preparing work for GPUs.

Core

A single processing unit inside a CPU or GPU. CPUs have a few strong cores; GPUs have many simpler cores. More cores mean more tasks can run at once. GPU cores do simple math in parallel over large datasets.

Throughput

How much work is finished in a given time. GPUs are designed for high throughput by doing many tasks at once. Higher throughput means more samples or more math per second. It’s different from latency, which is about how fast a single task finishes.

Latency

Version: 1

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 5: GPUs

Key Summary

Why This Lecture Matters

Lecture Summary

01Overview

Key Takeaways

Glossary

GPU (Graphics Processing Unit)

CPU (Central Processing Unit)

Core

Throughput

Latency

02Key Concepts

03Technical Details

04Examples

05Conclusion

SIMD (Single Instruction, Multiple Data)

Thread (GPU)

Block (GPU)