📚 Stanford CS336: Language Modeling from Scratch2 / 17

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 2: Pytorch, Resource Accounting

Beginner

Stanford Online

Deep LearningYouTube

Key Summary

•This session teaches two essentials for building language models: PyTorch basics and resource accounting. PyTorch is a library for working with tensors (multi‑dimensional arrays) and can run on CPU or GPU. You learn how to create tensors, perform math (including matrix multiplies), reshape, index/slice, and use automatic differentiation to compute gradients for training.
•Tensors come in shapes like scalars (0D), vectors (1D), matrices (2D), and higher dimensions. You can create them from Python lists, choose data types (like float32, float16, int64), and make special tensors like all‑zeros, all‑ones, or random values. Operations are element‑wise by default, and matrix multiplication uses torch.matmul with shape compatibility rules.
•Automatic differentiation (autograd) lets PyTorch compute derivatives for you. By setting requires_grad=True on tensors used as parameters, PyTorch records operations in a compute graph. Calling .backward() on a scalar loss backpropagates gradients, filling .grad on the parameters so you can update them.
•A simple linear regression training loop demonstrates the workflow: define parameters w and b, compute predictions y_hat, define mean squared error loss, call loss.backward(), and update parameters repeatedly. This shows the core pattern used in training neural networks: forward pass, loss, backward pass, and parameter update. The same pattern scales to big language models.
•Moving work to the GPU can dramatically speed training and inference. You detect a device (like 'cuda' if available) and move tensors and models with .to(device). Once moved, all operations run on that device automatically until you move them back.
•Backpropagation is the algorithm autograd uses under the hood to compute gradients. It applies the chain rule from calculus, starting from the loss and moving backward through each layer. You don’t manually code it—PyTorch does it when you call .backward().

Why This Lecture Matters

This material is crucial for anyone who wants to train language models or any deep learning model efficiently and safely. Engineers, researchers, and students often hit practical walls—not because they don’t know the theory, but because they don’t manage compute, memory, and data budgets well. These walls show up as out‑of‑memory crashes, week‑long training runs that go nowhere, or cloud bills that balloon unexpectedly. Learning PyTorch fundamentals lets you express models cleanly, push work to GPUs, and use autograd to train without writing gradients by hand. Learning resource accounting gives you a framework to measure first, then improve, so you can make confident, cost‑aware decisions. In real projects, you must select a batch size that fits your GPU, pick a dtype that balances speed and stability, and choose tricks like gradient checkpointing or mixed precision to scale. You need profiling skills to identify true bottlenecks and not waste time on micro‑optimizations that don’t matter. Token counting helps you plan training schedules and compare the cost of training from scratch versus fine‑tuning. These habits transfer directly to production systems where uptime and cost control are non‑negotiable. Investing in these basics also helps your career: teams prize people who can build models that work within constraints, explain trade‑offs, and deliver results predictably. With industry moving toward ever larger models, the ability to control resources while preserving performance is a standout skill.

Lecture Summary

Tap terms for definitions

01Overview

This lecture teaches two pillars you need before building language models: how to use PyTorch effectively and how to do resource accounting responsibly. PyTorch is a popular deep learning library that works with tensors—multi‑dimensional arrays of numbers—and can run its computations on both CPUs and GPUs. You learn how to create tensors from lists, specify data types (like 32‑bit floats or 64‑bit integers), and build special tensors full of zeros, ones, or random values. You also practice basic math operations (addition, subtraction, multiplication, division) that are element‑wise by default, and you learn how to do matrix multiplication with torch.matmul when shapes line up. Beyond math, you learn how to reshape tensors without changing their underlying data and how to index and slice to pick out elements or ranges.

The lecture’s second half focuses on automatic differentiation (autograd). This is the engine that computes gradients—the slopes that tell you how to adjust your model’s parameters to reduce error—so you can train models. By marking tensors as require $s_g$ rad=True, PyTorch builds a computation graph as you do operations. When you compute a loss and call .backward(), PyTorch runs backpropagation, which is just the chain rule from calculus applied repeatedly from the loss back through each operation. The result is that parameter tensors (like weights and biases) get their gradients filled in x.grad, w.grad, etc., and you can update them with a simple rule like param -= learnin $g_r$ ate * param.grad.

To make all this concrete, the lecture walks through a linear regression example. Linear regression predicts y from x using a straight line $y_h$ at = w * x + b. You define w and b as tensors with require $s_g$ rad=True, compute predictions, measure error with mean squared error (MSE), call backward to get gradients, and update w and b. Repeating this loop reduces the loss. This mirrors the pattern used for neural networks and language models: forward pass, compute loss, backward pass, update parameters. Finally, you see how to accelerate these steps by moving tensors and models to a GPU with .to(device), which can make training much faster.

The second pillar is resource accounting: tracking compute, memory, and data so you can train large models safely and cost‑effectively. Compute can be measured in time or in FLOPs (floating‑point operations). Memory is measured in bytes and includes model parameters, activations, and optimizer states. Data is measured in bytes or tokens—the units language models actually process. You use profilers (e.g., PyTorch Profiler and NVIDIA Nsight) to measure compute and memory hotspots and then optimize. Strategies to reduce compute include using smaller models and smaller batch sizes; tuning learning rate thoughtfully also affects how efficiently you reach good performance. Memory can be reduced with smaller models, smaller batches, lower‑precision data types like float16, and gradient checkpointing, which saves memory by not storing every intermediate activation and recomputing some during backpropagation. Data usage can be reduced with smaller datasets, careful augmentation (when appropriate), or transfer learning—starting from a model trained elsewhere and fine‑tuning it on your task with fewer tokens.

An important nuance is the trade‑off of using smaller data types. Float16 uses half the memory of float32 and can speed things up on modern GPUs, but it loses precision and can make training unstable. Mixed precision training mitigates this by using float16 for activations and gradients while keeping parameters in float32 for stability. PyTorch supports this with Automatic Mixed Precision (AMP), which handles most details for you.

In summary, you come away understanding how to use PyTorch tensors and autograd to build and train models, and how to think like a good engineer about compute, memory, and data budgets. You learn to measure first, then optimize, and to pick techniques like gradient checkpointing and AMP when they make sense. This foundation prepares you for the next step: the Transformer architecture that powers modern language models.

Key Takeaways

✓Start every project with a measurement plan. Profile a short training run to record time per step, peak memory, and token throughput. Use these numbers to choose batch size, model size, and precision. Re‑measure after each major change so you know what really helped.
✓Keep tensors, models, and inputs on the same device. After selecting device='cuda' when available, always move both data and model with .to(device). Device mismatches cause errors and slowdowns. Building a tiny helper function to fetch the device and move objects avoids many bugs.
✓Use shapes and dtypes deliberately. Print x.shape and x.dtype when debugging, and assert expected shapes at boundaries. Prefer float32 for stability and float16 with AMP for memory/speed when supported. Keep integer tensors for indices and counts to avoid accidental casts.
✓Treat matrix multiplication shapes as contracts. Before calling torch.matmul, check that the inner dimensions match (m×n times n×p). If they don’t, rethink your data pipeline or reshape appropriately. Shape errors are among the most common beginner issues. Validating shapes early saves time.
✓Structure your training loop clearly: forward → loss → backward → update → zero grads. Don’t forget to zero gradients after each update to avoid accumulation. Wrap updates in torch.no_grad() to prevent autograd from tracking them. Consistency keeps your loop correct and debuggable.
✓Tune learning rate with intention. Too high leads to exploding loss; too low wastes compute time. Use small pilot runs and learning rate ranges to find a sweet spot. Remember that the best learning rate can change with batch size and precision settings.

Glossary

Tensor

A tensor is a multi‑dimensional array of numbers that PyTorch uses to hold data. It can be 0D (a single number), 1D (a list), 2D (a grid), or higher. Tensors have shapes that describe their size and dtypes that describe their precision. They can live on CPU or GPU. Tensors are the basic building blocks of all model computations.

Scalar

A scalar is a tensor with zero dimensions—just one number. It has no length, rows, or columns. Scalars often represent losses or single parameters. They are simple but essential pieces in computations. PyTorch still treats them as tensors.

Vector

A vector is a one‑dimensional tensor, like a list of numbers. It has a length but no rows and columns. Vectors often represent features for one item. They are a common input or output shape in models. Operations on vectors happen element by element.

Matrix

A matrix is a two‑dimensional tensor with rows and columns. It is used for linear algebra, like transforming features. Many neural network layers boil down to matrix multiplications. Understanding matrices helps you reason about shapes. Matrix math is highly optimized on GPUs.

dtype (data type)

Version: 1

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 2: Pytorch, Resource Accounting

Key Summary

Why This Lecture Matters

Lecture Summary

01Overview

Key Takeaways

Glossary

Tensor

Scalar

Vector

Matrix

dtype (data type)

02Key Concepts

03Technical Details

05Conclusion

torch.tensor

torch.zeros / torch.ones / torch.rand

Element‑wise operation