🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 2: Pytorch, Resource Accounting | How I Study AI
📚 Stanford CS336: Language Modeling from Scratch2 / 17
PrevNext
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 2: Pytorch, Resource Accounting
Watch on YouTube

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 2: Pytorch, Resource Accounting

Beginner
Stanford Online
Deep LearningYouTube

Key Summary

  • •This session teaches two essentials for building language models: PyTorch basics and resource accounting. PyTorch is a library for working with tensors (multi‑dimensional arrays) and can run on CPU or GPU. You learn how to create tensors, perform math (including matrix multiplies), reshape, index/slice, and use automatic differentiation to compute gradients for training.
  • •Tensors come in shapes like scalars (0D), vectors (1D), matrices (2D), and higher dimensions. You can create them from Python lists, choose data types (like float32, float16, int64), and make special tensors like all‑zeros, all‑ones, or random values. Operations are element‑wise by default, and matrix multiplication uses torch.matmul with shape compatibility rules.
  • •Automatic differentiation (autograd) lets PyTorch compute derivatives for you. By setting requires_grad=True on tensors used as parameters, PyTorch records operations in a compute graph. Calling .backward() on a scalar loss backpropagates gradients, filling .grad on the parameters so you can update them.
  • •A simple linear regression training loop demonstrates the workflow: define parameters w and b, compute predictions y_hat, define mean squared error loss, call loss.backward(), and update parameters repeatedly. This shows the core pattern used in training neural networks: forward pass, loss, backward pass, and parameter update. The same pattern scales to big language models.
  • •Moving work to the GPU can dramatically speed training and inference. You detect a device (like 'cuda' if available) and move tensors and models with .to(device). Once moved, all operations run on that device automatically until you move them back.
  • •Backpropagation is the algorithm autograd uses under the hood to compute gradients. It applies the chain rule from calculus, starting from the loss and moving backward through each layer. You don’t manually code it—PyTorch does it when you call .backward().

Why This Lecture Matters

This material is crucial for anyone who wants to train language models or any deep learning model efficiently and safely. Engineers, researchers, and students often hit practical walls—not because they don’t know the theory, but because they don’t manage compute, memory, and data budgets well. These walls show up as out‑of‑memory crashes, week‑long training runs that go nowhere, or cloud bills that balloon unexpectedly. Learning PyTorch fundamentals lets you express models cleanly, push work to GPUs, and use autograd to train without writing gradients by hand. Learning resource accounting gives you a framework to measure first, then improve, so you can make confident, cost‑aware decisions. In real projects, you must select a batch size that fits your GPU, pick a dtype that balances speed and stability, and choose tricks like gradient checkpointing or mixed precision to scale. You need profiling skills to identify true bottlenecks and not waste time on micro‑optimizations that don’t matter. Token counting helps you plan training schedules and compare the cost of training from scratch versus fine‑tuning. These habits transfer directly to production systems where uptime and cost control are non‑negotiable. Investing in these basics also helps your career: teams prize people who can build models that work within constraints, explain trade‑offs, and deliver results predictably. With industry moving toward ever larger models, the ability to control resources while preserving performance is a standout skill.

Lecture Summary

Tap terms for definitions

01Overview

This lecture teaches two pillars you need before building language models: how to use PyTorch effectively and how to do resource accounting responsibly. PyTorch is a popular deep learning library that works with tensors—multi‑dimensional arrays of numbers—and can run its computations on both CPUs and GPUs. You learn how to create tensors from lists, specify data types (like 32‑bit floats or 64‑bit integers), and build special tensors full of zeros, ones, or random values. You also practice basic math operations (addition, subtraction, multiplication, division) that are element‑wise by default, and you learn how to do matrix multiplication with torch.matmul when shapes line up. Beyond math, you learn how to reshape tensors without changing their underlying data and how to index and slice to pick out elements or ranges.

The lecture’s second half focuses on automatic differentiation (autograd). This is the engine that computes gradients—the slopes that tell you how to adjust your model’s parameters to reduce error—so you can train models. By marking tensors as requires_grad=True, PyTorch builds a computation graph as you do operations. When you compute a loss and call .backward(), PyTorch runs backpropagation, which is just the chain rule from calculus applied repeatedly from the loss back through each operation. The result is that parameter tensors (like weights and biases) get their gradients filled in x.grad, w.grad, etc., and you can update them with a simple rule like param -= learning_rate * param.grad.

To make all this concrete, the lecture walks through a linear regression example. Linear regression predicts y from x using a straight line y_hat = w * x + b. You define w and b as tensors with requires_grad=True, compute predictions, measure error with mean squared error (MSE), call backward to get gradients, and update w and b. Repeating this loop reduces the loss. This mirrors the pattern used for neural networks and language models: forward pass, compute loss, backward pass, update parameters. Finally, you see how to accelerate these steps by moving tensors and models to a GPU with .to(device), which can make training much faster.

The second pillar is resource accounting: tracking compute, memory, and data so you can train large models safely and cost‑effectively. Compute can be measured in time or in FLOPs (floating‑point operations). Memory is measured in bytes and includes model parameters, activations, and optimizer states. Data is measured in bytes or tokens—the units language models actually process. You use profilers (e.g., PyTorch Profiler and NVIDIA Nsight) to measure compute and memory hotspots and then optimize. Strategies to reduce compute include using smaller models and smaller batch sizes; tuning learning rate thoughtfully also affects how efficiently you reach good performance. Memory can be reduced with smaller models, smaller batches, lower‑precision data types like float16, and gradient checkpointing, which saves memory by not storing every intermediate activation and recomputing some during backpropagation. Data usage can be reduced with smaller datasets, careful augmentation (when appropriate), or transfer learning—starting from a model trained elsewhere and fine‑tuning it on your task with fewer tokens.

An important nuance is the trade‑off of using smaller data types. Float16 uses half the memory of float32 and can speed things up on modern GPUs, but it loses precision and can make training unstable. Mixed precision training mitigates this by using float16 for activations and gradients while keeping parameters in float32 for stability. PyTorch supports this with Automatic Mixed Precision (AMP), which handles most details for you.

In summary, you come away understanding how to use PyTorch tensors and autograd to build and train models, and how to think like a good engineer about compute, memory, and data budgets. You learn to measure first, then optimize, and to pick techniques like gradient checkpointing and AMP when they make sense. This foundation prepares you for the next step: the Transformer architecture that powers modern language models.

Key Takeaways

  • ✓Start every project with a measurement plan. Profile a short training run to record time per step, peak memory, and token throughput. Use these numbers to choose batch size, model size, and precision. Re‑measure after each major change so you know what really helped.
  • ✓Keep tensors, models, and inputs on the same device. After selecting device='cuda' when available, always move both data and model with .to(device). Device mismatches cause errors and slowdowns. Building a tiny helper function to fetch the device and move objects avoids many bugs.
  • ✓Use shapes and dtypes deliberately. Print x.shape and x.dtype when debugging, and assert expected shapes at boundaries. Prefer float32 for stability and float16 with AMP for memory/speed when supported. Keep integer tensors for indices and counts to avoid accidental casts.
  • ✓Treat matrix multiplication shapes as contracts. Before calling torch.matmul, check that the inner dimensions match (m×n times n×p). If they don’t, rethink your data pipeline or reshape appropriately. Shape errors are among the most common beginner issues. Validating shapes early saves time.
  • ✓Structure your training loop clearly: forward → loss → backward → update → zero grads. Don’t forget to zero gradients after each update to avoid accumulation. Wrap updates in torch.no_grad() to prevent autograd from tracking them. Consistency keeps your loop correct and debuggable.
  • ✓Tune learning rate with intention. Too high leads to exploding loss; too low wastes compute time. Use small pilot runs and learning rate ranges to find a sweet spot. Remember that the best learning rate can change with batch size and precision settings.

Glossary

Tensor

A tensor is a multi‑dimensional array of numbers that PyTorch uses to hold data. It can be 0D (a single number), 1D (a list), 2D (a grid), or higher. Tensors have shapes that describe their size and dtypes that describe their precision. They can live on CPU or GPU. Tensors are the basic building blocks of all model computations.

Scalar

A scalar is a tensor with zero dimensions—just one number. It has no length, rows, or columns. Scalars often represent losses or single parameters. They are simple but essential pieces in computations. PyTorch still treats them as tensors.

Vector

A vector is a one‑dimensional tensor, like a list of numbers. It has a length but no rows and columns. Vectors often represent features for one item. They are a common input or output shape in models. Operations on vectors happen element by element.

Matrix

A matrix is a two‑dimensional tensor with rows and columns. It is used for linear algebra, like transforming features. Many neural network layers boil down to matrix multiplications. Understanding matrices helps you reason about shapes. Matrix math is highly optimized on GPUs.

dtype (data type)

#pytorch#tensor#autograd#backpropagation#gradient#linear regression#matrix multiplication#reshape#gpu#cuda#resource accounting#flops#memory#profiler#gradient checkpointing#mixed precision#amp#tokenization#batch size#learning rate
Version: 1
  • •Resource accounting means tracking compute, memory, and data so you don’t blow your budget or crash machines. Compute can be measured by time or FLOPs (floating‑point operations); memory by bytes; data by bytes or number of tokens. Using profilers (like PyTorch Profiler or NVIDIA Nsight) helps you measure and optimize.
  • •To reduce compute, you can choose a smaller model, smaller batch size, and tune learning rate. Smaller models and batches use fewer operations per step, but may affect accuracy. Learning rate must be balanced so training converges without wasting steps.
  • •Memory usage can be lowered with smaller models, smaller batch sizes, or smaller data types (like float16). Gradient checkpointing further saves memory by keeping only some activations during the forward pass and recomputing the rest on the backward pass. This trades extra compute for lower memory, which often pays off.
  • •Data usage is tracked in bytes or tokens, the units language models train on. You can reduce data needs by using a smaller dataset, doing data augmentation where appropriate, or fine‑tuning a pre‑trained model (transfer learning). Token counting with your tokenizer ensures you understand your training scale.
  • •Using smaller data types like float16 saves memory and can speed up training, but risks numerical precision issues. Mixed precision training uses both float16 and float32 to keep stability while saving memory. PyTorch’s Automatic Mixed Precision (AMP) makes this easy.
  • •The core message: know your tools (PyTorch tensors, autograd, GPU devices) and know your limits (compute, memory, data). Measure first with profilers; then make surgical changes like adjusting batch size, enabling checkpointing, or using AMP. This discipline lets you train models efficiently and sustainably.
  • 02Key Concepts

    • 01

      Tensors in PyTorch: A tensor is a multi‑dimensional array of numbers. Think of it like a spreadsheet for 2D, a stack of spreadsheets for 3D, and more stacks for higher dimensions. In PyTorch, you create them with torch.tensor([...]) or with helpers like torch.zeros, torch.ones, and torch.rand. They can live on CPU or GPU and have different data types like float32 or int64. Tensors hold the data that flows through your model, so understanding shapes and dtypes is essential.

    • 02

      Scalar, vector, matrix, higher‑order tensors: A scalar is a single number (0D), a vector is a list of numbers (1D), and a matrix is a grid with rows and columns (2D). Higher‑order tensors just add more axes, like a pile of matrices (3D) or video frames over time (4D). This matters because neural networks often operate on batches (extra dimension) of sequences or images. If you know how to read and change shapes, you can feed data in correctly. For example, a batch of 32 vectors of length 128 is a tensor of shape [32, 128].

    • 03

      Creating tensors and choosing dtypes: You can create tensors from Python lists and control their data types via dtype= arguments (e.g., torch.float32, torch.int64). The dtype affects memory usage and precision: float16 uses half the memory of float32 but is less precise. Matching dtypes avoids errors when doing operations together. Choosing smaller dtypes can save memory and speed, but you must ensure accuracy stays acceptable. For model parameters and calculations that need stability, float32 is safer.

    • 04

      Special tensors and random initialization: torch.zeros and torch.ones create tensors filled with zeros or ones; torch.rand creates random values between 0 and 1. Random initialization is common for starting model parameters before training. The shapes you pass (like (2, 3)) determine how many rows and columns are created. This is like ordering a blank grid of a certain size to fill with numbers. Good initialization and shapes that match your model’s needs are key to training properly.

    • 05

      Element‑wise math operations: Adding, subtracting, multiplying, and dividing tensors operate element by element when shapes match. If x and y are the same shape, x + y adds each position together to make a new tensor. This is like adding two equally sized grids of numbers cell by cell. Element‑wise math builds up computations that later get combined into losses. It’s the bread and butter of neural network layers’ internal calculations.

    • 06

      Matrix multiplication with torch.matmul: Matrix multiplication combines rows of the first matrix with columns of the second, following linear algebra rules. If x has shape [m, n] and y has shape [n, p], then torch.matmul(x, y) produces shape [m, p]. This operation is the heart of linear layers and attention mechanisms. Think of it as mixing features using learned weights to produce new features. Getting shapes right is crucial—if they don’t align, matmul can’t proceed.

    • 07

      Reshaping tensors without copying data: torch.reshape changes how you view the same underlying numbers in a different shape, as long as the total element count stays the same. For example, reshaping a 4x4 tensor (16 elements) to 2x8 still uses the same 16 numbers. This is like reorganizing the same set of Lego bricks into a new shape without adding or removing bricks. Reshaping helps prepare data for specific layers or batched processing. A mismatch in element counts raises an error.

    • 08

      Indexing and slicing to access data: You can pick out single elements with indices (x[0]) and use negative indices (x[-1]) to count from the end. Slicing (x[1:3]) returns ranges and excludes the last index. This is like picking certain rows or columns from a table. Indexing and slicing make it easy to inspect data, split batches, or feed only parts of tensors. Careful slicing is also used to build training batches and sequence windows for language models.

    • 09

      Automatic differentiation (autograd): Autograd computes gradients (derivatives) automatically by tracking operations on tensors that require gradients. When you call backward() on a scalar loss, it applies the chain rule from calculus to compute how each parameter affected the loss. The gradients get stored in each parameter’s .grad field. Without autograd, you’d need to derive and code every gradient by hand, which is error‑prone and slow. Autograd makes training many‑layer models practical.

    • 10

      Backpropagation conceptually: Backpropagation is the procedure that propagates gradients from the output (loss) back to inputs and parameters. It starts at the loss, computes gradients for the last operation, then moves step by step backward through the compute graph. This relies on the chain rule to combine partial derivatives. It’s like tracing a river from the ocean back to the streams that fed it, assigning credit or blame along the way. PyTorch implements this when you call .backward().

    • 11

      Linear regression training loop: Linear regression predicts y from x with y_hat = w*x + b, where w and b are trainable parameters. The loss is mean squared error, the average of (y_hat − y)^2 across samples. You create w and b with requires_grad=True, compute the loss, call loss.backward() to fill w.grad and b.grad, then update them. Repeating this loop makes the line fit the data better. This simple loop mirrors how all neural networks get trained.

    • 12

      Moving work to the GPU: Using .to(device) moves tensors and models to a target device like 'cuda' for GPU. Once there, computations can run much faster due to parallel processing. It’s like switching from a bicycle (CPU) to a motorcycle (GPU) for heavy tasks. You must keep tensors that interact on the same device. Forgetting to move either the model or the data can cause device mismatch errors.

    • 13

      Resource accounting—what and why: Resource accounting means tracking compute (time/FLOPs), memory (bytes), and data (bytes/tokens) during training. It keeps your experiments safe, efficient, and affordable. Without it, you can crash machines with out‑of‑memory errors or run up huge bills. Measuring first, then optimizing, is the disciplined way to work. It’s like budgeting money before going shopping.

    • 14

      Measuring compute and using profilers: Compute can be measured in wall‑clock time or estimated by counting FLOPs. Profilers such as PyTorch Profiler and NVIDIA Nsight help you see which operations take the most time. This guides you to focus optimization where it matters. It’s like checking which appliances in your house use the most electricity before deciding what to replace. Accurate measurement prevents guesswork.

    • 15

      Reducing compute load: You can reduce compute by choosing a smaller model and smaller batch size, which lowers work per step. Adjusting the learning rate affects how many steps you need to converge; it must be balanced. Profiling helps you identify slow layers or kernels to optimize. Sometimes algorithmic changes (like simplifying the model) bring the biggest wins. Always verify that accuracy stays good enough after changes.

    • 16

      Measuring and reducing memory usage: Memory usage includes parameters, activations, gradients, and optimizer states. Tools in PyTorch and NVIDIA Nsight can report peak memory, helping you spot overuse. To cut memory, use smaller models, smaller batches, or smaller dtypes like float16. Gradient checkpointing reduces activation storage by recomputing some activations during backward, trading compute for memory. This often prevents out‑of‑memory errors on large models.

    • 17

      Tracking and reducing data usage: Data can be counted in bytes or tokens (for language tasks). Counting tokens helps estimate training cost and schedule. You can reduce data needs via smaller datasets, data augmentation (for modalities where it makes sense), or transfer learning, where you fine‑tune a model pre‑trained elsewhere. Transfer learning lets you leverage prior knowledge and train with fewer tokens. This is like learning a second language faster because you already know a related one.

    • 18

      Gradient checkpointing: Gradient checkpointing stores only some forward activations and recomputes the rest during backpropagation. It lowers memory demands at the cost of extra compute. In many large models, memory is the tighter bottleneck, so this trade‑off is worth it. PyTorch provides torch.utils.checkpoint to apply this without writing complex code. It enables training deeper or wider models on the same hardware.

    • 19

      Precision trade‑offs and mixed precision: Using float16 halves memory for many tensors and can speed up training, but precision loss can cause instability or poor convergence. Mixed precision training keeps sensitive values (like parameters or certain accumulations) in float32 and uses float16 for others. PyTorch’s Automatic Mixed Precision (AMP) manages casts and scaling for you. This preserves stability while gaining memory and speed benefits. It’s a practical default on modern GPUs.

    • 20

      Putting it all together for language models: The PyTorch workflow (tensors → forward pass → loss → backward → update) is the same pattern used for language models. Resource accounting ensures you choose a model size, batch size, and precision that fit your hardware and budget. Profiling points you to the biggest wins, and techniques like checkpointing and AMP let you scale further. Careful token counting keeps your data budget in check. These practices make training modern models feasible and reproducible.

    03Technical Details

    Overall structure and workflow

    1. Working with tensors
    • What: Tensors are multi‑dimensional arrays of numbers. Shapes describe their dimensions, like [batch, features] or [rows, cols]. Dtypes describe precision and memory—float32 (common default), float16 (half precision), int64 (for indices and integer values), etc.
    • How: Create tensors with torch.tensor([...], dtype=...), or generate with torch.zeros(shape), torch.ones(shape), torch.rand(shape). Inspect shapes with x.shape and dtypes with x.dtype. Convert dtypes with x.to(torch.float16) or x.to(torch.int64). Move between CPU and GPU with x.to(device).
    • Why it matters: Correct shapes and dtypes prevent errors and control memory and speed. For neural networks, using the right shapes for batches, sequences, and features is essential. Choosing smaller dtypes can save memory for large models.
    1. Basic math and linear algebra
    • Element‑wise ops: x + y, x - y, x * y, x / y operate per element for same‑shape tensors. Broadcasting rules can expand singleton dimensions automatically, but beginners should first ensure exact shape matches to avoid confusion.
    • Matrix multiply: torch.matmul(A, B) follows linear algebra rules: [m, n] @ [n, p] → [m, p]. This is the core of linear layers and attention operations. It performs many multiply‑adds internally and is highly optimized on GPUs.
    • Reshape and view: torch.reshape(x, new_shape) returns a tensor with the same data arranged differently, as long as total elements match. For example, 4x4 (16) → 2x8 (16) is valid. Reshaping prepares data for layers or flattens features for linear layers.
    • Index and slice: x[i] accesses an element or row; x[i:j] slices ranges (exclusive of j). Negative indices count from the end. Use this for batching, windowing sequences, or inspecting subsets.
    1. Autograd and backpropagation
    • Concept: Autograd builds a computation graph from operations on tensors that require gradients. Each operation records how outputs depend on inputs.
    • Backward pass: When you call loss.backward(), PyTorch traverses this graph from the loss back to the parameters, computing gradients using the chain rule. The gradient for each parameter is stored in param.grad.
    • Updating parameters: After backward, you update parameters: param -= lr * param.grad (or use an optimizer like torch.optim.SGD). Then zero out gradients with param.grad.zero_() or optimizer.zero_grad() before the next step to prevent accumulation.
    • Importance: Without autograd, you would have to derive and implement gradients manually for every operation. Autograd makes deep learning feasible and fast to prototype.
    1. Simple linear regression example
    • Setup: Suppose x is a tensor of inputs and y is the target outputs, both shapes [N] or [N, 1]. Initialize parameters: w = torch.randn(1, requires_grad=True), b = torch.randn(1, requires_grad=True).
    • Forward: y_hat = w * x + b (broadcasting handles the addition). Compute loss = ((y_hat - y) ** 2).mean() for mean squared error.
    • Backward and update: loss.backward() computes w.grad and b.grad. Update: w -= lr * w.grad; b -= lr * b.grad. Then zero gradients: w.grad.zero_(); b.grad.zero_(). Repeat for many steps until loss decreases sufficiently.
    • Notes: Use a small learning rate to avoid overshooting. If x and y are batched, ensure shapes match. This toy example mirrors the training loop used for complex models.
    1. Using GPUs
    • Device selection: device = 'cuda' if torch.cuda.is_available() else 'cpu'. Move tensors and models: x = x.to(device); model.to(device).
    • Consistency: All tensors interacting in an operation must be on the same device. Mixing CPU and GPU tensors raises errors.
    • Benefit: GPUs run many operations in parallel, accelerating matrix multiplies and convolution-like operations used in deep learning.
    1. Resource accounting (compute, memory, data)
    • What to track: Compute (time/FLOPs), Memory (bytes), Data (bytes/tokens). These determine cost, speed, and feasibility.
    • Why: Big models and datasets can be expensive and fragile to run. Measuring and budgeting prevents out-of-memory crashes and surprise bills.
    • How to measure compute: Time each epoch or step with Python timers or profilers. Estimate FLOPs with profiling tools that count operations. PyTorch Profiler can record operator times and GPU kernel activity; NVIDIA Nsight can give low-level GPU metrics.
    • How to measure memory: Track peak memory during runs. PyTorch offers memory stats (e.g., torch.cuda.max_memory_allocated()) and profiler support. Nsight tools show GPU memory allocation and fragmentation.
    • How to measure data: Count bytes on disk and, for language models, count tokens with your tokenizer. Tokens are the subword units models actually consume, so they are the true training currency.
    1. Reducing compute
    • Smaller models: Fewer layers, smaller hidden sizes, fewer parameters reduce FLOPs per step.
    • Smaller batch sizes: Fewer examples per step reduce compute and memory per step. Total training time may change depending on convergence speed and data pipeline throughput.
    • Learning rate tuning: The learning rate affects how quickly you reach a good solution. Too high may diverge; too low may require many more steps. Pick a rate that converges efficiently.
    • Profiling-driven changes: Identify slow layers or data bottlenecks and optimize them first for the biggest gains.
    1. Reducing memory
    • Smaller models and batches: Directly lower parameter count and activation footprints.
    • Smaller dtypes: Float16 halves memory for many tensors compared to float32. Int types help for indices.
    • Gradient checkpointing: Keeps only a subset of activations, recomputing the rest in backward. Saves memory at the cost of extra compute; often worth it for large models where memory is tight.
    • Implementation: Use torch.utils.checkpoint.checkpoint(function, *args) to wrap expensive subgraphs. Be mindful that recomputation increases step time.
    1. Reducing data usage
    • Smaller datasets: Train with fewer samples or tokens if acceptable.
    • Token budgeting: For language tasks, limit total tokens processed per epoch or per project to control costs.
    • Data augmentation (when applicable): For images, rotate or add noise to create more variety from limited data. For language, augmentation is trickier and task-dependent.
    • Transfer learning: Start from a model pre-trained elsewhere and fine-tune on your data, reducing the tokens needed to reach good performance.
    1. Precision trade‑offs and AMP
    • Float16 pros/cons: Pros are lower memory and often faster compute. Cons are reduced precision, potential for gradient underflow/overflow, and convergence issues.
    • Mixed precision: Use float16 for activations/gradients and float32 for parameters/accumulations to maintain stability.
    • AMP in PyTorch: Automatic Mixed Precision manages casting and dynamic loss scaling to prevent numerical issues. It simplifies adoption; most code changes are minimal.

    Code and implementation details

    A) Creating tensors

    • From lists: x = torch.tensor([1.0, 2.0, 3.0]) creates a 1D float32 tensor by default. Explicit dtype: torch.tensor([1, 2, 3], dtype=torch.int64) for integers.
    • Special tensors: torch.zeros((2, 3)), torch.ones((2, 3)), torch.rand((2, 3)) produce 2x3 tensors. torch.rand yields uniform random numbers in [0,1).
    • Inspecting: x.shape gives dimensions; x.dtype gives data type. Printing a tensor shows values and dtype.

    B) Basic math

    • Element‑wise ops: Given x and y with the same shape (e.g., [2, 3]): x + y, x - y, x * y, x / y yield same‑shaped results. Keep dtypes compatible (float with float) to avoid implicit casts or errors.
    • Matrix multiply: A = torch.rand((3, 4)); B = torch.rand((4, 5)); C = torch.matmul(A, B) → C.shape is [3, 5]. If dimensions don’t match (e.g., A is [3, 4] and B is [3, 5]), matmul raises an error.

    C) Reshaping and indexing

    • Reshape: x = torch.arange(16).reshape(4, 4) makes a 4x4 matrix. x.reshape(2, 8) produces a 2x8 view of the same 16 values. Total elements must match.
    • Indexing: x[0] returns the first row; x[-1] returns the last row. x[1:3] returns rows 1 and 2 (excludes index 3). Combining with column indices (e.g., x[:, 0]) selects the first column.

    D) Autograd workflow

    • Require gradients: w = torch.randn(1, requires_grad=True) tells PyTorch to track ops on w for gradient computation. Values without requires_grad won’t produce gradients.
    • Compute loss: loss = some_function_of_parameters(). The loss must be a scalar to call backward directly; if not, reduce it (e.g., .mean()).
    • Backward: loss.backward() traverses the graph in reverse and populates .grad for parameters. After updating parameters, zero gradients before the next step to avoid accumulation: w.grad.zero_().

    E) Linear regression end‑to‑end

    • Data: x, y = torch.rand(100), 2.0*torch.rand(100) + 0.5 + noise.
    • Parameters: w, b = torch.randn(1, requires_grad=True), torch.randn(1, requires_grad=True).
    • Training loop (pseudocode): for step in range(steps): y_hat = w * x + b loss = ((y_hat - y) ** 2).mean() loss.backward() with torch.no_grad(): w -= lr * w.grad b -= lr * b.grad w.grad.zero_(); b.grad.zero_()
    • Explanation: The with torch.no_grad() block prevents autograd from tracking the update itself. After many steps, w and b approach values that minimize MSE.

    F) GPU usage

    • Device pick: device = 'cuda' if torch.cuda.is_available() else 'cpu'.
    • Moving data: x = x.to(device); y = y.to(device). Keep all related tensors and the model on the same device.
    • Performance: GPUs accelerate dense math like matrix multiplies. This is especially helpful for large batches and big models.

    G) Profiling compute and memory

    • PyTorch Profiler: Wrap training steps inside a profiler context to record operator times, CPU/GPU activity, and memory. Analyze profiles to find slow or memory‑heavy operations.
    • NVIDIA Nsight: System‑level GPU profiling provides kernel‑level timing and memory insights. Use it when you need deep performance diagnosis.
    • FLOPs estimation: Some profilers estimate FLOPs per layer. Comparing FLOPs across models helps you predict training time and cost.

    H) Memory optimization strategies

    • Model size: Reduce layer counts or hidden sizes. Parameter sharing or pruning can help in advanced setups.
    • Batch size: Lowering batch size often avoids out‑of‑memory errors; gradient accumulation can emulate larger batches if needed.
    • Dtypes: Consider float16 with AMP to balance speed and stability.
    • Gradient checkpointing: Wrap submodules or blocks where activations are large. For example, outputs = checkpoint(block, inputs) in a transformer block saves memory by recomputing activations during backward.

    I) Data budgeting and token counting

    • Tokenization: Language models process tokens (subword pieces), not raw characters or words. Counting tokens processed per step and per epoch shows your true data usage.
    • Dataset size: Control epochs and samples per epoch to cap total tokens. Track cumulative tokens to prevent overspending time and compute.
    • Transfer learning: Fine‑tune from pre‑trained checkpoints to cut required tokens for good performance.

    J) Precision management with AMP

    • Dynamic loss scaling: AMP scales losses up to avoid float16 underflow, then scales gradients back down. This maintains useful gradient ranges.
    • Autocast: PyTorch’s autocast context (torch.autocast) runs eligible ops in float16 and others in float32 automatically. You keep accuracy while gaining speed/memory wins.
    • Parameter storage: Keep parameters/optimizer states in float32 for stability; run activations in float16 where safe.

    Tips and warnings

    • Device mismatches: If you see errors like Tensors are on different GPUs, ensure all tensors and the model are on the same device.
    • Gradient accumulation: If you reduce batch size due to memory limits, accumulate gradients across several steps before updating to simulate a larger batch.
    • Learning rate balance: A too‑low learning rate can waste compute by needing many steps; too high can diverge. Use learning rate schedules or small pilot runs to tune efficiently.
    • Zeroing gradients: Always zero gradients after updates to prevent unintended accumulation from previous steps.
    • Numerical stability: When using float16, expect occasional instability; rely on AMP and monitor loss for NaNs or infs. Fall back to float32 for sensitive parts if needed.
    • Measure before optimizing: Profile to find real bottlenecks instead of guessing. Optimization without measurement can waste time and harm model quality.

    Q&A highlights embedded

    • Backpropagation: It’s the chain rule applied backward from the loss through each operation to parameters. PyTorch implements it with autograd and .backward(); you don’t code gradients by hand.
    • Gradient checkpointing: Saves memory by storing fewer activations in the forward pass and recomputing them during backward. Use torch.utils.checkpoint to apply it. It trades compute for memory and is often the right choice when memory is the bottleneck.
    • Float16 trade‑offs: Float16 reduces memory and can be faster, but it can hurt convergence due to reduced precision. Mixed precision (with AMP) uses float16 where safe and float32 where needed, offering a strong compromise for stability and efficiency.

    05Conclusion

    You learned the two pillars needed to start training language models effectively: PyTorch fundamentals and resource accounting. On the PyTorch side, you saw how to create and manipulate tensors, choose dtypes, perform element‑wise math and matrix multiplication, reshape and slice data, and use autograd to compute gradients automatically. The linear regression example demonstrated the full training loop—forward pass, loss computation, backward pass, and parameter updates—and showed how that pattern scales up to neural networks. You also learned how to move computations to GPUs using .to(device), unlocking large speedups by leveraging parallel hardware.

    On the resource accounting side, you learned to measure and manage compute, memory, and data. Compute is tracked in time and FLOPs; memory in bytes with tools like PyTorch Profiler and NVIDIA Nsight; data in bytes and especially tokens for language tasks. You saw practical strategies to reduce each: smaller models and batches, careful learning rate tuning, smaller dtypes, gradient checkpointing, token budgeting, augmentation where appropriate, and transfer learning to cut data demands. You also learned why precision choices matter and how mixed precision (AMP) helps balance speed and stability.

    The most important habit to build is to measure before you optimize. Profilers show where your time and memory go, letting you make focused, high‑impact changes. Combine this discipline with engineering tools like checkpointing, AMP, and device management, and you can train models that fit your hardware and budget without surprises. With these foundations in place, you’re ready to explore the Transformer architecture next, using the same training loop and resource mindset to build modern language models.

    ✓
    Choose batch size to fit memory and throughput. If you hit OOM, lower the batch size first and consider gradient accumulation to regain effective batch size. This keeps training stable without changing the learning dynamics too much. Always verify that throughput remains acceptable.
  • ✓Leverage GPUs for heavy workloads. Moving computation to 'cuda' often gives order‑of‑magnitude speedups for matrix-heavy code. Ensure the data loader keeps the GPU fed to avoid idle time. Monitor GPU utilization to confirm gains.
  • ✓Adopt mixed precision (AMP) on modern GPUs. It typically halves activation memory and speeds up training while staying stable. Keep a close eye on loss for NaNs or infs, and be ready to disable AMP for very sensitive parts. In most cases, AMP is a net win with minimal code changes.
  • ✓Use gradient checkpointing when memory is the bottleneck. Wrap large submodules to store fewer activations and recompute them in backward. Expect a moderate increase in step time in exchange for fitting larger models or batches. This trade‑off is often crucial for large language models.
  • ✓Plan and track token budgets for language tasks. Count tokens per batch and per epoch to forecast total training cost and time. Compare training from scratch versus fine‑tuning with transfer learning to save tokens. Keeping a token ledger makes decisions transparent.
  • ✓Profile before optimizing architecture. High‑level changes—like reducing hidden size or removing a layer—can cut compute dramatically. Confirm with profiler data which parts dominate runtime and memory. Avoid micro‑optimizing code paths that are not hotspots.
  • ✓Be explicit about precision. Decide when to use float32 versus float16 and justify it based on stability and performance. Document your AMP settings and any exceptions. Clarity prevents subtle bugs and eases collaboration.
  • ✓Guard against silent shape mistakes. Add assertions or small unit tests that check shapes after key transforms (reshape, transpose, concatenate). Catching a wrong shape early prevents cascading errors. This is especially important before matmul operations.
  • ✓Keep experiments reproducible. Fix random seeds for initial tests, and log all key settings: model size, batch size, learning rate, precision, and device. Reproducibility helps you compare runs and justify changes. It also makes debugging far simpler.
  • ✓Balance compute and memory trade‑offs consciously. Smaller models and batches reduce both, while checkpointing reduces memory at a compute cost. Pick the combination that meets your budget and timeline. Document the rationale for future you and teammates.
  • ✓Know when to fine‑tune instead of train from scratch. If a suitable pre‑trained model exists, transfer learning can massively cut data and compute needs. Evaluate early to prevent over‑spending tokens. This strategy accelerates projects without sacrificing quality.
  • The dtype tells you how numbers are stored, like float32 or float16. It sets precision and memory usage. Smaller dtypes use less memory but can be less accurate. Matching dtypes across tensors avoids errors. Some operations require certain dtypes.

    torch.tensor

    torch.tensor creates a PyTorch tensor from Python data like lists. You can specify dtype to control precision. The resulting tensor can be used in computations and tracked by autograd. It is the standard way to get data into PyTorch. It supports both CPU and GPU placement.

    torch.zeros / torch.ones / torch.rand

    These functions create tensors filled with zeros, ones, or random numbers. You pass the desired shape, and PyTorch allocates the memory. They are convenient for initialization and testing. Random values are between 0 and 1 for torch.rand. They help bootstrap model parameters and data.

    Element‑wise operation

    An element‑wise operation applies the same math to each position in two same‑shaped tensors. Examples include +, -, *, and /. Each output cell depends only on the matching input cells. This is simple and fast to compute. Many layers use element‑wise steps between larger operations.

    +31 more (click terms in content)