📚 Stanford CS336: Language Modeling from Scratch6 / 17

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 6: Kernels, Triton

Intermediate

Stanford Online

Key Summary

•Modern language models are expensive to run because they perform many matrix multiplications. The main cost comes from both compute and moving data in and out of GPU memory. Optimizing the low-level code that runs these operations can make inference and training much faster and cheaper.
•A kernel is a small function that runs on the GPU in parallel across many lightweight workers. Instead of a CPU loop that processes one item at a time, a kernel lets thousands of GPU threads work together. Writing good kernels is about splitting the work and reading/writing memory efficiently.
•High-level ML frameworks like PyTorch or TensorFlow call low-level kernels written in C++/CUDA under the hood. We can keep the nice Python API but still get GPU speed by improving these kernels. Changing the kernel often speeds up the whole model.
•Triton is a domain-specific language (DSL) that lets you write GPU kernels in a Python-like way. It is easier to write than raw CUDA but still generates fast low-level code. Triton compiles your function just-in-time (JIT) and produces GPU code optimized for your hardware.
•A simple Triton 'add' kernel loads elements from two input arrays, adds them, and stores the result. Each GPU program instance handles a block of elements and uses masks to avoid out-of-bounds memory accesses. You launch the kernel with a grid that says how many blocks to run.
•Launching a Triton kernel looks like calling a function with special syntax: kernel[grid](args...). You allocate tensors on the GPU with PyTorch, set a BLOCK_SIZE, set a grid (how many programs), and pass the number of elements. The kernel computes in parallel across those programs.
•Performance on GPUs is often limited by memory bandwidth, not pure math. Triton tries to make memory access patterns fast and coalesced (neighbors read neighbor addresses). It also uses techniques like loop unrolling and smart register allocation.

Why This Lecture Matters

Anyone building or deploying large language models faces significant compute and cost challenges, with matrix multiplications dominating both training and inference. Kernel-level optimization directly attacks the biggest bottlenecks, unlocking lower latency, higher throughput, and lower energy use. For ML engineers, learning kernels and Triton means you can tailor fast paths for your specific shapes and workflows instead of waiting on generic library updates. For research scientists, it empowers experiments with new layer designs or fusions that standard libraries don’t yet support. For systems engineers and platform teams, it reduces cloud GPU bills and improves service reliability by speeding up hot paths. This knowledge applies to real projects by letting you write custom kernels for elementwise ops, reductions, and matmuls, and by tuning tile sizes to your hardware. It makes it possible to fuse steps (e.g., bias + activation) to cut memory traffic and cut kernel launch overhead. Mastering masks, strides, and pointer arithmetic helps you build robust kernels that work on arbitrary shapes, which is essential in production. The industry increasingly rewards engineers who can bridge algorithm design and hardware efficiency, and Triton offers a practical route to that skill without diving fully into CUDA. As models grow and serve more users, efficiency matters as much as accuracy. Faster kernels decrease energy consumption, align with sustainability goals, and make advanced AI more accessible. In a competitive landscape, the ability to optimize at the kernel level can differentiate products by speed and cost, turning infrastructure into a strategic advantage.

Lecture Summary

Tap terms for definitions

01Overview

This lecture teaches how to make language models faster and cheaper by optimizing the low-level GPU code that actually does the math. It focuses on kernels—the tiny GPU programs that implement operations like matrix multiplication—and shows how to write them using Triton, a high-level, Python-like domain-specific language (DSL) that compiles to efficient CUDA code. The key motivation is that large language models (LLMs) are dominated by matrix multiplications during both training and inference, and the cost of these operations—in time, money, and energy—is very high. Improving the speed of these core kernels delivers immediate and large benefits to end-to-end performance.

You’ll see where computation happens when you write PyTorch code, and how frameworks delegate to deeply optimized kernels implemented in lower-level languages. You’ll learn what a kernel is and why parallelism and memory access patterns matter more than almost anything else on GPUs. Then you will dive into Triton: what it is, how it compiles, and how it optimizes for specific hardware. The lecture walks through two code examples—a simple elementwise add kernel and a tiled matrix multiplication kernel—explaining how they partition work, coordinate across many GPU program instances, and keep memory access efficient. It also highlights practical concerns like avoiding out-of-bounds reads with masks, picking block sizes, and understanding shared memory.

This material is for students and practitioners building or deploying LLMs who want to go beyond high-level APIs and control performance-critical parts of the stack. A basic understanding of Python and PyTorch is helpful. Knowing that GPUs run many threads in parallel and that matrix multiplication is central to neural networks will make it easier to follow. No previous CUDA experience is required because Triton abstracts most low-level details, but an interest in performance and hardware-aware programming will help you get the most from it.

After completing this lecture, you’ll be able to explain what GPU kernels are and why they dominate runtime for LLMs. You’ll be able to write and launch simple Triton kernels and reason about block sizes, grids, masks, and pointer arithmetic. You’ll understand the role of memory bandwidth, shared memory, and tiling in speeding up matrix multiplication. You’ll also know the trade-offs between using Triton versus writing raw CUDA and when each might make sense. Practically, you’ll be prepared to explore optimizing key building blocks like matmul and prepare for more advanced fusions and attention mechanisms in future work.

The lecture is structured in four parts. First, it motivates kernel optimization by connecting LLM cost and performance to matrix multiplications and memory movement. Second, it defines kernels and explains how high-level frameworks call low-level GPU code. Third, it introduces Triton and shows, step by step, how to write and launch a simple add kernel, then a more complex tiled matmul using shared memory-style tiling concepts and block pointers. Fourth, it compares Triton to CUDA, discussing benefits and limitations, and concludes with a reminder that as models scale, hardware-aware optimization grows in importance and is a powerful tool in every ML engineer’s toolbox.

Key Takeaways

✓Always compute the grid from problem size and tile size. For 1D ops, use grid=(ceil_div(n, BLOCK_SIZE),) so every element is covered. If grid×BLOCK_SIZE exceeds n, rely on masks to stay safe. Never under-cover the data or you will leave elements unprocessed.
✓Use masks for all boundary conditions. Create boolean masks for loads and stores so out-of-bounds lanes do nothing. This allows one kernel to handle many shapes and avoids crashes. It also simplifies host code because you don’t need special-case kernels.
✓Prefer tl.constexpr for tile sizes and other compile-time parameters. When the compiler knows sizes, it can unroll loops and pack registers better. This often yields measurable speedups with minimal effort. Pass BLOCK_SIZEs explicitly in your kernel signature.
✓Keep memory access coalesced. Arrange offsets so adjacent lanes access adjacent addresses. For vectors, offsets = base + tl.arange(0, BLOCK_SIZE) is a good default. Check any 2D indexing still produces contiguous regions for efficient loads.
✓Accumulate in higher precision when multiplying. Use float32 for the accumulator even if inputs are float16 or bfloat16. Cast to the output type at the end to balance speed and accuracy. This reduces numerical error in deep loops.
✓Start with correctness on small problems. Validate kernels against PyTorch reference ops (e.g., torch.add, torch.matmul) for small shapes. Then scale up and profile. Fix indexing and mask issues before chasing performance.
✓Tune tile sizes empirically. Try several BLOCK_SIZE_M/N/K combinations and measure on your target GPU. Larger tiles increase reuse but can hurt occupancy if they use too many registers. Find a sweet spot by benchmarking.

Glossary

Kernel

A small function that runs on the GPU and executes many parallel operations at once. It replaces a slow loop on the CPU with many workers doing the same kind of work on different data. Kernels are where most heavy math in ML actually happens. They decide how to split the job and how to read and write memory. Good kernels make models fast; bad ones make them slow.

Triton

A Python-like language for writing GPU kernels that compiles to fast low-level code. It lets you write compact code with high-level operations like tl.load and tl.store. The compiler optimizes your code for your specific GPU. Triton aims to be easier than CUDA but still very fast.

CUDA

NVIDIA’s platform and programming model for writing GPU code. It is powerful and gives you deep control but can be complex and verbose. Many ML libraries rely on CUDA kernels for speed. Triton can generate CUDA code under the hood.

GPU

A graphics processing unit designed to run many simple operations in parallel. GPUs are great for matrix math, which is full of repeated operations. They have many cores and high memory bandwidth. ML models use GPUs to run quickly.

Version: 1

text

1 block_start_m = pid_m * BLOCK_SIZE_M
2 block_start_n = pid_n * BLOCK_SIZE_N
3 
4 offsets_m = block_start_m + tl.arange(0, BLOCK_SIZE_M)
5 offsets_n = block_start_n + tl.arange(0, BLOCK_SIZE_N)
6 
7 a_block_ptr = tl.make_block_ptr(
8     base_ptr=a_ptr,
9     shape=(M, K),
10     strides=(K, 1),
11     offsets=(block_start_m, 0),
12     block_shape=(BLOCK_SIZE_M, BLOCK_SIZE_K),
13     order=(1, 0)
14 )
15 b_block_ptr = tl.make_block_ptr(
16     base_ptr=b_ptr,
17     shape=(K, N),
18     strides=(N, 1),
19     offsets=(0, block_start_n),
20     block_shape=(BLOCK_SIZE_K, BLOCK_SIZE_N),
21     order=(1, 0)
22 )
23 
24 accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
25 for k in range(0, K, BLOCK_SIZE_K):
26     a = tl.load(a_block_ptr, mask=(offsets_m[:, None] < M) & (offsets_k[None, :] < K))
27     b = tl.load(b_block_ptr, mask=(offsets_k[:, None] < K) & (offsets_n[None, :] < N))
28     accumulator += tl.dot(a, b)
29     a_block_ptr = tl.advance(a_block_ptr, (0, BLOCK_SIZE_K))
30     b_block_ptr = tl.advance(b_block_ptr, (BLOCK_SIZE_K, 0))
31 
32 output = accumulator.to(output_ptr.dtype.element_ty)
33 tl.store(output_ptr + offsets_m[:, None] * N + offsets_n[None, :], output, mask=(offsets_m[:, None] < M) & (offsets_n[None, :] < N))

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 6: Kernels, Triton

Key Summary

Why This Lecture Matters

Lecture Summary

01Overview

Key Takeaways

Glossary

Kernel

Triton

CUDA

GPU

02Key Concepts

03Technical Details

04Examples

05Conclusion

Matrix Multiplication (Matmul)

Memory Bandwidth

Shared Memory

Register

1	block_start_m = pid_m * BLOCK_SIZE_M
2	block_start_n = pid_n * BLOCK_SIZE_N
3
4	offsets_m = block_start_m + tl.arange(0, BLOCK_SIZE_M)
5	offsets_n = block_start_n + tl.arange(0, BLOCK_SIZE_N)
6
7	a_block_ptr = tl.make_block_ptr(
8	base_ptr=a_ptr,
9	shape=(M, K),
10	strides=(K, 1),
11	offsets=(block_start_m, 0),
12	block_shape=(BLOCK_SIZE_M, BLOCK_SIZE_K),
13	order=(1, 0)
14	)
15	b_block_ptr = tl.make_block_ptr(
16	base_ptr=b_ptr,
17	shape=(K, N),
18	strides=(N, 1),
19	offsets=(0, block_start_n),
20	block_shape=(BLOCK_SIZE_K, BLOCK_SIZE_N),
21	order=(1, 0)
22	)
23
24	accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
25	for k in range(0, K, BLOCK_SIZE_K):
26	a = tl.load(a_block_ptr, mask=(offsets_m[:, None] < M) & (offsets_k[None, :] < K))
27	b = tl.load(b_block_ptr, mask=(offsets_k[:, None] < K) & (offsets_n[None, :] < N))
28	accumulator += tl.dot(a, b)
29	a_block_ptr = tl.advance(a_block_ptr, (0, BLOCK_SIZE_K))
30	b_block_ptr = tl.advance(b_block_ptr, (BLOCK_SIZE_K, 0))
31
32	output = accumulator.to(output_ptr.dtype.element_ty)
33	tl.store(output_ptr + offsets_m[:, None] * N + offsets_n[None, :], output, mask=(offsets_m[:, None] < M) & (offsets_n[None, :] < N))