🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 6: Kernels, Triton | How I Study AI
📚 Stanford CS336: Language Modeling from Scratch6 / 17
PrevNext
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 6: Kernels, Triton
Watch on YouTube

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 6: Kernels, Triton

Intermediate
Stanford Online
LLMYouTube

Key Summary

  • •Modern language models are expensive to run because they perform many matrix multiplications. The main cost comes from both compute and moving data in and out of GPU memory. Optimizing the low-level code that runs these operations can make inference and training much faster and cheaper.
  • •A kernel is a small function that runs on the GPU in parallel across many lightweight workers. Instead of a CPU loop that processes one item at a time, a kernel lets thousands of GPU threads work together. Writing good kernels is about splitting the work and reading/writing memory efficiently.
  • •High-level ML frameworks like PyTorch or TensorFlow call low-level kernels written in C++/CUDA under the hood. We can keep the nice Python API but still get GPU speed by improving these kernels. Changing the kernel often speeds up the whole model.
  • •Triton is a domain-specific language (DSL) that lets you write GPU kernels in a Python-like way. It is easier to write than raw CUDA but still generates fast low-level code. Triton compiles your function just-in-time (JIT) and produces GPU code optimized for your hardware.
  • •A simple Triton 'add' kernel loads elements from two input arrays, adds them, and stores the result. Each GPU program instance handles a block of elements and uses masks to avoid out-of-bounds memory accesses. You launch the kernel with a grid that says how many blocks to run.
  • •Launching a Triton kernel looks like calling a function with special syntax: kernel[grid](args...). You allocate tensors on the GPU with PyTorch, set a BLOCK_SIZE, set a grid (how many programs), and pass the number of elements. The kernel computes in parallel across those programs.
  • •Performance on GPUs is often limited by memory bandwidth, not pure math. Triton tries to make memory access patterns fast and coalesced (neighbors read neighbor addresses). It also uses techniques like loop unrolling and smart register allocation.

Why This Lecture Matters

Anyone building or deploying large language models faces significant compute and cost challenges, with matrix multiplications dominating both training and inference. Kernel-level optimization directly attacks the biggest bottlenecks, unlocking lower latency, higher throughput, and lower energy use. For ML engineers, learning kernels and Triton means you can tailor fast paths for your specific shapes and workflows instead of waiting on generic library updates. For research scientists, it empowers experiments with new layer designs or fusions that standard libraries don’t yet support. For systems engineers and platform teams, it reduces cloud GPU bills and improves service reliability by speeding up hot paths. This knowledge applies to real projects by letting you write custom kernels for elementwise ops, reductions, and matmuls, and by tuning tile sizes to your hardware. It makes it possible to fuse steps (e.g., bias + activation) to cut memory traffic and cut kernel launch overhead. Mastering masks, strides, and pointer arithmetic helps you build robust kernels that work on arbitrary shapes, which is essential in production. The industry increasingly rewards engineers who can bridge algorithm design and hardware efficiency, and Triton offers a practical route to that skill without diving fully into CUDA. As models grow and serve more users, efficiency matters as much as accuracy. Faster kernels decrease energy consumption, align with sustainability goals, and make advanced AI more accessible. In a competitive landscape, the ability to optimize at the kernel level can differentiate products by speed and cost, turning infrastructure into a strategic advantage.

Lecture Summary

Tap terms for definitions

01Overview

This lecture teaches how to make language models faster and cheaper by optimizing the low-level GPU code that actually does the math. It focuses on kernels—the tiny GPU programs that implement operations like matrix multiplication—and shows how to write them using Triton, a high-level, Python-like domain-specific language (DSL) that compiles to efficient CUDA code. The key motivation is that large language models (LLMs) are dominated by matrix multiplications during both training and inference, and the cost of these operations—in time, money, and energy—is very high. Improving the speed of these core kernels delivers immediate and large benefits to end-to-end performance.

You’ll see where computation happens when you write PyTorch code, and how frameworks delegate to deeply optimized kernels implemented in lower-level languages. You’ll learn what a kernel is and why parallelism and memory access patterns matter more than almost anything else on GPUs. Then you will dive into Triton: what it is, how it compiles, and how it optimizes for specific hardware. The lecture walks through two code examples—a simple elementwise add kernel and a tiled matrix multiplication kernel—explaining how they partition work, coordinate across many GPU program instances, and keep memory access efficient. It also highlights practical concerns like avoiding out-of-bounds reads with masks, picking block sizes, and understanding shared memory.

This material is for students and practitioners building or deploying LLMs who want to go beyond high-level APIs and control performance-critical parts of the stack. A basic understanding of Python and PyTorch is helpful. Knowing that GPUs run many threads in parallel and that matrix multiplication is central to neural networks will make it easier to follow. No previous CUDA experience is required because Triton abstracts most low-level details, but an interest in performance and hardware-aware programming will help you get the most from it.

After completing this lecture, you’ll be able to explain what GPU kernels are and why they dominate runtime for LLMs. You’ll be able to write and launch simple Triton kernels and reason about block sizes, grids, masks, and pointer arithmetic. You’ll understand the role of memory bandwidth, shared memory, and tiling in speeding up matrix multiplication. You’ll also know the trade-offs between using Triton versus writing raw CUDA and when each might make sense. Practically, you’ll be prepared to explore optimizing key building blocks like matmul and prepare for more advanced fusions and attention mechanisms in future work.

The lecture is structured in four parts. First, it motivates kernel optimization by connecting LLM cost and performance to matrix multiplications and memory movement. Second, it defines kernels and explains how high-level frameworks call low-level GPU code. Third, it introduces Triton and shows, step by step, how to write and launch a simple add kernel, then a more complex tiled matmul using shared memory-style tiling concepts and block pointers. Fourth, it compares Triton to CUDA, discussing benefits and limitations, and concludes with a reminder that as models scale, hardware-aware optimization grows in importance and is a powerful tool in every ML engineer’s toolbox.

Key Takeaways

  • ✓Always compute the grid from problem size and tile size. For 1D ops, use grid=(ceil_div(n, BLOCK_SIZE),) so every element is covered. If grid×BLOCK_SIZE exceeds n, rely on masks to stay safe. Never under-cover the data or you will leave elements unprocessed.
  • ✓Use masks for all boundary conditions. Create boolean masks for loads and stores so out-of-bounds lanes do nothing. This allows one kernel to handle many shapes and avoids crashes. It also simplifies host code because you don’t need special-case kernels.
  • ✓Prefer tl.constexpr for tile sizes and other compile-time parameters. When the compiler knows sizes, it can unroll loops and pack registers better. This often yields measurable speedups with minimal effort. Pass BLOCK_SIZEs explicitly in your kernel signature.
  • ✓Keep memory access coalesced. Arrange offsets so adjacent lanes access adjacent addresses. For vectors, offsets = base + tl.arange(0, BLOCK_SIZE) is a good default. Check any 2D indexing still produces contiguous regions for efficient loads.
  • ✓Accumulate in higher precision when multiplying. Use float32 for the accumulator even if inputs are float16 or bfloat16. Cast to the output type at the end to balance speed and accuracy. This reduces numerical error in deep loops.
  • ✓Start with correctness on small problems. Validate kernels against PyTorch reference ops (e.g., torch.add, torch.matmul) for small shapes. Then scale up and profile. Fix indexing and mask issues before chasing performance.
  • ✓Tune tile sizes empirically. Try several BLOCK_SIZE_M/N/K combinations and measure on your target GPU. Larger tiles increase reuse but can hurt occupancy if they use too many registers. Find a sweet spot by benchmarking.

Glossary

Kernel

A small function that runs on the GPU and executes many parallel operations at once. It replaces a slow loop on the CPU with many workers doing the same kind of work on different data. Kernels are where most heavy math in ML actually happens. They decide how to split the job and how to read and write memory. Good kernels make models fast; bad ones make them slow.

Triton

A Python-like language for writing GPU kernels that compiles to fast low-level code. It lets you write compact code with high-level operations like tl.load and tl.store. The compiler optimizes your code for your specific GPU. Triton aims to be easier than CUDA but still very fast.

CUDA

NVIDIA’s platform and programming model for writing GPU code. It is powerful and gives you deep control but can be complex and verbose. Many ML libraries rely on CUDA kernels for speed. Triton can generate CUDA code under the hood.

GPU

A graphics processing unit designed to run many simple operations in parallel. GPUs are great for matrix math, which is full of repeated operations. They have many cores and high memory bandwidth. ML models use GPUs to run quickly.

#triton#gpu kernel#cuda#matrix multiplication#tiling#shared memory#memory bandwidth#coalesced access#program_id#tl.constexpr#tl.load#tl.store#block size#grid launch#jit compilation#register allocation#loop unrolling#pyTorch#inference optimization#hardware-aware optimization
Version: 1
  • •Shared memory is a small, very fast scratchpad all threads in a block can use. By loading tiles of A and B matrices into shared memory, you reduce slow global memory reads. Triton can help you orchestrate this tiling for faster matrix multiplication.
  • •A matrix multiplication kernel breaks the big problem into tiles along M, N, and K dimensions. Each program accumulates a submatrix (tile) of the output by looping over K tiles. The kernel uses block pointers and masks to load only valid ranges.
  • •In the provided matmul example, the intent is to tile A and B, load tiles, multiply-accumulate, then store a result tile. The code shows masks that should reference a K-offset vector; a variable like offsets_k must be defined for correctness. Fixing that mask is essential to avoid reading invalid memory.
  • •Triton compared to CUDA: CUDA is the standard and very powerful but low-level and verbose. Triton is higher-level, easier to write, and portable while still generating CUDA under the hood. The trade-off is learning a new DSL and accepting that the ecosystem is still evolving.
  • •The big picture: as LLMs grow, kernel and memory optimizations matter more. Even small percentage gains in matmul speed multiply across billions of operations. Knowing kernels and Triton helps you cut costs, save energy, and ship faster-serving models.
  • 02Key Concepts

    • 01

      What a kernel is: A kernel is a small function that runs on the GPU across many lightweight workers in parallel. It replaces a slow CPU for-loop with thousands of concurrent operations. The kernel’s job is to divide work and move data efficiently so each worker handles a slice of the problem. Think of it like dividing a big pizza into many slices so many people can eat at once. Without kernels, deep learning operations would be too slow to be practical.

    • 02

      Where computation really happens: When you write PyTorch or TensorFlow code in Python, your operations are mapped to lower-level kernels in C++/CUDA. The high-level API is the friendly face, but the heavy lifting is done by these compiled kernels. Speed depends on how well these kernels use the GPU hardware. This layering lets you write simple code while still achieving high performance. Improving kernels can accelerate entire models without changing user-facing code.

    • 03

      Why matrix multiplication dominates: In transformers, most compute cost is in matrix multiplications (matmuls) inside linear layers and attention. Matmuls consist of many multiply-accumulate operations that are easy to parallelize. However, moving the required data from memory can be the actual bottleneck. Optimizing both compute and memory access patterns is key. Faster matmuls ripple through to faster training and inference.

    • 04

      GPU parallelism basics: GPUs run many threads that execute similar instructions on different data (SIMD/SPMD style). In Triton, each program instance processes a block of elements, coordinated by program IDs along different axes. This scheme maps neatly onto tiles of tensors and encourages coalesced memory access. Parallelism works best when work is evenly split and memory locations are predictable. Poorly divided work or scattered memory access wastes bandwidth and time.

    • 05

      Memory bandwidth as a bottleneck: Memory bandwidth is how fast data can be read from or written to memory. GPUs have high bandwidth, but LLMs demand even more, so memory often limits speed. Coalesced access means adjacent threads read adjacent addresses, using the bus efficiently. Using fast on-chip storage like shared memory reduces trips to slower global memory. Careful tiling and reuse of data improve bandwidth utilization.

    • 06

      Triton overview: Triton is a Python-like DSL for writing GPU kernels that compiles to fast machine code (e.g., CUDA). It offers high-level constructs (like tl.load, tl.store, program_id) while still exposing low-level control over memory and tiling. Triton’s JIT compiler optimizes code for your specific GPU. It automates things like loop unrolling and register allocation. This balance makes it easier than raw CUDA while keeping performance high.

    • 07

      The add kernel structure: The add kernel loads a block of elements from two input arrays, adds them, and stores results. It computes a per-program offset using program_id and BLOCK_SIZE. A mask ensures it doesn’t read or write past the end. The code vectorizes loads and stores over a small range using tl.arange. These patterns generalize to many elementwise ops.

    • 08

      Launching a Triton kernel: You allocate tensors on the GPU using PyTorch, choose a BLOCK_SIZE, and compute a grid specifying how many programs to run. You then call kernelgrid to launch. The kernel runs with many identical program instances, each handling a different slice of data. Correct grid sizing ensures every element gets processed. Too small a grid leaves work undone; too large relies on masks to safely ignore out-of-range indices.

    • 09

      Compile-time constants (tl.constexpr): Some kernel parameters like BLOCK_SIZE work best when the compiler knows them at compile time. Triton’s tl.constexpr annotation lets it specialize and optimize code for those values. This can enable loop unrolling and better register allocation. Passing such parameters explicitly is a common tuning strategy. It’s one lever for performance without changing algorithmic behavior.

    • 10

      Masks for safety and performance: Masks guard memory operations so threads don’t access out-of-bounds addresses. In Triton, you pass mask= to tl.load and tl.store to selectively enable lanes. This avoids undefined behavior and crashes while allowing over-provisioned grids. Masking is especially useful for ragged edges when tensor sizes aren’t multiples of block sizes. It makes kernels robust to many shapes.

    • 11

      Shared memory and tiling ideas: Shared memory is a fast on-chip scratchpad for threads in the same block; in Triton, you emulate this benefit by tiling and reusing data with block pointers. The idea is to load a tile of A and B once and use it multiple times to accumulate output tiles. This reduces global memory traffic and increases arithmetic intensity. Properly chosen tile sizes balance reuse with occupancy. Tiling is the core of high-performance matmul.

    • 12

      Matmul kernel organization: A tiled matmul kernel partitions the output matrix into BLOCK_SIZE_M by BLOCK_SIZE_N tiles. It loops over K in steps of BLOCK_SIZE_K, loading corresponding A and B tiles. Each step computes a partial product and accumulates in registers. After looping, it stores the result tile back to global memory. Masks ensure only valid rows/cols get read or written at matrix edges.

    • 13

      Block pointers and strides: tl.make_block_ptr creates structured pointers aware of base address, shape, strides, offsets, and tile shape. This conveys to Triton how to walk matrices efficiently. Strides map 2D indices to 1D memory addresses (e.g., row-major layout). tl.advance moves the block pointer to the next K tile. These abstractions help generate coalesced, cache-friendly memory operations.

    • 14

      Accumulator precision: Accumulation often uses higher precision (e.g., float32) even if inputs are lower precision, to avoid numeric error. After finishing, the accumulator is cast to the output dtype. This pattern balances speed and accuracy. It is standard practice in matmul kernels. It can affect both model quality and performance.

    • 15

      Under-the-hood compiler optimizations: Triton’s compiler applies loop unrolling, register allocation, and memory access optimization. It targets your specific GPU architecture to pick good code paths. These choices influence occupancy, instruction scheduling, and bandwidth use. The goal is to fully utilize compute units while avoiding stalls. Auto-optimization reduces the need for manual micro-tuning.

    • 16

      CUDA vs Triton trade-offs: CUDA is mature, powerful, and the industry standard for NVIDIA GPUs, but is lower-level and verbose. Triton is easier to write, more Pythonic, and portable, while still generating CUDA under the hood. Triton’s ecosystem is newer and evolving, which means learning a fresh DSL and accepting occasional rough edges. For many ML workloads, Triton hits a sweet spot of ease and speed. For niche, bleeding-edge tuning, raw CUDA may still be necessary.

    • 17

      Practical limits and robustness: Kernels must handle sizes not divisible by block sizes, making masks essential. Grids must be chosen to cover all work without starving the GPU. Memory access must be coalesced whenever possible to avoid bandwidth waste. Debugging involves checking shapes, masks, and pointer math first. Small mistakes in indexing can silently hurt performance or correctness.

    • 18

      Why this matters as models scale: As LLMs grow, even 5–10% speedups on matmul save huge compute and energy across billions of calls. Optimized kernels can reduce latency noticeably and lower serving costs. They also make training more affordable and sustainable. Hardware-aware programming becomes a core skill for ML practitioners. Triton offers a practical path to those wins without diving deep into CUDA immediately.

    • 19

      Energy and cost implications: Faster kernels are greener because they finish work with less total energy. They reduce cloud GPU hours and operational bills for inference at scale. Some organizations find kernel work pays for itself quickly in cost savings. It can also enable features that were too slow before. Efficiency is a key competitive advantage for ML systems.

    • 20

      From toy kernels to real workloads: Start with elementwise ops to learn Triton’s model, then move to reductions and matmul. Combine these building blocks for complex layers like attention. You can integrate Triton kernels into PyTorch pipelines with minimal changes. Validating correctness with small test tensors is a must. Then profile and tune tile sizes for your target GPU.

    03Technical Details

    Overall Architecture/Structure

    1. High-level frameworks to kernels
    • You write Python code using PyTorch or TensorFlow. Each tensor operation you call (e.g., add, matmul) maps to one or more kernels. The framework handles autograd, scheduling, and device placement, but the performance-critical work happens in compiled kernels on the GPU. Optimizing those kernels often yields the biggest speedups.

    • The stack looks like: Python (model code) → Framework (PyTorch) → Backend (C++/CUDA/Triton) → GPU Hardware. Triton slots into the backend layer, letting you write kernels in a Pythonic way that compile to CUDA or other low-level representations.

    1. GPU execution model (as used by Triton)
    • GPUs run many lightweight parallel instances. In Triton’s SPMD (Single Program, Multiple Data) model, your kernel function is instantiated many times (program instances). Each instance handles a slice of the tensor defined by program_id(axis=...). The host specifies how many instances to launch via the grid.

    • Inside a kernel, you compute a block’s starting index using program_id × BLOCK_SIZE. tl.arange(0, BLOCK_SIZE) yields per-lane indices within that block. Combined, offsets = start + arange gives the addresses each lane will touch. Masking ensures lanes that fall beyond the end of the data do nothing.

    1. Memory hierarchy and bandwidth
    • Global memory (VRAM) is large but relatively slow to access. Shared memory is a small on-chip buffer shared by threads in the same block (fast). Registers are the fastest and store per-lane temporary values. L2/L1 caches sit between global memory and compute.

    • The chief challenge is feeding data to compute units without stalling. Memory bandwidth (how quickly you can move data) often caps performance. Coalesced memory access—where adjacent lanes access adjacent addresses—maximizes throughput. Tiling into shared memory and reusing tiles increases arithmetic intensity (more computation per byte fetched), improving performance.

    Code/Implementation Details A) Elementwise add kernel Provided kernel:

    import triton import triton.language as tl

    @triton.jit def add(x_ptr, y_ptr, output_ptr, n_elements: tl.constexpr): pid = tl.program_id(axis=0) block_start = pid * BLOCK_SIZE offsets = block_start + tl.arange(0, BLOCK_SIZE) mask = offsets < n_elements x = tl.load(x_ptr + offsets, mask=mask) y = tl.load(y_ptr + offsets, mask=mask) output = x + y tl.store(output_ptr + offsets, output, mask=mask)

    • What it does: Adds two same-shaped 1D tensors elementwise and writes to output.
    • Arguments: x_ptr, y_ptr, output_ptr are pointers to GPU memory holding the inputs/outputs. n_elements is the total number of elements (compile-time constant here), used to bounds-check.
    • tl.program_id(axis=0): Returns which program instance you are (0..grid[0]-1). Each instance processes a block of elements.
    • BLOCK_SIZE and tl.arange(0, BLOCK_SIZE): Define how many elements a program handles and their per-lane indices. offsets are the absolute indices for this instance.
    • mask = offsets < n_elements: Prevents reading/writing beyond array end when n_elements isn’t a multiple of BLOCK_SIZE.
    • tl.load and tl.store: Vectorized memory operations. Adding x and y is a vectorized elementwise add.

    Recommended refinement: pass BLOCK_SIZE as tl.constexpr to make it explicit and help the compiler specialize:

    @triton.jit def add(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr): pid = tl.program_id(axis=0) block_start = pid * BLOCK_SIZE offsets = block_start + tl.arange(0, BLOCK_SIZE) mask = offsets < n_elements x = tl.load(x_ptr + offsets, mask=mask) y = tl.load(y_ptr + offsets, mask=mask) tl.store(output_ptr + offsets, x + y, mask=mask)

    B) Launching the add kernel from Python

    import torch

    x = torch.randn(1024, device='cuda') y = torch.randn(1024, device='cuda') output = torch.zeros(1024, device='cuda') BLOCK_SIZE = 256 grid = (4,) add[grid](x, y, output, x.numel())

    • Tensors: x, y, output are 1D tensors on the GPU (device='cuda'). x.numel() = 1024 gives the total element count.
    • BLOCK_SIZE: Each program handles 256 elements.
    • grid = (4,): Launches 4 program instances in axis 0. 4 × 256 = 1024 elements, so every element is covered exactly once.
    • addgrid: Special Triton syntax to JIT-compile and launch the kernel.

    If using the refined kernel with explicit BLOCK_SIZE in the signature, pass it too: add[grid](x, y, output, x.numel(), BLOCK_SIZE)

    C) Under-the-hood: JIT and optimizations

    • @triton.jit compiles the decorated function into GPU code on first call for given constant parameters (like BLOCK_SIZE). The compiler emits CUDA-level code for NVIDIA GPUs, applying optimizations such as loop unrolling (turning small loops into straight-line code), register allocation (deciding which values stay in fast registers), and memory access optimization (arranging loads/stores for coalescing).
    • Because tl.constexpr values are known at compile time, the compiler can generate specialized, faster code based on those sizes.

    D) Tiled matrix multiplication kernel Goal: Compute C[M, N] = A[M, K] @ B[K, N] using tiles. Each program computes a tile of C of shape (BLOCK_SIZE_M, BLOCK_SIZE_N), accumulating over K in steps of BLOCK_SIZE_K.

    Provided skeleton:

    import triton import triton.language as tl

    @triton.jit def matmul(a_ptr, b_ptr, output_ptr, M, N, K, BLOCK_SIZE_M: tl.constexpr, BLOCK_SIZE_N: tl.constexpr, BLOCK_SIZE_K: tl.constexpr): pid_m = tl.program_id(axis=0) pid_n = tl.program_id(axis=1)

    text
    1block_start_m = pid_m * BLOCK_SIZE_M
    2block_start_n = pid_n * BLOCK_SIZE_N
    3
    4offsets_m = block_start_m + tl.arange(0, BLOCK_SIZE_M)
    5offsets_n = block_start_n + tl.arange(0, BLOCK_SIZE_N)
    6
    7a_block_ptr = tl.make_block_ptr(
    8 base_ptr=a_ptr,
    9 shape=(M, K),
    10 strides=(K, 1),
    11 offsets=(block_start_m, 0),
    12 block_shape=(BLOCK_SIZE_M, BLOCK_SIZE_K),
    13 order=(1, 0)
    14)
    15b_block_ptr = tl.make_block_ptr(
    16 base_ptr=b_ptr,
    17 shape=(K, N),
    18 strides=(N, 1),
    19 offsets=(0, block_start_n),
    20 block_shape=(BLOCK_SIZE_K, BLOCK_SIZE_N),
    21 order=(1, 0)
    22)
    23
    24accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
    25for k in range(0, K, BLOCK_SIZE_K):
    26 a = tl.load(a_block_ptr, mask=(offsets_m[:, None] < M) & (offsets_k[None, :] < K))
    27 b = tl.load(b_block_ptr, mask=(offsets_k[:, None] < K) & (offsets_n[None, :] < N))
    28 accumulator += tl.dot(a, b)
    29 a_block_ptr = tl.advance(a_block_ptr, (0, BLOCK_SIZE_K))
    30 b_block_ptr = tl.advance(b_block_ptr, (BLOCK_SIZE_K, 0))
    31
    32output = accumulator.to(output_ptr.dtype.element_ty)
    33tl.store(output_ptr + offsets_m[:, None] * N + offsets_n[None, :], output, mask=(offsets_m[:, None] < M) & (offsets_n[None, :] < N))
    • Intent: Use block pointers to describe tiles of A and B matrices and iterate along K. The accumulator holds a BLOCK_SIZE_M × BLOCK_SIZE_N partial result in higher precision (float32). After looping over K tiles, the result is cast to output dtype and written to the correct output slice with masking at the boundaries.
    • Important fix: offsets_k is referenced but not defined. Define it to represent the K-indices of the current tile:

    offsets_k = tl.arange(0, BLOCK_SIZE_K)

    Then the masks become valid. A corrected version of the innermost section:

    offsets_k = tl.arange(0, BLOCK_SIZE_K) a = tl.load(a_block_ptr, mask=(offsets_m[:, None] < M) & (offsets_k[None, :] < K)) b = tl.load(b_block_ptr, mask=(offsets_k[:, None] < K) & (offsets_n[None, :] < N)) accumulator += tl.dot(a, b)

    • Strides: For A[M, K] in row-major, stride for row is K and for column is 1. For B[K, N], stride for row is N and for column is 1. These strides map 2D indices to addresses. tl.make_block_ptr uses shape, strides, and offsets to locate tiles.
    • tl.advance: Moves the pointer to the next tile along K by adding (row_delta, col_delta) in tile units.
    • Mask logic: At matrix edges (where the last tile may be partially outside M, N, or K), masks prevent out-of-bounds loads/stores. This lets a uniform grid cover the entire matrix without conditional control flow.

    Role of Each Component

    • pid_m, pid_n: Identify which output tile this program computes along M and N grids.
    • BLOCK_SIZE_M/N/K: Tuneable tile sizes; compile-time constants for specialization. Larger tiles can increase reuse but may reduce occupancy; smaller tiles may increase overhead.
    • accumulator: Keeps partial results in registers; prefer float32 for accumulation to reduce numerical error.
    • tl.dot: Performs a matrix multiply between the loaded A and B tiles and returns a BLOCK_SIZE_M × BLOCK_SIZE_N result to accumulate.
    • tl.store with stride arithmetic: output_ptr + offsets_m[:, None] * N + offsets_n[None, :] computes linear indices for a 2D region in row-major layout.

    Tools/Libraries Used

    • PyTorch: Allocates tensors on the GPU and provides easy interop with Triton by passing PyTorch tensors (their underlying storage) as pointers to kernels.
    • Triton: DSL and compiler. Key language elements used: • @triton.jit: JIT-compile the function to GPU code. • triton.language as tl: Namespace for kernel intrinsics (program_id, arange, load, store, zeros, dot, make_block_ptr, advance). • tl.constexpr: Marks parameters known at compile time for specialization.
    • CUDA (implicitly): The generated code targets NVIDIA GPUs, leveraging CUDA capabilities under the hood.

    Step-by-Step Implementation Guide

    1. Install and set up
    • Ensure a machine with an NVIDIA GPU and CUDA-compatible drivers. Install PyTorch with CUDA support. Install Triton (often pip install triton). Verify a minimal Triton script runs without errors.
    1. Write and run the add kernel
    • Define the kernel (with explicit BLOCK_SIZE as tl.constexpr). Choose BLOCK_SIZE=256 as a starting point. Create input tensors with torch.randn and an output with torch.zeros on device='cuda'. Set grid=(ceil_div(n, BLOCK_SIZE),). Launch kernel, then check correctness with torch.allclose(output, x + y).
    1. Choose block sizes and grids
    • Compute grid size as grid=(math.ceil(n / BLOCK_SIZE),). If grid × BLOCK_SIZE > n, rely on mask to avoid out-of-bounds. If it’s less than n, some elements won’t be processed. Start with 128 or 256 and adjust based on profiling.
    1. Profile
    • Use torch.cuda.synchronize() before/after timing to measure only GPU time. Use torch.cuda.Event for accurate timing across many runs. Compare against torch.add to ensure the kernel is in the right performance ballpark. Expect the Triton version to be competitive with framework kernels for educational cases.
    1. Implement a simple matmul
    • Start with shapes M, N, K multiples of your tile sizes (e.g., M=N=K=1024, BLOCK_SIZE_M=BLOCK_SIZE_N=64, BLOCK_SIZE_K=32). Compute grid=(ceil_div(M, BLOCK_SIZE_M), ceil_div(N, BLOCK_SIZE_N)). Initialize output to zeros. Launch kernel with tl.constexpr tile sizes.
    1. Add masks and edge handling
    • Add offsets_k = tl.arange(0, BLOCK_SIZE_K) and create masks for A and B loads and for C stores. Test with sizes not divisible by tile sizes (e.g., M=1000, N=1010, K=1030). Verify correctness against torch.matmul within a tolerance (consider accumulation dtype and casting).
    1. Tune tile sizes
    • Try different BLOCK_SIZE_M/N/K values (e.g., 64×64×32, 128×64×32) and measure. Larger tiles can reduce memory traffic but may increase register pressure or reduce occupancy. Look for a sweet spot that is consistently faster on your GPU. Keep masks and correctness tests in place.
    1. Verify numerical behavior
    • Accumulate in float32 and cast to output dtype at the end. Compare to PyTorch matmul for various dtypes (float16, bfloat16) to ensure acceptable error. If errors grow, consider smaller BLOCK_SIZE_K or mixed-precision strategies.

    Tips and Warnings

    • Always mask: Out-of-bounds memory accesses can cause crashes or silent data corruption. Masks on tl.load/tl.store make kernels robust to ragged edges.
    • Keep memory coalesced: Ensure offsets are contiguous across lanes. For 1D add, x_ptr + offsets yields coalesced loads/stores if offsets is a consecutive range.
    • Pass compile-time sizes as tl.constexpr: This enables stronger compiler optimizations. It’s a low-effort way to gain performance.
    • Check pointer math: For 2D arrays, get strides right. In row-major, linear_index = row * stride_row + col * stride_col; common is stride_row = N and stride_col = 1 for shape (rows, cols) with contiguous storage.
    • Start simple: Get correctness on small inputs (e.g., 64 elements) before scaling. Print or assert shapes and check masks cover what you expect (e.g., sum of mask equals elements processed).
    • Balance tile sizes: Oversized tiles can increase register pressure and reduce occupancy, hurting performance. Undersized tiles increase overhead and reduce reuse. Tuning is empirical.
    • Compare to strong baselines: torch.add and torch.matmul are highly optimized; use them as references to judge your kernel’s performance. Don’t be discouraged—matching them is hard; the goal is to learn and target custom cases.
    • Be aware of hardware variability: Optimal tile sizes may differ across GPUs (e.g., RTX vs A100). Triton specializes code per compile-time configuration, so benchmark on your target device.

    Data Flow Summary

    • Elementwise add: Host computes grid from n and BLOCK_SIZE → each program computes block_start via pid → offsets vector generate addresses → masked tl.load of x and y → local add → masked tl.store to output.
    • Tiled matmul: Host computes 2D grid from M, N and tile sizes → each program chooses its output tile via (pid_m, pid_n) → for k in [0..K) with step BLOCK_SIZE_K: block_ptrs load A/B tiles with masks → tl.dot to get partial result → accumulate → advance pointers → after loop, cast and store output tile with mask.

    CUDA vs Triton Mechanics

    • CUDA: You write kernels in C++-like syntax, manage thread/block indexing, shared memory explicitly, and handle many low-level details. You compile with nvcc, link, and call kernels from host code. Control is maximal, but code is verbose and complex.
    • Triton: You write kernels in a Python file, with a compact API for indexing, loading/storing, and tile management. You launch with kernelgrid and Triton compiles JIT, optimizing for your hardware. It hides many complexities while letting you express the core performance-critical structure.

    04Examples

    • 💡

      Elementwise Add Basics: Inputs are two 1D tensors x and y of length 1024 on the GPU. The kernel computes offsets for each program using pid and BLOCK_SIZE, loads x and y at those offsets with a mask, adds them, and stores into output. With BLOCK_SIZE=256 and grid=(4,), each program handles 256 elements. The output matches torch.add(x, y) for all indices.

    • 💡

      Guarding Ragged Ends with Masks: Suppose n_elements=1000 but BLOCK_SIZE=256 and grid=(4,), which covers 1024 indices. Offsets 1000–1023 are beyond the end, but mask=(offsets < 1000) disables those lanes. tl.load and tl.store don’t touch invalid memory thanks to mask. This keeps one clean kernel for all sizes without special cases.

    • 💡

      Choosing the Grid: For n_elements=10_000 and BLOCK_SIZE=256, compute grid=(ceil(10000/256),)=(40,). That launches 40 programs, each doing 256 elements, totaling 10,240 potential lanes. The mask ensures only the first 10,000 are active. This strategy is simple and safe.

    • 💡

      Coalesced Memory Access: Using offsets = base + tl.arange(0, BLOCK_SIZE) makes neighboring lanes access neighboring addresses. tl.load(x_ptr + offsets) then issues coalesced reads, which are faster. If offsets were scattered, the GPU would perform many small, inefficient memory transactions. Coalescing boosts effective bandwidth.

    • 💡

      Matmul Tile Ownership: With M=N=K=1024 and BLOCK_SIZE_M=BLOCK_SIZE_N=64, the grid becomes (16, 16). Program (pid_m=3, pid_n=7) owns rows 192–255 and cols 448–511 of the output. It loops over K in steps of BLOCK_SIZE_K (e.g., 32), accumulating partial results for its 64×64 tile. After finishing, it writes the tile back using a masked store.

    • 💡

      Fixing the offsets_k Bug: The provided matmul code references offsets_k but doesn’t define it. Define offsets_k = tl.arange(0, BLOCK_SIZE_K) before loads. Now masks like (offsets_k[None, :] < K) correctly guard the K dimension. Without this, the code could read invalid memory or crash.

    • 💡

      Block Pointers and Strides: For A[M, K] in row-major, a_ptr’s stride for rows is K and for columns is 1. tl.make_block_ptr(base_ptr=a_ptr, shape=(M,K), strides=(K,1), offsets=(block_start_m, 0), block_shape=(BM, BK)) tells Triton how to fetch the tile A[block_start_m:block_start_m+BM, 0:BK]. Similarly for B with shape=(K,N) and strides=(N,1). These descriptors help Triton generate efficient loads.

    • 💡

      Accumulator Precision and Casting: In matmul, accumulator = tl.zeros((BM, BN), dtype=tl.float32) ensures stable accumulation even if inputs are float16. After looping over K tiles, output = accumulator.to(output_ptr.dtype.element_ty) casts to the desired output dtype. This balances speed with numeric stability. Comparing to torch.matmul verifies acceptable error.

    • 💡

      Launching 2D Grids: For matmul, grid is 2D: grid=(ceil_div(M, BM), ceil_div(N, BN)). Triton launches BM×BN-sized work per program. The kernel reads pid_m and pid_n to find its tile. This 2D layout maps naturally to matrices and keeps indexing simple.

    • 💡

      Performance Tuning via Tile Sizes: Try BM, BN, BK combinations like 64×64×32 vs 128×64×32. Measure runtime with torch.cuda.Event and multiple iterations. Larger tiles may reduce memory traffic but could lower occupancy if they use too many registers. Empirical testing finds a good balance for your GPU.

    • 💡

      Comparing to PyTorch Matmul: Create random A and B and compute C_torch = A @ B. Run the Triton matmul and compare with torch.allclose(C_triton, C_torch, atol=1e-2, rtol=1e-2) for float16/bfloat16. If mismatch occurs, check masks and pointer arithmetic. This ensures correctness before deeper tuning.

    • 💡

      Elementwise Add vs Framework Op: Timing add kernel vs torch.add shows how close a simple Triton kernel can get to a framework primitive. Use 10–100 warm-up runs and 100 timed runs with synchronization to get stable numbers. If slower, confirm coalescing and BLOCK_SIZE choices. Such experiments build intuition for memory-bound kernels.

    • 💡

      Memory-Bound Behavior: Increase tensor size from 1e5 to 1e7 elements for the add kernel. You’ll see runtime scale roughly with data size, showing memory bandwidth limits. Changing BLOCK_SIZE won’t change speed much once coalescing is good. This illustrates why bandwidth optimizations are critical.

    • 💡

      Robust Edge Handling: Set M=1000, N=1010, K=1030 with tile sizes 64×64×32. The last tiles along each dimension are partial. Masks on loads and stores prevent errors and produce correct results. This pattern keeps kernels simple and general-purpose.

    • 💡

      Grid Too Small Failure Mode: If grid=(3,) with BLOCK_SIZE=256 for n=1024, you process only 768 elements. The remaining 256 never run, and output is partially wrong. Always compute grid from size and BLOCK_SIZE; masks don’t fix under-coverage.

    05Conclusion

    Large language models rely heavily on matrix multiplications, and the performance of these operations hinges on the quality of the GPU kernels that implement them. By understanding where computation really happens—inside compiled kernels called by high-level frameworks—you gain leverage over the most expensive parts of training and inference. Triton provides a practical, Python-like way to write these kernels while still generating highly optimized GPU code tailored to your hardware. The elementwise add example shows the fundamentals: program IDs define work slices, arange creates per-lane offsets, and masks ensure safety at boundaries. The tiled matmul example adds the core ideas of high-performance GPU computing: tiling along M, N, and K, using structured block pointers and strides, accumulating partial results with stable precision, and carefully masking loads and stores.

    The most important lesson is that memory bandwidth and data movement patterns often limit speed more than raw compute. Coalesced access, tiling, and reuse are the keys to unlocking performance. Triton’s compiler helps with low-level optimizations like loop unrolling and register allocation, letting you focus on the algorithmic structure—how to partition work and how to move data. Compared to raw CUDA, Triton reduces complexity and speeds development, although it is a newer ecosystem that you must learn and keep up with as it evolves.

    To practice, implement and validate the add kernel, then build a tiled matmul and fix edge cases with masks. Profile with different tile sizes to understand trade-offs between reuse and occupancy. Next steps include exploring fused kernels for transformer blocks, experimenting with attention mechanisms, and comparing Triton performance with vendor libraries. Continue learning about GPU memory hierarchies, occupancy, and numerical precision strategies to push performance further.

    The core message to remember: as models scale, hardware-aware kernel optimization becomes a major lever for speed, cost, and energy savings. Even modest gains in matmul performance have large end-to-end effects. With Triton, you can attain much of the power of CUDA with less complexity, making kernel optimization accessible and impactful for anyone building modern language models.

  • ✓Time kernels correctly with CUDA synchronization. Wrap timing with torch.cuda.synchronize() before and after measurements. Use torch.cuda.Event for precise timings over many iterations. Ignore the first few warm-up runs to get stable numbers.
  • ✓Use block pointers and strides to simplify tile loads. tl.make_block_ptr encodes shape, strides, offsets, and tile shape. It helps the compiler generate efficient, coalesced loads. Verify your strides match the actual memory layout.
  • ✓Fix obvious indexing bugs early. If you use offsets_k in masks, define it with tl.arange(0, BLOCK_SIZE_K). Incorrect or missing indices often cause hard-to-debug errors or wrong results. Add asserts and small tests to catch mistakes early.
  • ✓Balance reuse and occupancy. Bigger tiles reduce memory traffic but can reduce the number of active programs. Too small tiles increase overhead and underutilize compute. Tune with profiling, not guesses.
  • ✓Rely on the compiler but guide it. Triton’s JIT applies loop unrolling and register allocation automatically. Giving it compile-time constants and clean access patterns lets it shine. Focus on the algorithmic structure and memory movement.
  • ✓Edge cases need careful masks. For M, N, or K not divisible by tile sizes, compute masks over both axes. Ensure both loads (A, B) and the final store (C) apply masks. Test with odd sizes to confirm robustness.
  • ✓Compare to strong baselines to set expectations. Highly optimized vendor libraries are tough to beat. Use them as references to validate performance and correctness. Aim for learning and targeted custom wins.
  • ✓Keep code simple and readable. Clear naming like offsets_m, offsets_n, and offsets_k reduces mistakes. Encapsulate index math cleanly. Simpler kernels are easier to debug and tune.
  • ✓Profile where it matters. Focus on matmul and other heavy ops first since they dominate runtime. Small wins there produce large end-to-end gains. Don’t over-optimize cold paths.
  • ✓Document assumptions about layouts and dtypes. Note that your code assumes row-major layout and specific strides. Record acceptable dtype combinations and precision choices. This prevents confusion during future changes or reuse.
  • Matrix Multiplication (Matmul)

    An operation that multiplies two matrices to make a new matrix. It is the workhorse for neural network layers and attention. It involves many multiply-accumulate steps. Efficient matmul is crucial for LLM speed.

    Memory Bandwidth

    How fast data can be moved between memory and compute units. On GPUs, it often limits speed more than the number of math units. If data arrives slowly, compute units sit idle. Efficient access patterns use bandwidth better.

    Shared Memory

    A small, very fast memory area on the GPU that threads in the same block can share. Using it reduces the need to fetch the same data from slow global memory many times. It enables tiling and reuse. It’s key to high-performance kernels.

    Register

    The fastest storage in a GPU, used for temporary values. Each thread has its own registers. Too many registers per thread can reduce how many threads run at once. Keeping hot data in registers speeds up compute.

    +30 more (click terms in content)