📚 Stanford CS336: Language Modeling from Scratch3 / 17

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 3: Architectures, Hyperparameters

Beginner

Stanford Online

Key Summary

•Language modeling means predicting the next token (a token is a small piece of text like a word or subword) given all tokens before it. If you can estimate this next-token probability well, you can generate text by sampling one token at a time and appending it to the history. This step-by-step sampling turns probabilities into full sentences or paragraphs. Good models make these probabilities sharp for likely words and low for unlikely ones.
•A neural language model uses a neural network to map tokens to vectors (called embeddings) and then predicts the next token with a softmax layer. The softmax converts raw scores into a probability distribution over the entire vocabulary. This gives one probability for each possible next token. The model learns its internal weights by comparing its predicted probabilities to the actual next token and improving over time.
•A feedforward language model looks at a fixed window of previous tokens, like the last 3 or 10. It processes their embeddings through hidden layers and outputs a probability distribution. This works but scales linearly with the window size and cannot remember beyond that fixed limit. Choosing the window size is tricky and increases parameters quickly as you make it larger.
•Recurrent Neural Networks (RNNs) solve the fixed window problem using a hidden state that acts like memory. At each step, the RNN updates this hidden state using the current token and the previous hidden state. In theory, it can remember information from very far back, giving it a variable-length context. The next token’s probability is predicted from the current hidden state.
•Training basic RNNs can be difficult because of vanishing gradients, where learning signals fade as they move backward through many time steps. When gradients vanish, the model struggles to learn long-range connections between far-apart words. This makes it hard to link, for example, a subject early in the sentence with a verb much later. As a result, long-term dependencies may be missed.

Why This Lecture Matters

This material is essential for anyone building or using language models—engineers, researchers, data scientists, and product teams. Picking the right architecture determines how well your model handles context, long-range dependencies, and scale. Feedforward models are easy but limited, RNNs handle sequences but can be hard to optimize, and LSTMs/GRUs manage memory better; Transformers dominate modern applications due to their strong performance and parallelism. Understanding these trade-offs helps you choose wisely for tasks like assistants, summarizers, or code completion. The hyperparameters covered—learning rate, batch size, depth, hidden size, and dropout—directly control training stability, speed, and accuracy. Poor choices waste compute and time, while good choices can unlock major gains without changing the dataset. The lecture’s emphasis on validation sets and search strategies gives you a practical, repeatable way to tune models responsibly and avoid overfitting. In real projects, this knowledge reduces trial-and-error, improves reliability, and speeds iteration. Career-wise, these concepts are foundational in NLP and large language model development. Teams expect engineers to understand architecture options and to run careful hyperparameter experiments. Industry-wide, Transformers and attention have reshaped how we approach sequence modeling, making these ideas crucial for staying current. Mastering these basics prepares you for later topics like evaluation, scaling, and safety, and it positions you to contribute meaningfully to LLM-based products.

Lecture Summary

Tap terms for definitions

01Overview

This lecture teaches the core model architectures used for language modeling and the key hyperparameters that determine how well they train. Language modeling means predicting the next token given all previous tokens; with a good next-token probability distribution, you can generate text one token at a time. The class starts from the simple feedforward language model, moves through recurrent models (RNNs, LSTMs, GRUs), and arrives at Transformers, which are the backbone of modern large language models. Along the way, you learn about embeddings, hidden layers, and softmax outputs—how tokens become vectors, how those vectors are processed, and how probabilities over the vocabulary are produced.

The lecture is great for learners who know basic neural networks and are comfortable with the idea of tokens, vectors, and probability distributions. If you’ve seen introductory NLP or deep learning (like CS224N), the material will feel like a natural next step. No code is strictly required here, but understanding basic training ideas—like gradients and overfitting—helps. The focus is on concepts and design choices that guide practical modeling decisions.

By the end, you will know what makes each architecture tick and when you might choose one over another. You will be able to explain why feedforward models are limited by fixed context windows, how RNNs use a hidden state as memory, and how LSTMs and GRUs use gates to manage information flow to combat vanishing gradients. You will also understand why Transformers’ self-attention excels at capturing long-range dependencies and has become the dominant approach. On the hyperparameters side, you will learn what learning rate, batch size, number of layers, hidden size, and dropout do; how they interact; and how to tune them wisely with validation sets and search strategies.

The lecture is structured in two major parts. First are the architectures: feedforward language models, RNNs (with LSTM and GRU variants), and Transformers with self-attention, including what the attention mechanism does and why it matters. Second are hyperparameters: what they are, why they matter, and practical strategies for choosing them. The flow builds from simple to complex. It starts by grounding you in the next-token prediction task and the building blocks (embeddings, hidden layers, softmax), then shows how each architecture solves the context and dependency problem differently. Finally, it emphasizes the importance of hyperparameters and shows how to pick them systematically using validation and automated search methods.

Key Takeaways

✓Define the task precisely: predict p(next token | previous tokens) and use softmax to get a valid probability distribution. This clarity guides how you design inputs, outputs, and loss. During generation, sample one token at a time and append it to the context. Keep your data pipeline consistent between training and inference for best results.
✓Start simple with embeddings, hidden layers, and softmax, then scale complexity as needed. Embeddings translate tokens into numeric vectors the network can learn from. Hidden layers combine these vectors into useful features. The final softmax converts scores into probabilities for sampling or evaluation.
✓Know the feedforward LM’s limits: a fixed window can miss important context. Enlarging the window increases parameters and still caps memory. Use it for short contexts or small projects where simplicity is key. For long-range needs, plan to move to RNNs or Transformers.
✓Use RNNs to handle variable-length sequences when you need a memory across time. The hidden state carries forward what the model has learned so far. However, watch for vanishing gradients that hinder learning long-range patterns. Consider LSTM/GRU if you need stronger long-distance handling.
✓Prefer LSTM or GRU when long-range dependencies matter but you still want a recurrent approach. Their gates control what to keep, forget, and output to fight vanishing gradients. They often outperform plain RNNs on longer sequences. GRU can be a faster, simpler alternative if compute is limited.
✓Choose Transformers for state-of-the-art performance and scalability. Self-attention directly relates distant tokens without stepping through each intermediate time step. This leads to better long-range modeling and efficient parallel training. When in doubt for large corpora, Transformers are usually the best bet.

Glossary

Language modeling

Predicting the next token in a sequence given all the previous tokens. It turns reading history into a guess about the next piece of text. If the guesses are good, whole sentences and stories sound natural. This is the foundation of many text applications like chat and writing aids.

Token

A small unit of text like a word or a piece of a word. Computers handle text as tokens instead of full words so they can cover many languages and rare terms. Tokens are mapped to IDs so models can process them. Strings get split into tokens before modeling.

Vocabulary

The set of all tokens the model knows. Each token has a unique ID, like entries in a dictionary. The vocabulary size affects model size and output layer shape. A larger vocabulary can capture more words but uses more memory.

Embedding

A vector that represents a token’s meaning in numbers. Similar meanings end up with similar vectors. The model learns embeddings during training. They help the network compare and combine words based on meaning.

Embedding layer

A lookup table that returns a token’s embedding vector when given its ID. It is a big matrix with one row per token. The layer is learned so the vectors become useful for prediction. It’s the first step after tokenization.

Version: 1

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 3: Architectures, Hyperparameters

Key Summary

Why This Lecture Matters

Lecture Summary

01Overview

Key Takeaways

Glossary

Language modeling

Token

Vocabulary

Embedding

Embedding layer

02Key Concepts

03Technical Details

05Conclusion

Hidden layer

Softmax

Probability distribution