Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 1: Overview and Tokenization
BeginnerKey Summary
- •This session introduces a brand-new course on building language models from scratch. You learn what language modeling is, where it’s used (speech recognition, translation, text generation, classification), and how different modeling families work. The class emphasizes implementing models yourself in Python and PyTorch, plus how to train and evaluate them.
- •A language model assigns probabilities to word sequences and predicts the next word. For example, given “the cat sat on the,” it estimates which word most likely comes next (like “mat”). Thinking in probabilities lets computers judge how natural or likely a sentence is.
- •Classic N-gram models estimate the next word using counts of short word windows (like bigrams and trigrams). You count how often sequences occur in a corpus and compute conditional probabilities. This approach is simple, fast, and a foundation for understanding modern neural methods.
- •Neural language models learn patterns beyond fixed windows and can handle long-range relationships. Recurrent neural networks (RNNs) and LSTMs capture sequences over time. Transformers use attention to focus on important words and are state-of-the-art for large language models.
- •Evaluation focuses on perplexity and BLEU. Perplexity is like the average number of choices the model considers at each step; lower is better. BLEU compares generated text to human references and is common in machine translation.
- •Tokenization breaks text into pieces called tokens, often words and punctuation. Simple whitespace splitting fails on punctuation and contractions (like “I’m”). Smarter tokenizers split “I’m” into “I” and “’m” and separate punctuation like periods.
- •Common tokenization tools include NLTK and spaCy, plus specialized tokenizers for code or social media. Good tokenization matters because models learn from and predict tokens. Poor tokenization confuses models and lowers accuracy.
Why This Lecture Matters
This lecture sets a practical foundation for anyone who needs to build or understand language systems. Software engineers, data scientists, and researchers gain a clear roadmap from raw text to working models. By explaining where language models are used—speech recognition, translation, generation, and classification—you can connect the concepts to real products like voice assistants, chatbots, and content tools. The emphasis on tokenization and subwords addresses one of the most common failure points in NLP pipelines. Many projects underperform not because the model is weak, but because the text was split and normalized poorly. Learning robust tokenization, handling contractions and punctuation, and normalizing numbers immediately improves model quality. Subword tokenization (BPE, WordPiece, SentencePiece) further solves rare-word and vocabulary-size problems, empowering you to scale beyond toy datasets toward real-world corpora. Understanding classic N-gram models prepares you to reason about probabilities, sparsity, and evaluation with perplexity and BLEU. These skills transfer directly to modern neural models like Transformers. With the course’s coding focus in Python and PyTorch and GPU support, you can implement and train models that matter in industry settings. Building this knowledge now strengthens your career, as language modeling underpins today’s most impactful AI applications—from large language models powering enterprise copilots to tools that automate analysis and content creation. In short, mastering these fundamentals lets you diagnose problems, design better pipelines, and communicate model behavior clearly to teammates and stakeholders. It’s the difference between hoping a prebuilt model works and confidently building one that does.
Lecture Summary
Tap terms for definitions01Overview
This first lecture launches a hands-on course about building language models—the computer systems that assign probabilities to sequences of words and can predict the next word. The focus is on learning both the ideas and the practical skills to construct models from scratch. The class covers classic N-gram models, modern neural models (RNNs, LSTMs, Transformers), training processes, evaluation metrics, and especially the crucial preprocessing step: tokenization. Along the way, you will learn how to prepare data, select hyperparameters, and measure how well your model works.
The lecture starts with course goals: deeply understand language modeling; implement systems in Python and PyTorch; and master training and evaluation. You’ll see where language models are used every day—speech recognition, machine translation, text generation, and text classification. For example, in speech recognition, a language model helps choose between ambiguous sound interpretations like “recognize speech” versus “wreck a nice beach.” In translation, it pushes the system toward fluent, grammatical outputs. In generation, it produces new text that resembles the training data; and in classification, it helps distinguish categories like spam vs. not spam.
You learn what a language model does in precise terms: it estimates the probability of a word sequence, such as P(w1, w2, …, wn), and uses that to predict likely continuations, like the word after “the cat sat on the.” Classic N-gram models compute these probabilities from counts of short windows of words (like bigrams and trigrams). Neural language models go further by learning richer patterns and capturing long-distance relationships: RNNs process sequences step by step, LSTMs handle long-term dependencies, and Transformers use attention to focus on the most relevant words, defining today’s state-of-the-art.
Evaluation uses metrics that quantify how well a model performs. Perplexity measures how many choices a model effectively juggles on average at each step—the lower, the better. BLEU (often used for machine translation) compares model output to human references to gauge similarity and fluency. These metrics guide you to improve models and compare different approaches.
Before you can train any model, you must tokenize text—that is, break it into tokens. Naive whitespace splitting fails on real text because it mishandles punctuation and contractions. A better tokenizer splits “I’m going to the store.” into [“I”, “’m”, “going”, “to”, “the”, “store”, “.”]. Tokenization tools like NLTK and spaCy can handle these rules, and there are specialized tokenizers for domains like code and social media. You also learn about tokenization challenges: contractions, punctuation, and numbers (which come in many formats, like 1,000.00 and “one thousand”). Normalizing numbers can simplify the model’s job.
Subword tokenization addresses rare words and large vocabularies by breaking words into smaller pieces. For example, “unbelievable” becomes “un,” “believe,” and “able.” Common subword methods include Byte Pair Encoding (BPE), WordPiece, and SentencePiece. BPE starts with characters and merges the most frequent adjacent pairs repeatedly until reaching a desired vocabulary size. A toy example with words like “low,” “lower,” “newest,” and “widest” shows how frequent pairs such as “e”+“s” merge into “es,” and then “es”+“t” into “est.”
The lecture also covers course structure and logistics. You’ll have lectures, homework, and a final project; grading is 60% homework, 30% project, and 10% participation. Prerequisites include introductory programming (CS106A or equivalent) and probability/statistics (CS109 or equivalent); deep learning familiarity is helpful but not required. Resources include a course website with notes and assignments, Piazza for Q&A, recorded lectures, and office hours offered by the instructor and TAs (Alice, Bob, and Carol). You’ll code in Python and PyTorch and have access to GPUs for training. For extra learning, the PyTorch website and CS230 materials provide excellent tutorials.
Finally, you get a preview of what’s next: a deeper dive into N-gram language models, including smoothing, backoff, and interpolation—methods that handle the problem of unseen word sequences by sensibly redistributing probability. This foundation prepares you for the neural approaches covered later, culminating in modern transformer-based large language models.
Key Takeaways
- ✓Start with clean tokenization. Split punctuation and contractions, and choose a consistent number normalization policy. Test tokenizers (NLTK, spaCy) on real samples from your domain. Keep the same rules for both training and inference to avoid mismatches. Small tokenization fixes often yield big accuracy gains.
- ✓Use subword tokenization to control vocabulary size. Train BPE/WordPiece/SentencePiece on representative data. Begin with a 30K vocabulary and adjust based on coverage and performance. Subwords reduce OOVs and improve generalization to rare words. They also save memory compared to giant word-level vocabularies.
- ✓Build a simple N-gram model first. Implement bigrams or trigrams and compute perplexity on a held-out set. This baseline clarifies how data and tokenization affect results. It helps you understand probabilities and sparsity before jumping to neural models. You’ll make better decisions later with this grounding.
- ✓Measure with perplexity and, if relevant, BLEU. Perplexity tells you how well the model predicts next words. BLEU is useful for translation or text generation tasks with references. Track these metrics as you change tokenization and vocabulary size. Let numbers guide your improvements.
- ✓Normalize numbers early. Decide how to handle commas, decimals, and word numbers (“one thousand”). Map equivalent forms to a single representation. This prevents the model from wasting capacity on superficial differences. It simplifies counts and improves generalization.
- ✓Treat punctuation as tokens. Separate periods, commas, and question marks from words. This reveals sentence boundaries and improves downstream tasks. You’ll get cleaner counts and better language structure learning. Avoid attaching punctuation to words like “store.”.
Glossary
Language model (LM)
A system that assigns probabilities to sequences of words and predicts likely next words. It helps computers judge how natural a sentence is. By learning patterns from text, it can continue text or rank choices. LMs power tasks like speech recognition and translation. Good LMs make outputs fluent and sensible.
Tokenization
The process of splitting text into small pieces called tokens, such as words and punctuation. It makes raw text easier for models to understand. Good tokenization handles contractions and punctuation correctly. It reduces confusion and makes patterns clearer. Poor tokenization leads to messy learning.
Token
A unit of text used by a model, often a word, punctuation mark, or subword piece. Tokens are like puzzle pieces; the model learns how they fit together. They form sequences that the model predicts over. Clear token boundaries improve learning. Tokens are mapped to IDs for model input.
N-gram
A sequence of N tokens. N-grams capture short context windows in text. Models use them to estimate the next word’s probability from recent words. Larger N captures more context but increases sparsity. N-grams are a classic, simple modeling approach.
