â˘Language modeling means predicting the next token (a token is a small piece of text like a word or subword) given all tokens before it. If you can estimate this next-token probability well, you can generate text by sampling one token at a time and appending it to the history. This step-by-step sampling turns probabilities into full sentences or paragraphs. Good models make these probabilities sharp for likely words and low for unlikely ones.
â˘A neural language model uses a neural network to map tokens to vectors (called embeddings) and then predicts the next token with a softmax layer. The softmax converts raw scores into a probability distribution over the entire vocabulary. This gives one probability for each possible next token. The model learns its internal weights by comparing its predicted probabilities to the actual next token and improving over time.
â˘A feedforward language model looks at a fixed window of previous tokens, like the last 3 or 10. It processes their embeddings through hidden layers and outputs a probability distribution. This works but scales linearly with the window size and cannot remember beyond that fixed limit. Choosing the window size is tricky and increases parameters quickly as you make it larger.
â˘Recurrent Neural Networks (RNNs) solve the fixed window problem using a hidden state that acts like memory. At each step, the RNN updates this hidden state using the current token and the previous hidden state. In theory, it can remember information from very far back, giving it a variable-length context. The next tokenâs probability is predicted from the current hidden state.
â˘Training basic RNNs can be difficult because of vanishing gradients, where learning signals fade as they move backward through many time steps. When gradients vanish, the model struggles to learn long-range connections between far-apart words. This makes it hard to link, for example, a subject early in the sentence with a verb much later. As a result, long-term dependencies may be missed.
Why This Lecture Matters
This material is essential for anyone building or using language modelsâengineers, researchers, data scientists, and product teams. Picking the right architecture determines how well your model handles context, long-range dependencies, and scale. Feedforward models are easy but limited, RNNs handle sequences but can be hard to optimize, and LSTMs/GRUs manage memory better; Transformers dominate modern applications due to their strong performance and parallelism. Understanding these trade-offs helps you choose wisely for tasks like assistants, summarizers, or code completion.
The hyperparameters coveredâlearning rate, batch size, depth, hidden size, and dropoutâdirectly control training stability, speed, and accuracy. Poor choices waste compute and time, while good choices can unlock major gains without changing the dataset. The lectureâs emphasis on validation sets and search strategies gives you a practical, repeatable way to tune models responsibly and avoid overfitting. In real projects, this knowledge reduces trial-and-error, improves reliability, and speeds iteration.
Career-wise, these concepts are foundational in NLP and large language model development. Teams expect engineers to understand architecture options and to run careful hyperparameter experiments. Industry-wide, Transformers and attention have reshaped how we approach sequence modeling, making these ideas crucial for staying current. Mastering these basics prepares you for later topics like evaluation, scaling, and safety, and it positions you to contribute meaningfully to LLM-based products.
Lecture Summary
Tap terms for definitions
01Overview
This lecture teaches the core model architectures used for language modeling and the key hyperparameters that determine how well they train. Language modeling means predicting the next token given all previous tokens; with a good next-tokenprobability distribution, you can generate text one token at a time. The class starts from the simple feedforward language model, moves through recurrent models (RNNs, LSTMs, GRUs), and arrives at Transformers, which are the backbone of modern large language models. Along the way, you learn about embeddings, hidden layers, and softmax outputsâhow tokens become vectors, how those vectors are processed, and how probabilities over the vocabulary are produced.
The lecture is great for learners who know basic neural networks and are comfortable with the idea of tokens, vectors, and probability distributions. If youâve seen introductory NLP or deep learning (like CS224N), the material will feel like a natural next step. No code is strictly required here, but understanding basic training ideasâlike gradients and overfittingâhelps. The focus is on concepts and design choices that guide practical modeling decisions.
By the end, you will know what makes each architecture tick and when you might choose one over another. You will be able to explain why feedforward models are limited by fixed context windows, how RNNs use a hidden state as memory, and how LSTMs and GRUs use gates to manage information flow to combat vanishing gradients. You will also understand why Transformersâ self-attention excels at capturing long-range dependencies and has become the dominant approach. On the hyperparameters side, you will learn what learning rate, batch size, number of layers, hidden size, and dropout do; how they interact; and how to tune them wisely with validation sets and search strategies.
The lecture is structured in two major parts. First are the architectures: feedforward language models, RNNs (with LSTM and GRU variants), and Transformers with self-attention, including what the attention mechanism does and why it matters. Second are hyperparameters: what they are, why they matter, and practical strategies for choosing them. The flow builds from simple to complex. It starts by grounding you in the next-token prediction task and the building blocks (embeddings, hidden layers, softmax), then shows how each architecture solves the context and dependency problem differently. Finally, it emphasizes the importance of hyperparameters and shows how to pick them systematically using validation and automated search methods.
Key Takeaways
âDefine the task precisely: predict p(next token | previous tokens) and use softmax to get a valid probability distribution. This clarity guides how you design inputs, outputs, and loss. During generation, sample one token at a time and append it to the context. Keep your data pipeline consistent between training and inference for best results.
âStart simple with embeddings, hidden layers, and softmax, then scale complexity as needed. Embeddings translate tokens into numeric vectors the network can learn from. Hidden layers combine these vectors into useful features. The final softmax converts scores into probabilities for sampling or evaluation.
âKnow the feedforward LMâs limits: a fixed window can miss important context. Enlarging the window increases parameters and still caps memory. Use it for short contexts or small projects where simplicity is key. For long-range needs, plan to move to RNNs or Transformers.
âUse RNNs to handle variable-length sequences when you need a memory across time. The hidden state carries forward what the model has learned so far. However, watch for vanishing gradients that hinder learning long-range patterns. Consider LSTM/GRU if you need stronger long-distance handling.
âPrefer LSTM or GRU when long-range dependencies matter but you still want a recurrent approach. Their gates control what to keep, forget, and output to fight vanishing gradients. They often outperform plain RNNs on longer sequences. GRU can be a faster, simpler alternative if compute is limited.
âChoose Transformers for state-of-the-art performance and scalability. Self-attention directly relates distant tokens without stepping through each intermediate time step. This leads to better long-range modeling and efficient parallel training. When in doubt for large corpora, Transformers are usually the best bet.
Glossary
Language modeling
Predicting the next token in a sequence given all the previous tokens. It turns reading history into a guess about the next piece of text. If the guesses are good, whole sentences and stories sound natural. This is the foundation of many text applications like chat and writing aids.
Token
A small unit of text like a word or a piece of a word. Computers handle text as tokens instead of full words so they can cover many languages and rare terms. Tokens are mapped to IDs so models can process them. Strings get split into tokens before modeling.
Vocabulary
The set of all tokens the model knows. Each token has a unique ID, like entries in a dictionary. The vocabulary size affects model size and output layer shape. A larger vocabulary can capture more words but uses more memory.
Embedding
A vector that represents a tokenâs meaning in numbers. Similar meanings end up with similar vectors. The model learns embeddings during training. They help the network compare and combine words based on meaning.
Embedding layer
A lookup table that returns a tokenâs embedding vector when given its ID. It is a big matrix with one row per token. The layer is learned so the vectors become useful for prediction. Itâs the first step after tokenization.
Version: 1
â˘LSTMs (Long Short-Term Memory networks) were created to fight vanishing gradients by adding gates that control what to keep, what to forget, and what to output. They maintain a cell state (long-term memory) and a hidden state (short-term memory). The input gate decides how much new information to add, the forget gate decides what to drop from memory, and the output gate decides what to reveal. This design helps LSTMs learn longer-range patterns more reliably.
â˘GRUs (Gated Recurrent Units) are a simpler alternative to LSTMs with fewer gates and parameters. They combine some of the LSTM functions into fewer steps, often training a bit faster. GRUs still maintain the idea of controlling information flow to reduce vanishing gradients. They are popular when speed and simplicity matter.
â˘Transformers use attention instead of recurrence to model sequences. Attention lets the model look at all tokens and weigh how relevant each is to the current position. This self-attention mechanism helps capture long-range dependencies directly and often more effectively than RNNs. Transformers have become the basis of many state-of-the-art language models.
â˘A Transformer is made of stacked layers that each include self-attention and feedforward sublayers. In an encoder-decoder setup, the encoder processes inputs and the decoder generates outputs, attending to both previous outputs and encoded inputs. For language modeling, the model focuses on previous tokens to predict the next one. The attention weights are computed, normalized with softmax, and used to blend information from all positions.
â˘Choosing model hyperparametersâlike learning rate, batch size, number of layers, hidden size, and dropout rateâhas a large impact on performance. Learning rate controls how big each training step is; too big can overshoot, too small can crawl. Batch size trades noisy updates for speed and stability; too small is noisy, too large can be slow to update. Depth, width, and dropout balance capacity and overfitting.
â˘You pick hyperparameters using a validation set or automated search. A validation set is held-out data used to score different choices and keep the one that works best before final testing. Search strategies include grid search (systematic), random search (sampled), and Bayesian optimization (guided by past results). This process often requires careful experiments and patience.
â˘In practice, each architecture has strengths: feedforward models are simple but limited; RNNs handle sequence length but can be hard to train; LSTMs/GRUs improve memory handling; Transformers deliver top results with attention. The best choice depends on your data size, task needs, and compute budget. Hyperparameters must be tuned for each setup to get strong performance. These foundations set you up for evaluation and improvement techniques later.
02Key Concepts
01
What is language modeling? (đŻ Definition) Language modeling is the task of predicting the next token given all previous tokens. (đ Analogy) Itâs like trying to guess the next word in a sentence by remembering everything youâve read so far. (đ§ Technical) The model estimates a probability distribution p(next token | previous tokens), then samples from it to generate text. (đĄ Why it matters) Without a good next-token probability, generated text becomes random and incoherent. (đ Example) Given âThe cat sat on theâ, a good model assigns high probability to âmatâ and low probability to unrelated words like âgalaxy.â
02
Embeddings (đŻ Definition) Embeddings turn tokens into vectors of numbers the model can process. (đ Analogy) Think of embeddings like GPS coordinates that place words in a map of meaning. (đ§ Technical) They are learned lookup vectors from an embedding matrix, so similar words often end up near each other in vector space. (đĄ Why it matters) Raw tokens are symbols; embeddings let neural networks measure similarity and combine meanings. (đ Example) âcatâ and âkittenâ embeddings end up closer than âcatâ and âcar.â
03
Softmax output (đŻ Definition) Softmax converts raw scores into a valid probability distribution over the vocabulary. (đ Analogy) Itâs like turning a set of messy votes into a clean poll result where all percentages add up to 100%. (đ§ Technical) It exponentiates each score and normalizes by the sum, producing nonnegative numbers that sum to 1. (đĄ Why it matters) The model must produce probabilities to sample or evaluate next tokens. (đ Example) If the scores favor âmat,â softmax converts that preference into a high probability for âmat.â
04
Feedforward language model (đŻ Definition) A feedforward LM predicts the next token from a fixed window of previous tokens using a standard neural network. (đ Analogy) Itâs like reading through a small peephole that only shows the last few words. (đ§ Technical) It concatenates or combines embeddings from the last n tokens, passes them through hidden layers, and applies softmax. (đĄ Why it matters) Itâs simple and works for short contexts but canât go beyond its fixed window. (đ Example) With a context window of 3, the model sees only âon theâ to predict the next word after âThe cat sat on the.â
05
Context window limitation (đŻ Definition) The context window is how many previous tokens the model can see. (đ Analogy) Itâs like trying to solve a puzzle while only allowed to view a few pieces at a time. (đ§ Technical) In feedforward LMs, increasing the window increases parameters roughly linearly and still caps the memory. (đĄ Why it matters) Important clues can be outside the window, hurting long-range understanding. (đ Example) A reference made five sentences earlier gets lost if the window only covers the last 10 tokens.
06
Text generation by sampling (đŻ Definition) Sampling is generating text by picking the next token from the modelâs probability distribution and appending it, step by step. (đ Analogy) Itâs like building a sentence one Lego brick at a time, choosing the brick that fits best at each step. (đ§ Technical) After each sample, the new token joins the history, and the model predicts again for the next position. (đĄ Why it matters) This is how models turn probabilities into full outputs. (đ Example) Starting with âOnce upon a,â the model might produce âtime,â then âthere,â then âwas,â forming a story.
07
Recurrent Neural Network (RNN) LM (đŻ Definition) An RNN uses a hidden state that carries information forward through a sequence. (đ Analogy) Think of it as a notepad you update after reading each word, keeping track of the story so far. (đ§ Technical) At time t, it computes h_t = f(x_t, h_{t-1}), then uses h_t to predict the next token with softmax. (đĄ Why it matters) It can, in theory, use information from arbitrarily far back. (đ Example) The subject at the start of a sentence can influence the verb at the end through the hidden state.
08
Vanishing gradient problem (đŻ Definition) Vanishing gradients are extremely small learning signals that fade as they backpropagate through many time steps. (đ Analogy) Itâs like whispering a message down a long line of people; by the time it reaches the start, the message is too faint. (đ§ Technical) Repeated multiplications by small derivatives cause gradients to shrink towards zero in long sequences. (đĄ Why it matters) The model fails to learn long-range dependencies well. (đ Example) An RNN may not connect a pronoun with the correct noun it refers to if they are far apart.
09
LSTM essentials (đŻ Definition) LSTMs add gates and a cell state to control what to remember, forget, and output. (đ Analogy) Itâs like a smart notebook with tabs for long-term storage and sticky notes for short-term thoughts. (đ§ Technical) The input gate writes new info, the forget gate erases stale info, and the output gate reveals whatâs needed for prediction. (đĄ Why it matters) This structure keeps gradients healthier and preserves useful information longer. (đ Example) An LSTM can remember the topic of a paragraph while still focusing on the current sentence.
10
GRU essentials (đŻ Definition) GRUs are a simpler gated recurrent unit that also controls information flow. (đ Analogy) Think of it as a lighter backpack with fewer compartments than an LSTM, but still organized. (đ§ Technical) GRUs use update and reset gates to blend past and new information with fewer parameters. (đĄ Why it matters) They often train faster while still handling longer dependencies better than plain RNNs. (đ Example) A GRU can track a storyline across multiple sentences with less computation than an LSTM.
11
Attention mechanism (đŻ Definition) Attention lets a model weigh different parts of the input by relevance. (đ Analogy) Itâs like highlighting the most important words in a passage while skimming the rest. (đ§ Technical) The model computes attention scores between positions, turns them into weights with softmax, and forms a weighted sum of values. (đĄ Why it matters) It directly focuses on useful information, enabling strong long-range connections. (đ Example) The word âitâ can attend back to âthe bookâ to resolve what âitâ refers to.
12
Self-attention (đŻ Definition) Self-attention applies attention within the same sequence, letting each position look at all other positions. (đ Analogy) Itâs like a group discussion where each sentence in a paragraph can reference any other sentence. (đ§ Technical) Each token becomes a query, key, and value; attention weights between queries and keys mix the values into new representations. (đĄ Why it matters) It captures dependencies without recurrence, often more efficiently. (đ Example) In âThe animal didnât cross because it was too tired,â âitâ can strongly attend to âanimal.â
13
Transformer architecture (đŻ Definition) A Transformer is a stack of layers using self-attention and feedforward sublayers. (đ Analogy) Itâs like multiple rounds of smart highlighting and rewriting, each round refining the understanding. (đ§ Technical) Encoders and decoders can be used together; for language modeling, the model focuses on previous tokens to predict the next. (đĄ Why it matters) Transformers achieve state-of-the-art results and scale well. (đ Example) Models like GPT-3 and BERT are built on Transformer ideas.
14
Stacking layers (đŻ Definition) Stacking multiple layers deepens the modelâs processing steps. (đ Analogy) Itâs like passing your notes through several rounds of editing to improve clarity. (đ§ Technical) Each layer refines representations from the previous layer, enabling complex patterns to emerge. (đĄ Why it matters) Depth increases capacity but can risk overfitting if too large. (đ Example) A 12-layer Transformer can solve harder patterns than a 2-layer one.
15
Learning rate (đŻ Definition) Learning rate controls how big each parameter update is during training. (đ Analogy) Itâs like the size of steps you take while hikingâtoo big and you trip, too small and you crawl. (đ§ Technical) Optimizers multiply gradients by the learning rate to move weights; scheduling and adaptive methods adjust it over time. (đĄ Why it matters) A bad learning rate can prevent convergence or make training very slow. (đ Example) Lowering the learning rate after initial progress can stabilize training.
16
Batch size (đŻ Definition) Batch size is how many examples you use to compute a gradient update at once. (đ Analogy) Itâs like averaging many opinions before making a decision versus reacting to one at a time. (đ§ Technical) Larger batches give smoother gradients but fewer updates per epoch; smaller batches give noisier but more frequent updates. (đĄ Why it matters) It affects speed, stability, and generalization. (đ Example) Using a batch size of 64 may balance stability and compute limits on a modest GPU.
17
Number of layers (đŻ Definition) This is how deep your network is. (đ Analogy) Itâs like adding floors to a buildingâmore floors mean more rooms to hold ideas. (đ§ Technical) More layers increase representational power but also training difficulty and overfitting risk. (đĄ Why it matters) Too few layers underfit; too many waste compute and may memorize noise. (đ Example) Starting with 4â6 layers and scaling up as needed is a common practice.
18
Hidden size (đŻ Definition) Hidden size is how wide each hidden layer isâthe number of neurons per layer. (đ Analogy) Wider hallways fit more people; wider layers can carry more features. (đ§ Technical) Larger widths store richer representations but increase parameters and compute. (đĄ Why it matters) Too small canât capture patterns; too large can overfit and slow training. (đ Example) Choosing 512 vs. 1024 hidden units changes both capacity and GPU memory use.
19
Dropout (đŻ Definition) Dropout randomly turns off some neurons during training to prevent overfitting. (đ Analogy) Itâs like practicing a song with some instruments muted so everyone learns to carry the tune. (đ§ Technical) Each unit is dropped with a fixed probability, forcing the network to build redundant, robust features. (đĄ Why it matters) It improves generalization to unseen data. (đ Example) A dropout rate of 0.1â0.3 is often used to regularize medium-sized models.
20
Validation set (đŻ Definition) A validation set is held-out data used to choose hyperparameters and avoid overfitting. (đ Analogy) Itâs like testing your speech with a small audience before the big event. (đ§ Technical) You train on the training set, tune using validation performance, and only evaluate final results on a separate test set. (đĄ Why it matters) It prevents accidentally picking hyperparameters that only work on training data. (đ Example) Try learning rates 3e-4 and 1e-4 and keep the one with better validation loss.
21
Hyperparameter search (đŻ Definition) These are strategies to explore and pick good hyperparameters. (đ Analogy) Itâs like trying different recipes to find the tastiest cake. (đ§ Technical) Grid search tests a fixed set of values, random search samples values, and Bayesian optimization uses previous results to guide the next trials. (đĄ Why it matters) Good hyperparameters can dramatically improve performance. (đ Example) Random search over learning rate and dropout often beats a tiny hand-picked grid.
22
Choosing architectures (đŻ Definition) Selecting the model family that fits your data and goals. (đ Analogy) Itâs like choosing the right vehicle: a bike, a car, or a train depending on the trip. (đ§ Technical) Feedforward is simple but limited; RNN/LSTM/GRU handle sequences; Transformers scale and excel on complex tasks. (đĄ Why it matters) The wrong choice wastes time and compute and may underperform. (đ Example) For large text corpora and long contexts, a Transformer is usually best.
23
Attention weights as probabilities (đŻ Definition) Attention weights show how much each token focuses on others. (đ Analogy) Itâs like distributing your attention points across words in a sentence. (đ§ Technical) Scores are normalized by softmax to sum to 1, forming a weighted sum of value vectors. (đĄ Why it matters) This explicit weighting helps interpret what the model considers relevant. (đ Example) In a question, the attention may peak on the noun phrase that contains the answer.
24
Trade-offs in depth and width (đŻ Definition) Balancing how many layers (depth) and how many units per layer (width). (đ Analogy) Itâs like choosing between a tall building or a wider one, both offering more space but with different costs. (đ§ Technical) Depth adds sequential processing steps; width enriches each stepâs capacity. (đĄ Why it matters) The best mix depends on your data and compute. (đ Example) If training is unstable with very deep models, slightly reduce depth and increase hidden size instead.
03Technical Details
Overall Architecture/Structure
The Language Modeling Objective
Goal: Estimate p(x_t | x_1, ..., x_{t-1}) for each position t in a sequence of tokens. A token is a small chunk of text like a word or subword.
Data flow (inference): Start with a prompt (context), compute the next-tokenprobability distribution, sample a token, append it, and repeat.
Data flow (training): For each position t, the model predicts x_t from x_{<t}; compare the predicted distribution to the true token and update weights to improve future predictions.
Core Building Blocks
Embedding Layer: A large lookup table that maps token IDs to dense vectors. If the vocabulary has V tokens and the embedding size is d, the table is V by d. When you input a token ID, you retrieve a d-dimensional vector representing that tokenâs meaning.
Hidden Layers: These layers transform embeddings into richer features. They combine signals so the model can form patterns like syntax, topic, or semantics.
Softmax Output: Converts final scores (logits) for all V tokens into probabilities. The highest-probability token is most likely to be the next token.
Three Architectural Families
Feedforward LM: Uses a fixed window of previous tokens.
Recurrent LMs (RNN, LSTM, GRU): Use a hidden state passed through time.
Transformer: Uses self-attention to mix information across positions.
Feedforward Language Model: Structure and Data Flow
Inputs: The last n tokens (the context window). Each token goes through the embedding layer to become a vector.
Combine: Concatenate or otherwise combine n embeddings into one representation. Pass through one or more fully connected layers with nonlinearities.
Output: Final hidden layer outputs logits for each vocabularytoken; softmax turns them into probabilities for the next token.
Parameter growth: If you increase n (the window size), the input dimension grows roughly linearly, increasing parameter count in the first hidden layer. This leads to heavier compute and memory usage, and still only allows n tokens of context.
Training: For each position t, use the previous n tokens as input and the actual token x_t as target. Update weights to improve the predicted distribution.
Inference: Slide the window forward one token at a time. For long text, the modelâs view is limited to the fixed window; it does not remember beyond it.
RNN Language Model: Structure and Data Flow
Hidden State as Memory: Maintain a vector h_t that summarizes all tokens up to time t. At each step: h_t = f(x_t, h_{t-1}). Here, x_t is the embedding of the current token.
Prediction: Use h_t to compute logits and apply softmax to get p(x_{t+1} | x_{<=t}). The model processes tokens sequentially.
Training: Unroll the network through time over a sequence. Compute the loss at each step and backpropagate gradients through the chain of time steps (backpropagation through time). Long sequences can cause gradients to vanish as they pass through many steps.
Vanishing Gradients: If derivatives at each step are small, multiplying many such terms causes the gradient to approach zero, weakening learning signals for early positions.
Practical Considerations: RNNs can, in theory, leverage long context. In practice, training stability and optimization challenges often limit effective memory, hence the development of gated variants.
LSTM: Gated Recurrent Unit with Cell and Hidden States
States: Cell state c_t (long-term memory) and hidden state h_t (short-term output-like state). The design separates storage from exposure.
Gates (conceptually): Input gate decides how much of new information to write to c_t. Forget gate decides how much of old c_{t-1} to erase. Output gate decides how much of c_t to reveal as h_t.
Update Flow (high-level):
Compute gate activations from x_t and h_{t-1}.
Propose new content (a candidate update) from x_t and h_{t-1}.
Form new cell state c_t by mixing old c_{t-1} (through forget gate) and candidate content (through input gate).
Form new hidden state h_t by filtering c_t through the output gate.
Why it helps: The cell state provides a more direct path for gradients to flow across many steps, reducing vanishing gradients and enabling learning of long-range dependencies.
Prediction: h_t feeds into a softmax layer to predict the next token.
GRU: Simpler Gated Recurrent Unit
States: Typically a single hidden state h_t, no separate cell state.
Gates: Update and reset gates control how much of the previous state to keep and how much to mix with new input.
Update Flow (high-level):
Compute gates from x_t and h_{t-1}.
Build a candidate new state using x_t and a gated version of h_{t-1}.
Mix candidate and old state according to the update gate to get h_t.
Benefits: Fewer parameters than LSTM; often faster to train while remaining effective against vanishing gradients.
Prediction: h_t goes through softmax to predict the next token.
Transformer: Attention-Centric Architecture
Self-Attention: Each token position computes query, key, and value vectors from its embedding. Attention scores between a query and all keys determine weights; the weighted sum of values becomes the new representation for that position.
Stacked Layers: Alternating self-attention and feedforward sublayers refine representations. Layer normalization and residual (skip) connections help training stability.
Encoder-Decoder (conceptual): Encoders read inputs; decoders generate outputs one token at a time while attending to encoder outputs (for tasks like translation). For pure language modeling over a single stream of text, the model focuses on previous tokens to predict the next.
Long-Range Dependencies: Self-attention provides a direct path between any pair of positions. This often captures distant relationships more easily than passing information through many recurrent steps.
Output: A final linear layer and softmax convert the last-layer representations into next-token probabilities.
Comparing Architectures
Feedforward LM: Simple and fast but limited to fixed windows. Parameter count and compute rise with window size.
RNN: Variable-length context via hidden state; sequential processing limits parallelism; training can be harder due to vanishing gradients.
Transformer: Parallelizable over sequence positions (self-attention can process tokens together during training) and strong at long-range relationships; foundation of modern large language models.
Hyperparameters: Roles and Effects
Learning Rate: Scales gradient steps. Too high can cause divergence or oscillation; too low makes learning slow. Schedules (like decaying the rate over time) and adaptive methods (that adjust per-parameter or over time) can help find a good path.
Batch Size: Larger batches yield smoother gradients but fewer updates per epoch; smaller batches provide more updates and noise that sometimes helps generalization. Often constrained by memory.
Number of Layers: Controls depth and capacity. More layers can learn more complex patterns but increase overfitting risk and training difficulty.
Hidden Size: Width of hidden representations. Larger width increases representational capacity and cost.
Dropout Rate: Probability of dropping units during training. Helps regularize and improve generalization.
Selecting Hyperparameters
Validation Set: Split data into training and validation sets. Train on training, pick hyperparameters that score best on validation, and keep test data separate for final evaluation.
Search Methods: Grid search (systematic combinations), random search (sampled combinations), and Bayesian optimization (guided by past evaluations). Random search often explores more diverse settings for the same budget than tiny grids.
Practical Strategy: Start with sensible defaults, run small pilot experiments to check learning, then refine choices. Monitor training and validation metrics to avoid overfitting.
Step-by-Step Implementation Guide (Conceptual)
A) Feedforward LM
Tokenization and Vocabulary: Map text to token IDs; decide on a window size n.
Embeddings: Build an embedding matrix; convert each of the last n tokens to vectors.
Combine and Transform: Concatenate embeddings (or average) and pass through one or more fully connected layers with nonlinearities.
Output: Use a final linear layer to produce logits for all vocabularytokens; apply softmax to get probabilities.
Training Loop: For each training batch, compute loss comparing the softmax distribution to the true next token; update parameters.
Inference: Slide the window; sample the next token; append and repeat.
B) RNN/LSTM/GRU LM
Tokenization and Vocabulary: As above; no fixed window is required.
Embeddings: Convert each token to its vector representation.
Recurrent Step: For each time step, update hidden state (and cell state for LSTM) using current embedding and previous state.
Output: From h_t, compute logits and softmax for the next token.
Training Loop: Unroll across time for each sequence segment; compute losses; backpropagate through time; update parameters.
Inference: Start with an initial hidden state (often zero); after each sampled token, feed it back and continue.
C) Transformer LM (conceptual for language modeling)
Tokenization and Vocabulary: Map text to token IDs; add positional information so the model knows order.
Embeddings + Position Information: Sum tokenembeddings with position encodings.
Self-Attention Layers: For each layer, compute queries, keys, values; form attention weights; mix values to produce new representations.
Feedforward Sublayer: Apply a position-wise feedforward network to each position; use residual connections and normalization.
Output: A final projection to vocabulary size followed by softmax returns the next-token distribution.
Training Loop: Compute loss over all positions where the target is the next token; update parameters.
Inference: Generate tokens autoregressivelyâat each step, use previous generated tokens to predict the next.
Tips and Warnings
Context vs. Capacity: Feedforward models canât grow context without rapidly growing parameters. If you need long context, prefer RNN/LSTM/GRU or a Transformer.
Training Stability: RNNs may suffer from vanishing gradients. Gated variants help; monitor losses and consider careful initialization and learning rate choices.
Parallelism: Transformers allow more parallel computation during training than RNNs, often speeding up large-scale training.
Overfitting: Large depth/width can memorize training data. Use dropout and monitor validation performance.
Hyperparameter Sensitivity: Learning rate is often the most sensitive; tune it first. Batch size influences stability and speed, but pick one that fits memory.
Experiment Discipline: Always evaluate on validation data when adjusting hyperparameters; avoid peeking at the test set.
Tools/Libraries (General Guidance)
While no specific tools are required here, practical implementations typically use deep learning frameworks like PyTorch or TensorFlow. They provide embedding layers, RNN/LSTM/GRU modules, and attention layers out of the box.
Basic usage involves defining layers, writing a forward pass to compute logits, and a training loop that computes loss and updates parameters with an optimizer.
In Summary
The feedforward LM is simple and limited by a fixed context window. RNNs replace that window with a hidden state but face vanishing gradients; LSTMs/GRUs add gates to better manage memory. Transformers use self-attention to directly connect distant tokens and now dominate language modeling. Hyperparametersâespecially learning rate, batch size, depth, hidden size, and dropoutâstrongly shape training outcomes. Use validation sets and systematic search to choose them well.
05Conclusion
This material brings together the essential model choices for language modeling and the hyperparameters that make them work well. A feedforward language model reads a fixed set of previous tokens, runs them through hidden layers, and predicts the next token via softmax. Its main weakness is the fixed window, which grows parameters and still cannot remember beyond its size. Recurrent models address that with a hidden state that carries memory through time, but plain RNNs often struggle with vanishing gradients and have trouble learning long-range dependencies reliably.
LSTMs and GRUs introduce gates that decide what to remember, what to forget, and what to show the next step, greatly improving the ability to learn long-term patterns. Transformers replace recurrence with self-attention, allowing each position to directly focus on any other part of the sequence. This makes it easier to capture distant relationships and scales well, which is why Transformers power many of todayâs leading language models.
Hyperparameters such as learning rate, batch size, number of layers, hidden size, and dropout determine how smoothly and effectively training progresses. There is no single perfect setting; tuning with a validation set and searches like grid, random, or Bayesian optimization often yields the best configuration. A good workflow is to start with reasonable defaults, run small experiments, watch training and validation behavior, and refine the choices step by step.
The core message is to match the architecture to the task and data, then tune hyperparameters with discipline. Feedforward models are simple; RNNs capture sequences; LSTMs/GRUs manage long-term memory; Transformers excel at long-range attention and scale. With careful selection and tuning, you can build language models that predict the next token accurately and generate coherent text.
âTune learning rate firstâitâs the most sensitive hyperparameter. If training diverges, lower it; if learning is too slow, raise it. Consider using a learning rate schedule to improve stability over time. Keep careful notes on what settings work.
âPick a batch size that fits your hardware while maintaining stable updates. Larger batches give smoother gradients but fewer updates per epoch; smaller batches add noise that can sometimes help generalization. Try a few sizes and watch validation metrics for guidance. Donât sacrifice too much stability just to use a larger batch.
âBalance number of layers and hidden size to match your dataset and compute. More layers and width increase capacity but risk overfitting. If training becomes unstable or too slow, scale back and try modest increases. Always check validation performance before scaling further.
âUse dropout to fight overfitting, especially as models get larger. Start with a small rate (like 0.1â0.3) and adjust based on validation performance. Remember that too much dropout can slow learning. Find the sweet spot where the model generalizes well.
âAlways use a validation set to choose hyperparameters and avoid overfitting to the training set. Do not tune on the test set. Track validation loss or accuracy to compare runs fairly. Save the best model according to validation metrics, not training metrics.
âTry simple hyperparameter searches before complex ones. Random search over a few key parameters can outperform small grids and is easy to run. If you have more time or budget, consider guided methods like Bayesian optimization. Keep experiments organized and reproducible.
âMatch the architecture to the problemâs context requirements. Short sequences with limited context may work with feedforward models. Longer or more complex dependencies call for LSTM/GRU or Transformers. Consider your compute budget and time constraints as part of the decision.
âObserve training curves to diagnose issues early. Spiking loss may signal a too-high learning rate. Flat curves may mean learning rate is too low or the model is underpowered. Use these signals to adjust hyperparameters methodically.
âDesign experiments to change one major factor at a time. If you alter many settings at once, itâs hard to know what helped. Start with learning rate, then adjust batch size, depth, width, and dropout. Keep a log and name runs clearly.
âEnsure inference mirrors training assumptions. If you trained to predict the next token given previous tokens, generate autoregressively in the same way. Feed each sampled token back into the model for the next prediction. Differences between train and test procedures can hurt results.
Hidden layer
A layer that transforms embeddings into richer features using learned weights and nonlinear functions. It compresses and mixes signals to find useful patterns. Multiple hidden layers can capture complex structures. They sit between input embeddings and the output.
Softmax
A function that turns scores into probabilities that add to 1. It highlights big scores and shrinks small ones. This lets the model produce a true probability distribution over the vocabulary. Itâs used at the last step before sampling.
Probability distribution
A list of probabilities for all possible outcomes that sum to 1. In language modeling, it assigns a probability to each possible next token. Higher probabilities mean the model believes that token fits better. Sampling uses this distribution to pick the next token.