•This session explains how to use a trained language model to produce outputs, a phase called inference. It covers three task types—conditional generation, open-ended generation, and classification—each with different input/output shapes that affect decoding choices. The lecture then dives into decoding methods, which are strategies to choose the next token step by step. Finally, it discusses how to evaluate generated text using human judgments and automatic metrics, along with their trade-offs.
•Conditional generation means producing an output sequence based on an input sequence, like translating English to French or summarizing an article. Open-ended generation has minimal or no input and focuses on creative, diverse outputs, such as stories or dialogue. Classification uses a language model to output a single label, such as sentiment or topic. Each of these requires different inference settings.
•Decoding is the search for the most likely output sequence given an input and a model. The exact best solution is usually impossible to compute because possible sequences grow exponentially with length. Therefore, we rely on approximate search algorithms. The lecture explains when to use deterministic versus stochastic (randomized) decoding.
•Greedy decoding picks the highest-probability next token at every step. It is simple and fast but can get stuck in local optima, making early mistakes that the model cannot fix later. The lecture uses a simple sentence completion example to show how a plausible but suboptimal choice can derail the rest of the generation. Greedy decoding is best when speed matters more than global optimality.
•Beam search keeps several top candidate sequences at each step, not just one. It explores more options to avoid local traps but costs more computation, scaling with the beam size. The lecture presents length normalization to avoid favoring short outputs. In practice, beam sizes around 5–10 are common starting points.
Why This Lecture Matters
Inference is where language models deliver real value—turning training into practical outputs for translation, summarization, chat, and classification. The decoding method you choose can transform the same model into a precise translator, a creative storyteller, or a fast classifier. Engineers, data scientists, product managers, and researchers all benefit from understanding these levers because they determine quality, latency, and user experience. In production, the right pairing of task and decoder can boost accuracy, reduce hallucinations, and control costs; the wrong pairing can do the opposite.
Evaluation knowledge is equally critical. Teams need to justify model choices with trustworthy evidence. Automatic metrics are fast but partial, and human evaluation, while costly, provides the necessary reality check. By combining both, you can iterate quickly yet avoid shipping systems that look good on paper but fail users.
Career-wise, the ability to configure decoding and design smart evaluations is a core skill for modern AI roles. It shows you can move beyond training to delivering products that work under constraints. As generative AI becomes standard across industries—from customer support to content creation—the importance of inference choices and meaningful evaluation only grows. Mastering these tools positions you to build systems that are not only powerful in theory but reliable, efficient, and aligned with user needs in practice.
Lecture Summary
Tap terms for definitions
01Overview
This lecture focuses on inference—the stage where a trained language model is used to produce outputs for real tasks. It starts by defining three broad task families: conditional generation (produce an output sequence based on an input sequence), open-ended generation (produce text with minimal or no input), and classification (output a single discrete label). Understanding the shape of inputs and outputs helps you choose an appropriate decoding method—how the model selects tokens step by step to form sequences. The lecture explains that exact search for the single most probable sequence is infeasible because the number of possibilities grows exponentially with sequence length and vocabulary size, so we use approximate algorithms.
The core of the session introduces four decoding strategies. Greedy decoding chooses the highest-probability next token at each step, making it fast but prone to getting stuck in local optima. Beam search keeps several candidate sequences, improving quality at the cost of more computation, and it often benefits from length normalization to avoid favoring short outputs. Sampling draws from the probability distribution rather than always taking the top choice, producing more diverse outputs; temperature controls how random or conservative sampling becomes. Top-k sampling restricts choices to the k most likely tokens, reducing incoherence from rare words, while nucleus (top-p) sampling adapts the candidate set to the distribution’s shape, usually improving coherence and diversity balance.
The lecture then turns to evaluation of generation quality, a notoriously difficult problem because language has many valid outputs and complex qualities that are hard to measure. Human evaluation is presented as the gold standard, using criteria such as fluency, coherence, relevance, and interestingness. However, it is resource-intensive, so automatic metrics are used in practice. Perplexity measures predictive fit to data but not generation quality; BLEU and ROUGE measure n-gram overlap with reference texts and are common in translation and summarization respectively; METEOR extends BLEU with stemming and synonym matching; newer metrics like BERTScore use embeddings to assess semantic similarity; and FID, originally from images, can be adapted to compare distributions of generated and real texts. Each metric brings partial insight but also clear limitations.
The session concludes with practical guidance on decoder selection by task type and a short Q&A. For conditional tasks with a clear target (translation, summarization), beam search is often preferred for higher quality and consistency. For open-ended tasks (creative writing, chat), sampling—especially nucleus sampling—is commonly better for diversity and naturalness. Beam size should be chosen by balancing quality improvements against rapidly increasing compute, often starting with a beam size between 5 and 10 and adjusting based on task and output length. Automatic metrics cannot replace humans because they capture surface-level similarity and may miss meaning; they are best used with caution and alongside human judgments.
By the end, you should understand the main task shapes for language model use, how decoding algorithms work and when to apply them, and how to think critically about evaluation. You will be able to configure decoding strategies, explain their trade-offs, and design a sensible evaluation plan that mixes automatic metrics with human review. The lecture is structured from problem framing (tasks) to algorithmic tools (decoding methods) to measurement (evaluation), with simple examples grounding each concept.
Key Takeaways
✓Match decoding to task goals. Use beam search for tasks with clear targets and correctness, like translation and summarization, and sampling (especially top-p) for creative, open-ended tasks. This alignment avoids optimizing for the wrong qualities. It ensures your outputs reflect what matters for the application.
✓Start beam search with small beam sizes. Try k=5, then 8 or 10, and stop when improvements plateau relative to compute cost. Longer outputs multiply cost, so smaller beams may be necessary. Monitor both quality and latency to pick a practical setting.
✓Apply length normalization for sequence tasks that risk short outputs. Without it, beam search tends to favor short sequences due to probability multiplication. Tune alpha between 0.6 and 1.0 to balance brevity and completeness. Validate with human judgments for readability.
✓Control sampling with temperature. Lower temperatures (e.g., 0.7) keep coherence, higher temperatures add creativity but risk derailment. For safety-critical tasks, prefer lower temperatures. Adjust only as needed to counter dullness.
✓Prefer nucleus sampling over plain sampling for open-ended tasks. Top-p adapts to the model’s certainty, giving a better balance of diversity and coherence. Start with top_p≈0.9 and adjust slowly. Combine with temperature for fine control.
✓Use top-k as a simple guardrail. If outputs are too random, set k=50 (or similar) to remove unlikely tokens. This can stabilize results with minimal complexity. If the distribution varies widely, consider switching to top-p.
✓Set clear stop conditions. Always define EOS tokens and max lengths to prevent run-on text and excessive costs. For structured tasks, add task-specific stopping rules. This keeps outputs tidy and predictable.
Glossary
Inference
Using a trained model to produce outputs on new inputs. It is the phase after training where the model is actually applied. The model predicts the next token or the class label based on learned patterns. In language models, inference builds text step by step. The choices made during inference strongly affect the final quality.
Conditional Generation
Producing an output sequence that depends on an input sequence. The input guides the output, like translating one language into another or summarizing a document. The model computes probabilities for each next token using both the input and the already generated output. This tight linkage makes outputs faithful to inputs. It is a common production use of language models.
Open-Ended Generation
Generating text with minimal or no input prompt. The model creates content mostly from its learned knowledge and patterns. Many different outputs can be valid. This favors creativity and variety over exact correctness. It is used in chat, brainstorming, and storytelling.
Classification
Predicting a single label for an input. Instead of writing a long output, the model decides among categories. Labels might be sentiment, topic, or intent. This often uses a small label vocabulary or a classifier head. It is a fast and common NLP task.
Version: 1
•
Sampling draws the next token from the model’s probability distribution rather than always taking the maximum. This introduces diversity and can produce multiple different but valid outputs. However, pure sampling can wander into incoherent text if it picks low-probability tokens. Temperature controls randomness by sharpening or flattening the distribution.
•Top-k sampling limits choices to the k most probable tokens and samples among them. This avoids extremely unlikely tokens while keeping diversity. The value of k controls the strength of the filter; too small can be dull, too large can be noisy. It is simpler but not adaptive to how peaked the distribution is.
•Nucleus (top-p) sampling adapts to the distribution by choosing the smallest set of tokens whose total probability mass exceeds p. When probabilities are very peaked, the set is small; when flatter, the set grows. This makes outputs more coherent than basic sampling and more adaptive than top-k. Many open-ended generation systems prefer top-p because of this flexibility.
•Task-decoder pairing matters: beam search for conditional tasks like translation and summarization, sampling (especially top-p) for creative/open-ended tasks. Deterministic methods favor consistency and quality on tasks with clear references. Stochastic methods favor diversity and exploration where many answers are acceptable. Choosing the right decoding aligns outputs with task goals.
•Evaluating generated text is hard because there is often no single right answer. Human evaluation remains the gold standard, with criteria like fluency, coherence, relevance, and interestingness. But it is costly and slow, so we also use automatic metrics. Each metric captures a narrow slice of quality and has limitations.
•Perplexity measures how well a model predicts data and is tied to its training loss. It is easy to compute but does not reflect real-world generation quality. BLEU and ROUGE compare overlap with reference texts and are useful for translation and summarization benchmarks. METEOR, BERTScore, and FID offer different trade-offs but still cannot fully replace humans.
•The lecture closes with two practical Q&As: why automatic metrics like BLEU cannot replace human evaluation, and how to choose beam size. Automatic metrics measure surface similarity and can be sensitive to specific references, missing deeper meaning and coherence. Beam size is a trade-off between quality and computational cost; start small (5–10) and scale until gains flatten. Longer target lengths push costs up fast, often requiring smaller beams.
02Key Concepts
01
Conditional Generation: Definition: Producing an output sequence based on an input sequence, like translation, summarization, or question answering. Analogy: It’s like answering a question on a test using the information given in the question. Technical: We model p(y|x) and decode y token by token conditioned on both x and previously generated tokens. Why it matters: Many real tasks demand the output be tightly tied to the input, so decoding must prioritize accuracy and consistency. Example: Translating “The cat sat on the mat” into French as “Le chat est assis sur le tapis.”
02
Open-Ended Generation: Definition: Generating text with minimal or no input prompt. Analogy: It’s like freestyle storytelling that starts from “Once upon a time” or even from nothing. Technical: We model p(y) or p(y|short prompt) and decode without a strong conditioning x. Why it matters: Some applications value creativity and variety over a single correct answer. Example: Continuing “Once upon a time,” the model writes “there was a princess who lived in a faraway land.”
03
Classification with LMs: Definition: Predicting a single label (e.g., positive/negative) for an input using a language model. Analogy: It’s like sorting mail into bins labeled “Sports,” “Politics,” or “Tech.” Technical: Instead of forming a long sequence, we map input x to a discrete class y, often via token-based logits or a small label vocabulary. Why it matters: Many practical tasks (sentiment, topic) need compact decisions rather than long outputs. Example: Labeling “This movie was amazing!” as Positive.
04
Decoding: Definition: The process of choosing a sequence y that maximizes p(y|x) under a language model. Analogy: It’s like finding the best path through a maze by looking at hints at each junction. Technical: Exact argmax over all sequences is intractable (V^n growth), so we use approximate search methods. Why it matters: The decoding strategy heavily shapes output quality, diversity, and speed. Example: Completing “The cat sat on the …” into a fluent sentence with a chosen algorithm.
05
Greedy Decoding: Definition: At each step, choose the single highest-probability next token. Analogy: It’s like always taking the biggest cookie in a jar without checking what comes next. Technical: y_t = argmax_v p(v|x, y_<t), one forward pass per token; simple and fast. Why it matters: Greedy can get stuck in local optima and early mistakes can’t be fixed later. Example: “The weather is nice, let’s go to the …” picking “restaurant” over “park,” which shifts the entire sentence direction.
06
Local Optimum (in Decoding): Definition: A choice that looks best right now but leads to a worse final sequence. Analogy: Climbing a hill that seems tallest nearby but isn’t the highest mountain overall. Technical: Greedy maximizes immediate token probability, not total sequence likelihood, so it may block better future continuations. Why it matters: Early missteps can lock outputs into suboptimal narratives. Example: Choosing “restaurant” prevents later generating “…park,” even if the overall sentence would be better.
07
Beam Search: Definition: Keep the top-k partial sequences at each step, expanding and pruning iteratively. Analogy: It’s like exploring several promising paths in a maze at once instead of just one. Technical: Maintain a beam of size k; at each step, extend each sequence with possible next tokens, retaining the k highest-scoring sequences. Why it matters: It reduces the risk of bad early decisions and improves quality at higher compute cost. Example: With k=2, keep both “…mat” and “…sofa,” then continue from each and choose the best overall.
08
Beam Size: Definition: The number k of parallel candidates maintained in beam search. Analogy: It’s like the number of lanes you keep open when merging traffic—more lanes, more options. Technical: Larger k increases search coverage and quality but scales compute roughly linearly per step. Why it matters: Too small misses good paths; too big is slow and offers diminishing returns. Example: Starting with k=5 and increasing to k=10 until further improvements flatten out.
09
Length Normalization: Definition: Adjusting scores so beam search doesn’t over-prefer shorter sequences. Analogy: It’s like giving long essays a fair shot even though they have more words that could go wrong. Technical: Divide log-probability by length^alpha (alpha controls strength), balancing short vs. long sequences. Why it matters: Without it, the product of many probabilities favors shorter outputs unfairly. Example: Summaries stop being clipped too early when alpha is set around 0.6–1.0.
10
Sampling: Definition: Randomly drawing the next token according to the model’s probability distribution. Analogy: It’s like rolling a weighted die where heavier sides represent more likely words. Technical: y_t ~ Multinomial(p(v|x,y_<t)); introduces diversity across runs. Why it matters: It can produce varied, creative outputs but risks incoherence from low-probability picks. Example: Different story continuations from the same prompt on repeated runs.
11
Temperature: Definition: A control knob to make sampling more random or more conservative. Analogy: It’s like turning a spice dial—higher heat spreads choices, lower heat focuses them. Technical: Adjust logits by 1/T before softmax; high T flattens, low T sharpens the distribution. Why it matters: Lets you balance creativity and reliability for your task. Example: T=0.01 approximates greedy; T=100 makes choices nearly random.
12
Top-k Sampling: Definition: Restrict sampling to the k most probable tokens, renormalize, then sample. Analogy: It’s like picking a snack from the top five favorites instead of the entire pantry. Technical: Zero out probabilities beyond rank k; sample within the truncated set. Why it matters: Cuts out unlikely words that derail coherence, while keeping diversity. Example: With k=5, only the five most likely tokens are considered for the next step.
13
Nucleus (Top-p) Sampling: Definition: Sample from the smallest set of tokens whose cumulative probability exceeds p. Analogy: It’s like filling a bag with candies until you reach 90% of total sweetness, then picking from that bag. Technical: Sort by probability, include tokens until sum ≥ p, renormalize, then sample. Why it matters: Adapts to peaked vs. flat distributions for better balance of coherence and variety. Example: With p=0.9, sometimes 3 tokens suffice; other times, 10 are needed.
14
Choosing Decoding by Task: Definition: Matching the decoding method to the problem’s needs. Analogy: It’s like choosing shoes—running shoes for sprints, hiking boots for trails. Technical: Use beam search for tasks with clear references and correctness (translation/summarization); use sampling (top-p) for creative tasks needing diversity. Why it matters: The same model behaves very differently under different decoders, affecting output quality. Example: A summarizer with beam search is concise and stable; a chatbot with top-p feels more natural.
15
Evaluation Problem: Definition: Measuring generation quality when many valid answers exist. Analogy: It’s like grading a poem—there’s no single right way to write it. Technical: Human evaluation is most reliable but costly; automatic metrics provide proxies with limitations. Why it matters: Poor evaluation misleads model selection and deployment decisions. Example: A text with high n-gram overlap may still be incoherent or off-topic.
16
Human Evaluation: Definition: People rate outputs on fluency, coherence, relevance, and interestingness. Analogy: It’s like a panel of judges scoring performances on multiple criteria. Technical: Design rubrics, collect ratings, and compute inter-rater agreement; expensive but captures meaning. Why it matters: Humans understand nuance that metrics miss, making this the gold standard. Example: Reviewers flag a high-BLEU translation that is fluent but subtly wrong in meaning.
17
Perplexity: Definition: A measure of how well the model predicts data, tied to cross-entropy loss. Analogy: It’s like how surprised you are by each next word—the less surprised, the better. Technical: Perplexity = exp(average negative log-likelihood); lower is better for modeling. Why it matters: Good for model training and validation, but not sufficient for generation quality. Example: A model with low perplexity can still produce dull or inconsistent outputs when decoding poorly.
18
BLEU: Definition: An automatic metric for translation based on n-gram precision against references. Analogy: It’s like counting how many small word chunks match a known good answer. Technical: Computes clipped n-gram overlap with a brevity penalty; higher is better. Why it matters: Standardized and easy to compute, but measures surface similarity, not meaning. Example: Two different valid translations may score differently depending on reference phrasing.
19
ROUGE: Definition: An automatic metric emphasizing recall of n-grams against references, common in summarization. Analogy: It’s like checking how much of the important stuff you remembered from the article. Technical: ROUGE-N, ROUGE-L measure overlap and longest common subsequence; higher recall means more coverage. Why it matters: Encourages capturing key points but may reward verbosity or redundancy. Example: A summary that includes many reference phrases scores high even if repetitive.
20
METEOR: Definition: A translation metric that accounts for stemming and synonyms beyond exact matches. Analogy: It’s like recognizing that “run” and “running” are the same idea, or “car” and “automobile” are related. Technical: Aligns hypothesis to reference using exact, stem, synonym matches; computes precision/recall with penalties. Why it matters: More semantically forgiving than BLEU, often correlating better with humans. Example: A translation using a synonym gets credit that BLEU might miss.
21
BERTScore: Definition: A semantic similarity metric using contextual embeddings to compare candidate and reference. Analogy: It’s like comparing meanings rather than exact words by looking at context. Technical: Uses pretrained models (e.g., BERT) to compute token-level similarity and F1-like scores. Why it matters: Captures meaning beyond surface overlap, helpful when many paraphrases are valid. Example: Two differently worded but semantically equivalent sentences score high.
22
FID (Fréchet Inception Distance) for Text: Definition: A distributional distance adapted from image generation evaluation. Analogy: It’s like comparing the shapes of two clouds rather than matching droplets. Technical: Embed real and generated samples, fit Gaussians, and compute the Fréchet distance; lower is better. Why it matters: Evaluates overall distribution similarity, not just one-to-one matches. Example: A generator whose outputs resemble the corpus statistics achieves lower FID.
23
Limits of Automatic Metrics: Definition: Reasons metrics can mislead and why humans are still needed. Analogy: It’s like judging a story just by word overlap, missing plot and tone. Technical: Metrics often measure surface forms, depend on specific references, and miss coherence and factuality. Why it matters: Over-optimizing for a metric can worsen real quality. Example: A high-BLEU output that is fluent but semantically wrong.
24
Choosing Beam Size: Definition: Practical guidance for setting k in beam search. Analogy: It’s like opening more lanes on a highway—traffic flows better but costs more to build. Technical: Larger k improves search but increases compute per step; returns diminish beyond a point. Why it matters: Balancing quality and speed is essential in production. Example: Start at k=5, increase to k=10, and stop if metrics and human ratings plateau.
03Technical Details
Overall Architecture/Structure
Problem Setup
We have a trained language model that predicts the probability of the next token given prior context (and possibly an input x). In conditional generation, we model p(y|x) as a product over steps: p(y|x) = Π_t p(y_t | x, y_<t). In open-ended generation, the conditioning x is minimal or absent (perhaps a short prompt). In classification, we map x to a discrete label y, often using a small output label vocabulary or mapping logits to classes. In inference, our job is to decode: construct y one token at a time using these probabilities.
Why Exact Search Is Infeasible
To find the most probable sequence, one might try y* = argmax_y p(y|x). But enumerating all sequences is impossible in practice: with vocabulary size V and desired length n, there are V^n sequences. Even for moderate V (e.g., 30k) and n (e.g., 20), this number is astronomical. Hence, we use approximate algorithms that build sequences incrementally and prune aggressively.
Decoding as Incremental Decision-Making
Each decoding step uses the model to compute a probability distribution over the vocabulary given current context. We then choose the next token via a strategy: deterministic (greedy or beam) or stochastic (sampling variants). Data flow: input x + generated prefix → model forward pass → logits → probabilities (softmax) → next token decision → append token → repeat until stop (EOS token, max length, or task-specific stop).
Code/Implementation Details
There is no single code listing, but the algorithms follow standard skeletons. Below are clear step-by-step descriptions you can directly translate into code in any language.
Greedy Decoding
Inputs: model, input x (optional), max_length, end-of-sequence token (EOS).
State: generated sequence y (initially empty or prompt), step t = 1.
Loop:
Compute probabilities p(v | x, y_<t) with a forward pass.
Choose v* = argmax_v p(v | ...).
Append v* to y.
If v* = EOS or length hits max_length, stop; else t = t+1.
Output: y. Pros: minimal compute (one forward pass per token). Cons: vulnerable to local optima, lacks diversity.
State: a list (beam) of k partial sequences with scores. Scores are typically cumulative log-probabilities.
Initialization: beam = [{seq: prompt, score: 0}].
Loop each step:
For each sequence in beam, run a forward pass to get p(v) for next tokens.
For each sequence, generate candidate extensions with each token v and new_score = score + log p(v).
Pool all candidates (size up to k * V) and select the top k by adjusted score:
If using length normalization, adjust candidate score: score’ = new_score / (length^alpha).
Replace beam with k best candidates.
If all candidates end with EOS or max_length reached, stop.
Output: Best sequence in beam by (possibly length-normalized) score. Pros: explores multiple paths; better global quality. Cons: k more forward passes per step; still approximate.
Length Normalization Details
Without normalization, longer sequences multiply more probabilities (or add more negative log-probabilities), biasing toward shorter outputs. Length normalization adjusts by dividing by length^alpha (or using alternative normalizers), with alpha ∈ [0,1] common. Alpha = 0 means no normalization; alpha = 1 applies full normalization. In practice, tuning alpha prevents overly short generations in tasks like summarization.
Sampling (Basic)
Inputs: model, input x, max_length, EOS, temperature T.
Loop each step:
Forward pass to get logits over vocabulary.
Adjust logits by temperature: logits’ = logits / T. Lower T (<1) sharpens; higher T (>1) flattens.
Convert to probabilities with softmax.
Sample next token from Multinomial(probabilities).
Append token; stop at EOS or max_length.
Pros: generates diverse outputs; multiple unique samples possible. Cons: may sample unlikely tokens and drift into incoherence when T is high.
Top-k Sampling
Steps are the same as basic sampling, except before sampling:
Sort tokens by probability.
Keep only the top k; set others to zero.
Renormalize the remaining probabilities.
Sample from the truncated set.
Trade-off: With k small, outputs become conservative; with k large, risk of noise returns. Useful to avoid very unlikely tokens entirely.
Nucleus (Top-p) Sampling
Differs in how the candidate set is chosen:
Sort tokens by probability (descending).
Accumulate probabilities until the sum ≥ p (e.g., 0.9).
Keep just that minimal set; zero out the rest.
Renormalize and sample.
Advantage: The size of the candidate set automatically adapts to distribution shape—few tokens if peaked, many if flat. Often produces coherent yet varied text.
Classification with Language Models
While generation outputs sequences, classification outputs a single class. Implementation options include:
Constrain decoding to a small label vocabulary (e.g., tokens for “Positive”/“Negative”) and pick the highest probability label token.
Map the model’s internal representation to a classifier head (common in fine-tuned setups) and choose the argmax class.
In prompt-based setups, one can compute p(label token | prompt + input) and choose the maximum.
Tools/Libraries Used
The lecture stays framework-agnostic, but typical implementations use PyTorch or TensorFlow for model inference, and Hugging Face Transformers for ready-made decoding utilities. These frameworks expose parameters like temperature, top_k, top_p, num_beams, length_penalty (alpha), max_length, and early_stopping. Install via package managers (pip install transformers torch). The core ideas apply regardless of framework.
Step-by-Step Implementation Guide
A) Choosing a Decoder for Your Task
Step 1: Identify task type.
• Conditional with clear references (translation/summarization): prefer beam search.
• Open-ended/creative (story, chat): prefer top-p sampling (optionally with temperature and top-k).
• Classification: constrain to label tokens or use a classifier head.
Step 3: Iterate.
• Adjust beam size until quality plateaus and compute is acceptable.
• Tune top_p and temperature to balance coherence and creativity.
B) Implementing Beam Search (Pseudo-Workflow)
Input: source text x; model; tokenizer; num_beams=5; length_penalty=0.7; max_length=128.
Encode x; initialize beam with [([BOS], score=0)].
For t in 1..max_length:
• For each sequence in beam, compute next-token logits given x and sequence.
• Convert logits to log-probabilities; form candidates by appending each token.
• Compute adjusted scores = (sum log-probs) / (length^alpha) if using normalization.
• Keep top num_beams candidates overall.
• If all candidates end with EOS, break.
Encode prompt; set generated sequence to prompt tokens.
For t in 1..max_length:
• Forward pass to obtain logits.
• logits’ = logits / temperature; probs = softmax(logits’).
• Sort tokens by probs; accumulate until sum ≥ top_p; truncate; renormalize.
• Sample one token; append.
• If EOS, break.
Decode tokens to text.
D) Classification via Label Tokens
Define a small label vocabulary (e.g., {“Positive”, “Negative”}).
Build a prompt: "Review: <text>\nSentiment:".
Compute p(label_token | prompt) for each label token.
Choose the label with the highest probability.
Tips and Warnings
Beware of Length Bias
In beam search without length normalization, shorter sequences often win. Always consider a length penalty for tasks where complete answers are needed (e.g., summaries). Too high a penalty can inflate length and cause verbosity—tune alpha.
Compute Cost Scales Fast
Beam search scales with num_beams per decoding step; long outputs multiply costs. For long-form generation, limit beam size or consider sampling to keep latency manageable.
Diversity vs. Coherence Trade-off
High temperature or large top_p encourages creativity but risks derailment. For production systems, start conservative (e.g., top_p=0.9, temperature=0.7) and relax only if outputs feel repetitive.
Early Mistakes Snowball
Greedy decisions are irreversible. If your task punishes early errors (e.g., translation), avoid greedy unless speed is the top priority.
Stop Criteria Matter
Always set a sensible max_length and an EOS token. For conditional tasks, add task-specific stop conditions (e.g., end of sentence) to prevent run-on outputs.
Metric Blind Spots
Perplexity correlates with training fit, not necessarily with generation usefulness. BLEU/ROUGE can reward overlap and miss meaning; complement them with human evaluation.
Beam Size Tuning
Start with 5; compare with 8 and 10; stop if quality gains flatten. For very long outputs, prefer smaller beams to control cost.
Sampling Guardrails
Combine temperature with top-k or top-p to avoid very low-probability tokens. If outputs become too random, lower temperature or reduce top_p.
Reproducibility
Sampling-based methods need fixed random seeds for reproducible testing. Deterministic decoders produce the same output for identical inputs and settings.
Task-Decoder Alignment
If a single best answer exists and references are available, deterministic methods likely score better on benchmarks. If many answers are acceptable (dialogue, brainstorming), sampling gives a better user experience.
Evaluation Details
Human Evaluation
Design rubrics covering fluency (natural wording), coherence (logical flow), relevance (on-topic), and interestingness (engagement). Use multiple raters and compute agreement (e.g., Cohen’s kappa) to ensure reliability. Consider pairwise comparisons (A/B testing) when relative judgments are easier than absolute scores.
Perplexity
Computed as exp of average negative log-likelihood across tokens in a corpus. Lower perplexity suggests the model assigns higher probabilities to ground-truth data. However, decoding choices can still yield poor generations even from a low-perplexity model. Use perplexity primarily for training monitoring and model selection, not as a sole measure of user-facing quality.
BLEU
Calculates n-gram precision with clipping and applies a brevity penalty to avoid too-short outputs. Works best with multiple reference translations to mitigate sensitivity to phrasing. Known limitations include lack of semantic understanding and poor handling of synonyms and paraphrases.
ROUGE
Common variants: ROUGE-N (n-gram overlap), ROUGE-L (longest common subsequence). Emphasizes recall, rewarding coverage of reference content—which can encourage overly long outputs if unchecked. Combine with precision-oriented checks and human review.
METEOR, BERTScore, FID
METEOR aligns words using stems and synonyms, often correlating better with human judgments than BLEU. BERTScore uses contextual embeddings to score semantic similarity; better at recognizing paraphrases. FID compares distributions rather than one-to-one matches, but it requires a robust embedding space and is more computationally demanding.
Putting It Together
For conditional tasks with references, report BLEU/ROUGE/METEOR and perform a targeted human evaluation sample. For open-ended tasks, rely more on human evaluation and consider semantic metrics like BERTScore; distributional checks like FID can complement but are not standard. Always interpret metrics within their limits and validate conclusions with human judgments before deployment.
04Examples
💡
Translation (Conditional Generation): Input: “The cat sat on the mat.” Output: “Le chat est assis sur le tapis.” Processing: The model conditions on the English sentence, computes p(y|x), and decodes token by token to form the French sentence. Key point: Output depends directly on the input; beam search often improves accuracy and fluency.
💡
Summarization (Conditional Generation): Input: A long news article. Output: A short summary that captures the main points. Processing: The model reads the document, predicts summary tokens step by step, and ideally benefits from beam search with length normalization to avoid overly short summaries. Key point: Conditional generation with a clear reference summary is well-suited to deterministic decoding.
💡
Question Answering (Conditional Generation): Input: “What is the capital of France?” Output: “Paris.” Processing: The model conditions on the question and produces a concise, single-token (or few-token) answer with high probability. Key point: Even though the output can be short, it is still decoded via sequence prediction tied to the input.
💡
Open-Ended Story Start: Input: Prompt “Once upon a time”. Output: A creative continuation like “there was a princess who lived in a faraway land.” Processing: The model samples the next tokens, potentially using nucleus sampling with top_p=0.9 and temperature=0.8 to balance novelty and coherence. Key point: Many valid continuations exist—diversity matters more than a single best answer.
💡
Greedy Next-Word Example: Input prefix: “The cat sat on the”. Processing: The model computes probabilities over the vocabulary; greedy picks the most likely next token, e.g., “mat”. Output: “The cat sat on the mat.” Key point: Greedy is simple and fast but may fail in more ambiguous contexts.
💡
Greedy Pitfall (Local Optimum): Input prefix: “The weather is nice, let’s go to the”. Processing: Greedy chooses “restaurant” because it is slightly more common in data than “park” at that moment. Output drifts: “The weather is nice, let’s go to the restaurant for dinner.” Key point: A locally high-probability choice can misdirect the entire sentence.
💡
Beam Search with k=2: Input prefix: “The cat sat on the”. Step 1: Keep both “…mat” and “…sofa” candidates. Step 2: Extend each; e.g., “…mat .” and “…sofa again”. Select top 2 by cumulative (possibly length-normalized) scores. Key point: Maintaining multiple candidates avoids committing too early and can find better overall sequences.
💡
Length Normalization Example: Task: Summarization where the model tends to output very short summaries. Processing: Apply length penalty with alpha around 0.7 to adjust scores for longer candidates. Result: The beam search no longer over-prefers ultra-short outputs; summaries include enough detail. Key point: Fixes systematic shortness bias in beam search.
💡
Temperature Extremes: Setup: Sampling with different temperatures. High T=100 flattens probabilities so choices are almost random; Low T=0.01 makes choices near-deterministic. Output: High T produces diverse but often incoherent text; low T yields safe, repetitive text. Key point: Temperature is a dial between creativity and control.
💡
Top-k Sampling with k=5: Setup: Sort tokens by probability and keep the top five. Processing: Renormalize those five probabilities and sample next token. Output: Avoids rare, off-topic tokens while retaining some randomness. Key point: Simple control over diversity with a fixed-size candidate set.
💡
Nucleus Sampling with p=0.9: Setup: Sort tokens, accumulate probabilities until reaching 0.9, then sample from this set. Case 1: If the top three tokens already sum to 0.92, only three tokens are considered. Case 2: If probabilities are flat, many tokens are included. Key point: Adaptive candidate sets tailor diversity to the model’s current confidence.
💡
Evaluation with BLEU in Translation: Input: A set of English sentences, model-generated French outputs, and human reference translations. Processing: Compute n-gram overlaps, apply brevity penalty, and average BLEU across the dataset. Output: A BLEU score that reflects surface-level similarity to references. Key point: Useful for benchmarks but cannot by itself ensure semantic correctness.
💡
ROUGE for Summarization: Input: Articles, model-generated summaries, and human-written reference summaries. Processing: Compute ROUGE-N and ROUGE-L to measure recall and LCS overlap. Output: Scores indicating how much of the reference content appears in the system summary. Key point: Encourages coverage but may reward redundant phrasing.
💡
Human Evaluation Panel: Setup: Judges rate fluency, coherence, relevance, and interestingness on a Likert scale. Processing: Collect multiple ratings per sample and compute average scores and inter-rater agreement. Output: Reliable judgments of real quality aspects that metrics miss. Key point: The most trustworthy evaluation, though expensive to run.
💡
Beam Size Tuning Experiment: Setup: Run beam search with k=3, 5, 8, and 10 on a validation set. Processing: Observe improvements in BLEU/ROUGE and human ratings versus increased latency. Output: Gains diminish after k≈8 for this task; choose k=8 as a balance. Key point: Quality vs. compute is a practical trade-off that depends on task length and constraints.
05Conclusion
Inference is the phase where a trained language model becomes useful—by generating sequences or producing labels for real tasks. The lecture began by framing three task types—conditional generation, open-ended generation, and classification—each with distinct needs and evaluation styles. Because exact search for the most likely sequence is intractable, we rely on approximate decoding. Greedy decoding is fastest but risks local optima; beam search explores multiple candidates and benefits from length normalization; sampling methods inject diversity, with temperature controlling randomness; top-k trims unlikely tokens, while nucleus (top-p) adapts to the distribution and often yields the best balance for creative tasks.
Evaluation remains challenging: no single metric captures all qualities of good text. Human evaluation is the gold standard, scoring fluency, coherence, relevance, and interestingness, but it is costly. Automatic metrics provide useful signals within limits: perplexity reflects predictive fit, BLEU and ROUGE measure n-gram overlap for translation and summarization, METEOR and BERTScore account more for meaning, and FID compares distributions. Use metrics judiciously and always validate with human judgments for critical applications.
To practice, pair decoders with tasks: try beam search (k≈5–10) and length normalization for translation and summarization; try nucleus sampling (p≈0.9) with moderate temperature (≈0.7–1.0) for chat or storytelling. Run small ablations to tune beam size, top-p, temperature, and max lengths, and track both automatic scores and a small, well-designed human evaluation. A good next step is to implement each decoding method on a single prompt and compare outputs and latency, then scale to a dataset and analyze metric trends.
Going forward, deepen your understanding of advanced decoding controls (e.g., better length penalties), richer human evaluation protocols, and semantic metrics that align more closely with human judgments. Remember the core message: the decoder you choose and how you evaluate it fundamentally shape the behavior and perceived quality of your model. Always align decoding with task goals and interpret metrics with care. With these tools, you can confidently deploy language models that are both effective and appropriate for their intended use.
✓Beware of early mistakes. Greedy or high-temperature sampling can make irreversible choices that snowball. If early accuracy is crucial, use beam search or reduce randomness. Check first-token choices carefully in analysis.
✓Measure with multiple metrics and humans. Perplexity is not a generation-quality metric; BLEU/ROUGE capture overlap but not meaning. Combine automatic metrics with human evaluation to avoid metric gaming. Use small but well-designed human studies often.
✓Tune hyperparameters systematically. Change one setting at a time (beam size, top_p, temperature) and log effects on quality and speed. Use validation sets and A/B testing for realistic comparisons. This makes improvements trustworthy and reproducible.
✓Balance coherence and diversity consciously. More randomness is not always better; too little can be dull. Set a target style for your application, then tune toward it. Keep user experience as the guiding principle.
✓Consider cost from the start. Beam search increases compute per token, and long outputs amplify this. For latency-sensitive apps, prefer smaller beams or sampling setups. Profile and budget before scaling.
✓Design evaluation rubrics that mirror user needs. Rate fluency, coherence, relevance, and interestingness according to the task. Calibrate raters and measure agreement. This ensures reliable signals for iteration.
✓Sanity-check automatic metrics. Look at examples that score high but read poorly, and vice versa. Understand your metric’s blind spots (e.g., paraphrasing issues for BLEU). Prevent overfitting to any single score.
✓Document decoding settings for reproducibility. Record temperature, top_k, top_p, beam size, length penalty, and seeds. This allows consistent comparisons across experiments. It also eases debugging and handoffs.
✓Use short validation prompts to prototype quickly. Compare decoders on the same small set and inspect outputs qualitatively. Once you see promising settings, scale to a larger test set. This saves time while steering you toward good defaults.
Decoding
The strategy for choosing each next token to form an output sequence. Because searching all possibilities is impossible, decoding uses heuristics or sampling. It can be deterministic (always the same output) or stochastic (varied outputs). The chosen method shapes coherence and diversity. Good decoding aligns with task goals.
Vocabulary (V)
The set of all tokens the model can output. Each token can be a word, subword, or character, depending on tokenization. A larger vocabulary offers more expression but makes each step’s choice larger. The size V affects computational cost. It also influences how many sequences are possible.
Token
A basic unit of text used by the model, like a word or subword. Text is split into tokens by a tokenizer before being processed. The model predicts one token at a time during decoding. Tokens are mapped to numeric IDs. Sequences are token lists.
Argmax
The choice with the highest value in a set. In decoding, argmax picks the token with the highest probability. This is used by greedy decoding at each step. It is simple and fast. But it can ignore future consequences.