📚 Stanford CS336: Language Modeling from Scratch10 / 17

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 10: Inference

Intermediate

Stanford Online

Key Summary

•This session explains how to use a trained language model to produce outputs, a phase called inference. It covers three task types—conditional generation, open-ended generation, and classification—each with different input/output shapes that affect decoding choices. The lecture then dives into decoding methods, which are strategies to choose the next token step by step. Finally, it discusses how to evaluate generated text using human judgments and automatic metrics, along with their trade-offs.
•Conditional generation means producing an output sequence based on an input sequence, like translating English to French or summarizing an article. Open-ended generation has minimal or no input and focuses on creative, diverse outputs, such as stories or dialogue. Classification uses a language model to output a single label, such as sentiment or topic. Each of these requires different inference settings.
•Decoding is the search for the most likely output sequence given an input and a model. The exact best solution is usually impossible to compute because possible sequences grow exponentially with length. Therefore, we rely on approximate search algorithms. The lecture explains when to use deterministic versus stochastic (randomized) decoding.
•Greedy decoding picks the highest-probability next token at every step. It is simple and fast but can get stuck in local optima, making early mistakes that the model cannot fix later. The lecture uses a simple sentence completion example to show how a plausible but suboptimal choice can derail the rest of the generation. Greedy decoding is best when speed matters more than global optimality.
•Beam search keeps several top candidate sequences at each step, not just one. It explores more options to avoid local traps but costs more computation, scaling with the beam size. The lecture presents length normalization to avoid favoring short outputs. In practice, beam sizes around 5–10 are common starting points.

Why This Lecture Matters

Inference is where language models deliver real value—turning training into practical outputs for translation, summarization, chat, and classification. The decoding method you choose can transform the same model into a precise translator, a creative storyteller, or a fast classifier. Engineers, data scientists, product managers, and researchers all benefit from understanding these levers because they determine quality, latency, and user experience. In production, the right pairing of task and decoder can boost accuracy, reduce hallucinations, and control costs; the wrong pairing can do the opposite. Evaluation knowledge is equally critical. Teams need to justify model choices with trustworthy evidence. Automatic metrics are fast but partial, and human evaluation, while costly, provides the necessary reality check. By combining both, you can iterate quickly yet avoid shipping systems that look good on paper but fail users. Career-wise, the ability to configure decoding and design smart evaluations is a core skill for modern AI roles. It shows you can move beyond training to delivering products that work under constraints. As generative AI becomes standard across industries—from customer support to content creation—the importance of inference choices and meaningful evaluation only grows. Mastering these tools positions you to build systems that are not only powerful in theory but reliable, efficient, and aligned with user needs in practice.

Lecture Summary

Tap terms for definitions

01Overview

This lecture focuses on inference—the stage where a trained language model is used to produce outputs for real tasks. It starts by defining three broad task families: conditional generation (produce an output sequence based on an input sequence), open-ended generation (produce text with minimal or no input), and classification (output a single discrete label). Understanding the shape of inputs and outputs helps you choose an appropriate decoding method—how the model selects tokens step by step to form sequences. The lecture explains that exact search for the single most probable sequence is infeasible because the number of possibilities grows exponentially with sequence length and vocabulary size, so we use approximate algorithms.

The core of the session introduces four decoding strategies. Greedy decoding chooses the highest-probability next token at each step, making it fast but prone to getting stuck in local optima. Beam search keeps several candidate sequences, improving quality at the cost of more computation, and it often benefits from length normalization to avoid favoring short outputs. Sampling draws from the probability distribution rather than always taking the top choice, producing more diverse outputs; temperature controls how random or conservative sampling becomes. Top-k sampling restricts choices to the k most likely tokens, reducing incoherence from rare words, while nucleus (top-p) sampling adapts the candidate set to the distribution’s shape, usually improving coherence and diversity balance.

The lecture then turns to evaluation of generation quality, a notoriously difficult problem because language has many valid outputs and complex qualities that are hard to measure. Human evaluation is presented as the gold standard, using criteria such as fluency, coherence, relevance, and interestingness. However, it is resource-intensive, so automatic metrics are used in practice. Perplexity measures predictive fit to data but not generation quality; BLEU and ROUGE measure n-gram overlap with reference texts and are common in translation and summarization respectively; METEOR extends BLEU with stemming and synonym matching; newer metrics like BERTScore use embeddings to assess semantic similarity; and FID, originally from images, can be adapted to compare distributions of generated and real texts. Each metric brings partial insight but also clear limitations.

The session concludes with practical guidance on decoder selection by task type and a short Q&A. For conditional tasks with a clear target (translation, summarization), beam search is often preferred for higher quality and consistency. For open-ended tasks (creative writing, chat), sampling—especially nucleus sampling—is commonly better for diversity and naturalness. Beam size should be chosen by balancing quality improvements against rapidly increasing compute, often starting with a beam size between 5 and 10 and adjusting based on task and output length. Automatic metrics cannot replace humans because they capture surface-level similarity and may miss meaning; they are best used with caution and alongside human judgments.

By the end, you should understand the main task shapes for language model use, how decoding algorithms work and when to apply them, and how to think critically about evaluation. You will be able to configure decoding strategies, explain their trade-offs, and design a sensible evaluation plan that mixes automatic metrics with human review. The lecture is structured from problem framing (tasks) to algorithmic tools (decoding methods) to measurement (evaluation), with simple examples grounding each concept.

Key Takeaways

✓Match decoding to task goals. Use beam search for tasks with clear targets and correctness, like translation and summarization, and sampling (especially top-p) for creative, open-ended tasks. This alignment avoids optimizing for the wrong qualities. It ensures your outputs reflect what matters for the application.
✓Start beam search with small beam sizes. Try k=5, then 8 or 10, and stop when improvements plateau relative to compute cost. Longer outputs multiply cost, so smaller beams may be necessary. Monitor both quality and latency to pick a practical setting.
✓Apply length normalization for sequence tasks that risk short outputs. Without it, beam search tends to favor short sequences due to probability multiplication. Tune alpha between 0.6 and 1.0 to balance brevity and completeness. Validate with human judgments for readability.
✓Control sampling with temperature. Lower temperatures (e.g., 0.7) keep coherence, higher temperatures add creativity but risk derailment. For safety-critical tasks, prefer lower temperatures. Adjust only as needed to counter dullness.
✓Prefer nucleus sampling over plain sampling for open-ended tasks. Top-p adapts to the model’s certainty, giving a better balance of diversity and coherence. Start with top_p≈0.9 and adjust slowly. Combine with temperature for fine control.
✓Use top-k as a simple guardrail. If outputs are too random, set k=50 (or similar) to remove unlikely tokens. This can stabilize results with minimal complexity. If the distribution varies widely, consider switching to top-p.
✓Set clear stop conditions. Always define EOS tokens and max lengths to prevent run-on text and excessive costs. For structured tasks, add task-specific stopping rules. This keeps outputs tidy and predictable.

Glossary

Inference

Using a trained model to produce outputs on new inputs. It is the phase after training where the model is actually applied. The model predicts the next token or the class label based on learned patterns. In language models, inference builds text step by step. The choices made during inference strongly affect the final quality.

Conditional Generation

Producing an output sequence that depends on an input sequence. The input guides the output, like translating one language into another or summarizing a document. The model computes probabilities for each next token using both the input and the already generated output. This tight linkage makes outputs faithful to inputs. It is a common production use of language models.

Open-Ended Generation

Generating text with minimal or no input prompt. The model creates content mostly from its learned knowledge and patterns. Many different outputs can be valid. This favors creativity and variety over exact correctness. It is used in chat, brainstorming, and storytelling.

Classification

Predicting a single label for an input. Instead of writing a long output, the model decides among categories. Labels might be sentiment, topic, or intent. This often uses a small label vocabulary or a classifier head. It is a fast and common NLP task.

Version: 1

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 10: Inference

Key Summary

Why This Lecture Matters

Lecture Summary

01Overview

Key Takeaways

Glossary

Inference

Conditional Generation

Open-Ended Generation

Classification

02Key Concepts

03Technical Details

04Examples

05Conclusion

Decoding

Vocabulary (V)

Token

Argmax