Stanford CS329H: Machine Learning from Human Preferences | Autumn 2024 | Preference Models
BeginnerKey Summary
- ā¢Decision trees are models that turn feature values into decisions like Yes/No. You follow a path from the root question down through branches until you reach a leaf decision. They are great for problems where the target is discrete, like classifying outcomes. They also tolerate noisy data and can work even when some feature values are missing.
- ā¢The lecture uses a classic golf example with four features: Outlook, Temperature, Humidity, and Wind. By looking at 14 past days labeled Yes/No for playing golf, the goal is to learn rules that predict the decision for a new day. The learned tree first splits on Outlook, then uses Humidity for Sunny days and Wind for Rainy days. Overcast days always predict Yes.
- ā¢Learning a tree follows a simple recursive algorithm. If all examples in the current set are positive (all Yes), make a leaf labeled Yes. If all are negative, make a leaf labeled No. Otherwise, choose the best attribute to split and repeat for each branch.
- ā¢The key question is how to choose the best attribute at each step. The lecture uses Information Gain, a measure from information theory. The attribute with the highest Information Gain separates the labels (Yes/No) the best. This makes the data in the children more āpure.ā
- ā¢Entropy measures the impurity of a set of examples. If a set is all one class, entropy is 0 (no uncertainty). If it is half Yes and half No, entropy is 1 (maximum uncertainty for two classes). Information Gain is the parentās entropy minus the average entropy of the child splits.
- ā¢In the golf example, the overall entropy of the 14 examples is 0.94. Splitting on Outlook produces three groups: Sunny, Overcast, and Rainy. Their entropies are 0.971, 0, and 0.971, respectively. This yields an Information Gain of 0.246 for Outlook.
Why This Lecture Matters
Understanding decision tree learning is valuable because it blends strong predictive power with human-friendly explanations. For product managers, analysts, and domain experts, trees turn messy data into clear if-then rules you can discuss with stakeholders. Engineers and data scientists gain a reproducible, principled methodāentropy and information gaināto choose the best splits, along with practical tools to control overfitting and handle continuous attributes. These skills translate directly to building real systems that must operate under noisy data, missing values, and limited measurement budgets. In many industriesāhealthcare, finance, operations, and educationāinterpretability is as important as accuracy. Decision trees shine here because each root-to-leaf path is a readable policy. As regulations tighten and model transparency becomes essential, trees (and their ensembles) offer a strong balance between clarity and performance. The lectureās hands-on, numeric walk-through demystifies the math, so you can calculate gains by hand, verify choices, and debug models logically. Practically, the lectureās method lets you start from a small labeled dataset, compute entropy and information gain, and iteratively grow a tree that handles both discrete and continuous features. With pruning guided by a validation set, you can reduce complexity while improving performance on new data. This knowledge prepares you to build robust classifiers, present understandable rules to teams, and make cost-aware decisions when feature measurement is expensive. In a landscape enamored with black-box models, mastering decision trees equips you with a trustworthy, interpretable baseline and a gateway to more advanced methods like random forests and gradient boosting.
Lecture Summary
Tap terms for definitions01Overview
This lecture teaches the complete, practical method for learning decision treesāmodels that transform feature values into clear, discrete decisions like Yes/No. You will learn how a decision tree grows by asking the best question at each step and how to measure what ābestā means using information theory. The lecture explains entropy as a measure of uncertainty in your labels and shows how information gain tells you which attribute separates the data most cleanly. A classic, concrete exampleādeciding whether to play golfāgrounds the math with real counts and exact numerical values.
The session begins with what decision trees are, when to use them, and why they are particularly good for discrete target functions, conjunctions (ANDs) and disjunctions (ORs) of attributes, and even noisy or partially missing data. You then see a full end-to-end decision-making path: Outlook at the root; if Sunny, check Humidity; if Rainy, check Wind; and Overcast always leads to Yes. From there, the lecture steps back and asks the crucial question: how did we learn that tree from examples? The answer is a greedy recursive algorithm that stops when a node becomes all positive or all negative, and otherwise picks the attribute with the largest information gain to split the data.
Next, the lecture dives into the core math. Entropy is introduced as a formal way to measure impurity: zero if all labels in a set are the same, and maximal (for two classes) when the set is half positive and half negative. Using this, information gain is defined as the parent nodeās entropy minus the weighted average entropy of the children after a split. You compute this for each attribute and choose the one with the largest gain. The golf datasetās 14-day history is used to show exact entropy values and the resulting gains: Outlookās 0.246 beats Temperatureās 0.029, Humidityās 0.151, and Windās 0.048, so Outlook is placed at the root.
With the splitting rule established, the lecture explains that the process repeats inside each branch until the tree is complete. It emphasizes that the method is greedyāit optimizes the immediate step rather than planning the whole tree in advance. While this makes the algorithm fast and effective, it can also lead to overfitting, where the tree gets too specific to the training data. To control this, two main strategies are offered: early stopping (stopping when data is pure or splits no longer help) and post-pruning (first grow, then remove branches that hurt validation performance). The lecture favors the idea that post-pruning often works well because it uses a separate validation set to judge usefulness.
The lecture then addresses practical complications you will meet in real data. Many attributes are continuous (like a temperature in degrees), but decision trees need clear branching rules. Two solutions are given: discretize values into bins (e.g., cool/mild/hot) or automatically learn a cut-point by sorting values, finding places where labels change, and testing midpoints as thresholds. The best threshold is the one that gives the highest information gain. Another real-world factor is measurement cost. If some attributes are expensive to gather, you can still rely on information gain to score splits, but then bias your choice toward cheaper attributes when performance is similar.
By the end, you will be able to: compute entropy and information gain; pick the best attribute to split; grow a decision tree recursively; recognize and reduce overfitting with pruning; and handle continuous attributes and attribute costs in a principled way. This makes you capable of building practical, interpretable classifiers that work well on a wide range of tasks. The overall structure starts with an intuitive introduction and example, moves into the math of entropy and information gain, continues through the recursive algorithm, and closes with advanced considerations like overfitting, pruning, continuous splits, and cost-sensitive decisions. The lecture is self-contained and equips you to implement the method and understand the trade-offs involved.
Key Takeaways
- āStart with entropy to measure uncertainty and purity. Compute it carefully, treating 0 log 0 as 0 to avoid numerical issues. Use entropy values to build intuition: 0 for pure, near 1 for very mixed (binary case). This sets the stage for choosing splits that truly simplify the problem.
- āUse information gain to pick the best split at each node. Calculate weighted child entropies and subtract from the parentās entropy. Compare gains across attributes and select the highest. This consistent rule drives the entire learning process.
- āWork the example numbers end-to-end when learning: it builds trust. In the golf dataset, confirm Entropy(S)=0.94 and Outlookās Gain=0.246. Check the other attributesā gains to see why they lose. Doing the math helps cement understanding and catches mistakes.
- āEmbrace the greedy nature: it is a feature, not a bug. The algorithm is fast because it avoids planning entire trees at once. Greedy splits usually work well in practice. Reserve complexity for pruning decisions, not split selection.
- āStop splitting when nodes are pure or gains are tiny. Small, impure leaves often lead to overfitting. Consider minimum samples per leaf to avoid fragile rules. Stopping rules keep trees interpretable.
- āPrefer post-pruning over only early stopping when possible. Grow a near-full tree and prune bottom-up using a validation set. Keep prunes that improve or maintain validation accuracy. This typically yields better generalization.
- āHandle continuous features with thresholds at split time for precision. Sort values, find label-change points, and try midpoints as candidates. Pick the threshold with the highest gain. This often outperforms coarse binning.
Glossary
Decision tree
A decision tree is a model that makes choices by asking a series of questions about the dataās features. Each question splits the data into smaller groups. When you reach the end of a path, you get a final answer like Yes or No. It is simple to understand and explain because it looks like a flowchart. It works best when the goal is to pick from a few categories.
Attribute (Feature)
An attribute is a property or characteristic you can measure about an example. It can be a word like Sunny or a number like 75. The treeās questions test these attributes to split the data. Good attributes make the groups more pure. Without helpful attributes, the tree cannot separate the classes well.
Label (Target)
A label is the final answer we want the model to predict. For classification, it comes from a small set like Yes/No. The tree learns rules that map attributes to labels. Accurate labels during training are crucial for learning good splits. If labels are wrong, the tree can be misled.
Discrete target function
This is a rule that outputs one of a few categories instead of a number. Decision trees are designed to learn such rules. They split the data until each leaf mostly contains one category. This makes the final prediction a simple category choice. Itās different from predicting a continuous number.
