Stanford CS230 | Autumn 2025 | Lecture 6: AI Project Strategy
BeginnerKey Summary
- •This lecture explains decision trees as simple, rule-based models for classification and regression. A decision tree splits data by asking yes/no questions about features until the remaining data in a group is mostly one label (pure). Each leaf makes a final prediction by majority vote for classification or by average value for regression. Trees are powerful, easy to understand, and highly interpretable.
- •Building a decision tree means choosing the best feature and split point at each step to reduce impurity. Impurity measures how mixed the classes are inside a node; two common measures are Gini impurity and entropy. The best split is the one with the highest information gain, which is the parent impurity minus the weighted impurity of the children. Splitting continues until stopping rules say to stop to avoid overfitting.
- •Gini impurity is 1 minus the sum of squared class probabilities, and it is zero when all samples in the node belong to one class. Entropy is the negative sum of p times log2 p for each class, and it also equals zero when a node is perfectly pure. When a node has a 50-50 class mix, both Gini and entropy are high, signaling disorder. Lower impurity means the model is more confident about labels in that node.
- •Information gain measures how much impurity drops after a split. You compute the parent's impurity, compute child impurities, take their size-weighted average, and subtract that from the parent. The larger the information gain, the better the split. This process is repeated recursively to grow the tree.
- •Decision trees can easily overfit because they keep splitting to fit every tiny pattern, including noise. To prevent this, you limit depth, require a minimum number of samples to split, or require a minimum number of samples per leaf. These are called stopping criteria and act like brakes for tree growth. Good stopping rules make trees generalize better to new data.
Why This Lecture Matters
Decision trees and random forests are core tools for anyone working with structured, tabular data—data scientists, ML engineers, analysts, and product teams. They offer a rare mix of power and practicality: trees provide clear, explainable rules, and forests deliver robust, high-accuracy predictions without heavy feature engineering. Knowing how impurity, information gain, and stopping criteria work lets you diagnose overfitting, bias toward high-cardinality features, and instability. Understanding bootstrapping and feature subsampling explains why random forests outperform single trees and guides you in tuning the number of trees and feature subset sizes. This knowledge directly solves common problems: choosing a baseline model that’s fast to build and explain; improving performance with ensembles; and handling missing data in a way that preserves predictive power. It helps you ship real models quickly—trees for transparent decisions (like policy or compliance) and forests for production accuracy (like risk scoring or churn prediction). Mastery of these methods strengthens your ML toolkit and prepares you to evaluate and select among other ensemble approaches like boosting and stacking. In today’s industry, random forests remain competitive on many tabular tasks and are strong baselines even when deep learning is popular elsewhere. They scale reasonably, work with mixed data types, and provide feature importance for insight. Learning these fundamentals sets you up for better model selection, clearer communication with stakeholders, and stronger, more trustworthy deployments.
Lecture Summary
Tap terms for definitions01Overview
This lecture teaches the fundamentals of decision trees and random forests, two of the most widely used supervised learning methods. Supervised learning means we train a model on labeled examples so that it can predict labels for new inputs. Decision trees build a set of simple if-then rules to split data into smaller and purer groups, eventually ending at leaves that make predictions. Random forests take this a step further by combining many trees trained on different slices of data and features to produce a stronger, more stable model.
The lecture starts with the big picture: most supervised learning algorithms aim to minimize a loss function, which measures how wrong predictions are. While algorithms differ in how they search for the best model, their core goal is the same. Decision trees are highlighted as being interpretable, easy to visualize, and usable for both classification (predicting categories) and regression (predicting numbers). Their structure is intuitive: nodes ask questions about feature values, edges route data according to answers, and leaves store final predictions.
You learn how to construct a decision tree by repeatedly choosing the best split. The best split is defined using impurity measures like Gini impurity or entropy, which quantify how mixed the classes are in a node. The split that maximizes information gain (the drop in impurity from parent to children) is chosen. This greedy process is applied recursively until stopping criteria signal that the model should stop growing to avoid overfitting. Stopping can be enforced by limits on tree depth or minimum samples required for splits and leaves.
Random forests are introduced as an ensemble method—an approach that combines multiple models to improve performance. Each tree in a random forest is trained on a bootstrapped sample (sampling with replacement) of the training data. Additionally, at each node, a random subset of features is considered for splitting. These two sources of randomness decorrelate trees so that their errors don’t line up. Final predictions are aggregated: majority vote for classification or averaging for regression.
The lecture explains why random forests are so effective: they reduce variance, which is the tendency of a model to change a lot in response to small changes in the training data. A single tree is a high-variance model and can be unstable. Averaging many high-variance models dramatically stabilizes predictions without increasing bias too much. As a result, random forests usually generalize better than a single tree and are more robust to overfitting.
You also learn pros and cons. Decision trees are interpretable, handle both numeric and categorical features, and are easy to implement. But they can overfit, be biased toward features with many levels, and be sensitive to small data changes. Random forests are more accurate and robust, offer feature importance measures, but are less interpretable, more computationally expensive, and require careful hyperparameter tuning like selecting the number of trees.
The lecture covers practical questions: how to pick the number of trees (start with hundreds and grow until validation performance plateaus), other ensemble methods (gradient boosting, AdaBoost, bagging without feature subsampling, and stacking across different model types), and how to handle missing data (impute, add a missing-value branch, or skip the feature, depending on implementation). These details anchor the theory in everyday modeling choices.
By the end, you understand the mechanics and motivations behind decision trees and random forests. You can explain how impurity, information gain, and stopping rules guide tree growth; how bootstrapping and feature subsampling build diverse forests; and how averaging reduces variance. You also gain practical wisdom about interpretability trade-offs, hyperparameters, and data issues like missing values. The lecture equips you to confidently build, tune, and evaluate trees and forests for both classification and regression tasks.
Key Takeaways
- ✓Start with a decision tree to get a transparent baseline. Keep max_depth modest and set min_samples_split and min_samples_leaf to prevent overfitting. Inspect the learned splits to ensure they make domain sense. Use this interpretability to spot data issues or leakage early.
- ✓Choose a split criterion that fits your needs: Gini impurity is fast and common; entropy offers an information-theoretic view. Compute information gain as parent impurity minus the weighted child impurities. Always prefer the split with the largest gain. Validate that splits generalize by checking performance on held-out data.
- ✓Apply stopping criteria aggressively to avoid overfitting. Limit max_depth to a sensible range and ensure leaves have enough samples. Stop splitting when impurity cannot be meaningfully reduced. This keeps trees simple, stable, and faster to train and predict.
- ✓Use random forests when accuracy and robustness are more important than a single-model explanation. Train with bootstrapping and random feature subsampling to create diverse trees. Aggregate predictions by voting or averaging to reduce variance. Expect better generalization than a single tree.
- ✓Tune the number of trees by watching for performance plateaus. Start with a few hundred trees and increase until validation metrics stabilize. Stop adding trees when improvements flatten to save compute. Record accuracy vs. n_estimators to justify your choice.
- ✓Set max_features thoughtfully to balance diversity and strength. Smaller subsets increase tree diversity but might miss strong features; larger subsets can reduce ensemble gains. Use common defaults like sqrt(M) for classification as a starting point. Adjust based on validation results.
Glossary
Supervised learning
A type of machine learning where the model learns from examples that include inputs and the correct outputs (labels). The goal is to map from inputs to outputs so the model can predict well on new data. It’s like practicing with answer keys to learn how to answer similar questions later. Decision trees and random forests are both supervised learners.
Decision tree
A model that makes predictions by asking a series of yes/no questions about feature values. Each question splits the data into smaller groups that are more uniform. You keep splitting until you reach a point where a simple prediction can be made. The final answers live at the leaf nodes.
Node
A point in the decision tree where something happens. Internal nodes ask a question to split data, while leaf nodes give the final prediction. Nodes help organize the decision-making process in steps. The root node is the very first node at the top of the tree.
Edge
A connector between nodes that represents the result of a question. Following an edge means taking the path that matches the yes/no (or condition) answer. Edges guide the flow from one question to the next. They eventually lead to a leaf node.
