Stanford CS230 | Autumn 2025 | Lecture 2: Supervised, Self-Supervised, & Weakly Supervised Learning

Beginner

Stanford

Machine LearningYouTube

Key Summary

•Decision trees are models that make predictions by asking a series of yes/no questions about features, like a flowchart. You start at a root question, follow branches based on answers, and end at a leaf that gives the prediction. This simple structure makes them easy to read and explain to anyone.
•A tree chooses which question to ask by using information gain, which measures how much a split reduces confusion (entropy) in the data. Entropy represents disorder: mixed labels are high entropy, and pure groups are low entropy. The best split is the one that makes the child groups as pure as possible.
•There are two main types of trees: classification trees for categories (like click vs. no click) and regression trees for numbers (like house price). Classification leaves output a class, while regression leaves output a number. The same splitting idea applies, but the prediction type differs.
•Trees learn by recursive partitioning: they keep splitting data into smaller groups that are more and more uniform. At each step, the algorithm tests all features and chooses the split that gives the most information gain. This repeats until stopping rules say to stop.
•Key parts of a tree include the root node (first question), decision nodes (middle questions), branches (answers), and leaf nodes (final decisions). Each path from root to leaf is like a rule you can read, such as 'If vision is not 20/20, predict glasses = yes.' This makes trees naturally interpretable.
•Trees can handle both numbers and categories, and they do not require scaling features like some other models. They can also model non-linear relationships because different regions of the feature space can follow different rules. This flexibility often makes trees strong baseline models.

Why This Lecture Matters

Decision trees matter because they combine strong practical performance with rare clarity. For product managers, analysts, and data scientists who must explain model behavior, trees provide human-readable rules that build trust with stakeholders and regulators. In healthcare and finance, where accountability and auditability are crucial, being able to show the exact path to a decision supports compliance and ethical review. Trees reduce engineering effort by handling both numeric and categorical data without heavy preprocessing, which speeds up iteration cycles and delivery. This knowledge solves real problems like early model prototyping, quick diagnostics of data issues, and clear communication with non-technical teams. It helps you choose between a single interpretable tree and more powerful ensembles when accuracy and robustness are required. Random forests and gradient boosting, built on tree foundations, often lead industry benchmarks across many tasks, so understanding core tree logic prepares you to use these advanced methods well. For career development, mastering decision trees gives you a reliable, explainable tool you can apply on day one, and it opens the door to leading ensemble approaches used in top machine learning solutions today. In the current industry, where transparency and risk management matter as much as raw accuracy, trees offer a balanced path. They deliver actionable insights, support fair and accountable AI, and scale up through ensembles for competitive performance. Whether you are building a simple rule-based system or a state-of-the-art boosted model, the decision tree is a foundational skill you will use again and again.

Lecture Summary

Tap terms for definitions

01Overview

This lecture teaches decision trees, one of the most practical and understandable models in machine learning. A decision tree predicts an outcome by asking a sequence of simple questions about input features, like walking through a flowchart. You start at the root node, follow branches based on yes/no answers or threshold checks, and finish at a leaf node that outputs the prediction. The big idea is to choose questions that best separate the data into groups that are as pure as possible. Purity means that the examples in a group mostly share the same label, which makes the final decision more certain.

The lecture explains how trees choose the best questions using a concept called information gain, which is based on entropy. Entropy measures disorder: a mixed group of labels has high entropy, while a group where nearly all labels agree has low entropy. Information gain is the reduction in entropy after splitting on a feature. The model compares all possible splits and chooses the one that most reduces entropy, leading to more homogeneous child groups. A clear classroom example shows that splitting students by 'submitting assignments' better separates pass vs. fail than splitting by 'attending lectures.'

Two kinds of trees are covered: classification trees for categorical outcomes (like yes/no or types) and regression trees for continuous outcomes (like prices). The learning process, called recursive partitioning, keeps splitting the data into smaller subsets that are more uniform. This continues until stopping criteria are met, such as a maximum depth or a minimum number of samples in a leaf. These rules help control the complexity of the tree and reduce overfitting.

The lecture highlights strong advantages of trees: they are highly interpretable, handle both numerical and categorical data, and naturally capture non-linear relationships without feature scaling. However, it also addresses notable weaknesses: trees can overfit, can become complex and deep, and can be sensitive to small changes in data, producing different structures each time. To address these, pruning and constraints are recommended. Pruning removes weak branches that do not improve accuracy, and constraints limit depth or leaf size to keep the model simpler and more generalizable.

Beyond single trees, the lecture introduces ensemble methods that build on them. Random forests train many trees on different random samples and features, then average their predictions, making the final model more stable and accurate. Gradient boosting builds trees one after another, with each new tree focusing on fixing the errors the earlier trees made. Both methods often outperform a single tree when accuracy and robustness are priorities.

The target audience includes beginners and practitioners who want a clear and practical model they can explain to teammates or stakeholders. You should be comfortable with basic machine learning ideas like features and labels, but you do not need to know advanced math. After this lecture, you will be able to describe the structure of a decision tree, explain entropy and information gain in simple terms, understand the training process, and apply methods to prevent overfitting. You will also know when to choose a decision tree and when to use ensemble variants such as random forests or gradient boosting.

The lecture is structured from basic definitions and parts of a tree, to how trees choose splits, to learning via recursive partitioning, to advantages and disadvantages, and finally to practical fixes and ensemble improvements. Real-world examples (vision and glasses, student performance, ad clicks, house prices, healthcare, loans, and marketing) ground the theory in familiar contexts. The focus throughout is on clarity, interpretability, and practical decision-making about when and how to use tree-based models.

Key Takeaways

✓Start with a single decision tree to learn your data. Its structure reveals which features matter and how they interact. Use the readable paths to spot odd thresholds or mislabeled samples. This early visibility speeds up both modeling and data cleaning.
✓Always compare intuitive splits with measured information gain. What feels important may not reduce entropy much. Let the gain guide your choices to create purer child nodes. This discipline produces stronger trees.
✓Control complexity from the start with constraints. Set reasonable max depth and minimum samples per leaf to avoid tiny, brittle leaves. Adjust based on validation results, not just training accuracy. Guardrails prevent overfitting surprises.
✓Use pruning to simplify after training. Cut branches that do not help on validation data. A simpler tree often performs better on new data. Clarity and generalization tend to rise together.
✓Expect single trees to be unstable. Small data changes can flip top splits and alter many branches. Do not rely on a lone tree when decisions must be consistent. Consider ensembles for stability.
✓Choose random forests for a strong, robust default. They reduce variance by averaging many diverse trees. You get better accuracy without heavy tuning. Feature importance still gives interpretability at a high level.
✓Choose gradient boosting for maximum accuracy with care. It improves predictions step by step by fixing residual errors. Tune depth, learning rate, and number of trees to avoid overfitting. Monitor validation metrics closely.

Glossary

Decision Tree

A model that predicts outcomes by asking a series of simple questions about input features. You start at the top, follow branches based on answers, and end at a leaf that gives the final prediction. It works like a flowchart made of if-else rules. It is easy to read and explain to others. Trees can handle both categories and numbers.

Root Node

The very first question asked by the tree. It is chosen because it best separates the data into clearer groups. All training examples start here before being split. The root strongly shapes the rest of the tree. A good root leads to simpler downstream decisions.

Decision Node

A node inside the tree where a question is asked about a feature. Based on the answer, you move to one of the child nodes. Each decision node reduces confusion step by step. These nodes form the internal structure of the model. They are repeated until a leaf is reached.

Branch

A path that connects nodes and represents the outcome of a question. For yes/no questions, there are usually two branches. Following branches is how the model narrows down the prediction. Each branch leads to a region with more similar examples. Together, they form the tree shape.

Version: 1

Stanford CS230 | Autumn 2025 | Lecture 2: Supervised, Self-Supervised, & Weakly Supervised Learning

Key Summary

Why This Lecture Matters

Lecture Summary

01Overview

Key Takeaways

Glossary

Decision Tree

Root Node

Decision Node

Branch

02Key Concepts

03Technical Details

04Examples

05Conclusion

Leaf Node

Feature (Variable)

Target (Outcome)

Entropy