Stanford CS329H: Machine Learning from Human Preferences | Autumn 2024 | Introduction
BeginnerKey Summary
- •This lecture introduces decision trees, a classic machine learning method that makes choices by asking a series of yes/no or small set-of-value questions. Each internal node is a test on a feature (like 'Is it Friday or Saturday?'), each branch is the outcome, and each leaf is a predicted label. You classify by starting at the root, following the test outcomes, and stopping at a leaf.
- •The instructor uses a restaurant example: deciding whether to wait for a table. Features include alternates, bar, Friday/Saturday, hungry, patrons (none/some/full), price, rain, reservation, and type. A hand-designed tree shows how to use tests in order to reach a wait/don't-wait decision.
- •Learning a decision tree from data means choosing the 'best' feature to split on at each step and recursing. 'Best' is defined using Information Gain, a measure of how much splitting reduces uncertainty (entropy). We repeat splitting until data are pure, attributes run out, or a stopping rule is hit.
- •Entropy measures how 'mixed up' the labels are. If all examples have the same label, entropy is 0 (no uncertainty). If the labels are evenly split, entropy is high (for binary, it’s 1 bit). Entropy is computed as -Σ p_i log2 p_i over class probabilities p_i.
- •Information Gain compares the dataset's entropy before a split with the weighted average entropy after splitting on a feature. The higher the gain, the better that feature separates the classes. We pick the feature with the highest gain to split next.
- •In the restaurant dataset with 12 examples (6 wait, 6 don't wait), the overall entropy is 1. Splitting on 'patrons' yields subgroups with entropies 0, 0, and about 0.918, weighted by their sizes. The information gain for 'patrons' is about 0.541, making it the best root split.
- •We keep splitting: when a subset is perfectly pure (entropy 0), we stop and assign that class. If we run out of features or hit a stopping rule, we also stop and assign the majority class there. This recursive process builds the full tree.
- •Continuous attributes like weight or temperature can be handled by discretizing into bins or by splitting at a threshold. With thresholds, we try possible cut points (often midpoints between sorted unique values) and pick the one with the best information gain. This strategy turns a continuous value into a binary decision at that node.
- •Overfitting happens when the tree becomes too complex and memorizes noise in the training set. Overfit trees perform poorly on new data. Simpler trees usually generalize better, so we use pruning to reduce unnecessary branches.
- •Pre-pruning stops growth early, using rules like maximum depth or minimum samples at a leaf. Post-pruning grows the full tree, then removes branches that don’t help on a validation set. Reduced error pruning is one post-pruning approach.
- •Decision trees are easy to interpret: you can read the path of decisions to see why a prediction was made. They are powerful enough for many tasks and can handle both categorical and numerical features. They can be used for both classification and regression (predicting numbers instead of categories).
- •To implement from scratch: compute entropy of the dataset, evaluate information gain for each feature, pick the best, split the data, and recurse. Use stopping rules to avoid infinite growth, and then prune to improve generalization. Handle continuous features via discretization or best-threshold search.
Why This Lecture Matters
Decision trees are a cornerstone of machine learning because they balance accuracy with interpretability. For data analysts and product teams, trees provide transparent logic that can be communicated to non-technical stakeholders, making them ideal for decisions that must be explained or audited. In scenarios like customer support triage, credit risk screening, and simple medical rule systems, understanding why the model chose an outcome is as important as the outcome itself. Trees help identify which features drive predictions, exposing data quality issues, bias in features, or opportunities for new data collection. Technically, learning trees teaches you core machine learning ideas that transfer to many models: impurity measures, greedy splitting, validation, and overfitting control. You practice computing entropy and information gain, evaluating candidate splits, and designing stopping and pruning strategies. These skills translate into building more complex models later and diagnosing why they behave as they do. Even when you later use ensembles or other advanced methods, the decision logic you learn here remains relevant. In real projects, you can start with a decision tree to get a fast, interpretable baseline. It supports rapid iteration: you can quickly see which features matter and refine data collection or feature engineering. The knowledge from this lecture helps career development by giving you a practical, widely applicable tool and a solid understanding of model evaluation and simplification. In today’s industry where accountability and transparency are increasingly required, decision trees are especially valuable because they are inherently explainable.
Lecture Summary
Tap terms for definitions01Overview
This lecture teaches the fundamentals of decision trees, a widely used and very interpretable machine learning algorithm. A decision tree makes predictions by asking a sequence of feature-based questions at its internal nodes and using the answers to follow paths down to leaves, where class labels (for classification) or values (for regression) are assigned. The lecture walks through how trees are structured, how to classify an instance by traversing the tree, and—most importantly—how to learn a decision tree automatically from data rather than hand-designing it. The key mathematical ideas come from information theory: entropy is used to measure uncertainty in class labels, and information gain measures how much a feature split reduces that uncertainty. The lecture anchors these ideas in a concrete example: deciding whether to wait for a restaurant table based on features like whether there are alternates, whether it's Friday or Saturday, whether we're hungry, and especially the number of patrons present (none, some, or full).
The talk is aimed at learners who have basic familiarity with supervised learning—knowing that we have input features and a target label—and a willingness to learn a little bit of math notation. No deep prerequisites are required; the lecture defines entropy and information gain from scratch and shows exactly how to compute them. Even if you’ve never built a model before, the step-by-step approach is clear: choose the best feature using information gain, split the data, and recurse until you should stop. The lecture briefly mentions that trees can also do regression, but the focus is on classification and the entropy-based learning process.
By the end, you will be able to explain what a decision tree is, read and interpret a decision path, compute entropy for a dataset, compute information gain for potential splits, and decide the best feature to split on at each step. You’ll also understand how to handle continuous attributes—either by discretizing into bins or by choosing an optimal threshold—and how to control tree complexity to avoid overfitting through pruning. You’ll be able to follow a complete example end-to-end, including a worked calculation showing why splitting on 'patrons' is the best first move in the restaurant dataset.
The lecture is structured as follows. It starts with the intuition and structure of decision trees, using the restaurant example to show how a manually crafted tree works in practice. It then moves to learning trees automatically using entropy and information gain, defining these concepts clearly and giving numeric examples. Next, it shows how to choose splits for continuous features using discretization or thresholding—often by sorting values and testing midpoints. Finally, it explains overfitting and two main strategies to prevent it: pre-pruning (stopping early) and post-pruning (growing fully, then cutting back), including reduced error pruning on a validation set. The session closes by summarizing why decision trees remain useful: they are simple, powerful, and highly interpretable.
02Key Concepts
- 01
🎯 Decision tree fundamentals: A decision tree is a flowchart-like model that makes predictions by testing features at internal nodes and outputting a label at leaves. 🏠 It’s like playing 20 Questions, where each answer leads you to the next question until you can decide. 🔧 Technically, each node applies a test on an attribute (e.g., 'patrons = full?'), each branch is a test outcome, and each leaf holds a class label or distribution. 💡 Without this structure, decisions would be opaque and hard to trace; the tree shows exactly how predictions are formed. 📝 In the restaurant example, you start at the root and follow tests like 'alternates?', 'hungry?', and 'Friday/Saturday?' to reach 'wait' or 'don’t wait'.
- 02
🎯 Classification with a tree: Classifying a new instance means walking from the root down following test results to a leaf. 🏠 Imagine a choose-your-own-adventure book where choices guide you to different endings. 🔧 For each node, you evaluate the node’s condition on the instance (e.g., 'bar = yes'), choose the matching branch, and continue until a leaf is reached. 💡 This step-by-step traversal makes predictions deterministic and reproducible. 📝 For a restaurant with alternates = no, you may immediately reach a 'don’t wait' leaf per the handcrafted tree.
- 03
🎯 Learning trees from data: Instead of writing the tree by hand, we can learn it automatically. 🏠 It’s like sorting your closet by the rule that best reduces mess each time. 🔧 At each step, compute a score (information gain) for each feature, pick the best one to split the dataset, and repeat the process recursively on each subset. 💡 Without a systematic criterion, splits might be arbitrary and lead to poor generalization. 📝 In the dataset of 12 restaurant visits, evaluating all features shows 'patrons' gives the largest gain, so it becomes the root split.
- 04
🎯 Entropy (uncertainty measure): Entropy quantifies how 'mixed' the labels are in a dataset. 🏠 It’s like a jar of marbles—if they’re all the same color, there’s no surprise; if colors are evenly mixed, you’re unsure what you’ll pull next. 🔧 Formally, entropy(S) = -Σ p_i log2 p_i over classes i, where p_i is the fraction of S in class i. 💡 With no uncertainty (all one class), entropy is 0; with maximum uncertainty in binary cases (50/50), entropy is 1. 📝 For six 'wait' and six 'don’t wait', entropy = -0.5 log2 0.5 - 0.5 log2 0.5 = 1.
- 05
🎯 Information gain (how much a split helps): Information gain measures how much a feature split reduces entropy. 🏠 Think of cleaning a mixed box of socks by separating them—if one divider sorts most socks correctly, it’s very helpful. 🔧 Gain(S, A) = entropy(S) − Σ_v (|S_v|/|S|) entropy(S_v), where S_v are subsets after splitting on A’s values v. 💡 We choose the feature with the highest gain to make the data in child nodes as pure as possible. 📝 In the restaurant dataset, 'patrons' yields gain ≈ 0.541, higher than the others, so it is selected first.
- 06
🎯 Restaurant dataset example: The dataset has 12 examples with features like alternates, bar, Friday/Saturday, hungry, patrons, price, rain, reservation, and type. 🏠 It’s a realistic dining decision with familiar cues like crowds and hunger. 🔧 Total entropy is 1 because of a 6/6 label split; splitting by patrons yields subsets: none (2 all negative, entropy 0), some (4 all positive, entropy 0), full (6 with 2 positive, 4 negative, entropy ≈ 0.918). 💡 The large drop in uncertainty shows patrons is the most informative first test. 📝 After splitting, the pure subsets become leaves; only the 'full' branch needs further splitting.
- 07
🎯 Recursive tree growth: After choosing the best split, repeat the same process on each child subset. 🏠 It’s like repeatedly sorting your closet sections until each shelf holds a single type of item. 🔧 Compute entropy and information gain within each subset, split again if needed, and stop when pure or when stopping conditions are triggered. 💡 Recursion ensures local choices keep improving label purity. 📝 In the 'patrons = full' subset, you test other features to continue reducing entropy.
- 08
🎯 Stopping criteria: We can stop splitting when a node is pure, when no features remain, or when we hit rules like max depth or minimum samples. 🏠 It’s like stopping a game of 20 Questions when you already know the answer or when no further useful questions remain. 🔧 Practically, stopping prevents infinite growth and controls complexity. 💡 Without stopping rules, the tree could grow unnecessarily and overfit noise. 📝 In our example, 'patrons = some' is immediately pure, so it becomes a leaf without further splits.
- 09
🎯 Handling continuous attributes: Some features are numeric and continuous, such as weight or temperature. 🏠 Picture drawing a line on a number line to separate 'low' from 'high'. 🔧 Two main strategies: discretize into bins (e.g., <100, 100–200, >200) or choose a threshold and split into less-than vs greater-than. 💡 Without this step, trees couldn’t use rich numerical information effectively. 📝 A threshold at 150 pounds creates two groups; we try possible thresholds and choose the one with highest information gain.
- 10
🎯 Threshold selection for continuous features: The goal is to find a cut that maximizes information gain. 🏠 It’s like trying different spots to place a fence so your two fields are as uniform as possible. 🔧 A simple approach is to sort the data by the feature and test midpoints between adjacent unique values as candidate thresholds. 💡 This ensures we only consider meaningful splits where labels could change. 📝 We compute gain for each candidate threshold, selecting the one that yields the largest gain.
- 11
🎯 Discretization alternative: Instead of thresholds, we can predefine intervals and treat each interval as a category. 🏠 Imagine labeling shelves Small, Medium, and Large rather than measuring exact length. 🔧 Choose ranges that make sense (equal width, domain knowledge, or data-driven) and split on the resulting categorical values. 💡 Discretization can simplify trees but may lose fine-grained information. 📝 Weight bins like <100, 100–200, >200 allow a 3-way split similar to 'patrons' values.
- 12
🎯 Overfitting in decision trees: Overfitting occurs when the tree memorizes the training data and doesn’t generalize to new cases. 🏠 It’s like learning every tiny detail of a single test rather than the main ideas. 🔧 Deep trees with many branches can fit noise; performance drops on unseen data. 💡 We need controls like stopping rules and pruning to keep the model simple enough. 📝 A branch that only matches a rare training quirk won’t help future predictions and should be removed.
- 13
🎯 Pruning to prevent overfitting: Pruning removes branches that don’t help accuracy on new data. 🏠 Think of trimming a bush so it grows healthier and not too wild. 🔧 Pre-pruning stops growth early using rules like max depth; post-pruning grows a full tree, then cuts back, often guided by validation performance. 💡 Pruning balances fit to training data with generalization. 📝 Reduced error pruning is one post-pruning method that deletes branches if accuracy on a validation set doesn’t drop.
- 14
🎯 Interpretability of trees: Trees are transparent—each decision path shows exactly why the model predicted a label. 🏠 It’s like a visible checklist rather than a black-box verdict. 🔧 Each path corresponds to a set of feature conditions, making it easy to audit, explain, and debug. 💡 This clarity builds trust and helps identify data or feature issues. 📝 In the restaurant case, if 'patrons = full' leads to 'don’t wait' unless other conditions hold, you can see and critique that logic.
- 15
🎯 Classification vs regression trees: Trees can predict classes or continuous values. 🏠 For regression, it’s like estimating a number (e.g., wait time) rather than a yes/no decision. 🔧 For classification, leaves hold labels or class distributions; for regression, leaves hold numeric predictions (e.g., average of training targets in that leaf). 💡 The same splitting concept applies, though the impurity measure differs for regression (the lecture focuses on classification and entropy). 📝 The restaurant example is classification: predict 'wait' or 'don’t wait'.
- 16
🎯 Using validation for pruning decisions: A held-out validation set can guide pruning. 🏠 It’s like testing your trimmed tree in a practice game before the real match. 🔧 If removing a branch doesn’t harm validation accuracy, keep it pruned; otherwise, restore it. 💡 This avoids trusting training accuracy alone, which can be misleading when overfitting. 📝 Reduced error pruning follows this idea by evaluating accuracy changes after each candidate cut.
- 17
🎯 Ties and practicalities in splitting: Sometimes multiple features have very similar information gain. 🏠 It’s like choosing between two almost-equal sorting rules for your closet. 🔧 In practice, you can break ties consistently (e.g., by feature order) or choose either since downstream recursion will further refine purity. 💡 Consistent tie-breaking keeps results reproducible. 📝 If 'bar' and 'rain' give nearly identical gains after 'patrons = full', picking either is acceptable and the recursion will continue.
- 18
🎯 Final tree usage: Once trained, the tree is used to classify new cases quickly. 🏠 Like scanning a short checklist, the evaluation is fast and straightforward. 🔧 Evaluate node conditions top-down until reaching a leaf; output the leaf’s label or probability. 💡 This makes trees efficient at inference and simple to deploy. 📝 Given a new restaurant scenario, the model quickly predicts whether to wait by following the learned path.
03Technical Details
- Overall Architecture/Structure
- Model structure: A decision tree is a rooted tree where each internal node tests a feature (attribute) and each outgoing branch corresponds to an outcome of that test. Leaves store predictions: either a class label or a class probability distribution for classification, and a numeric value for regression (though this lecture focuses on classification).
- Data flow during prediction: To classify an instance, start at the root. Evaluate the node’s condition on the instance (e.g., patrons = full?). Follow the matching branch to the next node. Repeat until reaching a leaf. Output the leaf’s label. This deterministic path is the model’s explanation for its prediction.
- Data flow during training: Given a labeled dataset S, choose the feature A that best separates classes according to information gain. Split S into subsets S_v by A’s values v (for continuous features, the values correspond to whether the feature is below or above a chosen threshold). Recurse on each subset, choosing the next best feature within that subset. Stop when nodes are pure, no features remain, or a stopping rule triggers. Optionally, prune the tree to improve generalization.
- Entropy and Information Gain
- Entropy (uncertainty): For a dataset S with class proportions p1, p2, ..., pC, entropy(S) = -Σ_i p_i log2 p_i. Intuition: entropy is 0 when there is no uncertainty (all examples share the same label). For binary classification, entropy is 1 when p1 = p2 = 0.5 (maximum uncertainty). Entropy smoothly increases as labels become more mixed.
- Computing entropy: Suppose S has 12 examples: 6 positive (wait) and 6 negative (don’t wait). Then p_pos = 0.5 and p_neg = 0.5. Entropy(S) = - (0.5 log2 0.5 + 0.5 log2 0.5) = - (-0.5 - 0.5) = 1.
- Information gain (IG): For feature A with possible outcomes v in Values(A), let S_v be the subset where A = v. Then IG(S, A) = Entropy(S) − Σ_v (|S_v|/|S|) Entropy(S_v). Interpretation: IG is how much the split reduces uncertainty, on average weighted by subset sizes. Choose the feature with the highest IG at each node.
- Worked Example: Restaurant Dataset
- Dataset summary: 12 examples with features: alternates, bar, Friday/Saturday, hungry, patrons (none/some/full), price, rain, reservation, type. Target: wait (yes/no). Overall entropy is 1 as labels are evenly split.
- Split on 'patrons': Values are none, some, full. Compute entropies per value: • patrons = none: 2 examples, both negative → p_neg = 1, entropy = 0. • patrons = some: 4 examples, all positive → p_pos = 1, entropy = 0. • patrons = full: 6 examples, with 2 positive and 4 negative → p_pos = 2/6 = 1/3, p_neg = 4/6 = 2/3. Entropy = - (1/3 log2 (1/3) + 2/3 log2 (2/3)) ≈ 0.918.
- Weighted post-split entropy: (2/12)*0 + (4/12)*0 + (6/12)*0.918 = 0.5 * 0.918 = 0.459.
- Information gain: IG(S, patrons) = 1 − 0.459 = 0.541. This is the largest among tested features, so patrons is chosen as the root.
- Recursion after first split: For patrons = none and patrons = some, entropy is 0 → make leaves (don’t wait, wait). For patrons = full, continue splitting using the same IG calculation on remaining features within that subset.
- Building a Tree: Step-by-Step Algorithm (ID3-style by concept, without naming)
- Input: Training set S of examples with features X and class labels Y.
- Step A: Compute entropy(S) using class proportions. If entropy(S) = 0, return a leaf with that class.
- Step B: For each candidate feature A, split the dataset and compute the weighted average entropy after the split. Compute IG(S, A) = Entropy(S) − weighted entropy.
- Step C: Choose the feature A* with the highest information gain. Create an internal node that tests A*.
- Step D: For each outcome v of A*, form subset S_v. If S_v is empty, create a leaf with the parent’s majority label. Else, recursively call the algorithm on S_v using the remaining features.
- Step E: Apply stopping rules: stop when entropy is 0 (pure), when no features remain (assign majority class), or when other limits (max depth, minimum samples per leaf) are reached. Optionally, after the full tree is built, apply pruning.
- Handling Continuous Attributes
- Discretization approach: Define intervals (bins) over the numeric range—e.g., <100, 100–200, >200—and treat the attribute as categorical with those interval labels. Then compute IG as with any categorical feature. Pros: simple, yields multi-way splits; cons: may be coarse and not adaptive.
- Threshold (binary split) approach: Convert the numeric feature to a binary question at the node: is value ≤ t (left) or > t (right)? To select t, sort training examples by the feature and consider candidate thresholds at midpoints between adjacent unique values. For each candidate t, split the data and compute IG; choose t that maximizes IG. Pros: data-driven and fine-grained; cons: requires scanning multiple thresholds per node.
- Practical tip: When labels do not change across adjacent values, midpoints between those values need not be tested; evaluate only where label identity changes to reduce computation.
- Overfitting and Pruning
- Overfitting definition: A tree that keeps splitting to fit idiosyncrasies in the training set can achieve very low training error but poor test performance. Symptoms include deep trees with leaves that cover very few examples and rules that mirror noise rather than meaningful patterns.
- Pre-pruning (early stopping): Impose constraints during training to limit complexity. Common constraints include a maximum depth, a minimum number of examples required to split a node further, or a minimum information gain threshold to justify a split. If a split fails these checks, make the node a leaf with the majority class.
- Post-pruning (simplification after growth): Grow the full tree (or a large tree), then consider removing subtrees and replacing them with leaves. Use a validation set to test if pruning a subtree reduces accuracy; if not, keep it pruned. Reduced error pruning is a straightforward method: prune any subtree whose removal does not decrease validation accuracy.
- Trade-off: Pre-pruning may stop too early and miss useful splits; post-pruning may initially overfit but can find better simplifications guided by validation performance. In practice, even simple pruning heuristics often yield much better generalization than unpruned trees.
- Interpreting and Using Trees
- Interpretability: Every root-to-leaf path is a human-readable rule, composed of feature tests. For auditing, you can list these paths and inspect where certain decisions come from.
- Determinism and speed: Classification involves a small number of feature checks along a path, making inference fast and predictable.
- Dealing with ties in IG: If multiple features share nearly equal IG, pick any consistently (e.g., by a fixed feature order). Subsequent recursion often resolves remaining uncertainty.
- Implementation Guide (no code required)
- Preparation: Ensure features are in a suitable format. Categorical features should have well-defined values; continuous features must be either binned (discretized) or prepared for threshold search.
- Core functions: • Entropy(S): Count labels in S, convert counts to probabilities p_i, compute -Σ p_i log2 p_i. • Split(S, A): For categorical A, group examples by each value v. For continuous A with threshold t, split into ≤t and >t. Return subsets and their sizes. • InformationGain(S, A): Compute entropy(S), compute weighted entropy after splitting on A, return the difference. • BestSplit(S): For each candidate A (and for continuous A, each candidate t), compute IG and return the best A (and t, if applicable). • BuildTree(S, features, depth): If stop condition met, return leaf with majority label (and optionally class distribution). Else choose best split, create node, and recurse on child subsets.
- Stopping conditions to implement: • Node is pure (entropy = 0). • No remaining features. • Max depth reached. • Node size below a minimum threshold. • Information gain below a small threshold (split not worth it).
- Post-processing: • If using post-pruning, hold out a validation set. Consider replacing subtrees with leaves if validation accuracy doesn’t drop (reduced error pruning). Iterate until no further beneficial pruning is possible.
- Tips and Warnings
- Data fragmentation: Each split reduces the number of examples per child. Be cautious of deep trees where many leaves have very few examples; this is a red flag for overfitting.
- Feature engineering: Categorical values with many rare categories can lead to brittle splits. Consider grouping rare categories if appropriate.
- Continuous thresholds: Always consider candidate thresholds derived from sorted data; testing arbitrary thresholds (e.g., every 10 units) is less effective and can miss better cut points.
- Class imbalance: If one class dominates, entropy will be low even without good separation. Be mindful when interpreting IG and consider whether features meaningfully reduce uncertainty for minority classes.
- Validation strategy: Keep a dedicated validation set or use cross-validation to evaluate pruning decisions; relying solely on training metrics will not reveal overfitting.
- Reproducibility: Fix the order of features for tie-breaking and document stopping/pruning rules so others can replicate your tree.
- Putting It All Together on the Restaurant Data
- Start: S has 12 examples, 6 positive and 6 negative, so entropy(S) = 1.
- Evaluate features: Compute IG for alternates, bar, Friday/Saturday, hungry, patrons, price, rain, reservation, type. The lecture shows 'patrons' has the highest IG ≈ 0.541.
- First split: Create a root node testing patrons with branches none, some, full. • none: 2 examples, both negative → make a 'don’t wait' leaf. • some: 4 examples, all positive → make a 'wait' leaf. • full: 6 examples with 2 positive, 4 negative → continue splitting.
- Recursion: Within 'full', choose the next feature by recomputing IG with only these 6 examples and the remaining features. Continue until subsets are pure or a stopping rule applies.
- Result: A readable, rule-based model that explains 'wait' vs 'don’t wait' decisions using intuitive restaurant cues. The dominant early split on 'patrons' captures the key intuition that crowd level strongly predicts waiting.
In summary, decision trees combine a clear structure (feature tests leading to labeled leaves) with a principled learning method based on entropy and information gain. They handle both categorical and continuous features (via discretization or thresholding), and they require safeguards—stopping rules and pruning—to avoid overfitting. The restaurant example shows exactly how to compute entropies, pick the best split, and grow a tree that is both effective and interpretable.
04Examples
- 💡
Restaurant path traversal: Given alternates = no at the root of the handcrafted tree, the decision immediately goes to 'don’t wait'. The path is short because that feature was used early and was decisive. This demonstrates classification as a top-down sequence of tests until a leaf is reached. The key point is that using a tree, you can trace exactly why a decision was made.
- 💡
Entropy of the full dataset: With 12 examples split 6 'wait' and 6 'don’t wait', the entropy is 1. We compute -0.5 log2 0.5 - 0.5 log2 0.5 = 1. This shows maximum uncertainty for a binary case. It sets the baseline before any splits.
- 💡
Entropy for 'patrons = none': In this subset there are 2 examples and both are negative. The class proportion is 0% positive, 100% negative, so entropy = 0. This shows a perfectly pure node that needs no further splitting. Purity leads directly to a leaf.
- 💡
Entropy for 'patrons = some': In this subset there are 4 examples and all are positive. The class proportion is 100% positive, 0% negative, so entropy = 0. Again, purity is achieved and a leaf is created. This demonstrates why 'patrons' is a powerful first split.
- 💡
Entropy for 'patrons = full': In this subset there are 6 examples with 2 positive and 4 negative. The entropy is -1/3 log2(1/3) - 2/3 log2(2/3) ≈ 0.918. This is a mixed node that still contains uncertainty. It requires further splitting to improve purity.
- 💡
Information gain for 'patrons': The post-split weighted entropy is (2/12)*0 + (4/12)*0 + (6/12)*0.918 = 0.459. Information gain is 1 - 0.459 = 0.541. This is the largest among tested features, so 'patrons' is chosen as the root split. It illustrates the gain formula in action.
- 💡
Discretization example (weight bins): Suppose weight is continuous; we define bins <100, 100–200, >200. We treat these as three categories and compute IG as for a categorical split. If many positives cluster in 100–200, that bin may have low entropy. This shows how continuous variables can be turned into multi-way splits.
- 💡
Threshold split example (weight ≤ 150): We test a binary question: is weight ≤ 150? The dataset is split into two groups; we compute the weighted entropy and IG. If the ≤150 group is much purer, IG will be high. We try multiple candidate thresholds and keep the best.
- 💡
Threshold selection via sorted midpoints: Sort examples by weight and look at adjacent unique values. For each adjacent pair, test the midpoint as a threshold and compute IG. Only positions where labels differ across the pair can change IG. This efficient method finds the best cut without guessing.
- 💡
Pre-pruning by max depth: Set a maximum depth (e.g., 3). When depth reaches 3, stop splitting and assign the majority label at that node. This prevents deep, highly specific branches that could overfit. It shows how early stopping controls complexity.
- 💡
Post-pruning with validation (reduced error): Build the full tree, then consider pruning each subtree. If removing a subtree and making it a leaf does not reduce validation accuracy, keep it pruned. Repeat until no improvement is possible. This demonstrates post-hoc simplification guided by generalization.
- 💡
Majority assignment when no features remain: If a node is still impure but all features have been exhausted, assign the majority class in that node. This ensures the algorithm terminates even without perfect purity. It’s a practical fallback behavior. It keeps the tree usable when data cannot be fully separated.
05Conclusion
This lecture presented decision trees as a simple yet powerful model for classification, grounded in clear, step-by-step logic. The core ideas are that each internal node tests a feature, each branch represents a test outcome, and each leaf produces a prediction. We learned how to grow trees from data using entropy to measure uncertainty and information gain to pick the most helpful features to split on. A detailed, numeric walk-through with the restaurant dataset demonstrated that 'patrons' provides the greatest reduction in uncertainty at the root, leading to immediate pure leaves for 'none' and 'some' and a recursive split for 'full'.
Continuous attributes were addressed with two mainstream strategies: discretization into bins and threshold-based binary splits. Thresholds can be selected efficiently by sorting the values and evaluating midpoints between adjacent unique values, choosing the one that maximizes information gain. The lecture stressed the danger of overfitting—trees that are too complex and overly tailored to training noise—and introduced pruning as the antidote. Pre-pruning stops early using rules like maximum depth or minimum samples per leaf, while post-pruning grows a fuller tree and then removes branches that don’t help on a validation set, with reduced error pruning as a straightforward method.
Practically, decision trees are easy to interpret, debug, and deploy. Their decisions are transparent: each path reveals exactly which conditions led to a prediction. After this lecture, you should be able to compute entropy and information gain, select splits, handle continuous features, and implement stopping and pruning strategies. To deepen your understanding, practice by building a small tree on a toy dataset, compute all entropies and gains by hand once, and then implement a simple recursive learner that includes stopping rules and a basic pruning pass.
The key message to remember is balance: use information-theoretic splitting to carve out structure, but apply sensible stopping and pruning to avoid overfitting. With these principles, decision trees deliver both clarity and solid predictive performance on many real-world problems.
Key Takeaways
- ✓Start with entropy to understand label uncertainty. Compute class proportions and plug them into -Σ p_i log2 p_i. Aim for splits that produce child nodes with lower entropy. This tells you which features most reduce uncertainty.
- ✓Use information gain to pick features. For each feature, compute the post-split weighted entropy and subtract from the parent entropy. The feature with the largest gain is the best split at that node. Repeat this process recursively.
- ✓Work through at least one example by hand. Calculate entropy and information gain step-by-step for a small dataset, like the restaurant example. Seeing the numbers clarifies why one feature is better than another. It also helps you trust the algorithm.
- ✓Handle continuous features deliberately. Either discretize into a few meaningful bins or search for the best threshold by sorting and testing midpoints. Choose the approach that makes your splits purer. Don’t guess thresholds randomly.
- ✓Define clear stopping rules. Set maximum depth, minimum samples per split or leaf, or a minimum information gain threshold. These prevent the tree from growing too complex. Stopping early reduces overfitting risk.
- ✓Prune to improve generalization. After building a larger tree, use a validation set to test whether removing subtrees hurts accuracy. If not, keep them pruned. This leads to simpler, more reliable models.
- ✓Favor interpretability in early modeling. Trees are easy to read: each path is a rule. Use them to explain predictions to stakeholders and to debug data issues. Transparency builds trust in your model.
- ✓Be careful with small leaves. Very small leaves often indicate overfitting to noise. Consider increasing minimum samples per leaf or pruning those branches. Keep enough data in each leaf to make stable predictions.
- ✓Use consistent tie-breaking. When two features yield similar gains, pick one using a fixed rule, like feature order. This ensures reproducibility across runs. Document the rule in your implementation.
- ✓Monitor class imbalance. If one class dominates, entropy can be low even without meaningful separation. Check whether splits help minority classes too. Consider collecting more data or adjusting sampling if needed.
- ✓Store class distributions at leaves. Even when predicting a label, having probabilities is useful for decision thresholds. For example, only act if the probability of 'wait' is above a certain level. This gives flexibility in downstream use.
- ✓Validate early and often. Keep a validation set to evaluate pruning and to compare different stopping rules. Do not rely solely on training accuracy. Validation performance reflects real-world behavior.
- ✓Record split rationales. Log each node’s chosen feature, information gain, and stopping reason. This helps you audit and refine the model later. It also supports reproducibility and learning from past iterations.
- ✓Prefer data-driven thresholds. For continuous features, sort values and evaluate midpoints where labels change. This systematically finds strong splits. It’s more effective than using arbitrary cuts.
- ✓Keep the tree as simple as possible while accurate. Simplicity reduces overfitting and increases interpretability. If two trees perform similarly, choose the smaller one. Clarity is a competitive advantage.
- ✓Use trees as a baseline. They are quick to train and easy to understand, making them ideal first models. The insights guide feature engineering and data collection. Even if you switch models later, you’ll keep the lessons learned.
Glossary
Decision tree
A predictive model that asks a sequence of questions about features to make a decision. Each question splits the data into groups based on the answer. The process continues until a final decision is reached at a leaf. Trees are easy to read and explain because you can follow the path of questions. They work for both classification (categories) and regression (numbers).
Node
A point in the tree where something happens: either a question is asked (internal node) or a decision is made (leaf). Internal nodes split data using a feature test. Leaves store the final predicted label or value. Every path from the root to a leaf goes through several nodes. Nodes organize the logic of the decision.
Root
The top node of the decision tree where the first question is asked. It is chosen to be the most informative split on the entire training set. All classifications start from the root. Its choice greatly affects the size and accuracy of the tree. A good root simplifies later decisions.
Leaf (terminal node)
The final node in a path where the model outputs a prediction. In classification, the leaf may hold a label like 'wait' or 'don’t wait'. It can also store class probabilities based on training examples reaching that leaf. Leaves are created when data are pure or when stopping rules say to stop. They represent the model’s conclusions.
Branch
A connection between nodes that represents the outcome of a test at an internal node. Each branch carries a label like 'yes' or 'no', or a specific feature value. Following branches is how the tree navigates from root to leaf. Branches partition the dataset into subsets. They encode the flow of decisions.
Feature (attribute)
An input variable used by the model to make decisions. Features can be categorical (like 'bar: yes/no') or continuous (like weight). Each test in the tree examines one feature. Good features help separate classes clearly. Poor features add little information and lead to weaker splits.
Classification
A machine learning task where the goal is to assign items to categories. Decision trees perform classification by splitting data until leaves mostly contain one class. Resulting leaves can output a class label or a class probability. The training data provide the correct labels to learn from. The quality of splits affects accuracy.
Regression
A task where the goal is to predict a continuous number instead of a label. A regression tree splits the data similarly but stores numeric predictions at leaves. The splitting criterion changes for regression (the lecture focuses on classification). Still, the idea of recursive partitioning remains the same. It estimates values based on similar cases grouped together.
+21 more (click terms in content)
