Stanford CS329H: ML from Human Preferences | Autumn 2024 | Model-based Preference Optimization
BeginnerKey Summary
- ā¢Decision trees are flowchart-like models used to predict a class (like yes/no) by asking a series of questions about features. You start at the root and follow branches based on answers until you reach a leaf with a class label. Each internal node tests one attribute, each branch is an outcome of that test, and each leaf gives the prediction.
- ā¢Learning a decision tree is a top-down process that begins with all training examples at the root. At each step, you choose the ābestā attribute to split the data into purer groups, then recurse on each group. This continues until stopping rules are met, like when a node is pure or there are no attributes left.
- ā¢Purity means a group of examples mostly (or entirely) belongs to a single class. The goal of splitting is to make child groups as pure as possible. Pure nodes are easier to label confidently because there is less uncertainty.
- ā¢Entropy measures uncertainty in a group: it is zero when all examples are the same class and highest when classes are evenly mixed. Information gain tells us how much uncertainty drops after splitting on an attribute. The attribute with the highest information gain is chosen to split next because it makes the data more orderly for classification.
- ā¢Technically, entropy is computed as the sum over classes of -p_i * log2(p_i), where p_i is the fraction of class i in the set. Information gain is the parent entropy minus the weighted sum of child entropies after a split. A larger information gain means a more helpful split.
- ā¢The Gini index is a different way to measure impurity: it is 1 minus the sum of squared class probabilities. It equals zero for a pure node and is larger when classes are more mixed. You choose the split that reduces the weighted Gini the most (equivalently, maximizes Gini reduction).
- ā¢Stopping criteria include: all examples at a node share the same class, no attributes remain to split, too few examples remain (below a threshold), or no split improves purity beyond a threshold. These rules prevent the tree from growing forever. They also help avoid overfitting by stopping weak or meaningless splits.
- ā¢If a split produces an empty child (no examples go to a branch), you assign that leaf the most common class from the parent. This provides a sensible default when data is missing for that branch. It prevents the algorithm from failing on unseen combinations of attribute values.
- ā¢Pruning reduces tree size by removing branches that donāt help accuracy. Pre-pruning stops early when splits look weak; post-pruning grows a full tree then trims bad branches afterward. Pruning improves generalization to new data and makes the model simpler and faster.
- ā¢Missing values can be handled in several ways. You can drop examples with missing values (simple but wasteful), fill in (impute) missing values with the most common value for that attribute, or use a model to predict the missing entries. Better handling of missing data keeps more information and often improves accuracy.
- ā¢Continuous attributes like temperature can be handled by turning them into ranges (discretization), such as cold, warm, and hot. The algorithm can then treat these ranges as categorical values and split accordingly. This lets decision trees use both categorical and numeric inputs.
- ā¢Decision trees are easy to understand and explain because they look like a set of human-readable rules. They work for classification and regression, handle mixed data types, and need little data preprocessing. But without careful stopping and pruning, they can overfit noisy data.
- ā¢Choosing between entropy/information gain and Gini is often a practical preference, since both aim to create purer child nodes. Entropy is more information-theoretic and sometimes slightly slower due to logs. Gini is often favored in CART-style trees for speed and similar performance.
- ā¢When multiple attributes tie as ābest,ā you can break ties by picking the one with fewer unique values or by a fixed order. This keeps the algorithm deterministic and avoids random trees unless intentional. Clear tie-breaking makes experiments repeatable.
- ā¢In practice, you will compute class proportions at a node, compute impurity, evaluate each attributeās split, and pick the best one. You then repeat this for every child node until a stop rule triggers. Finally, you label leaves with the majority class if needed.
- ā¢Decision trees naturally capture interactions by splitting on different attributes at different depths. This is like asking a new question depending on previous answers, which makes trees flexible. The path structure encodes conditional rules that are easy to audit.
- ā¢Model performance depends on good split choices, reasonable minimal sample thresholds, and smart pruning. Simple heuristics like minimum samples per leaf and maximum depth prevent extreme overfitting. Validation accuracy guides post-pruning decisions to keep only helpful branches.
Why This Lecture Matters
Decision trees are a cornerstone of practical machine learning because they turn complex, mixed data into clear, step-by-step rules that people can read and trust. Product managers, data scientists, analysts, and engineers benefit because trees work well with small to medium datasets, require minimal preprocessing, and handle both categorical and continuous features. The knowledge in this lecture solves common real-world problems: choosing meaningful splits, avoiding overfitting with sensible stopping and pruning, filling in missing values without throwing away data, and converting numeric signals into interpretable ranges. These skills apply directly to tasks like churn prediction, fraud screening, medical triage, credit approval, and operations decision supportāwhere transparent, auditable logic is valuable. From a career standpoint, mastering decision trees equips you with a reliable baseline model and builds intuition for impurity, gain, and generalizationāconcepts that also power advanced methods. Trees are the building blocks of random forests and gradient boosting, two of the most widely used techniques in industry. Understanding when and why a split is good trains your judgment for feature engineering and model selection. In a tech landscape that values explainability and fairness, decision trees offer interpretable decisions, making them suitable for regulated domains. This lectureās methods let you design, implement, and tune trees end-to-end, so you can deliver accurate, understandable models that stakeholders can trust and adopt quickly.
Lecture Summary
Tap terms for definitions01Overview
This lecture teaches the core ideas behind decision tree learning for supervised classification. A decision tree is a flowchart-like structure that predicts a target label (such as yes/no) by asking a sequence of questions about input attributes. Each internal node asks a question (a test) about one attribute, each branch corresponds to one possible outcome of that test, and each leaf node provides the final predicted class. By following the path from the root to a leaf, the model makes a prediction for any new example. The main goal of the learning algorithm is to grow a tree that separates the training data into increasingly pure groups, where each group mostly contains a single class.
The lecture focuses on how to construct such a tree from labeled data, how to choose which attribute to split on at each step, and when to stop splitting. The process starts at the root with all training examples and proceeds top-down. At every node, the algorithm evaluates each candidate attribute and chooses the one that produces the purest child groups. Two common measures of purity are introduced: entropy (from information theory) and the Gini index. The attribute that maximizes information gain (the drop in entropy) or equivalently reduces Gini impurity the most is selected for splitting.
You will also learn practical decisions every tree builder must make: when to stop growing the tree, how to handle empty branches after a split, how to prune a tree to avoid overfitting, how to deal with missing values, and how to handle continuous (numeric) attributes by discretizing them into ranges. Stopping rules include: all examples in a node share a class, no attributes remain, too few examples remain, or further splits provide negligible improvement. If a branch ends up with no training samples, you label it with the parent nodeās most common class. Pruning comes in two flavorsāpre-pruning (stop early) and post-pruning (trim after fully growing)āand it helps create a smaller, more generalizable model.
This lecture is for beginners who have basic familiarity with supervised learning terms like features (attributes), labels (classes), and training sets. No advanced math is required beyond understanding averages and logarithms, though the idea of log base 2 appears in the entropy formula. If you know how to count class frequencies in a group and compute proportions, you can follow the impurity calculations. The instructor keeps the concepts grounded with everyday analogies and small, concrete examples, like predicting whether someone will play tennis based on weather conditions such as outlook (sunny, overcast, rainy).
After completing this lecture, you will be able to: describe what a decision tree is; trace how it predicts a class label by following attribute-based questions; compute entropy, information gain, and the Gini index for a node; choose the best attribute to split; apply reasonable stopping criteria; decide what to do with empty branches; explain and apply pre-pruning and post-pruning; handle missing values through simple or model-based imputation; and discretize continuous attributes into useful ranges. These skills transfer directly to practical machine learning projects, where decision trees often serve as strong, interpretable baselines and are central to powerful ensembles like random forests and gradient-boosted trees (even though the lecture focuses on single trees).
The lecture is structured in a clean progression. It begins with the conceptual picture of a decision treeānodes as questions, branches as answers, and leaves as predictions. Then it presents the top-down training algorithm and the key question: how to pick the best split. Two impurity measuresāentropy (with information gain) and the Gini indexāare introduced along with their formulas and intuitive meaning. Next, the lecture addresses when to stop splitting, how to handle empty child sets, and how to label leaves. Finally, it covers practical refinements: pruning, strategies for missing values, and discretizing continuous attributes into categories like cold/warm/hot so the same splitting logic applies. By the end, you have a complete, end-to-end understanding of how to build and refine a decision tree classifier from real data.
02Key Concepts
- 01
Decision Tree (What is it?): A decision tree is a flowchart-like model that predicts a class by asking a sequence of questions about features. It's like playing 20 Questions, where each yes/no or multi-way answer sends you down a new branch until you reach a final guess. Technically, internal nodes test attributes, branches are outcomes, and leaves carry class labels. It matters because it turns messy, mixed data into simple, human-readable rules for prediction. Example: Predicting whether to play tennis using weather attributes like Outlook (sunny, overcast, rainy).
- 02
Top-Down Learning (How the tree is built): The learning algorithm starts at the root with all training examples and chooses the best attribute to split them. It then sends examples to child nodes according to their attribute values and repeats the process recursively. This approach pushes uncertainty downward, making nodes purer as we go. Without it, youād have a flat, confusing rule that treats all examples the same. Example: At the root, split by Outlook; at the next level, split rainy days by Windy = yes/no.
- 03
Purity/Impurity (Measuring how mixed a node is): Purity means examples at a node mostly share the same class, while impurity means they are mixed. Imagine sorting socks: a drawer with only white socks is pure; a drawer with half white, half black is mixed. Technically, purity is quantified by entropy or Gini index, based on class proportions. It matters because we want child nodes to be easy to label. Example: A node with 10 yes and 0 no is pure; a node with 5 yes and 5 no is highly impure.
- 04
Entropy (Uncertainty measure): Entropy is a number that tells how uncertain a nodeās class distribution is, computed as āā p_i log2 p_i. Itās like how surprised you are when drawing a random candy from a jar; if all candies are the same, thereās no surprise. Technically, entropy is zero for a pure node and maximal when classes are evenly split. Without entropy, we canāt quantify how much a split helps. Example: With two classes at 0.5 and 0.5, entropy is 1; with 1.0 and 0.0, entropy is 0.
- 05
Information Gain (Choosing the best split): Information gain is the drop in entropy after splitting by an attribute. Itās like cleaning a messy room into separate bins so each bin is more uniform. Technically, IG(A) = Entropy(parent) ā ā_v (|S_v|/|S|)*Entropy(S_v). It matters because we pick the attribute with the highest information gain at each node. Example: If splitting by Outlook makes sunny/overcast/rainy groups much purer, Outlook has high information gain.
- 06
Gini Index (Alternative impurity measure): Gini index is 1 ā ā p_i^2 and also measures how mixed a node is. Think of it as the chance youād pick two examples at random and theyād be of different classes. Lower Gini means purer nodes; we choose the split that most reduces weighted Gini. It matters because it usually performs similarly to entropy but is faster to compute. Example: A node with classes 0.7 and 0.3 has Gini 1 ā (0.49 + 0.09) = 0.42.
- 07
Branching and Leaves (Route to prediction): Each split creates branches for each attribute value, sending examples down matching paths. Itās like a road map: each turn depends on your last turn and the sign you see. Technically, a leafās prediction is the majority class of examples landing there. This matters because the path captures conditional logic: different rules apply under different conditions. Example: If Outlook = overcast, predict yes immediately because that leaf is pure.
- 08
Stopping Criteria (When to stop splitting): We stop when a node is pure, when no attributes remain, when the node is too small, or when no split significantly improves purity. Itās like stopping a game of 20 Questions when you already know the answer or have run out of good questions. Technically, thresholds on minimum samples per node and minimum gain prevent tiny, noisy splits. Without stopping rules, trees can grow too complex and overfit. Example: If a node has only 3 samples and rule says minimum is 5, stop and label by majority.
- 09
Empty Child Sets (Handling missing branches): Sometimes a split creates a branch with zero examples. Thatās like making a folder that no document belongs to. Technically, we label that leaf using the parent nodeās most common class. It matters because we need a sensible default for unseen combinations. Example: If no training records had Outlook = foggy, weād still assign foggy a label by copying the parentās majority class.
- 10
Pruning (Simplifying the tree): Pruning removes branches that donāt help accuracy on new data. Itās like trimming a plant so the healthy parts get more strength. Pre-pruning stops early based on thresholds; post-pruning grows a full tree and then removes weak subtrees. It matters because pruning reduces overfitting and improves generalization. Example: If splitting on Windy barely changes accuracy, prune that branch.
- 11
Pre-Pruning vs. Post-Pruning (Two strategies): Pre-pruning uses limits like max depth or minimum gain to avoid weak splits. Post-pruning evaluates the grown tree on validation data and cuts back branches that donāt help. Pre-pruning is faster and simpler; post-pruning can fix overgrowth after the fact. Without either, trees can memorize noise. Example: Set max depth to 4 (pre-pruning) or prune subtrees that donāt improve validation accuracy (post-pruning).
- 12
Handling Missing Values (Keeping data usable): Missing values are common; dropping rows wastes information. A simple fix is imputation by the most common value (mode) for that attribute; a stronger fix is model-based prediction for missing entries. Technically, imputation lets all examples participate in splits. This matters because better data use often means better models. Example: If Windy is missing, fill with the most frequent Windy value in the data.
- 13
Continuous Attributes (Turning numbers into categories): Decision trees expect discrete branches, so continuous features need thresholds or bins. Itās like labeling temperatures as cold, warm, or hot instead of every possible degree. Technically, discretization creates intervals that the tree can split on as if they were categories. Without this, continuous values would create too many tiny branches. Example: Split Temperature at 20°C and 30°C to create three bins.
- 14
Majority Class at Leaves (Final labeling): When the data at a leaf isnāt pure, predict the class that appears most often there. Itās like voting: pick the label with the most votes. Technically, this is the empirical risk minimizer for 0-1 loss on that node. It matters because it gives a clear, consistent prediction rule. Example: If a leaf has 7 yes and 3 no, predict yes.
- 15
Tie-Breaking (When two attributes look equally good): Sometimes multiple attributes give the same gain. You can break ties by choosing the one with fewer categories, the first in a fixed order, or randomly if you want diversity. Having a rule avoids indecision and keeps training deterministic. It matters for reproducibility and fairness. Example: Between Outlook and Humidity with equal gain, pick Outlook due to fewer values.
- 16
Overfitting and Generalization (Core risk): Trees can grow very deep and memorize noise. Overfitting hurts performance on new data even if training accuracy is perfect. Pruning and stopping criteria guard against this, and validation data helps check generalization. It matters because real-world success depends on new, unseen data. Example: A tree that splits on rare patterns might ace training but fail in production.
- 17
Interpretability (Why trees are loved): Trees read like if-then rules that people understand. Each path explains a particular decision, which helps trust and debugging. Technically, this interpretability is a key advantage over many black-box models. It matters in regulated or high-stakes settings. Example: A doctor can read a path: if Age > 60 and Symptoms = X, then test further.
- 18
Entropy vs. Gini (Practical choice): Both aim to make child nodes purer; differences in performance are often small. Entropy uses logs and has a clear information-theory meaning; Gini is simpler and faster. Choose one and be consistent; test both if youāre unsure. What matters is that you apply a solid stopping/pruning strategy. Example: Many libraries default to Gini for speed, but entropy is also common.
- 19
Correctness Note on Gini Reduction (Be careful): The best split is the one that most reduces impurity. If you define āGini gainā as reduction, higher is better; if you define it as weighted child Gini, lower is betterāthese are equivalent views. Be consistent with your definition to avoid mistakes. This matters to implement the selection logic correctly. Example: Implement either āmaximize reductionā or āminimize weighted child impurity,ā but not a mix.
03Technical Details
Overall Architecture/Structure
- Problem framing
- Input: A labeled training set S of examples, each described by attributes (features) and a class label. Attributes may be categorical (e.g., Outlook ā {sunny, overcast, rainy}) or continuous (e.g., Temperature in °C). The goal is to learn a mapping from attributes to a class label.
- Output: A decision tree T whose internal nodes test attributes, branches represent attribute values (or intervals), and leaves output predicted classes.
- Data flow: Start with S at the root. At each node, evaluate candidate attributes to split S into subsets S_v. Choose the attribute that yields the purest children. Recurse on each child with its subset. Stop when a stopping rule applies. For any leaf, predict the majority class of the subset assigned to it.
- Node anatomy
- Internal node: Stores which attribute is tested and how many branches (one per discrete value or interval). Also stores counts of class labels for impurity and prediction backup.
- Branch: Labeled by one attribute value (for categorical) or by a condition like Temperature ⤠20°C (for continuous after discretization or thresholding).
- Leaf: Stores the predicted class (usually the majority class) and optionally the class distribution for confidence estimation.
- Training loop (top-down induction)
- Start at the root with all examples.
- Compute impurity at the node (entropy or Gini) from class proportions.
- For each candidate attribute A:
- Partition examples by Aās values (categorical) or by chosen thresholds/bins (continuous).
- Compute the weighted average impurity of child nodes.
- Compute the improvement (e.g., information gain or Gini reduction) relative to the parent.
- Select the attribute that yields the greatest improvement (equivalently, the smallest weighted child impurity).
- Create child nodes for each attribute value (or interval) and assign the corresponding subsets.
- Recurse on each child until a stop rule triggers.
- If a child subset is empty, assign the leaf the parentās majority class.
Impurity Measures and Split Selection
- Entropy (uncertainty of a node)
- Definition: Entropy(S) = āā_i p_i log2(p_i) where p_i is the fraction of examples in S of class i. For binary classes, entropy ranges from 0 (all one class) to 1 (50/50 split). For more classes, the maximum is log2(K) when all K classes are equally likely.
- Interpretation: Higher entropy = more mixed classes (more uncertainty). Lower entropy = purer node (more certainty).
- Edge cases: If p_i = 0 for any class, define 0Ā·log2(0) = 0 (limit case) to avoid numerical issues.
- Information Gain (how much entropy is reduced by a split)
- IG(S, A) = Entropy(S) ā ā_v (|S_v|/|S|) Ā· Entropy(S_v), where v ranges over Aās values.
- Weighted averaging reflects the size of each child node. Larger, impure children penalize the gain more than small, impure children.
- Selection: Choose A that maximizes IG(S, A). This prioritizes splits that create children with low entropy, especially if those children hold many samples.
- Gini Index (alternative impurity)
- Gini(S) = 1 ā ā_i p_i^2. It is 0 if the node is pure and increases as classes mix.
- Gini reduction for attribute A: ĪGini(S, A) = Gini(S) ā ā_v (|S_v|/|S|) Ā· Gini(S_v).
- Equivalent selection rules: either maximize ĪGini(S, A) or minimize the weighted child Gini ā_v (|S_v|/|S|) Ā· Gini(S_v). Be consistent in code.
- Multiway vs. binary splits
- Categorical attributes naturally create a branch per category (sunny/overcast/rainy). Continuous attributes are often split by thresholds (e.g., Temperature ⤠20 vs. > 20) or discretized into bins (e.g., cold/warm/hot). Both approaches reduce the problem to discrete branch choices for the tree.
Stopping Rules and Leaf Labeling
- Stopping criteria
- Purity: If all examples at a node belong to the same class, stop and make a leaf.
- No attributes: If there are no remaining attributes to split on, stop.
- Minimum samples: If the number of examples is below a threshold (e.g., fewer than 5), stop to avoid noisy splits.
- Minimum improvement: If the best candidate split yields information gain or Gini reduction below a small threshold, stop.
- Leaf labeling
- Majority class rule: The leafās predicted label is the class with the highest count at that node. This minimizes classification error on the nodeās training samples.
- Confidence/backup: Optionally store the class distribution to estimate confidence (e.g., 80% yes, 20% no) or to break ties.
Handling Empty Children and Missing Values
- Empty child sets after a split
- Sometimes no training example exhibits a certain attribute value (e.g., no day was foggy). Create the leaf and label it with the parentās majority class. This gives a reasonable default for unseen values during prediction.
- Missing attribute values in the data
- Drop rows: Simple but discards data; use only if missingness is rare and random.
- Mode imputation: Replace missing categorical values with the most common value for that attribute. For numeric, replace with a typical value or bin.
- Model-based imputation: Use a predictive model (e.g., another tree or simple classifier) to infer missing values based on other attributes.
- Rationale: Better imputation preserves dataset size and often improves predictive power compared to dropping rows.
Continuous Attributes and Discretization
- Why discretize?
- Trees branch on discrete outcomes. Continuous values can be converted to a small number of intervals (bins) like cold/warm/hot so the tree can branch cleanly.
- How to create intervals
- Simple thresholds: Use domain knowledge or scan candidate thresholds (e.g., midpoints between sorted unique values) to choose splits that maximize gain or Gini reduction.
- Fixed bins: Choose ranges like < 20°C, 20ā30°C, > 30°C. Treat the resulting bins as categories for the tree.
- Trade-offs
- Too many bins create many branches and can overfit. Too few bins may hide useful distinctions. Use validation to pick a sensible binning strategy.
Pruning to Improve Generalization
- Why prune?
- Fully grown trees can memorize noise. Pruning simplifies the tree, usually improving performance on new data and making the model faster and easier to interpret.
- Pre-pruning (early stopping)
- Set limits: max depth, min samples per split/leaf, or a minimum gain threshold. If a candidate split fails these rules, stop splitting.
- Pros: Faster training, simpler trees, less overfitting from the start.
- Cons: Risk of stopping too early, missing helpful structure that would appear deeper in the tree.
- Post-pruning (cut after growing)
- Process: Grow a full tree, then evaluate subtrees on validation data. Replace a subtree with a leaf if it does not improve validation performance.
- Pros: Often finds a better trade-off between fit and simplicity than pre-pruning alone.
- Cons: Requires extra data or cross-validation; more computation.
Implementation Guide (step-by-step) Step 1: Prepare the data
- Ensure each example has attribute values and a class label. Decide which attributes are categorical vs. continuous. Decide how to handle missing values (drop, mode imputation, or model-based imputation). For continuous features, decide whether to discretize into bins or consider simple thresholds.
Step 2: Choose impurity measure
- Pick entropy (with information gain) or Gini. Entropy has a clear information-theory meaning; Gini is often faster. Be consistent across the entire training process.
Step 3: Build the tree (recursive function)
- Base cases: If node is pure, or no attributes remain, or node size < min_samples, or best gain < min_gain, return a leaf labeled by majority class.
- Otherwise: For each candidate attribute:
- Partition S into subsets S_v according to the attributeās values (or intervals).
- Compute weighted child impurity and the improvement (gain or reduction).
- Pick the attribute with the best improvement.
- Create child nodes for each value/interval. For any empty child, create a leaf labeled with the parentās majority class.
- Recurse on each non-empty child with its subset and the remaining attributes.
Step 4: (Optional) Post-pruning
- Hold out a validation set. For each internal node, consider replacing its subtree with a leaf labeled by its majority class, and measure validation accuracy. If accuracy does not drop (or improves), keep the prune. Repeat bottom-up.
Tools/Libraries (conceptual)
- Although no specific code was shown, most ML libraries (e.g., scikit-learn, Rās rpart) implement decision trees with Gini or entropy, and support pre-pruning via parameters like max_depth, min_samples_split, min_samples_leaf, and min_impurity_decrease. Missing value handling and discretization are usually performed before fitting.
Tips and Warnings
- Be consistent with definitions: If you define āgainā as reduction in impurity, higher is better; if you minimize weighted child impurity directly, lower is betterāthese are equivalent criteria when implemented correctly, but do not mix them.
- Avoid too many categories in a single split: Very high-cardinality categorical features can create many branches and overfit. Consider grouping rare categories or using thresholds/encoding.
- Watch for tiny leaves: Enforce min_samples_per_leaf to avoid rules learned from very few examples.
- Validate stopping and pruning thresholds: Use a validation set or cross-validation to choose min_gain, max_depth, and other hyperparameters.
- Handle missingness thoughtfully: If missingness is systematic (not random), simple imputation may bias the model; consider model-based imputation or explicit āmissingā categories when appropriate.
- Continuous variables: Try several candidate thresholds and pick the one that yields the best impurity reduction, or discretize into a small number of meaningful bins.
- Interpretability: Keep trees reasonably shallow to maintain readability; pruning helps. Document leaf distributions for transparency.
Code/Execution Flow (concept in pseudocode)
- fit(S, Attributes):
- If stop_condition(S, Attributes): return Leaf(label=majority(S))
- For A in Attributes:
- Partition S into {S_v}
- score[A] = impurity(S) ā ā_v (|S_v|/|S|)*impurity(S_v)
- A* = argmax_A score[A]
- Node = DecisionNode(test=A*)
- For each value v of A*:
- If S_v is empty: Node.child[v] = Leaf(label=majority(S))
- Else: Node.child[v] = fit(S_v, Attributes ā {A*})
- Return Node
- predict(x, Node): Traverse from root; at each DecisionNode, read attribute A* from x, follow the branch matching x[A*], continue until a Leaf; return Leaf.label.
By following these steps and principlesāimpurity-based selection, sound stopping rules, careful handling of empties and missing values, and prudent pruningāyou can build decision trees that are both accurate and easy to explain.
04Examples
- š”
Play Tennis basic traversal: Input attributes include Outlook ā {sunny, overcast, rainy}. At prediction time, if Outlook = overcast, follow the overcast branch and land at a leaf labeled yes. Output is yes. The key point is how a single attribute test routes an example to a direct prediction.
- š”
Entropy calculation at a mixed node: Suppose a node has 6 yes and 2 no. Class proportions are p_yes = 6/8 = 0.75 and p_no = 0.25. Entropy = ā(0.75Ā·log2 0.75 + 0.25Ā·log2 0.25) ā 0.811. This shows how uncertainty is quantified before testing a split.
- š”
Information gain for attribute Outlook: Assume splitting by Outlook yields three children with entropies 0.0, 0.5, and 0.918 and relative sizes 0.25, 0.25, and 0.5. The weighted child entropy is 0.25Ā·0.0 + 0.25Ā·0.5 + 0.5Ā·0.918 = 0.584. If parent entropy was 0.811, then IG = 0.811 ā 0.584 = 0.227. This demonstrates how we decide if Outlook is a helpful split.
- š”
Gini index at a node: With the same 6 yes and 2 no, p_yes = 0.75 and p_no = 0.25. Gini = 1 ā (0.75² + 0.25²) = 1 ā (0.5625 + 0.0625) = 0.375. Lower Gini means purer nodes; a perfect node (8 yes, 0 no) would have Gini = 0. This illustrates Gini as an alternative impurity measure to entropy.
- š”
Stopping due to purity: If a node contains 10 yes and 0 no, entropy and Gini are both 0. The node is pure, so we stop splitting and make it a leaf. Predict yes for any example that reaches this node. This prevents unnecessary questions once the answer is certain.
- š”
Stopping due to minimum samples: Suppose a node has only 4 examples and the min_samples threshold is 5. Even if a split is possible, we stop to avoid learning from too few examples. The leaf is labeled by majority class in those 4 samples. This guards against overfitting tiny, noisy subsets.
- š”
Empty child handling: After splitting on Outlook, no training samples happen to be Outlook = foggy. We still create a foggy branch because the attribute allows this value. Since itās empty, we label the foggy leaf with the parentās majority class, giving a sensible default for predictions.
- š”
Pre-pruning with minimum gain: If the best attribute at a node yields information gain of 0.01 and the min_gain threshold is 0.02, we do not split. We convert the node into a leaf labeled by majority class. This keeps the tree from adding weak, low-value branches. It improves generalization and keeps the tree small.
- š”
Post-pruning with validation: Grow a full tree first. Then evaluate a candidate subtree and a replacement leaf (majority class) on validation data. If the leaf performs as well or better, prune the subtree. This shows how post-pruning removes over-specific structure that does not help new data.
- š”
Missing value imputation (mode): Suppose Windy is missing for some examples. Compute the most common Windy value (e.g., no), and fill missing entries with no. Now all examples can be used in splits that reference Windy. This simple approach preserves sample size with minimal effort.
- š”
Model-based imputation: Use a simple classifier (e.g., a small decision tree) to predict missing Windy from other attributes like Outlook and Humidity. Fill in predicted values and proceed with the main tree learning. This can outperform mode imputation when missingness is systematic. It demonstrates a smarter way to preserve information.
- š”
Discretizing Temperature: Convert a continuous Temperature into bins: cold (< 20°C), warm (20ā30°C), hot (> 30°C). Now the tree can split on Temperature with three clean branches. This avoids dealing with every unique number and keeps the tree comprehensible. It mirrors the lectureās cold/warm/hot example.
- š”
Tie-breaking between equal gains: Two attributes, Outlook and Humidity, both yield information gain 0.15. A tie-break rule picks the attribute with fewer values (e.g., Outlook) to reduce branching. This maintains determinism and simplicity. Itās a practical detail that keeps training predictable.
- š”
Majority label at a mixed leaf: A leaf ends up with 7 yes and 3 no after pruning removes further splits. Predict yes because itās the majority class at that node. This yields the lowest training error at that leaf. It illustrates how leaves produce outputs even when not perfectly pure.
- š”
Predict workflow on a new sample: Given x = {Outlook = rainy, Windy = yes, Temperature = warm}, start at the root. Test Outlook, go to rainy branch; test Windy, go to yes branch; reach a leaf labeled no. Output no. This shows end-to-end prediction by following attribute-based decisions.
05Conclusion
This lecture built a complete, practical understanding of decision trees for classification. You learned that a decision tree is a series of attribute-based questions leading from a root to a leaf, where the leaf provides the predicted class. Training proceeds top-down: at each node, choose the attribute that best improves purity, measured by entropy (with information gain) or the Gini index. You saw stopping rules that prevent endless growth and reduce overfitting, rules for labeling leaves and handling empty child sets, and how pruningāeither before or after full growthāsimplifies trees for better generalization. Real-world issues like missing values and continuous attributes were addressed through imputation and discretization into ranges such as cold/warm/hot.
To practice, start with a small dataset like the play-tennis example. Compute node entropies and Gini values by hand, evaluate information gain for a couple of attributes, and choose the best split. Implement a simple recursive trainer that stops on purity or minimum samples, and add a majority-class label at leaves. Then experiment with pre-pruning settings like max depth and min gain, and try a post-pruning step using a validation set. Test discrete binning on a numeric feature and compare results.
Next steps include exploring decision trees for regression, where splits aim to reduce variance instead of class impurity, and learning about ensembles like random forests and gradient-boosted trees that combine many trees for stronger performance. You might also compare entropy and Gini on your data to see if one works better, and investigate more advanced handling of missing data. Studying feature engineering and category encoding for high-cardinality attributes will further improve tree performance.
The core message is simple and powerful: pick the question that most reduces confusion, stop before noise takes over, and trim branches that donāt help. Decision trees excel because they are both effective and interpretableāturning complex datasets into clear, conditional rules. With careful split selection, sensible stopping, and pruning, you can build trees that perform well on new data and are easy to explain, audit, and trust.
Key Takeaways
- āStart with a clean training set and decide how to treat missing values before training. Use simple mode imputation for categorical attributes if missingness is light, and consider model-based imputation when patterns in missingness exist. Avoid dropping rows unless missingness is rare and random. Keeping more data usually improves your tree.
- āPick one impurity measureāentropy (with information gain) or Giniāand stick with it. Entropy has a clear information-theory meaning; Gini is often faster to compute. Both aim to create purer children, and performance differences are usually small. Consistency avoids logical mistakes in split selection.
- āImplement clear stopping rules to prevent runaway growth. Use conditions like max depth, minimum samples per split and per leaf, and minimum information gain (or Gini reduction). These limits cut off weak, noisy branches. Well-chosen thresholds boost generalization and training speed.
- āWhen evaluating a split, always consider weighted child impurity. Large impure children should count more than small impure children. Compute either information gain (parent entropy minus weighted child entropy) or Gini reduction. Pick the split that most reduces impurity overall.
- āHandle empty child branches by assigning the parentās majority class. This ensures the model can still predict for unseen attribute values. It avoids crashes or undefined behavior during prediction. It also provides a reasonable default label when data is absent.
- āUse pre-pruning for speed and simplicity, and post-pruning for final polishing. Pre-pruning via max depth and min gain prevents weak splits from ever forming. Post-pruning trims back overgrown parts of the tree after full training using validation. Combining both yields a compact, strong model.
- āDiscretize continuous attributes into a small number of meaningful bins or use simple thresholds. Too many bins can overfit; too few can hide patterns. Validate your binning choices to ensure they improve split quality and interpretability. Keep ranges non-overlapping and complete.
- āUse deterministic tie-breaking to keep results reproducible. When two attributes score the same, pick the one with fewer values or a fixed order. This avoids randomness in structure that can confuse analysis. Consistency helps with debugging and collaboration.
- āGuard against tiny leaves that come from over-splitting. Set a minimum number of samples per leaf to avoid rules based on very few cases. Tiny leaves are unstable and often memorize noise. Larger leaves give more reliable, generalizable predictions.
- āDocument class distributions at leaves to aid interpretation. Stakeholders can see not only the predicted class but also how confident the leaf is. This transparency boosts trust and helps spot risky, borderline rules. It can also guide where to prune or collect more data.
- āRegularly validate your model on held-out data while tuning thresholds. Track validation accuracy or loss to decide pruning and stopping settings. If validation stops improving, youāre likely overfitting. Use early stopping and pruning to keep only helpful structure.
- āWhen in doubt between entropy and Gini, benchmark both on your dataset. Choose the one that gives simpler trees with equal or better validation performance. Favor interpretability and stability alongside raw accuracy. Small computational savings from Gini can add up on large datasets.
Glossary
Decision Tree
A decision tree is a model that predicts a class by asking a series of questions about features. Each question splits the data into parts. You follow branches like in a flowchart until you reach a leaf with the answer. It is easy to read and explain. It works for both classification and regression.
Node
A node is a point in the tree where something happens. Internal nodes ask a question about a feature. Leaf nodes give the final class label. Nodes store data counts to help decide what to do. They link together to form the tree.
Root Node
The root node is the very top node of the tree. It holds all the training examples at the start. The first split happens here. The choice made at the root often has a big effect on the rest of the tree. Itās the starting point for every prediction path.
Leaf Node
A leaf node is an end point that outputs a prediction. It does not split further. It usually holds the majority class of the examples that reach it. Leaves are where the model makes final decisions. They are the answers to the treeās questions.
Attribute (Feature)
An attribute is a property used to describe an example. It can be categorical, like Outlook, or continuous, like Temperature. Attributes are the things the tree asks about. They help the model decide which way to go at each node. Good attributes create purer splits.
Split
A split is when a node divides its examples into groups based on an attribute. Each group becomes a branch to a child node. Splits try to make child nodes as pure as possible. The choice of split decides how the tree grows. Good splits reduce confusion about the class.
Branch
A branch is a path from a node to a child node. Each branch matches one outcome of the nodeās test. Following branches leads you from the root to a leaf. Branches represent different conditions in the data. They form the structure of the tree.
Class (Label)
A class is the category the model predicts. In binary classification there are two classes, like yes or no. The label is what we want to guess from features. Leaves contain the class to predict. Classes define what purity means at a node.
+32 more (click terms in content)
