Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 1 - Transformer
BeginnerKey Summary
- •Decision trees are simple, powerful models that make predictions by asking a sequence of yes/no questions about input features. They work for both classification (like spam vs. not spam) and regression (like house price prediction). The tree’s structure—what to split on and when—comes directly from the data, which is why we call them non-parametric.
- •Building a decision tree starts with all data at the root and repeatedly splits it into smaller groups that are as pure as possible. Purity means each group mostly contains one class. This recursive splitting continues until a stopping rule is met, like maximum depth, minimum samples per node, or no further gain in purity.
- •Impurity measures how mixed a node is; common metrics are Gini impurity and entropy. Gini impurity is 1 minus the sum of class probabilities squared; entropy is the negative sum of p times log2(p) across classes. Both usually give similar rankings for splits; Gini is often preferred for speed.
- •Information gain tells us how much a split improves purity. It equals parent impurity minus the weighted average of child impurities. We compute information gain for every candidate split and choose the feature/threshold with the highest gain.
- •An example tree predicting online purchase used features age, student status, and credit rating. A top split on age routes young users left and older users right. Leaves (terminal nodes) contain final decisions like “Buy: Yes” or “Buy: No.”
- •Terminology: the root is the top node; internal nodes are where splits happen; leaves assign the prediction. Internal nodes test a feature (e.g., “age ≤ 30?”). Leaves store the resulting class or value once splitting stops.
- •Decision trees have strengths: easy to understand and explain, little feature scaling needed, and they handle both numbers and categories. They work well out of the box and are often competitive on many tasks. They are also the building blocks of ensembles like random forests and gradient boosting.
- •They also have weaknesses: they can overfit by memorizing noise in the training data. Small data changes can produce a very different tree (instability). Trees may be biased toward features with many categories if you aren’t careful with how you evaluate splits.
- •To combat overfitting, we use pruning and constraints. Pre-pruning limits growth with max depth, min samples per leaf, or min impurity decrease; post-pruning removes weak branches after the tree is grown. Cross-validation helps tune these hyperparameters to generalize better.
- •Handling missing data depends on the implementation. Some tree algorithms learn how to split with missing values directly (e.g., surrogate splits). Others require imputing missing values with means, medians, or special categories before training.
- •The choice of split metric (Gini vs. entropy) is often not critical. Gini tends to be faster; entropy is information-theoretic and sometimes yields more balanced trees. In practice, both typically end up producing similar accuracy.
- •A worked Gini example: in a node with 60% class A and 40% class B, Gini is 1 − (0.6² + 0.4²) = 0.48. For the same node, entropy is −(0.6 log2 0.6 + 0.4 log2 0.4) ≈ 0.97. These numbers quantify how mixed the node is; lower is purer.
- •A worked information gain example: if a parent node’s impurity is 0.5 and it splits into two children with impurities 0.3 (60% of data) and 0.4 (40% of data), the information gain is 0.5 − (0.6×0.3 + 0.4×0.4) = 0.2. Higher information gain means a better split at that step.
- •Decision trees are sensitive to data—adding or removing a few points can change the chosen splits. This instability is a reason ensembles like random forests train many trees on different data and features. Their averaged prediction reduces variance and overfitting.
- •Decision trees can be used for regression by predicting averages at leaves and minimizing variance (often using mean squared error). The intuition is the same: split to make leaves as homogeneous as possible, but now “homogeneous” means similar numeric values. The recursive strategy and stopping rules carry over unchanged.
Why This Lecture Matters
Decision trees matter because they combine accuracy, speed, and interpretability in a single package—a rare trio in machine learning. For product managers, analysts, and data scientists, they offer a fast, reasonable baseline that can quickly surface which features drive outcomes, all in a form stakeholders can understand: simple if-then rules. In operations, trees enable quick decisions with minimal compute, suitable for real-time classification or risk scoring. For teams with messy, mixed-type data, trees are forgiving—they do not require heavy scaling, can handle categories, and often cope with missingness through simple strategies. The knowledge in this lecture solves common problems like overfitting and instability by teaching you to measure impurity, compute information gain, and set effective stopping and pruning strategies. These tools allow you to build models that generalize well and to explain your decisions clearly, which is critical in regulated settings (finance, healthcare, lending) or customer-facing applications (recommendations, support triage). In real projects, you can apply these concepts to select meaningful splits, tune hyperparameters with cross-validation, and communicate model logic to non-technical audiences using decision paths. From a career perspective, mastering decision trees equips you to use and reason about more advanced ensemble methods like random forests and gradient boosting, which are industry standards on tabular data. Understanding the building block—how trees split, measure impurity, and overfit—makes you far more effective at diagnosing and improving ensemble models. In the current industry landscape, where explainability and fairness are increasingly important, decision trees remain a highly relevant, practical, and teachable model that bridges theory and application.
Lecture Summary
Tap terms for definitions01Overview
This lecture teaches the core ideas behind decision trees, one of the most widely used and practical machine learning models. A decision tree predicts a target by asking a series of simple questions about the input’s features—like “Is age ≤ 30?” or “Is credit rating fair?”—and following branches down the tree until it reaches a leaf that contains the final prediction. The model is non-parametric, which means it does not assume a fixed form (like a line or a curve) in advance. Instead, the tree’s structure—what feature to split on and where—is discovered from the data itself. Decision trees work for both classification (choosing a category) and regression (predicting a number), and they form the backbone of powerful ensemble models like random forests and gradient boosting.
The lecture is aimed at beginners to intermediate learners who are familiar with basic supervised learning ideas: features, labels, classification vs. regression. If you know what it means to train a model on labeled data and evaluate it on a test set, you’re ready. No advanced math is required beyond understanding proportions, logs (for entropy), and averages. The lecture goes step-by-step from definitions and intuition to calculating impurity, information gain, and choosing splits, and ends with practical tips like pruning and handling missing values.
After finishing this lecture, you will be able to: explain what a decision tree is and how it makes predictions, compute Gini impurity and entropy for any node, calculate information gain for a proposed split, and describe common stopping criteria. You will also be able to articulate the main advantages (interpretability, mixed data types, low prep) and disadvantages (overfitting, instability, category bias). You’ll understand how pruning and hyperparameter constraints help prevent overfitting. Lastly, you’ll grasp how trees fit into ensembles, especially how random forests reduce overfitting by averaging many trees.
The lecture is structured in a practical flow. It starts with the high-level definition of a decision tree and a relatable example predicting whether someone will buy online based on age, student status, and credit rating. It introduces key terms—root node, internal node, leaf—and explains how the tree grows by recursively splitting the data to create increasingly homogeneous nodes. It then formalizes impurity with two metrics, Gini impurity and entropy, and shows how to compute each with a concrete numerical example (0.6/0.4 split). With impurity in hand, it defines information gain as the reduction in impurity after a split and walks through a worked calculation. Next, it covers when to stop splitting: pure leaves, maximum depth, minimum samples, and minimal information gain. The lecture closes with a balanced discussion of pros and cons, how to avoid overfitting with pruning and hyperparameters, handling missing data, model instability, and a brief comparison to random forests as an ensemble of many trees trained on different subsets of data and features.
Overall, the lesson combines intuition (trees ask simple questions to sort data), precise math (measures of impurity and gains), and practical guidance (stopping rules, pruning, cross-validation) so that you can not only understand the model but also apply it responsibly. You will leave with a complete mental model of how decision trees are constructed, evaluated, constrained, and used in real projects, as well as how they serve as the foundation for more advanced ensemble methods.
02Key Concepts
- 01
Definition of a Decision Tree (🎯, 🏠, 🔧, 💡, 📝): A decision tree is a model that predicts a target by asking a series of simple questions about input features. It’s like a flowchart where each question routes you left or right until you reach an answer. Technically, it’s a non-parametric supervised learning method for both classification and regression. Without this model, we’d miss an interpretable, flexible baseline that often performs well quickly. Example: Predicting if a user will buy online using age, student status, and credit rating.
- 02
Non-Parametric Nature (🎯, 🏠, 🔧, 💡, 📝): Non-parametric means the model’s shape isn’t fixed ahead of time; it adapts to data. Imagine molding clay into whatever shape best fits; the data determines where the bumps and edges go. In trees, both the features chosen and the thresholds are learned from data, not pre-specified. This matters because it allows modeling complex, nonlinear patterns. Example: Instead of fitting a single straight line, a tree can split the space into multiple regions based on feature questions.
- 03
Supervised Learning Context (🎯, 🏠, 🔧, 💡, 📝): Supervised learning trains on inputs with known answers (labels) to make predictions on new inputs. It’s like studying with an answer key so you can learn how to get those answers next time. Decision trees use training data with features and target labels to learn split rules. This matters because labels guide which splits make nodes purer. Example: Emails labeled spam/not-spam help the tree learn which words or features best separate the classes.
- 04
Classification vs. Regression (🎯, 🏠, 🔧, 💡, 📝): Classification assigns categories; regression predicts numbers. Think of choosing a team (cat vs. dog) versus guessing a player’s score (a number). Trees use class impurity (Gini/entropy) for classification and variance/MSE for regression to make leaves homogeneous. This matters because the split criterion changes depending on task type. Example: Classifying customer churn vs. predicting monthly spending.
- 05
Tree Structure: Root, Internal Nodes, Leaves (🎯, 🏠, 🔧, 💡, 📝): The root is the top node; internal nodes ask feature-based questions; leaves give final predictions. Picture a 20-questions game where each answer narrows possibilities until you decide. Technically, each internal node splits data into groups based on a rule (e.g., age ≤ 30). This structure matters for clarity and interpretability. Example: A path “age ≤ 30 → student = yes → Buy: Yes” is a clear, auditable rule.
- 06
Recursive Partitioning (🎯, 🏠, 🔧, 💡, 📝): Trees grow by repeatedly splitting data to make child nodes purer. It’s like sorting socks by color, then by size, until each pile is neat. At each step, we try all candidate splits and pick the one with the highest information gain. Without this, the tree wouldn’t progressively simplify the classification problem. Example: First split by age, then within the young group, split by student status.
- 07
Impurity (🎯, 🏠, 🔧, 💡, 📝): Impurity measures how mixed a node’s classes are. Think of a jar of red and blue marbles—the more mixed, the higher the impurity. Formally, impurity can be Gini (1 − Σ p²) or entropy (−Σ p log2 p). This matters because splits aim to lower impurity to make decisions clearer. Example: A node with 60% A and 40% B has Gini 0.48 and entropy about 0.97.
- 08
Gini Impurity (🎯, 🏠, 🔧, 💡, 📝): Gini impurity quantifies class mix as 1 minus the sum of squared class probabilities. It’s like measuring how likely you’d pick two different-colored marbles if you draw twice. We compute it from the class distribution in the node; lower is better. It matters because it’s fast and commonly used. Example: With p(A)=0.6, p(B)=0.4, Gini = 1 − (0.6²+0.4²) = 0.48.
- 09
Entropy (🎯, 🏠, 🔧, 💡, 📝): Entropy measures uncertainty using information theory; higher means more disorder. Imagine how “surprised” you’d be by the next marble’s color—the more mixed, the more surprise. It’s computed as −Σ p log2 p and reaches 0 when a node is pure. It matters because it provides a principled uncertainty measure and underlies information gain. Example: With 60/40 classes, entropy ≈ 0.97 bits.
- 10
Information Gain (🎯, 🏠, 🔧, 💡, 📝): Information gain is how much impurity drops after a split. It’s like how much clearer your sorting gets after dividing socks into two piles first. Technically, gain = impurity(parent) − Σ (child_size/parent_size × impurity(child)). This matters because we pick the split with the highest gain at each step. Example: With parent 0.5, children 0.3 (60%) and 0.4 (40%), gain = 0.2.
- 11
Choosing Splits (🎯, 🏠, 🔧, 💡, 📝): At each node, we try splitting on each feature (and thresholds for numeric features) and compute information gain. It’s like testing different first questions to see which separates possibilities best. We select the feature/threshold giving the biggest gain. This matters to ensure steady progress toward pure leaves. Example: Comparing “age ≤ 30?” vs. “student = yes?” and choosing whichever yields higher gain.
- 12
Stopping Criteria (🎯, 🏠, 🔧, 💡, 📝): We stop growing when nodes are pure, depth is at a limit, samples per node are too small, or gain is too tiny. It’s like knowing when your piles are sorted enough and further sorting won’t help. Technically, these are hyperparameters like max_depth, min_samples_leaf, and min_impurity_decrease. This matters to prevent overfitting and keep the model practical. Example: Stop splitting a node that has only 3 samples or is already 100% one class.
- 13
Advantages (🎯, 🏠, 🔧, 💡, 📝): Trees are easy to interpret; you can trace a prediction path. Like reading a simple flowchart, each step is explainable. They handle mixed data types and require little scaling or normalization. This matters for quick, clear baselines and stakeholder trust. Example: Explaining to a manager, “We predicted ‘Buy’ because age ≤ 30 and student = yes.”
- 14
Disadvantages (🎯, 🏠, 🔧, 💡, 📝): Trees tend to overfit, can be unstable, and may favor features with many categories. It’s like memorizing a study guide too literally—small changes break your answers. Overfitting happens because trees can grow deep and chase noise. This matters because it hurts performance on new data. Example: A tree perfectly fits training data but performs poorly on a fresh test set.
- 15
Pruning (🎯, 🏠, 🔧, 💡, 📝): Pruning removes branches that add little value to accuracy to simplify the tree. Think of trimming a bush so it’s healthy and not messy. Pre-pruning uses constraints during growth; post-pruning grows a full tree then cuts weak parts. This matters to reduce overfitting and improve generalization. Example: Removing a branch that only fits two rare training points.
- 16
Handling Missing Data (🎯, 🏠, 🔧, 💡, 📝): Missing value handling depends on the implementation. It’s like guessing a blank on a form using context. Some trees can route missing values using learned surrogate rules; others require imputing with mean/median or a special category. This matters because real datasets often have gaps. Example: If age is missing, a surrogate split might use income to decide the branch.
- 17
Instability of Trees (🎯, 🏠, 🔧, 💡, 📝): Instability means small data changes can lead to very different trees. It’s like changing one puzzle piece and suddenly reassembling the whole picture differently. Since splits are chosen greedily, early choices can cascade. This matters because it affects reproducibility and confidence in a single tree. Example: Adding a few new records shifts the top split from “age” to “credit rating.”
- 18
Random Forests vs. Single Trees (🎯, 🏠, 🔧, 💡, 📝): A random forest is an ensemble of many decision trees trained on different data and feature subsets. It’s like asking many experts and averaging their opinions. Technically, it reduces variance by averaging predictions, often improving accuracy and robustness. This matters when single trees overfit and vary too much. Example: Training dozens of trees and predicting the class by majority vote.
03Technical Details
Overall Architecture/Structure
-
Data flow: A decision tree takes an input vector of features (e.g., age, student status, credit rating) and predicts a target (class label or numeric value). Training begins with all training samples at the root node. The algorithm evaluates candidate splits on each feature to find the split that most reduces impurity (i.e., maximizes information gain). The dataset is partitioned into child nodes based on that split. The process repeats recursively for each child until stopping criteria are met. Leaves store the final prediction—either a class (or class probabilities) for classification or an average value for regression.
-
Roles of components:
- Root node: Entry point containing all training samples.
- Internal nodes: Decision points defined by a rule (e.g., feature ≤ threshold for numeric, feature ∈ subset for categorical).
- Edges/branches: Outcomes of the rule (true/false or category branches) that route samples to children.
- Leaf nodes: Terminal nodes that output predictions (most frequent class for classification, mean for regression).
-
Prediction path: To predict for a new sample, start at the root and apply the node’s rule. If true, go left; if false, go right (for binary trees). Repeat until reaching a leaf, then return the leaf’s stored prediction (and class probabilities if needed by using the class frequencies at that leaf).
Core Computations
-
Node impurity (classification):
- Gini impurity: G = 1 − Σ_i p_i^2, where p_i is the proportion of class i in the node. G ∈ [0, 1 − 1/C], where C is the number of classes. Pure nodes have G = 0.
- Entropy: H = −Σ_i p_i log2 p_i. H ∈ [0, log2 C]. Pure nodes have H = 0; maximum H occurs when classes are equally likely.
-
Information gain (classification):
- Gain(Split) = Imp(parent) − [ (N_L/N) Imp(left) + (N_R/N) Imp(right) ] for binary splits, where Imp is Gini or entropy, N is parent sample count, and N_L/N_R are child sample counts. For multiway splits (e.g., categorical with many values), sum over all children.
-
Impurity (regression):
- Commonly variance or mean squared error (MSE). For node with values y_1..y_n and mean μ, variance = (1/n) Σ (y − μ)^2. Split quality is measured by reduction in variance (or MSE): Var(parent) − [ (N_L/N) Var(left) + (N_R/N) Var(right) ]. Leaves predict μ of their samples.
-
Candidate splits:
- Numeric features: Try thresholds between sorted unique values (midpoints). For feature x with sorted values v1 ≤ v2 ≤ …, evaluate splits x ≤ t for t between adjacent unique values.
- Categorical features: Depending on implementation, either try multiway splits (one branch per category) or binary splits using subsets (e.g., x ∈ S vs. x ∉ S). Some libraries binarize categories or require one-hot encoding; others natively handle categories.
Worked Numerical Examples from the Lecture
- Gini example: Node with 60% class A and 40% class B: Gini = 1 − (0.6^2 + 0.4^2) = 1 − (0.36 + 0.16) = 1 − 0.52 = 0.48.
- Entropy example: Same node: H = − [0.6 log2 0.6 + 0.4 log2 0.4] ≈ − (0.6 × −0.737 + 0.4 × −1.322) ≈ 0.971.
- Information gain example: Parent impurity = 0.5. Children impurities: left 0.3 with 60% of samples; right 0.4 with 40% of samples. Weighted child impurity = 0.6 × 0.3 + 0.4 × 0.4 = 0.18 + 0.16 = 0.34. Gain = 0.5 − 0.34 = 0.16. The lecture rounded to 0.2 for illustration; the key idea is the parent minus weighted children impurities.
Greedy, Recursive Tree-Building Algorithm
-
High-level steps:
- Start with all training samples at the root.
- If the node meets a stopping criterion (pure, too small, too deep, or no gain), make it a leaf and set its prediction (majority class or mean value).
- Otherwise, for each feature, consider candidate splits and compute information gain (classification) or variance reduction (regression).
- Choose the split with the highest improvement.
- Partition data into children according to the chosen split and recurse on each child.
-
Why greedy? The algorithm chooses the best split at each node locally rather than searching all possible trees globally (which is computationally infeasible). This local optimization typically works well in practice and is the standard approach.
Stopping Criteria (Pre-Pruning)
- Purity stop: If all samples at a node belong to the same class, it is pure—stop splitting.
- Max depth: Limit the depth of the tree (e.g., depth ≤ 5). Deeper trees can capture more patterns but risk overfitting.
- Min samples per split/leaf: Require a minimum number of samples to attempt a split or to be at a leaf (e.g., min_samples_split = 10, min_samples_leaf = 5). This prevents brittle rules based on tiny sample counts.
- Min impurity decrease (or min information gain): Only split if the impurity reduction is above a threshold. This avoids pointless splits that do not help generalization.
Post-Pruning
- Concept: Grow a large tree (possibly to pure leaves) and then cut back branches that do not help validation performance. This is often guided by a validation set or cross-validation.
- Benefit: Can simplify the model after fully exploring the structure, improving generalization and interpretability. Although the lecture emphasized pruning in general, practical implementations include cost-complexity pruning that balances accuracy vs. tree size.
Advantages of Decision Trees
- Interpretability: Paths are human-readable rules. Stakeholders can understand and audit decisions.
- Minimal feature engineering: No need for scaling/normalization; trees can naturally handle mixed feature types.
- Flexibility: Applicable to classification and regression; can model nonlinear boundaries and interactions.
- Speed at prediction time: Traversing a tree is fast (logarithmic in the number of leaves), suitable for real-time decisions.
Disadvantages and Caveats
- Overfitting: Deep trees can memorize noise. Constraining growth and pruning are essential.
- Instability (high variance): Small changes in data can alter splits, leading to very different trees.
- Bias toward high-cardinality categorical features: Features with many categories can create many partitions that look purer by chance. Mitigations include regularization, careful encoding, and using validation-based criteria.
Handling Missing Data
- Implementation-dependent options:
- Imputation: Fill missing values with mean/median (numeric) or mode/special category (categorical) before training.
- Surrogate splits: Learn backup splits that mimic the primary split; if the primary feature is missing for a sample, use the surrogate to route it.
- Missing-as-category: Treat “missing” as its own category for categorical features.
- Importance: Real-world datasets often contain missing entries; robust handling improves performance and usability.
Practical Tips for Choosing Gini vs. Entropy
- Gini: Slightly faster, often similar results; tends to favor the most frequent class separation.
- Entropy: Information-theoretic; can sometimes produce slightly more balanced trees. In practice, accuracy differences are usually minor.
Using Decision Trees in Practice
-
Training pipeline:
- Clean data: Handle missing values (as per implementation), unify data types, check for obvious errors.
- Split into train/validation/test (or use cross-validation).
- Train a baseline tree with default hyperparameters.
- Evaluate and then tune max_depth, min_samples_split, min_samples_leaf, and min_impurity_decrease.
- Consider post-pruning if available.
- Assess stability and consider ensembles (random forests) if variance is high.
-
Evaluation metrics:
- Classification: Accuracy, precision, recall, F1, ROC-AUC; analyze confusion matrix.
- Regression: Mean squared error (MSE), mean absolute error (MAE), R².
-
Feature importance:
- Many implementations compute feature importance as the total impurity reduction contributed by splits on that feature, averaged over the tree. While informative, it can be biased toward high-cardinality features; use with caution and validate with permutation importance when possible.
Hands-On Implementation Guide (Conceptual)
-
Step 1: Prepare data.
- Ensure features and target are correctly typed (numeric/categorical). Decide how to handle missing values (impute or use implementation that supports missing). For categorical variables, check if your library needs one-hot encoding or can split on categories directly.
-
Step 2: Choose split criterion.
- Classification: Gini or entropy (information gain).
- Regression: Variance reduction (often MSE).
-
Step 3: Fit the tree.
- At the root, compute impurity. For each feature, enumerate candidate splits (thresholds for numeric; subsets or categories for categorical) and compute information gain. Select the highest-gain split.
- Partition the data into left/right child nodes (binary trees) or multiple children (multiway, if supported). Recurse on each child, applying stopping criteria.
-
Step 4: Set stopping rules.
- Use sensible defaults (e.g., max_depth 5–10 for small/medium datasets; min_samples_leaf 5–20). Monitor validation performance as depth increases; stop when adding depth does not improve validation score.
-
Step 5: Prune (optional / if supported).
- If your library supports post-pruning (e.g., cost-complexity pruning), search over pruning strengths with cross-validation and choose the best generalization score.
-
Step 6: Evaluate and interpret.
- Report metrics on a held-out test set. Visualize the tree to verify sanity (are the first splits plausible?). Extract and discuss the top decision paths for stakeholder clarity.
-
Step 7: Consider ensembles.
- If the single tree overfits or is unstable, use random forests (bagging with feature subsampling) to reduce variance and improve accuracy.
Debugging and Common Pitfalls
- Poor generalization (overfitting): Tree too deep or leaves too small. Fix by lowering max_depth, increasing min_samples_leaf, or pruning. Validate changes with cross-validation.
- Underfitting: Tree too shallow or constraints too strict. Slightly increase max_depth or decrease min_samples_leaf; verify improvements.
- High-cardinality categorical features dominate: Try encoding strategies, regularization, or switch to an implementation with balanced split handling. Validate with permutation importance.
- Missing data mishandled: Ensure consistent imputation strategy or leverage surrogate splits. Inspect distributions before and after imputation to avoid bias.
- Instability: Use ensembles or set random seeds for reproducibility. Report variability across folds.
Tools/Libraries (General Guidance)
- Many practitioners use standard libraries that implement efficient, battle-tested tree learners. For Python, scikit-learn’s DecisionTreeClassifier and DecisionTreeRegressor are common: they support Gini/entropy, max_depth, min_samples_leaf, and more. For R, rpart and partykit are popular. While the lecture did not depend on a specific tool, these libraries embody the same principles discussed: greedy splitting to maximize impurity reduction, with user-controlled stopping criteria and pruning in some cases.
Data and Model Interpretability
- Rule extraction: Each path to a leaf is a human-readable rule (e.g., if age ≤ 30 and student = yes then Buy). Listing top rules can help explain model behavior.
- Leaf probabilities: For classification, a leaf can store class counts, allowing predictions with probabilities (counts normalized by leaf size). These probabilities can be calibrated further if needed.
From Single Trees to Ensembles
- Random forests: Train many trees on bootstrap samples (bagging) and consider random subsets of features at each split. Average their predictions (majority vote for classification, mean for regression). This reduces variance and usually improves generalization.
- Gradient boosting: Sequentially build trees that correct the errors of previous ones. Though beyond the lecture’s main scope, it relies on the same tree building blocks.
Ethical and Practical Considerations
- Interpretability aids accountability: Because trees are readable, they’re suitable where explanations are required.
- Beware data leakage: Ensure that features used in splits are available at prediction time and do not leak future information.
- Fairness: Splits can inadvertently encode biases present in data. Monitor disparate impact across groups and consider constraints or preprocessing to mitigate harms.
In summary, decision trees operate by greedy, recursive partitioning of data to create pure, interpretable decision rules. Their key computations—impurity (Gini/entropy) and information gain—provide a principled way to choose splits. With sensible stopping criteria and pruning, trees become robust, understandable models and serve as the cornerstone for powerful ensembles like random forests.
04Examples
- 💡
Online Purchase Tree Walkthrough: Input features were age, student status, and credit rating, and the target was “Buy?” yes/no. The tree first split on age ≤ 30; if true, it then checked if the person is a student: if yes → Buy; if no → Not Buy. If age > 30, it checked credit rating: fair → Buy, excellent → Not Buy. This example shows how a simple sequence of questions leads to a clear prediction path.
- 💡
Gini Calculation Example: Consider a node with 60% class A and 40% class B. Compute Gini as 1 − (0.6² + 0.4²) = 0.48. The output tells us the node is moderately impure: not all one class, but also not evenly mixed. The key point is learning to quantify “mixedness” with a single number that lower is better.
- 💡
Entropy Calculation Example: Using the same node (0.6/0.4), entropy is −[0.6 log2 0.6 + 0.4 log2 0.4] ≈ 0.97. This number reflects uncertainty in bits; 0 would mean a pure node. The example emphasizes that Gini and entropy both capture impurity and often rank splits similarly. It also shows how logs appear naturally in information-theoretic metrics.
- 💡
Information Gain Example: Suppose a parent node has impurity 0.5. A split produces two children: left impurity 0.3 with 60% of samples, right impurity 0.4 with 40% of samples. The weighted average child impurity is 0.34, giving information gain 0.5 − 0.34 = 0.16 (conceptually rounded to 0.2 in the lecture). This demonstrates how to measure a split’s usefulness.
- 💡
Numeric Threshold Split Example: For a feature like age with distinct values, we sort ages and try midpoints as thresholds (e.g., 29.5, 30.5). For each threshold, we compute information gain and pick the best. The output is a rule like “age ≤ 30?” that directs younger users left. The focus is on how numeric features lead to binary threshold rules.
- 💡
Categorical Split Example: For student status (yes/no), the natural split is “student = yes?” which yields two clean branches. For a multi-category feature like color {red, blue, green}, an implementation may test binary partitions (e.g., {red, blue} vs. {green}). The output is the partition with highest gain. This highlights how categories are handled differently from numbers.
- 💡
Stopping by Purity Example: Imagine a child node where all samples are “Buy.” Its impurity is 0, so no further splitting is needed. The node becomes a leaf predicting “Buy.” The key lesson is recognizing when a node is pure and should stop growing.
- 💡
Stopping by Max Depth Example: Suppose we set max_depth = 3; once a node reaches depth 3, it cannot split further. Even if a small gain is available, we enforce the limit. The output leaf gives the majority class or mean value at that node. This demonstrates how depth constraints prevent over-complex trees.
- 💡
Overfitting and Pruning Example: A fully grown tree perfectly classifies the training data but fails on validation data. By pruning away branches that only fit a few rare cases, validation accuracy improves. The output is a smaller tree that generalizes better. The emphasis is on controlling complexity to avoid memorizing noise.
- 💡
Missing Data Handling Example: If some entries for credit rating are missing, one approach is to impute them with the most common rating (e.g., “fair”). Another approach is to use a tree implementation that learns surrogate splits to route missing cases via a correlated feature like income. The output is consistent routing despite missing fields. The point is that you must decide how to handle missingness before or during training.
- 💡
Instability Illustration Example: Start with a dataset where age is the best top split. Add a few older students with excellent credit who buy; now credit rating becomes the top split. The entire tree shape changes downstream. The point is that small data edits can cascade into very different trees.
- 💡
Regression Tree Example: Predicting house price, each leaf stores the average price of houses routed there. Splits are chosen to minimize variance (MSE) in the children, making price values within each leaf similar. A node with very mixed prices gets split by features like square footage or location to reduce spread. The output is a piecewise-constant predictor that changes by region of feature space.
- 💡
Random Forest Comparison Example: Train many trees on bootstrapped samples and random feature subsets. To predict, each tree votes, and the majority (or mean) is taken as the final answer. This averaging reduces variance and typically outperforms a single tree on test data. The example underscores how ensembles tame instability and overfitting.
05Conclusion
Decision trees learn to predict by asking a sequence of simple, human-readable questions about input features and following the answers down to a leaf. They are non-parametric models whose structure is discovered from data, making them flexible for both classification and regression. The heart of tree building is choosing splits that make nodes purer: we measure impurity with Gini or entropy (for classification) and variance or MSE (for regression), and we select the split that maximizes information gain. This process repeats recursively until stopping rules say “enough,” producing a model that is easy to trace and explain.
Despite their strengths—interpretability, minimal data prep, and broad applicability—single decision trees can overfit, become unstable, and sometimes prefer features with many categories. Practical usage, therefore, depends on wise regularization: set maximum depth, require minimum samples per leaf, apply minimum impurity decrease, and use pruning when available. Cross-validation should guide these choices so the model generalizes well beyond the training data. Handling missing values is another key step, whether by imputing or using implementations that support missing-aware splits.
In the real world, trees are not just end models but also fundamental building blocks of ensembles. Random forests, for example, train many slightly different trees and average their predictions to reduce variance and improve accuracy. Understanding single trees makes it much easier to grasp why and how these ensembles work. After mastering impurity, information gain, and stopping/pruning strategies, you can confidently deploy trees as clear baselines, interpret their decisions for stakeholders, and scale up to ensembles when you need extra performance.
The core message to remember is simple: a decision tree breaks a hard problem into a set of small, easy choices. If you make each choice reduce uncertainty (maximize information gain) and stop before the model memorizes noise (prune and constrain), you’ll build robust, understandable predictors. From there, ensembles like random forests offer a path to greater stability and accuracy while keeping the intuitive building blocks you now understand.
Key Takeaways
- ✓Start with interpretability: use a single decision tree to gain clear, rule-based insight into your data. Trace a few prediction paths to verify that top splits make domain sense. This early check often reveals data issues or feature leakage. Share these paths to build stakeholder trust.
- ✓Compute impurity correctly to pick good splits: use Gini or entropy for classification and variance/MSE for regression. Lower impurity is better, and the best split maximizes information gain. Always weight child impurities by their sample sizes. Sanity-check extreme cases (pure nodes should have impurity 0).
- ✓Try all reasonable thresholds on numeric features: sort unique values and test midpoints. This ensures you don’t miss a strong split due to a bad threshold guess. For speed, consider quantile-based candidates on very large datasets. Verify the chosen threshold improves validation performance.
- ✓Handle categories thoughtfully: if your tool doesn’t support native categorical splits, encode them (e.g., one-hot) and watch for high-cardinality bias. Group rare categories into “other” to stabilize splits. Validate that categorical splits generalize across folds. Consider target encoding only with strict leakage controls.
- ✓Control depth early to reduce overfitting: set max_depth and min_samples_leaf to reasonable values based on dataset size. Deep trees with tiny leaves are brittle and memorize noise. Use cross-validation curves to find the sweet spot. Prefer simpler trees when validation scores are similar.
- ✓Prune weak branches: after growing a tree, remove parts that don’t help on validation data. This usually shrinks the model with little or no accuracy loss. It also improves interpretability by focusing on strong rules. Use pruning strength selected via cross-validation when possible.
- ✓Use cross-validation for tuning: don’t trust a single train/test split to pick hyperparameters. Evaluate a grid (e.g., max_depth, min_samples_leaf, min_impurity_decrease) across folds. Select the simplest model within one standard error of the best score to avoid overfitting. Refit on full training data and confirm on a held-out test set.
- ✓Plan for missing values: decide on imputation or use an implementation that supports missing-aware routing. Keep the imputation strategy consistent between training and inference. If imputing, compare mean vs. median for robustness to outliers. Consider adding a “missing” indicator feature when appropriate.
- ✓Watch for instability: small data changes can reshuffle splits, especially near the top. Set random seeds for reproducibility and report variability across folds. If predictions vary too much, move to ensembles like random forests. Communicate that a single tree’s exact structure is not always stable.
- ✓Measure and communicate uncertainty: for classification, use leaf class probabilities rather than only hard labels. Calibrate if necessary using validation data. For regression, report prediction intervals if your pipeline supports them. Stakeholders appreciate quantified confidence.
- ✓Guard against high-cardinality traps: features with many categories can look deceptively good. Use validation-based selection to avoid overfitting to rare categories. Consider combining rare categories or using domain knowledge to group them. Cross-check feature importance with permutation tests.
- ✓Baseline then iterate: start with a single tree to understand signal and quick wins. If performance plateaus or variance is high, graduate to a random forest. Keep monitoring interpretability by examining representative trees or using global explanations. This staged approach balances clarity and accuracy.
- ✓Document decision paths and constraints: record max_depth, min_samples settings, and key splits that drive decisions. Clear documentation helps audits and future maintenance. It also aids reproducibility when datasets evolve. Treat model changes like code changes with versioning.
- ✓Validate with proper metrics: choose accuracy/ROC-AUC for balanced classes, and precision/recall for imbalanced ones. For regression, check MSE/MAE and R², and inspect residuals by leaf. Don’t rely on training accuracy—use cross-validation and a held-out test. Align metrics with business goals.
- ✓Know when to switch to ensembles: if a single tree overfits or is unstable, random forests or gradient boosting can help. They average many trees to reduce variance and often boost accuracy. Keep the core tree principles in mind to debug these models. Ensembles build on everything you learned here.
Glossary
Decision Tree
A model that predicts by asking a series of questions about the input and following branches to a final answer. It is non-parametric, meaning its shape is learned from data. It works for both classification and regression tasks. Each internal node asks a question, and each leaf gives a prediction.
Root Node
The very top node of a decision tree that contains all the training data at the start. It is where the first split is chosen. The quality of this first split often strongly affects the whole tree. From here, data flows down based on the decision rule.
Internal Node
A node that applies a decision rule to split the data into children. It is not a final prediction; it just asks a question to sort data better. It appears between the root and the leaves. Many internal nodes together form the decision path.
Leaf (Terminal Node)
The final node where the model stops splitting and issues a prediction. It can store a class label or a numeric value, and often class probabilities. The path to a leaf is an if-then rule. Leaves aim to be as pure as possible.
Classification
A task where the goal is to assign an input to one of several categories. Decision trees for classification use impurity measures suited to classes. They output the most frequent class at a leaf, sometimes with probabilities. The splits aim to reduce class mixing.
Regression
A task where the goal is to predict a number. Decision trees for regression try to make leaves where numbers are close together. Leaves output the average of the numbers they contain. Splits try to reduce variance (spread) of the target values.
Non-Parametric
A model type that doesn’t assume a fixed shape (like a line) before seeing data. Its structure grows and adapts based on the data patterns. Decision trees are non-parametric because their splits and depth are learned. This makes them flexible for many shapes of data.
Feature
An input variable used by the model to make a decision. Features can be numbers (age) or categories (credit rating). Trees test features at internal nodes to route data. Good features help create pure leaves.
+27 more (click terms in content)
