Stanford CS230 | Autumn 2025 | Lecture 6: AI Project Strategy

Beginner
Stanford
Machine LearningYouTube

Key Summary

  • This lecture explains decision trees as simple, rule-based models for classification and regression. A decision tree splits data by asking yes/no questions about features until the remaining data in a group is mostly one label (pure). Each leaf makes a final prediction by majority vote for classification or by average value for regression. Trees are powerful, easy to understand, and highly interpretable.
  • Building a decision tree means choosing the best feature and split point at each step to reduce impurity. Impurity measures how mixed the classes are inside a node; two common measures are Gini impurity and entropy. The best split is the one with the highest information gain, which is the parent impurity minus the weighted impurity of the children. Splitting continues until stopping rules say to stop to avoid overfitting.
  • Gini impurity is 1 minus the sum of squared class probabilities, and it is zero when all samples in the node belong to one class. Entropy is the negative sum of p times log2 p for each class, and it also equals zero when a node is perfectly pure. When a node has a 50-50 class mix, both Gini and entropy are high, signaling disorder. Lower impurity means the model is more confident about labels in that node.
  • Information gain measures how much impurity drops after a split. You compute the parent's impurity, compute child impurities, take their size-weighted average, and subtract that from the parent. The larger the information gain, the better the split. This process is repeated recursively to grow the tree.
  • Decision trees can easily overfit because they keep splitting to fit every tiny pattern, including noise. To prevent this, you limit depth, require a minimum number of samples to split, or require a minimum number of samples per leaf. These are called stopping criteria and act like brakes for tree growth. Good stopping rules make trees generalize better to new data.
  • Random forests are ensembles of many decision trees trained on different bootstrapped samples and random subsets of features. Bootstrapping means sampling the training data with replacement to form a new training set for each tree. At each split, a random subset of features is considered instead of all features, which decorrelates trees. Final predictions are made by averaging (regression) or majority vote (classification).
  • Random forests work well mainly because they reduce variance. A single tree is high-variance: tiny data changes can build a very different tree. Averaging many such trees cancels out individual quirks but keeps shared signal. This makes random forests accurate and robust to overfitting compared to a single tree.
  • Decision trees are very interpretable: you can read their if-then rules and understand decisions. They handle both numeric and categorical features and work for classification and regression. But they can be unstable, biased toward features with many levels (because more splits are possible), and prone to overfitting. Random forests trade interpretability for higher accuracy and robustness.
  • Choosing the number of trees in a random forest is a practical tuning decision. Generally, more trees improve performance but with diminishing returns and higher computational cost. A good approach is to start with a few hundred trees and increase until validation performance plateaus. Cross-validation helps find the sweet spot efficiently.
  • Other ensemble methods include gradient boosting and AdaBoost, which build trees sequentially, each new tree correcting previous errors. Bagging builds multiple models on bootstrapped data without random feature subsampling. Stacking combines different model types by training a meta-model on their outputs. These methods all aim to improve accuracy and generalization by combining models.
  • Handling missing data in trees and forests can be done in multiple ways. You can impute missing values with mean, median, or mode; create a special 'missing' branch; or skip using a feature if it’s missing. Different libraries implement different strategies automatically, so check documentation. Proper handling keeps splits meaningful and maintains performance.
  • Overall, the lecture frames both trees and forests inside supervised learning and loss minimization. Trees build structure by greedily choosing splits that most reduce impurity. Forests amplify stability and accuracy by aggregating many diverse trees. The result is a practical toolkit that balances simplicity, interpretability, and real-world performance.

Why This Lecture Matters

Decision trees and random forests are core tools for anyone working with structured, tabular data—data scientists, ML engineers, analysts, and product teams. They offer a rare mix of power and practicality: trees provide clear, explainable rules, and forests deliver robust, high-accuracy predictions without heavy feature engineering. Knowing how impurity, information gain, and stopping criteria work lets you diagnose overfitting, bias toward high-cardinality features, and instability. Understanding bootstrapping and feature subsampling explains why random forests outperform single trees and guides you in tuning the number of trees and feature subset sizes. This knowledge directly solves common problems: choosing a baseline model that’s fast to build and explain; improving performance with ensembles; and handling missing data in a way that preserves predictive power. It helps you ship real models quickly—trees for transparent decisions (like policy or compliance) and forests for production accuracy (like risk scoring or churn prediction). Mastery of these methods strengthens your ML toolkit and prepares you to evaluate and select among other ensemble approaches like boosting and stacking. In today’s industry, random forests remain competitive on many tabular tasks and are strong baselines even when deep learning is popular elsewhere. They scale reasonably, work with mixed data types, and provide feature importance for insight. Learning these fundamentals sets you up for better model selection, clearer communication with stakeholders, and stronger, more trustworthy deployments.

Lecture Summary

Tap terms for definitions

01Overview

This lecture teaches the fundamentals of decision trees and random forests, two of the most widely used supervised learning methods. Supervised learning means we train a model on labeled examples so that it can predict labels for new inputs. Decision trees build a set of simple if-then rules to split data into smaller and purer groups, eventually ending at leaves that make predictions. Random forests take this a step further by combining many trees trained on different slices of data and features to produce a stronger, more stable model.

The lecture starts with the big picture: most supervised learning algorithms aim to minimize a loss function, which measures how wrong predictions are. While algorithms differ in how they search for the best model, their core goal is the same. Decision trees are highlighted as being interpretable, easy to visualize, and usable for both classification (predicting categories) and regression (predicting numbers). Their structure is intuitive: nodes ask questions about feature values, edges route data according to answers, and leaves store final predictions.

You learn how to construct a decision tree by repeatedly choosing the best split. The best split is defined using impurity measures like Gini impurity or entropy, which quantify how mixed the classes are in a node. The split that maximizes information gain (the drop in impurity from parent to children) is chosen. This greedy process is applied recursively until stopping criteria signal that the model should stop growing to avoid overfitting. Stopping can be enforced by limits on tree depth or minimum samples required for splits and leaves.

Random forests are introduced as an ensemble method—an approach that combines multiple models to improve performance. Each tree in a random forest is trained on a bootstrapped sample (sampling with replacement) of the training data. Additionally, at each node, a random subset of features is considered for splitting. These two sources of randomness decorrelate trees so that their errors don’t line up. Final predictions are aggregated: majority vote for classification or averaging for regression.

The lecture explains why random forests are so effective: they reduce variance, which is the tendency of a model to change a lot in response to small changes in the training data. A single tree is a high-variance model and can be unstable. Averaging many high-variance models dramatically stabilizes predictions without increasing bias too much. As a result, random forests usually generalize better than a single tree and are more robust to overfitting.

You also learn pros and cons. Decision trees are interpretable, handle both numeric and categorical features, and are easy to implement. But they can overfit, be biased toward features with many levels, and be sensitive to small data changes. Random forests are more accurate and robust, offer feature importance measures, but are less interpretable, more computationally expensive, and require careful hyperparameter tuning like selecting the number of trees.

The lecture covers practical questions: how to pick the number of trees (start with hundreds and grow until validation performance plateaus), other ensemble methods (gradient boosting, AdaBoost, bagging without feature subsampling, and stacking across different model types), and how to handle missing data (impute, add a missing-value branch, or skip the feature, depending on implementation). These details anchor the theory in everyday modeling choices.

By the end, you understand the mechanics and motivations behind decision trees and random forests. You can explain how impurity, information gain, and stopping rules guide tree growth; how bootstrapping and feature subsampling build diverse forests; and how averaging reduces variance. You also gain practical wisdom about interpretability trade-offs, hyperparameters, and data issues like missing values. The lecture equips you to confidently build, tune, and evaluate trees and forests for both classification and regression tasks.

02Key Concepts

  • 01

    Decision Trees (Definition): A decision tree is a model that predicts outcomes by asking a sequence of feature-based questions. It’s like following a flowchart of yes/no checks to reach a final answer. Technically, it organizes data into nodes (questions) and leaves (predictions) by splitting on feature values. This matters because it turns complex patterns into simple rules that are easy to explain. Example: Deciding whether to go surfing by checking weather conditions like 'sunny,' 'humidity ≤ 80%,' and 'windy.'

  • 02

    If-Then Rules (Interpretation): Decision trees can be read as a stack of if-else rules. It’s like reading code that says, 'if condition1 then A, else if condition2 then B.' Technically, each path from root to leaf represents a conjunction of conditions that leads to a prediction. This matters because it gives transparency—users can see exactly why a prediction was made. Example: 'If weather is sunny and humidity ≤ 80%, then go surfing; else if overcast, go surfing; else if raining and windy, don’t go.'

  • 03

    Nodes, Edges, and Leaves (Structure): Nodes hold the question, edges hold the answer direction, and leaves hold the prediction. It’s like a quiz where each question sends you to the next question or to an answer page. Technically, internal nodes split data by a threshold on a feature, and leaves store the label distribution or average. This matters because organized structure makes prediction fast and logic clear. Example: A leaf might contain 90% class A and 10% class B, so it predicts class A.

  • 04

    Supervised Learning Framing (Context): Trees and forests are supervised learners that aim to minimize prediction error. It’s like training for a test by checking mistakes and improving. Technically, at each split, they greedily reduce impurity, indirectly minimizing loss on classification tasks. This matters because it connects trees with the broader family of models driven by optimizing a target. Example: A tree that reduces node impurity typically lowers misclassification on validation data.

  • 05

    Impurity (What and Why): Impurity measures how mixed labels are inside a node. It’s like checking how messy a drawer is: all socks is tidy (pure), socks plus many other items is messy (impure). Technically, common metrics are Gini impurity and entropy; both are zero when all samples are the same class. This matters because splitting should make groups purer so predictions are more confident. Example: A node with 50% cats and 50% dogs is very impure and worth splitting.

  • 06

    Gini Impurity (Metric): Gini is defined as 1 minus the sum of squared class probabilities. It’s like asking, 'What is the chance two random picks from this node are of different classes?' Technically, Gini is minimal at zero when one class has probability 1 and maximal around balanced splits. This matters because it’s fast to compute and guides which splits are best. Example: With pa=0.5 and pb=0.5, Gini = 1 − (0.25 + 0.25) = 0.5.

  • 07

    Entropy (Metric): Entropy is the negative sum over classes of p times log2 p. It’s like measuring disorder: a mixed bag is high-entropy, a uniform bag is low-entropy. Technically, entropy peaks when classes are equally likely and is zero when one class dominates. This matters because it drives information gain, a core splitting criterion. Example: For a 50-50 split in two classes, entropy equals 1 bit.

  • 08

    Information Gain (Criterion): Information gain is the impurity reduction from splitting a node. It’s like tidying a big messy drawer into two smaller, tidier drawers and asking how much neater things got. Technically, IG = impurity(parent) − weighted average impurity(children). This matters because the split with the highest information gain is chosen. Example: If a split drops entropy from 0.9 to a weighted 0.3, the information gain is 0.6 and likely preferred.

  • 09

    Recursive Partitioning (Process): Trees keep splitting nodes into smaller groups until they’re pure enough or limits are reached. It’s like slicing a cake again and again where each cut tries to separate flavors. Technically, the algorithm greedily searches over features and thresholds to maximize information gain at each step. This matters because it builds structure that captures complex decision boundaries. Example: Weather → Humidity → Wind creates clear regions for 'go' or 'don’t go' surfing.

  • 10

    Stopping Criteria (Regularization): Stopping rules prevent trees from growing too deep and memorizing noise. It’s like setting a bedtime so you don’t overdo it. Technically, you can cap max depth, require a minimum number of samples to split, or require a minimum per leaf. This matters because without it, trees overfit and fail on new data. Example: Stop splitting if a node has fewer than 10 samples or depth exceeds 5.

  • 11

    Tree Predictions (Leaves): Final predictions come from leaves: majority class for classification or average value for regression. It’s like taking a vote or calculating the mean of numbers in a group. Technically, the leaf stores either a class distribution or a mean target value. This matters because it turns structure into usable outputs. Example: A leaf with targets [12, 11, 13] predicts 12 for regression.

  • 12

    Random Forests (Ensemble): A random forest combines many decision trees to improve accuracy and stability. It’s like asking many friends and taking a vote instead of trusting one person. Technically, each tree trains on a bootstrapped dataset, and each split considers a random subset of features. This matters because averaging decorrelates errors and reduces variance. Example: 300 trees each trained on different samples and feature subsets, then vote on class.

  • 13

    Bootstrapping (Data Sampling): Each tree is trained on a sample drawn with replacement from the original data. It’s like drawing marbles from a bag and putting each one back before the next draw. Technically, this creates overlapping but distinct training sets, encouraging tree diversity. This matters because diversity among models is key to strong ensemble performance. Example: A dataset of 10,000 rows might yield bootstraps of 10,000 rows with duplicates.

  • 14

    Feature Subsampling (Split-Time Randomness): At each node, a random subset of features is considered for splitting. It’s like telling each tree to look at a few clues instead of all clues. Technically, this reduces correlation between trees because they don’t all chase the same dominant feature. This matters because decorrelated trees average out noise better. Example: With 50 features, a tree might consider only 7 randomly chosen features at a split.

  • 15

    Aggregation (Voting/Averaging): Forest predictions aggregate over all trees. It’s like a democratic vote for classification and an average for regression. Technically, classification uses majority vote; regression uses mean of outputs. This matters because aggregation smooths out instability of single trees. Example: If 180 trees say class A and 120 say class B, final prediction is class A.

  • 16

    Variance Reduction (Why Forests Work): Single trees have high variance; small data changes can produce very different trees. It’s like a weather vane that swings wildly with gusts. Technically, averaging many high-variance estimators reduces variance while keeping bias manageable. This matters because lower variance means better generalization to new data. Example: The average of many trees has far steadier accuracy across test sets.

  • 17

    Pros and Cons (Trade-offs): Trees are interpretable, flexible, and easy to visualize, but can overfit, be unstable, and bias toward features with many levels. It’s like a clear but sometimes too-eager problem solver. Forests are accurate and robust with feature importance, but less interpretable and more computationally heavy. This matters because choosing a model means balancing clarity, speed, and performance. Example: Use a single tree to explain policy decisions; use a forest when accuracy is paramount.

  • 18

    Choosing Number of Trees (Tuning): More trees often improve performance but with diminishing returns. It’s like adding more judges to a panel: beyond a point, verdicts barely change but meetings take longer. Technically, validation curves help find where adding trees stops improving accuracy. This matters to control compute cost while keeping accuracy high. Example: Start at 200 trees and raise to 800 until cross-validation accuracy plateaus.

  • 19

    Other Ensembles (Landscape): Boosting methods like gradient boosting and AdaBoost add trees sequentially, each fixing previous errors. Bagging trains multiple models on bootstraps without random feature subsampling. Stacking blends different model types by training a meta-model on their outputs. This matters because ensembles provide multiple ways to improve generalization. Example: Combine a random forest, a boosted tree model, and a neural net in a stacked system.

  • 20

    Missing Data (Practical Handling): Trees and forests can handle missing values in several ways. You can impute with mean/median/mode, create a dedicated 'missing' branch, or skip that feature when missing. Technically, libraries differ in default behavior, so confirm how your tool works. This matters because mishandling missing values can break good splits. Example: If 'humidity' is missing, route to a 'missing' branch rather than guessing incorrectly.

03Technical Details

Overall Architecture/Structure

  • Problem framing: Decision trees and random forests are supervised learning methods for classification and regression. Training data consists of feature vectors X and labels y. The learning goal is to produce a function f(x) that predicts y with minimal error on new, unseen data.
  • Decision tree structure: A binary decision tree consists of internal nodes (decision points), edges (routing according to a condition), and leaves (final predictions). At each internal node, a split rule tests a single feature against a threshold (for numeric features) or a subset of categories (for categorical features). Data reaching a node is partitioned into left and right child nodes according to the rule. This recursive partitioning continues until stopping criteria are met, at which point leaves store the prediction for any future sample routed there.
  • Random forest structure: A random forest is an ensemble of many decision trees. Each tree is trained on a different bootstrapped sample (sampling with replacement) of the original training set, introducing diversity in training data. During training of each tree, at every node, only a random subset of features is considered for splitting; this further diversifies trees. At inference, the forest aggregates predictions across trees—majority vote for classification or mean for regression—to form the final output.

Data Flow

  1. Input data: Start with a dataset containing rows (examples) and columns (features), plus labels. Preprocess as needed (e.g., encode categories numerically if required by your library; many tree implementations handle categorical splits directly or via encoding).
  2. Training decision tree: Route all data to the root. Evaluate candidate splits (feature, threshold) by measuring impurity reduction (information gain). Choose the best split and partition data into left/right child nodes. Repeat recursively for each child until stopping rules apply. Assign predictions to leaves (majority class or average value).
  3. Training random forest: For each tree, sample a bootstrap dataset from the training data (same size as the original but drawn with replacement). Train a decision tree on this bootstrap, but at each split consider a random subset of features (e.g., sqrt(M) features for classification, where M is total features). Repeat for T trees. Store all trained trees.
  4. Inference: For a new input, push it down each tree to reach a leaf prediction. Aggregate across trees: majority vote (classification) or average (regression). Return the aggregated prediction.

Split Criteria and Impurity

  • Impurity motivation: A useful split creates child nodes where labels are more homogeneous than in the parent. Homogeneity means the node contains mostly a single class. Impurity quantifies the lack of homogeneity.
  • Gini impurity: Gini(node) = 1 − Σ_i p_i^2, where p_i is the proportion of class i in the node. Properties: 0 when a single class has probability 1; maximum occurs for balanced classes (e.g., two classes at 0.5 each give Gini = 0.5). Computationally efficient and commonly used in practice.
  • Entropy: Entropy(node) = − Σ_i p_i log2 p_i. Properties: 0 for pure nodes; higher values for more uniform class mixes (two classes at 0.5 each give entropy = 1 bit). Has information-theoretic interpretation; often used with information gain.
  • Information gain (IG): IG = Impurity(parent) − [ (N_L/N) * Impurity(left) + (N_R/N) * Impurity(right) ], where N is parent size, N_L and N_R are child sizes. The best split maximizes IG. For multiway splits (e.g., category branches), apply the same weighting across all children.

Algorithm for Building a Decision Tree (Greedy, Top-Down)

  1. Initialization: Start with all training samples at the root.
  2. Stopping check: If stopping conditions hold (max depth reached, samples fewer than min_samples_split, or node is already pure), make this node a leaf and assign prediction (majority class or average target).
  3. Candidate splits: For each feature, enumerate possible thresholds. For numeric features, potential thresholds often come from midpoints between sorted unique values. For categorical features, consider binary partitions of categories (implementation-dependent; often restricted for efficiency).
  4. Evaluate splits: For each candidate, compute child impurities and information gain. Choose the feature and threshold with the highest IG.
  5. Split and recurse: Partition data accordingly and create left/right child nodes. Recursively apply steps 2–5 on each child.
  6. Leaf assignment: For classification, store class proportions (for probabilities) and majority class (for hard prediction). For regression, store average target value.

Stopping Criteria (Regularization within Trees)

  • Max depth (max_depth): Limit how many levels the tree can grow. Prevents overly complex, deep trees prone to overfitting.
  • Minimum samples to split (min_samples_split): Do not split nodes with fewer than this number of samples. Forces splits to be statistically meaningful.
  • Minimum samples per leaf (min_samples_leaf): Ensure each leaf holds at least this number of samples. Avoids tiny leaves that memorize noise.
  • Pure node or no improvement: Stop if all samples are the same class or if no split reduces impurity.

Prediction Mechanics

  • Routing: A test sample traverses the tree by evaluating split rules at each node until reaching a leaf.
  • Classification output: Hard label via majority class. Probabilities via class distribution stored at the leaf.
  • Regression output: Predicted value equals the mean of training targets in that leaf.

Random Forest Training Details

  • Bootstrapping: For each tree, sample N examples with replacement from the original N. On average, about 63.2% of unique examples appear at least once in a bootstrap sample. Duplicates are allowed, which gives each bootstrap its own emphasis and omissions.
  • Feature subsampling at splits: At each node, consider only m randomly chosen features (m < M). Typical defaults: m = sqrt(M) for classification. This prevents dominant features from being chosen by all trees, increasing diversity.
  • Diversity and decorrelation: Two trees trained on the exact same data and features tend to look similar and make similar errors. Bootstrapping and feature subsampling force trees to explore different partitions, so their errors are less correlated. Lower error correlation makes averaging more effective at variance reduction.
  • Aggregation: After training T trees, prediction for a new sample aggregates across trees. For classification, the predicted class is the mode of tree votes. For regression, the predicted value is the mean of tree outputs.

Why Random Forests Reduce Variance

  • Single decision trees are high-variance estimators; small perturbations in data (e.g., adding/removing a few samples) can significantly change splits and structure. This instability leads to fluctuating test performance. Ensembles average across many such unstable learners, canceling out idiosyncratic errors while preserving common signal. The result is a model that generalizes better than any single tree.

Advantages and Disadvantages

  • Decision trees advantages: Interpretability (readable if-then rules), support for categorical and numeric features, applicability to classification and regression, fast inference, easy visualization.
  • Decision trees disadvantages: Overfitting risk without strong stopping or pruning, instability (sensitive to small data changes), bias toward features with many levels (more potential split points lead to higher apparent information gain), and potentially suboptimal global structure due to greedy splitting.
  • Random forests advantages: Higher accuracy and better generalization than single trees, robustness to overfitting through averaging, built-in feature importance via split usage or impurity reduction, and strong performance with minimal preprocessing.
  • Random forests disadvantages: Reduced interpretability (must inspect many trees), higher computational and memory cost, sensitivity to hyperparameters like number of trees and feature subsampling size, and longer training time as trees increase.

Hyperparameter Tuning (Random Forest Focus)

  • Number of trees (n_estimators): Increasing T usually improves performance up to a plateau; beyond that, gains vanish but compute cost rises. Practical approach: start with a few hundred and grow until validation accuracy stabilizes.
  • Feature subset size (max_features): Controls m, the number of features considered at each split. Smaller m increases diversity but may miss strong features; larger m makes trees more similar and can reduce ensemble benefit.
  • Tree depth and leaf sizes: Even in forests, excessively deep trees can overfit bootstraps; gentle constraints (max_depth, min_samples_leaf) can improve stability and speed.
  • Evaluation strategy: Use cross-validation to compare settings and pick the sweet spot between accuracy and cost.

Handling Missing Data

  • Imputation: Replace missing numeric values with mean or median and categorical values with mode. Simple and widely supported.
  • Missing branch: Create a special route when the splitting feature is missing, keeping that uncertainty explicit. Some implementations support this natively.
  • Skip feature if missing: When evaluating a sample, if a feature needed for a node is missing, some libraries can default to other behavior; check documentation.
  • Library behavior: Different libraries vary; always confirm defaults and options to avoid unintended routing of missing values.

Bias Toward Features with Many Levels

  • Phenomenon: Features with many distinct values offer many possible split points, making it easier (by chance) to achieve large apparent impurity reductions. The greedy algorithm may overuse such features even if they aren’t truly most predictive.
  • Mitigation: Use strong validation, constrain tree depth, and consider feature engineering or grouping categories. Random forests reduce the impact by feature subsampling, which limits repeated selection of such features across trees.

Computation and Performance Considerations

  • Training complexity (trees): Evaluating all candidate splits can be costly, especially with many features and large datasets. Efficient implementations sort values once per feature per node and reuse scan-based computations.
  • Training complexity (forests): Scales roughly linearly with the number of trees; parallel training helps because trees are independent. Memory usage rises with more trees and deeper structures.
  • Inference: Prediction is fast: routing down O(depth) nodes per tree and aggregating across T trees. Depth control keeps inference predictable.

Step-by-Step Implementation Guide Step 1: Prepare data

  • Gather training features X and labels y. Handle categorical data as required by your chosen library (some support categorical splits directly; others require encoding). Decide how to handle missing values: impute, special branch (if supported), or leave to library defaults.

Step 2: Train a decision tree (baseline)

  • Choose impurity metric: Gini or entropy for classification. Set stopping criteria: max_depth (e.g., 5–15), min_samples_split (e.g., 2–20), min_samples_leaf (e.g., 1–10). Fit the tree on (X, y). Inspect the learned structure and evaluate on validation data.

Step 3: Train a random forest

  • Choose number of trees (start with 200–500). Choose max_features (e.g., sqrt(M) for classification). Optionally set gentle depth/leaf constraints for stability. Fit the forest on (X, y). Evaluate validation performance and adjust n_estimators upward until performance plateaus.

Step 4: Interpret and diagnose

  • Feature importance: Review which features are used often or reduce impurity the most. Stability: Compare validation to training performance to check overfitting. Speed: Measure training and inference time as trees increase.

Step 5: Handle practical issues

  • Missing data: Apply imputation or choose library options for missing split handling. Class imbalance: Although not a primary topic here, ensure evaluation metrics reflect class ratios. Reproducibility: Set random seeds for bootstrapping and feature subsampling.

Tips and Warnings

  • Don’t overgrow trees: Deep, unconstrained trees memorize noise; set reasonable depth and leaf sizes.
  • Prefer validation curves over guessing: Increase n_estimators until improvements flatten, then stop to save compute.
  • Beware feature leakage: If a feature encodes target information improperly (e.g., post-outcome data), trees will exploit it aggressively; clean data carefully.
  • Understand defaults: Libraries differ on impurity metrics, handling categorical splits, and missing values; always check documentation.
  • Expect interpretability trade-offs: If explainability is key, a single, well-regularized tree is easier to communicate than a forest; if accuracy is key, prefer the forest.

By following these technical steps and considerations, you can build decision trees that are transparent and random forests that are robust, tuning them to your data’s size, complexity, and accuracy needs.

04Examples

  • 💡

    Surfing Decision Tree: Input features are weather conditions: 'sunny', 'overcast', 'rainy', humidity percentage, and a 'windy' flag. The tree splits first on 'weather': if overcast → go surfing; if sunny → check if humidity ≤ 80% → go, else don’t; if rainy → check 'windy' → if windy don’t go, else go. The output is a binary decision: go or don’t go. This example shows how real-world decisions become simple if-then rules.

  • 💡

    Binary Classification Purity Check: Suppose a node contains 40 samples: 20 cats, 20 dogs. Gini = 1 − [(0.5)^2 + (0.5)^2] = 0.5; entropy = 1 bit, both signaling high impurity. A split that sends 18 cats left and 2 dogs, and 2 cats right and 18 dogs would drop impurity in both children. The key point is that good splits create purer groups than the parent.

  • 💡

    Information Gain Calculation: Parent node has 100 samples with entropy 0.9. A split yields left child 60 samples (entropy 0.3) and right child 40 samples (entropy 0.4). Weighted child entropy = (60/100)*0.3 + (40/100)*0.4 = 0.34; information gain = 0.9 − 0.34 = 0.56. The instructor emphasized choosing the split with the highest information gain.

  • 💡

    Stopping by Max Depth: Train a tree with max_depth = 5. Even if more splits are possible, the algorithm stops at depth 5 and turns nodes into leaves. This prevents modeling tiny fluctuations that don’t generalize. The key point is regularization through structural limits.

  • 💡

    Stopping by Minimum Samples: Set min_samples_split = 20 and min_samples_leaf = 10. Nodes with fewer than 20 samples won’t be split, and any split must leave at least 10 samples in each leaf. This guards against creating narrow, overfitted leaves. The example shows how sample thresholds stabilize trees.

  • 💡

    Leaf Predictions for Regression: At a leaf with numeric targets [10, 12, 11, 13], the regression output is the average, 11.5. If another leaf holds [30, 31, 29], it predicts 30.0. Averaging within leaves smooths noise and yields a simple numeric prediction. This demonstrates how trees handle regression tasks.

  • 💡

    Bootstrapped Training Sets: For a forest of 300 trees on a dataset with 50,000 rows, each tree samples 50,000 rows with replacement. Some rows repeat, some are absent, creating varied training views. This diversity makes each tree different. The example underlines how bootstraps power ensemble strength.

  • 💡

    Random Feature Subsets at Splits: With 60 features, set max_features ≈ sqrt(60) ≈ 8 for classification. At each node, the tree randomly picks 8 features to consider for splitting, rather than all 60. Different nodes and trees see different subsets, decorrelating the forest. This shows why forests don’t over-focus on a single strong feature.

  • 💡

    Majority Vote Aggregation: In a 500-tree forest, 310 trees vote for class A, and 190 vote for class B. The final prediction is class A by majority vote. If it were regression, the outputs would be averaged to get a final number. This example highlights ensemble aggregation mechanics.

  • 💡

    Variance Reduction Intuition: Train two trees on slightly different splits of the data; their predictions change notably on some points. Train 300 such trees and average them; the prediction curve becomes smoother and more stable. The fluctuations cancel out while the main pattern remains. The key point is that averaging reduces variance.

  • 💡

    Bias Toward Many-Level Features: A categorical feature with 30 unique values yields many potential splits, often showing large impurity drops by chance. A tree might prefer this feature repeatedly, even if it’s not truly predictive. In a forest, feature subsampling reduces repeated selection of this feature. The example warns about split-count bias.

  • 💡

    Choosing Number of Trees via Plateau: Start with 200 trees and measure validation accuracy: 85%. Increase to 400 trees: 86.2%. Increase to 800 trees: 86.3%, and to 1200: 86.3%. The performance plateau at 800 signals a good stopping point to save compute.

  • 💡

    Handling Missing Values with Imputation: For a humidity feature with some missing entries, fill missing with median humidity. Train the tree/forest normally. Predictions are more stable because splits on humidity are now always evaluable. This example shows a simple, effective missing-data strategy.

  • 💡

    Missing-as-a-Branch Strategy: Suppose 'windy' is occasionally missing. During splitting on 'windy', create three routes: windy=true, windy=false, and windy=missing (if the library supports it). Samples with missing 'windy' follow the missing branch and get predictions consistent with similar missing cases. This preserves information about missingness.

  • 💡

    Ignoring Feature When Missing: If an implementation skips a feature when it’s missing, the sample may be routed based on other splits down the tree. The model still reaches a leaf using available features. This approach avoids guessing but relies on alternative signals. The example stresses checking library behavior for missing values.

05Conclusion

This lecture built a complete understanding of decision trees and random forests by grounding them in supervised learning and loss minimization. Decision trees split data into purer groups using impurity measures (Gini or entropy) and choose splits that maximize information gain. The tree grows recursively until stopping rules like max depth or minimum samples prevent overfitting, and leaves make predictions by majority class or average value. Random forests then combine many such trees trained on bootstrapped data and random feature subsets, aggregating predictions to reduce variance and improve generalization.

The main trade-offs are clear: decision trees offer interpretability and simplicity but can overfit and be unstable; random forests deliver higher accuracy and robustness but at the cost of interpretability and greater computational load. Practical tuning focuses on the number of trees—growing until validation performance plateaus—and sensible constraints on tree depth and leaf sizes. The lecture also connected these models to the broader ensemble landscape (bagging, boosting, stacking) and addressed everyday issues like handling missing data via imputation, missing branches, or skipping features depending on library support.

To practice, start by fitting a single decision tree with Gini impurity and experiment with max_depth, min_samples_split, and min_samples_leaf to observe overfitting and generalization. Next, train a random forest, increase n_estimators stepwise (e.g., 200 → 400 → 800), and track validation accuracy to find the plateau. Inspect feature importance to understand which inputs drive predictions, and test different missing-data strategies to see their effects. A small project could compare a single tree and a forest on the same dataset, documenting interpretability vs. accuracy and computing cost.

Your next steps include exploring other ensembles like gradient boosting and AdaBoost, which build trees sequentially and often reach state-of-the-art accuracy on tabular tasks. Deepen your understanding of cross-validation for reliable model selection and practice fair comparisons across hyperparameters. The core message to remember is this: trees convert complex patterns into simple rules, and forests turn many such rules into a stable, accurate predictor. Use trees when clarity is essential, forests when performance is paramount, and always validate your choices with data.

Key Takeaways

  • Start with a decision tree to get a transparent baseline. Keep max_depth modest and set min_samples_split and min_samples_leaf to prevent overfitting. Inspect the learned splits to ensure they make domain sense. Use this interpretability to spot data issues or leakage early.
  • Choose a split criterion that fits your needs: Gini impurity is fast and common; entropy offers an information-theoretic view. Compute information gain as parent impurity minus the weighted child impurities. Always prefer the split with the largest gain. Validate that splits generalize by checking performance on held-out data.
  • Apply stopping criteria aggressively to avoid overfitting. Limit max_depth to a sensible range and ensure leaves have enough samples. Stop splitting when impurity cannot be meaningfully reduced. This keeps trees simple, stable, and faster to train and predict.
  • Use random forests when accuracy and robustness are more important than a single-model explanation. Train with bootstrapping and random feature subsampling to create diverse trees. Aggregate predictions by voting or averaging to reduce variance. Expect better generalization than a single tree.
  • Tune the number of trees by watching for performance plateaus. Start with a few hundred trees and increase until validation metrics stabilize. Stop adding trees when improvements flatten to save compute. Record accuracy vs. n_estimators to justify your choice.
  • Set max_features thoughtfully to balance diversity and strength. Smaller subsets increase tree diversity but might miss strong features; larger subsets can reduce ensemble gains. Use common defaults like sqrt(M) for classification as a starting point. Adjust based on validation results.
  • Handle missing data before or during training. If your library doesn’t handle missing splits, impute with median (numeric) or mode (categorical). If supported, consider a dedicated 'missing' branch to preserve missingness information. Document your choice and its effect on validation performance.
  • Watch out for features with many levels grabbing too many splits. High-cardinality features offer many thresholds and can look falsely attractive to the greedy algorithm. Use feature subsampling in forests and validate importance carefully. Consider grouping categories or encoding strategies if needed.
  • Measure both training and validation performance to detect overfitting. A big gap signals the tree is memorizing noise. Tighten stopping criteria or move to a random forest. Keep a validation log as you change hyperparameters.
  • Use feature importance from forests to guide iteration. High-importance features can be audited for quality and leakage. Low-importance features might be candidates for removal to simplify models. Always confirm importance aligns with domain knowledge.
  • Balance interpretability and performance based on the use case. For decisions that must be explained precisely, a small tree may be best. For competitive accuracy in production, a random forest is usually stronger. Communicate this trade-off to stakeholders clearly.
  • Leverage cross-validation to select hyperparameters reliably. Single train/validation splits can mislead due to randomness. Use k-fold CV to average results and reduce variance in estimates. Base your final settings on stable cross-validated performance.
  • Plan for computational cost as you scale forests. Training time and memory grow with the number and depth of trees. Parallelize training if possible and cap depth where sensible. Monitor resource use to maintain practical deployment targets.
  • Use trees and forests as strong baselines for tabular problems. They often beat or match more complex models without heavy preprocessing. If more accuracy is required, compare against boosting or stacking. Keep the tree/forest as a reference point for fairness.
  • Keep randomness reproducible during experiments. Set random seeds for bootstrapping and feature subsampling. This ensures you can replicate and compare results meaningfully. It also helps debug changes across runs.
  • Document your split criteria, stopping rules, and missing data strategy. These choices strongly shape model behavior and results. Clear documentation speeds up reviews and future iterations. It also aids compliance and auditability in regulated contexts.

Glossary

Supervised learning

A type of machine learning where the model learns from examples that include inputs and the correct outputs (labels). The goal is to map from inputs to outputs so the model can predict well on new data. It’s like practicing with answer keys to learn how to answer similar questions later. Decision trees and random forests are both supervised learners.

Decision tree

A model that makes predictions by asking a series of yes/no questions about feature values. Each question splits the data into smaller groups that are more uniform. You keep splitting until you reach a point where a simple prediction can be made. The final answers live at the leaf nodes.

Node

A point in the decision tree where something happens. Internal nodes ask a question to split data, while leaf nodes give the final prediction. Nodes help organize the decision-making process in steps. The root node is the very first node at the top of the tree.

Edge

A connector between nodes that represents the result of a question. Following an edge means taking the path that matches the yes/no (or condition) answer. Edges guide the flow from one question to the next. They eventually lead to a leaf node.

Leaf (leaf node)

The end point of a path in a decision tree where the final prediction is made. It stores either a class label distribution (for classification) or an average value (for regression). Leaves summarize what the data in that branch looks like. Predictions come directly from leaves.

Root node

The topmost node in a decision tree where all data starts. The root asks the first, most informative question to split the data. Choosing a good root split sets up the rest of the tree for success. It is the entry point for all predictions.

Feature

An input variable used to make decisions in the model. Features can be numeric (like humidity) or categorical (like weather type). Splits test a feature against a threshold or category set. Good features help the model predict accurately.

Label (target)

The correct output we want the model to learn to predict. In classification, labels are categories; in regression, labels are numbers. The model trains to map features to labels with minimal error. Labels guide learning and evaluation.

+32 more (click terms in content)

#decision tree#random forest#gini impurity#entropy#information gain#supervised learning#bootstrapping#feature subsampling#ensemble learning#majority vote#variance reduction#overfitting#stopping criteria#feature importance#cross-validation#hyperparameters#classification#regression#missing data#bagging
Version: 1