Stanford CS230 | Autumn 2025 | Lecture 9: Career Advice in AI
BeginnerKey Summary
- •This lecture explains how to choose and fairly evaluate machine learning models so they work well on new, unseen data. The main goal is generalization, which means the model should not only do well on the training set but also on future examples. Overfitting (too complex) and underfitting (too simple) are the two big mistakes to avoid.
- •You learn to split data into three parts: training (to learn), validation (to tune), and test (to judge at the end). A common split is 70%/15%/15%, but it depends on how much data you have. The test set must be kept untouched until the final check to get an unbiased estimate.
- •Cross-validation helps when data is limited by rotating which part of data is used for validation. In k-fold cross-validation, you train on k-1 folds and validate on the remaining fold, repeating k times and averaging results. Common k values are 5 or 10, while leave-one-out uses every single point as its own validation fold.
- •Bias and variance describe different types of model errors. High bias means the model is too simple and misses patterns (underfitting), while high variance means it’s too sensitive to the training data (overfitting). Good model selection finds a balance, like aiming for the middle curve that fits the data just right.
- •For classification tasks, key metrics are accuracy, precision, recall, F1 score, and AUC (area under the ROC curve). Accuracy can be misleading when classes are imbalanced, like 90% vs 10%. Precision and recall focus on how well positives are predicted, and F1 balances them.
- •Precision measures how many predicted positives are correct, while recall measures how many actual positives are found. In a disease test, high precision means few healthy people are incorrectly told they are sick, and high recall means most sick people are found. You often trade one for the other by changing the decision threshold.
- •F1 score is the harmonic mean of precision and recall and rewards models that do both well. It is helpful when you need a single metric that captures performance for imbalanced data. The formula is F1 = 2 * (precision * recall) / (precision + recall).
- •The ROC curve shows the trade-off between true positive rate and false positive rate across thresholds. AUC summarizes the ROC curve into one number between 0.5 (random) and 1.0 (perfect). A higher AUC means the model is better at ranking positives above negatives.
- •For regression, common metrics include mean squared error (MSE), root mean squared error (RMSE), and R-squared (R²). MSE averages squared prediction errors and is sensitive to outliers. RMSE is the square root of MSE, making it easier to interpret in the same units as the target, and R² measures how much variance in the target is explained by the model.
- •Hyperparameters are the settings you choose before training, like learning rate or tree depth. Tuning them well is crucial for good performance. Techniques include grid search (try all combinations), random search (sample combinations), and Bayesian optimization (smartly pick next trials based on past results).
- •When data is too small to hold out a separate validation set, use cross-validation to tune hyperparameters. This way, every data point helps with both training and validation at different steps. Only at the end should you use the test set once, after all decisions are finalized.
- •To know which hyperparameters to tune, start with the ones known to matter for your model type from docs or papers. Focus on a few highly impactful ones rather than many at once. Experiment carefully, measure with the right metrics, and keep a clean separation between training, validation, and final testing.
- •Model selection means comparing different algorithms, features, and hyperparameter settings on a validation process. Always use the same evaluation protocol across candidates so the comparison is fair. Pick the model that best serves your problem’s goals, not just the one with the highest generic score.
- •Class imbalance can make accuracy look high even if the model ignores the rare class. Use precision, recall, F1, and AUC to see the true picture. Adjust thresholds or re-balance training to handle skewed data better.
- •The lecture’s visual examples include three fits on the same data: underfit (too straight), good fit (follows the curve), and overfit (wiggles too much). You also see five-fold cross-validation as a rotating validation scheme. These pictures help you feel the bias-variance balance and the idea of fair evaluation.
- •In practice, always record your splits, random seeds, and metrics to make your results reproducible. If results swing wildly with small changes, you may be overfitting or not using enough data. Aim for stable performance across folds and guard the test set for the very end.
Why This Lecture Matters
This lecture matters to anyone building or deploying machine learning models, such as data scientists, ML engineers, analysts, and product managers. It solves the real-world problem of models that look good in the lab but fail in production because they were overfit, poorly tuned, or measured with the wrong metric. By using proper data splits, cross-validation, and the right metrics for the job, you create systems that perform reliably on new data and withstand minor shifts over time. In practice, these ideas guide everyday work: choosing precision/recall trade-offs for risk scoring, picking AUC for model comparison before thresholding, and using F1 on imbalanced datasets. Hyperparameter tuning methods like grid, random, and Bayesian optimization help you get strong performance without wasting compute. Clear separation between training, validation, and testing protects you from data leakage and inflated claims—crucial in regulated industries like healthcare and finance. Logging your pipeline and metrics supports reproducibility, audits, and team collaboration. Mastering these fundamentals also boosts your career. Hiring teams value candidates who can design honest, rigorous experiments, interpret metrics correctly, and avoid common pitfalls. These skills travel across tools and model types, from linear models to deep learning. In today’s industry, where decisions and safety often hinge on model predictions, disciplined model selection and evaluation are not optional—they are the backbone of trustworthy AI.
Lecture Summary
Tap terms for definitions01Overview
This lecture teaches you how to choose the best machine learning model and how to evaluate it fairly so it performs well on new, unseen data. The core idea is generalization: we don’t only want a model that is good at memorizing examples it has already seen; we want one that can handle fresh cases tomorrow. You learn the twin pitfalls to avoid: underfitting (the model is too simple and misses patterns) and overfitting (the model is too complex and chases noise). To make selection and evaluation trustworthy, the lecture walks through splitting your dataset into training, validation, and test sets, and explains how cross-validation helps when you have limited data.
The scope covers both classification and regression evaluation metrics, since different problems need different scorecards. For classification, you get accuracy, precision, recall, F1 score, and AUC (area under the ROC curve). For regression, you learn mean squared error (MSE), root mean squared error (RMSE), and R-squared (R²). The lecture also addresses hyperparameter tuning—the process of picking the best settings that are chosen before training—covering grid search, random search, and Bayesian optimization at a conceptual level.
This material is well-suited for beginners who have basic machine learning familiarity: you should know what a dataset is, what a model does, and the meaning of training. If you’ve seen simple models like linear regression or logistic regression, you’ll be comfortable. No advanced math is required beyond understanding that we compare predictions against real labels. If you have tried training a model once and wondered how to judge it properly, this lecture gives you a full, practical framework.
After this lecture, you will be able to: (1) clearly explain underfitting and overfitting, and recognize them in plots or results; (2) set up a clean train/validation/test workflow; (3) run k-fold cross-validation when data is small and average results properly; (4) pick and interpret the right metrics for your problem; and (5) tune hyperparameters using grid search, random search, or Bayesian ideas. You’ll also have the judgment to avoid common traps like peeking at the test set or over-trusting accuracy on imbalanced data, and you’ll know when to adjust thresholds to meet real-world needs, such as catching more true positives in a medical test.
The lecture is structured from first principles to application. It starts with the goal of machine learning—learning a function f(x) that predicts y for new x—and explains why testing only on training data is misleading. It then introduces underfitting and overfitting, with a simple plot showing a linear underfit, a balanced good fit, and an overfit that wiggles through every point. From there, it moves to model selection via the three-way data split, including a typical 70%/15%/15% ratio and when to adjust it. Next, cross-validation is introduced, including common k values like 5 and 10, plus leave-one-out for extreme cases.
With the data-splitting foundation set, the lecture explores evaluation metrics in detail. It starts with accuracy but warns that it can be misleading under class imbalance (e.g., 90% majority class). It then explains precision, recall, their trade-offs, and the F1 score as a harmonic mean. The ROC curve and AUC show how a model ranks positives vs. negatives across thresholds. Finally, the lecture shifts to regression metrics: MSE, RMSE, and R², explaining what each tells you and when they are useful.
The lecture closes with hyperparameter tuning strategies. Grid search tries all combinations in a preset grid (expensive but thorough), random search samples combinations (often faster and effective), and Bayesian optimization uses a probabilistic model to choose the next best set of hyperparameters to test (more sample-efficient). In Q&A, the instructor notes that which hyperparameters matter depends on the model; check documentation and research, and combine knowledge with experimentation. For tiny datasets without a separate validation set, cross-validation is the recommended path. The final message: use solid splits, the right metrics, and thoughtful tuning to pick a model that truly generalizes.
02Key Concepts
- 01
Definition: Generalization means performing well on new, unseen data, not just the training examples. Analogy: It’s like learning the idea of riding a bike so you can ride any bike, not just the one in your driveway. Technical: We learn a function f(x) from pairs (xi, yi) and judge it on data it hasn’t seen to estimate future performance. Why it matters: A model that only memorizes will fail when the data changes even a little. Example: A spam filter trained on last month’s emails must still catch spam in next month’s new messages.
- 02
Definition: Underfitting is when a model is too simple and misses the true pattern. Analogy: Using a straight ruler to trace a curvy road—it won’t match the bends. Technical: High bias models make strong assumptions (like linearity) that don’t fit the data structure, causing high error on both training and test sets. Why it matters: You’ll never get good predictions because the model can’t capture key relationships. Example: Fitting a linear line to clearly curved data leaves large mistakes everywhere.
- 03
Definition: Overfitting is when a model is too complex and learns noise as if it were pattern. Analogy: Connecting every dot with a squiggly line so it passes through all points but looks wild. Technical: High variance models are very sensitive to small training set changes and often have very low training error but high test error. Why it matters: The model appears great during training but fails in real use. Example: A decision tree that splits until each leaf has one point memorizes the data and generalizes poorly.
- 04
Definition: The bias-variance trade-off is the balance between underfitting (high bias) and overfitting (high variance). Analogy: Tuning a guitar string—not too loose (flat) and not too tight (sharp). Technical: As model complexity increases, bias typically decreases and variance increases; the goal is the sweet spot with lowest expected test error. Why it matters: Choosing model complexity wisely improves real-world accuracy. Example: A moderate-degree polynomial might fit data well without wiggling excessively.
- 05
Definition: Model selection is choosing the best model, features, and hyperparameters from candidates. Analogy: Trying on different shoes to see which pair fits best for a long walk. Technical: You compare candidates using a consistent validation procedure and pick the one that optimizes the chosen metric. Why it matters: A systematic approach beats guessing and prevents bias toward a favorite model. Example: Comparing logistic regression, SVM, and a small neural network under the same cross-validation routine.
- 06
Definition: The training set is the data used to fit the model’s parameters. Analogy: It’s like practice drills before a sports game. Technical: The model updates weights to reduce a loss on training examples. Why it matters: Without training data, the model can’t learn patterns. Example: Training a house price model on past home sales with features like size and location.
- 07
Definition: The validation set is used to tune hyperparameters and choose among models. Analogy: It’s the scrimmage before the real game to see what strategy works. Technical: You evaluate different settings (e.g., learning rate, tree depth) on validation performance and pick the best. Why it matters: This prevents overfitting to the training set while guiding model choices. Example: Trying several regularization strengths and selecting the one with the highest validation F1 score.
- 08
Definition: The test set is only used at the end to estimate final performance. Analogy: It’s the championship game score—you can’t replay it after changing your strategy. Technical: The test set remains untouched during training and tuning to provide an unbiased estimate. Why it matters: Peeking at or tuning on the test set inflates reported performance and misleads. Example: After all choices are fixed, you run the final model once on the test set to report accuracy.
- 09
Definition: A common split is 70% train, 15% validation, and 15% test, but it’s flexible. Analogy: Dividing a pie into big slices for practice, a medium slice for rehearsal, and a final slice for the show. Technical: Larger datasets can afford smaller validation/test portions, while smaller datasets may need cross-validation. Why it matters: Right-sizing splits helps balancing reliable estimates with enough training data. Example: With 100,000 rows, you might use 80/10/10 since even 10% is a lot.
- 10
Definition: Cross-validation is rotating which part of data is used for validation to better estimate performance. Analogy: Taking turns so that everyone gets to be the referee once. Technical: In k-fold CV, split data into k folds, train on k-1 folds, validate on the 1 remaining, and average the k results. Why it matters: It uses data efficiently and reduces variance in performance estimates. Example: In 5-fold CV, each fifth of the dataset is the validation set once, and the final score is the average.
- 11
Definition: Accuracy is the proportion of correct predictions. Analogy: It’s your test score: how many answers you got right out of all questions. Technical: Accuracy = (TP + TN) / (TP + TN + FP + FN). Why it matters: It’s simple but can mislead on imbalanced data where one class dominates. Example: Predicting the majority class in a 90/10 split yields 90% accuracy while missing all minority cases.
- 12
Definition: Precision is how many predicted positives are truly positive. Analogy: When you say someone is sick, how often are you right? Technical: Precision = TP / (TP + FP), focusing on false positives. Why it matters: High precision avoids wrongly flagging negatives as positives. Example: A medical test with high precision rarely alarms healthy people.
- 13
Definition: Recall is how many actual positives are correctly found. Analogy: Of all sick people, how many did you correctly identify as sick? Technical: Recall = TP / (TP + FN), focusing on false negatives. Why it matters: High recall avoids missing true cases that matter. Example: In disease screening, high recall means you catch most patients who truly have the disease.
- 14
Definition: F1 score is the harmonic mean of precision and recall. Analogy: It’s like asking a student to be good at both math and reading, not just one. Technical: F1 = 2 * (precision * recall) / (precision + recall), rewarding balance. Why it matters: Especially useful with class imbalance when you need one summary number. Example: A fraud detector with moderate precision and recall can still have a strong F1 if both are solid.
- 15
Definition: The ROC curve plots true positive rate vs. false positive rate across thresholds; AUC summarizes it. Analogy: It’s a score of how well you can rank real positives above negatives. Technical: AUC ranges from 0.5 (random) to 1.0 (perfect); higher is better at discrimination. Why it matters: It compares models independent of a single threshold. Example: Choosing the model with AUC 0.92 over one with 0.85 because it better separates classes overall.
- 16
Definition: Mean squared error (MSE) is the average of squared prediction errors. Analogy: It’s like squaring the distance of every missed dart from the bullseye and then averaging. Technical: MSE = (1/n) Σ (yi − ŷi)^2, sensitive to large errors. Why it matters: Encourages models that avoid big mistakes. Example: A house price model with MSE 10000 means large average squared deviations in price units.
- 17
Definition: Root mean squared error (RMSE) is the square root of MSE. Analogy: Turning the squared distances back into normal distance units you understand. Technical: RMSE = sqrt(MSE), making the metric in the same units as the target. Why it matters: Easier to interpret and communicate. Example: If predicting temperature in °C, an RMSE of 2.0 means typical errors are about 2°C.
- 18
Definition: R-squared (R²) measures how much variance in the target is explained by the model. Analogy: It’s the percentage of the story your model can tell about the outcome. Technical: R² = 1 − (SSres / SStot), ranging from 0 to 1 for better fits. Why it matters: It shows explanatory power, though it can be misleading if used alone. Example: An R² of 0.85 means 85% of the variation in prices is captured by the model inputs.
- 19
Definition: Hyperparameters are model settings chosen before training. Analogy: Oven temperature and bake time you set before making cookies. Technical: Examples include learning rate, tree depth, and regularization strength; they are not learned from data directly. Why it matters: Good hyperparameters often make the difference between mediocre and strong results. Example: Lowering the learning rate can stabilize training and improve final accuracy.
- 20
Definition: Grid search tries all combinations of selected hyperparameter values. Analogy: Checking every square on a chessboard to find a hidden coin. Technical: You define a grid (e.g., learning rate {0.01, 0.001} x depth {3, 5, 7}) and evaluate each pair. Why it matters: It’s thorough but can be expensive. Example: Testing 60 combinations across 5-fold CV to pick the best validation F1.
- 21
Definition: Random search samples hyperparameter combinations randomly from ranges. Analogy: Throwing darts at a board to explore space quickly. Technical: It often finds good settings faster because not all hyperparameters matter equally. Why it matters: More efficient than grid search for high-dimensional spaces. Example: Sampling 30 random trials can outperform a 100-point grid in practice.
- 22
Definition: Bayesian optimization uses a probabilistic model to guide which hyperparameters to try next. Analogy: Asking a smart friend where to look next based on where you already searched. Technical: It builds a surrogate model of performance and picks promising points to evaluate. Why it matters: It can find strong hyperparameters with fewer runs. Example: After a few trials, it focuses around learning rates that looked best so far.
- 23
Definition: Class imbalance happens when one class has far more examples than the other. Analogy: A classroom with 9 kids wearing blue shirts and 1 wearing red—guessing blue is right most of the time. Technical: Accuracy can be misleading; precision, recall, F1, and AUC give a truer picture. Why it matters: You may miss the rare but important cases. Example: Fraud detection datasets often have less than 1% fraud cases.
- 24
Definition: Leave-one-out cross-validation (LOOCV) uses each single point once as its own validation set. Analogy: Every player gets a personal practice session with the coach. Technical: Train on n−1 points and validate on 1, repeating n times and averaging. Why it matters: Maximizes training data but can be computationally heavy and high-variance per fold. Example: With 1,000 points, you train 1,000 times, each time holding out one point.
03Technical Details
- Overall Architecture/Structure
Goal and Data Flow
- Input: A dataset of examples, each with features x and a target y (labels for classification or numeric values for regression).
- Training: Fit candidate models on the training set to learn parameters that minimize a loss (e.g., cross-entropy for classification, squared error for regression).
- Validation: Use a validation set, or cross-validation, to evaluate different models and hyperparameters without touching the test set.
- Selection: Choose the best-performing model and hyperparameter setting based on the validation process and chosen metric(s).
- Final Evaluation: Keep the test set strictly separate and run it once at the end to report an unbiased estimate of generalization performance.
Key Roles of Splits
- Training set: Teaches the model by adjusting internal parameters (weights) to reduce error.
- Validation set: Guides decisions among models and settings, giving a less biased view than training error.
- Test set: Acts as the final judge; it must not influence training or tuning decisions to prevent optimistic bias (data leakage).
Bias-Variance Framing
- Underfitting (high bias): The model’s assumptions are too restrictive; it fails to capture patterns, leading to high errors on both training and test data.
- Overfitting (high variance): The model is too sensitive to the specifics of the training set; it learns noise, achieving low training error but high test error.
- Target: Choose model complexity and regularization so that expected test error is minimized.
- Code/Implementation Details (Conceptual, with examples)
Data Splits
- Standard split: 70% train, 15% validation, 15% test. Adjust as needed based on dataset size.
- Practical tip: Use stratification for classification to keep class ratios similar across splits.
- Random seeds: Fix seeds (e.g., 42) so splits are reproducible.
Cross-Validation
- K-fold CV: Shuffle and partition data into k roughly equal folds (commonly k=5 or k=10). For each fold i from 1 to k, train on folds not equal to i and validate on fold i. Average the metric across k folds to estimate performance.
- Leave-One-Out CV (LOOCV): For each data point, train on all others and validate on that one point. Provides almost full-data training but is computationally heavy and can have high variance per fold.
- When to use: Prefer k-fold CV for small-to-medium datasets; use a single validation split for very large datasets where variance is naturally low.
Choosing Metrics
- Classification: • Accuracy: Fraction correct; simple but can mislead with imbalance. • Precision: TP / (TP + FP); focuses on correctness of positive predictions. • Recall: TP / (TP + FN); focuses on completeness of catching positives. • F1 Score: Harmonic mean of precision/recall; balances the two, especially valuable under imbalance. • ROC and AUC: Plot TPR vs. FPR over thresholds; AUC summarizes ranking ability.
- Regression: • MSE: Average squared error; penalizes large errors strongly. • RMSE: Square root of MSE; interpretable in target units. • R²: Proportion of variance explained; intuitive but be cautious if used alone.
Thresholds and Trade-offs
- Many classifiers output scores or probabilities. Picking a threshold (e.g., 0.5) turns scores into class labels.
- Raising the threshold often increases precision (fewer false positives) but lowers recall (more false negatives), and vice versa.
- Use precision-recall curves and ROC curves to choose thresholds that best match your application’s needs.
Hyperparameters
- Definition: Model settings fixed before training (e.g., learning rate, regularization strength, number of layers, tree depth, kernel choice).
- Examples: • Linear models: Regularization strength (L2 penalty λ), choice of penalty (L1 vs. L2). • Trees/Forests: Max depth, min samples per split, number of trees, feature subsampling. • SVM: C (regularization), kernel type (linear, RBF), kernel parameters (gamma for RBF). • Neural nets: Learning rate, batch size, number of layers/units, dropout rate.
Hyperparameter Tuning Methods
- Grid Search: • Process: Define a discrete grid of values for each hyperparameter and evaluate all combinations. • Pros: Systematic and thorough within the grid; easy to parallelize. • Cons: Scales poorly with more parameters; wastes time on unimportant dimensions. • When to use: Small search spaces, when you want guaranteed coverage.
- Random Search: • Process: Define distributions/ranges and sample random combinations for trial. • Pros: More efficient in high-dimensional spaces; often finds strong settings quickly. • Cons: Results depend on random samples; may miss narrow, optimal regions. • When to use: Larger search spaces; early-stage exploration.
- Bayesian Optimization: • Process: Build a surrogate model (e.g., Gaussian process) of the score as a function of hyperparameters; choose next trials by balancing exploration and exploitation (e.g., via expected improvement). • Pros: Can reach strong results with fewer evaluations; adapts as it learns. • Cons: More complex to implement; overhead may not pay off for tiny problems. • When to use: Expensive training runs where each evaluation costs a lot.
Avoiding Data Leakage
- Never tune hyperparameters on the test set.
- If you try many models and report the best test score, you are implicitly tuning on test; use a separate final holdout or nested cross-validation.
- Carefully separate preprocessing that learns from data (e.g., scaling mean/std) so it is fit only on the training portion within each fold.
Workflow Example (Classification)
- Prepare data: Clean, handle missing values, encode categorical variables (fit encoders on train only), and stratify splits.
- Split data:
- Option A: Train/validation/test (e.g., 70/15/15). Keep test locked.
- Option B: Train/test plus k-fold CV inside training for tuning.
- Choose baseline models: For instance, logistic regression and a small decision tree.
- Define metrics: If classes are balanced and costs symmetric, use accuracy; otherwise, consider F1 and AUC.
- Tune hyperparameters:
- For logistic regression: Try C values (inverse of regularization) like {0.01, 0.1, 1, 10}.
- For decision tree: Try max_depth in {3, 5, 7, 9} and min_samples_split in {2, 10, 20}.
- Evaluate via validation or cross-validation: Record mean and standard deviation of metrics.
- Select the best candidate: Consider both average performance and stability across folds.
- Finalize model: Retrain the chosen configuration on the full training+validation data if using a hold-out validation scheme.
- Test once: Run on the held-out test set for the final, unbiased estimate.
- Report results: Include primary metric and supporting metrics (e.g., accuracy, precision, recall, F1, AUC), and document the split protocol.
Workflow Example (Regression)
- Prepare features: Scale numeric features if needed; be careful to fit scalers on training data only.
- Split into train/validation/test or set up k-fold CV.
- Choose models: Linear regression with regularization (Ridge/Lasso) and a random forest regressor.
- Metrics: Use RMSE for interpretability; report R² to show variance explained.
- Tuning:
- Ridge/Lasso: Regularization strength α across a log-scale grid.
- Random forest: Number of trees, max depth, min samples per leaf.
- Validate: Use CV to average RMSE across folds.
- Select and finalize: Pick the model with best RMSE and stable fold performance, retrain on combined training data, then test once.
Evaluation Metric Nuances
- Accuracy Pitfall: In imbalanced datasets (e.g., 99% negatives, 1% positives), predicting all negatives gives 99% accuracy but zero recall on positives. Prefer precision/recall/F1 and AUC.
- Precision vs. Recall: Choose based on costs. In medical screening, missing a sick patient (false negative) may be worse than alarming a healthy person (false positive), so target higher recall. In spam filtering, too many false positives can hide important emails, so precision matters.
- F1 in Practice: Good when you need one number to summarize balance under imbalance. If you value recall more, consider Fβ scores (β>1 weights recall higher); if precision more, use β<1.
- ROC/AUC: Useful when class distributions shift or thresholds vary by use case. For highly imbalanced data, also inspect precision-recall curves because ROC can look optimistic when negatives dominate.
- Regression Choices: MSE punishes large misses strongly; RMSE is interpretable in target units. R² complements them but doesn’t convey error magnitude.
Cross-Validation Details
- Fold construction: Ensure folds are representative; for classification, stratify folds to maintain class ratios.
- Variance estimation: Report mean ± standard deviation across folds to show stability.
- Model refit: After picking the best hyperparameters via CV, refit on full training data (all folds combined) before testing.
- Computational cost: k-fold multiplies training time by k. Balance k with available compute; k=5 or 10 is common.
Hyperparameter Prioritization
- Which to tune: Focus on a few high-impact hyperparameters (e.g., learning rate for neural nets, C and gamma for SVM with RBF, depth for trees). Check documentation and papers to identify these.
- Search ranges: Use log scales for parameters that span orders of magnitude (e.g., 1e-4 to 1e0). Start broad, then narrow.
- Budgeting: Decide on a maximum number of trials or total compute time. Early stopping can save time for models that support it.
Documentation and Reproducibility
- Record: Data split ratios, random seeds, versions of libraries, metric definitions, and chosen thresholds.
- Store: Validation scores for each hyperparameter setting and the final test result. Keep confusion matrices and curves (ROC, precision-recall) where applicable.
- Prevent leakage: Ensure any operation that learns from data (imputation, scaling, feature selection) is applied only using training data within each fold.
Practical Tips and Warnings
- Don’t peek at the test set: Even checking it repeatedly for curiosity introduces bias. Treat test performance as a one-time report.
- Monitor learning curves: Plot training vs. validation error as you vary training set size or epochs. Diverging curves indicate overfitting; high errors on both suggest underfitting.
- Handle imbalance: Try class weighting, resampling (over/under-sampling), or focal loss (in deep learning) in addition to using better metrics.
- Calibrate probabilities: If you care about probability estimates (not just rankings), consider calibration methods like Platt scaling or isotonic regression on validation data.
- Nested CV (advanced): If you must use all data for tuning and testing without a dedicated test set, use nested CV to avoid bias (inner loop for tuning, outer loop for evaluation).
End-to-End Example Summary
- Suppose you have 10,000 email samples with labels spam/ham. You reserve 1,500 for validation and 1,500 for testing, training on 7,000.
- You compare logistic regression vs. a small gradient-boosted tree model. You define F1 as the primary metric due to class imbalance.
- You run random search over 30 trials for each model’s hyperparameters, using 5-fold CV inside the training set to estimate performance, log all results, and pick the best.
- You retrain the chosen configuration on training+validation data, apply a threshold tuned for desired precision/recall balance, and finally evaluate once on the test set to report F1, precision, recall, and AUC alongside a confusion matrix.
04Examples
- 💡
Underfit vs. Good Fit vs. Overfit Plot: Imagine scattered points that follow a gentle curve. A straight line (underfit) misses the curve badly, a moderate polynomial (good fit) follows the pattern, and a high-degree curve (overfit) twists through every point. Input is the same dataset; outputs are three different fitted functions. The point is to visualize bias (too simple), balance (just right), and variance (too complex).
- 💡
Class Imbalance Accuracy Trap: Suppose 90% of samples are class A and 10% are class B. A model that always predicts A gets 90% accuracy but identifies zero B cases. Input is an imbalanced dataset; processing is training a naive majority-class predictor; output is seemingly high accuracy but zero recall for B. The key lesson is accuracy alone can be misleading.
- 💡
Precision in Disease Testing: If a model flags people as sick, precision measures how many flagged people are truly sick. High precision means few healthy people are told they are sick (few false positives). The input is people’s health features; the output is a positive/negative test result. The emphasis is on the cost of false alarms.
- 💡
Recall in Disease Testing: Recall measures how many of the actually sick people are correctly flagged. High recall means the system rarely misses a sick patient (few false negatives). The input is the same health dataset; the output is the count of true positives over all actual positives. The focus is on catching as many true cases as possible.
- 💡
F1 Score Calculation Example: Suppose precision is 0.80 and recall is 0.60. F1 = 2 * (0.80 * 0.60) / (0.80 + 0.60) = 0.69 (rounded). The input metrics are precision and recall; the processing is applying the harmonic mean; the output is a single balanced score. The emphasis is that F1 rewards balance rather than extremes.
- 💡
Five-Fold Cross-Validation: Split the dataset into 5 equal folds. For each of the 5 runs, hold out one fold for validation and train on the remaining four, then record the chosen metric. Input is the full dataset and k=5; processing is rotating validation; output is the average score and variance. The key is fair, efficient use of limited data.
- 💡
Leave-One-Out CV (LOOCV): For a dataset of 100 points, you perform 100 training runs. Each run holds out one point as the validation set and trains on the other 99. Input is the dataset and LOOCV scheme; output is the mean validation metric over 100 runs. The lesson is maximal data use for training with high computational cost.
- 💡
ROC and AUC Comparison: Two classifiers produce probability scores for positives. Plot ROC curves for both; one curve consistently lies above the other. Input is predicted scores and true labels; processing is computing TPR/FPR across thresholds; output is two ROC curves with AUCs like 0.85 vs. 0.92. The key is choosing the model with better ranking ability.
- 💡
Regression with MSE and RMSE: A house price model predicts values in dollars. Compute MSE to see average squared error and RMSE to interpret typical error in dollars. Input is predicted and actual prices; processing is MSE and RMSE formulas; output is numeric error values. The lesson is the trade-off between sensitivity to large errors and interpretability.
- 💡
R² Interpretation: A regression model achieves R² = 0.78 on validation. This means 78% of the variance in the target is explained by the model’s features and structure. Input is predictions and actuals; processing is computing SSres and SStot; output is the R² statistic. The point is understanding explanatory power, not just error size.
- 💡
Hyperparameter Grid for Trees: You try max_depth in {3, 5, 7} and min_samples_split in {2, 10, 20}. That’s 3 × 3 = 9 combinations to evaluate with 5-fold CV. Input is the parameter grid; processing is exhaustive evaluation; output is the best setting by validation F1. The key is systematic search within a limited space.
- 💡
Random Search for SVM: Sample 25 combinations of C from log-uniform [1e-3, 1e2] and gamma from log-uniform [1e-4, 1e1]. Evaluate each with 5-fold CV and record AUC. Input is distributions and sample size; processing is random sampling and evaluation; output is the best combination after 25 tries. The emphasis is efficiency in larger spaces.
- 💡
Threshold Tuning for Precision vs. Recall: A classifier outputs probabilities; you adjust the threshold from 0.5 down to 0.3. Recall increases because more positives are captured, but precision may drop due to more false positives. Input is the validation set with predicted probabilities; processing is threshold sweeping; output is a curve showing precision and recall at each threshold. The key is aligning the threshold with real-world costs.
- 💡
Stratified Splits for Classification: When splitting data, you ensure each split has similar class proportions. This avoids a fold getting too few minority class examples. Input is labeled data; processing is stratified sampling; output is balanced train/validation/test splits. The lesson is fairer, more stable evaluation.
- 💡
Reporting Final Results: After tuning, retrain the chosen model on all training+validation data, then evaluate once on the test set. Report the main metric plus supporting ones and include notes on the split and CV protocol. Input is the finalized model and the untouched test set; processing is a single evaluation; output is the final performance report. The key is honesty, reproducibility, and avoiding leakage.
05Conclusion
This lecture gives you a complete, practical framework for picking and fairly testing machine learning models so they truly generalize. It begins with the core goal—performing well on new data—and uses the twin ideas of underfitting (too simple, high bias) and overfitting (too complex, high variance) to frame the challenge. You learned to structure your process with training, validation, and test sets, including when to adjust the common 70/15/15 split and how cross-validation lets you reuse limited data to get steady estimates. For classification, you now know why accuracy can mislead under imbalance and how precision, recall, F1, and AUC paint a clearer picture. For regression, MSE, RMSE, and R² give complementary views of error size and explanatory power.
On the tuning side, hyperparameters like learning rate and tree depth are key levers for performance. You saw three mainstream strategies: grid search for exhaustive, small spaces; random search for efficient coverage in bigger ones; and Bayesian optimization for sample-efficient improvement guided by a surrogate model. A unifying message is to keep the test set locked away until the very end to avoid data leakage and inflated scores. Also, match your metric and threshold to the real-world costs of mistakes—false positives and false negatives matter differently in different domains.
To practice, start by training two or three simple baseline models on a dataset you know, using a clear train/validation/test split. Try k-fold cross-validation when you have limited data, define one primary metric and a couple of supporting ones, and log results for each hyperparameter trial. Tune a small number of high-impact hyperparameters first, and visualize ROC or precision-recall curves to choose thresholds. Then retrain the selected model on all non-test data and run a single, final test evaluation.
For next steps, learn more about calibration for probabilities, resampling and class weighting for imbalanced data, and nested cross-validation for fully rigorous comparison. Explore advanced model families and regularization techniques, and consider automated hyperparameter tuning libraries. The core message to remember is this: structure your evaluation carefully, choose metrics that match your goals, and separate training, tuning, and testing. If you do, you’ll build models that don’t just look good on paper—they actually work when it counts.
Key Takeaways
- ✓Always separate training, validation, and testing to keep evaluations honest. Use validation (or cross-validation) to tune choices and reserve the test set for the final, single check. Peeking at the test set during tuning leads to inflated scores and bad surprises in production. Treat the test set as a locked box until the end.
- ✓Choose metrics that match your business costs, not just what’s common. Accuracy can mislead when classes are imbalanced; prefer precision, recall, F1, and AUC. Decide whether false positives or false negatives hurt more and tune thresholds accordingly. Document why you chose your metric and threshold.
- ✓Start with simple baselines before complex models. Baselines reveal whether fancy methods add real value. If a simple logistic regression matches a deep model, prefer the simpler one for speed and interpretability. This saves time and reduces overfitting risk.
- ✓Use cross-validation when data is limited. K-fold CV gives a steadier estimate than a single split and uses data efficiently. Report mean and standard deviation across folds to show stability. Stratify folds for classification to preserve class ratios.
- ✓Tune a small number of high-impact hyperparameters first. For SVMs with RBF, focus on C and gamma; for trees, depth and min samples per split; for neural nets, learning rate. Search ranges on log scales for parameters spanning orders of magnitude. Start broad, then narrow based on results.
- ✓Prefer random search over large, high-dimensional grids. Random search often finds strong settings faster by sampling broadly. Use grid search only for small, carefully chosen spaces. When training is very costly, consider Bayesian optimization for sample efficiency.
- ✓Align thresholds to your use case after choosing a model. Adjusting the decision threshold trades precision for recall. Use validation curves (precision-recall and ROC) to pick a value that balances your costs. Revisit thresholds when data drift changes conditions.
- ✓Watch for signs of underfitting and overfitting in learning curves. High training and validation error together suggests underfitting—add complexity or features. Low training error but high validation error suggests overfitting—simplify or regularize. Keep monitoring as you iterate.
- ✓Prevent data leakage at every step. Fit preprocessing (scaling, imputation, encoding) on training data only and apply to validation/test. Don’t tune hyperparameters based on test results. Keep pipelines modular to enforce these boundaries.
- ✓Use multiple supporting metrics to get a full picture. For classification, alongside accuracy, report precision, recall, F1, and AUC. For regression, pair RMSE with R² to show both error size and explained variance. This helps stakeholders understand trade-offs.
- ✓Document everything for reproducibility. Record data splits, random seeds, metric definitions, and the final configuration. Keep trial logs for each hyperparameter setting and cross-validation fold. This enables audits, debugging, and fair comparison later.
- ✓Handle imbalanced data deliberately. Use class weighting, resampling, or specialized losses, and evaluate with precision/recall and F1. Examine confusion matrices to see mistake types. Don’t rely on accuracy as your north star in skewed datasets.
- ✓Right-size your splits for your dataset and compute budget. With larger datasets, smaller validation and test slices still yield stable estimates. With small datasets, prefer cross-validation over a large holdout. Balance evaluation reliability with enough training data.
- ✓Refit your final model on all non-test data after tuning. This lets the model learn from as much data as possible before the final test. Then evaluate once on the untouched test set and report the result. Avoid further changes after seeing test performance.
- ✓Communicate results with clarity and context. Don’t just give a single number—explain metric choice, threshold, and data conditions. Include variability (e.g., standard deviation across folds) to show stability. This builds trust with teammates and stakeholders.
Glossary
Generalization
Generalization means a model performs well on new data it has never seen before. It’s not just memorizing examples; it understands patterns that apply to fresh cases. Good generalization is the main goal of machine learning. Without it, your model is just overfitting the past. A generalizing model stays useful when the world changes slightly.
Overfitting
Overfitting happens when a model is too complex and learns noise along with real patterns. It fits the training data very closely, maybe perfectly, but fails on new data. This often shows up as low training error and high test error. Overfitting means the model is too sensitive to small changes in data. It usually needs regularization or simpler structure.
Underfitting
Underfitting happens when a model is too simple to capture the real patterns in data. It makes strong assumptions (like everything is a straight line), which are not true. This leads to high errors on both training and test sets. The model needs more complexity or better features. Without fixing it, you won’t get good predictions.
Bias
Bias is error from overly simple assumptions about the data. High bias means the model is too rigid and misses patterns. It often causes underfitting, with poor performance even on training data. Lowering bias usually involves making the model more flexible. The trick is not to swing too far and cause overfitting.
Variance
Variance is error from being too sensitive to the training data. High variance means small changes in data produce very different models. This often causes overfitting: great training scores but bad test scores. Reducing variance often means simplifying the model or adding regularization. Balancing variance with bias is crucial.
Model selection
Model selection is the process of choosing the best model and settings from a set of candidates. It compares algorithms, features, and hyperparameters fairly. You use validation or cross-validation to guide the choice. The goal is the best generalization for your problem. This is not guessing; it’s measured decision-making.
Training set
The training set is the portion of data used to fit the model’s parameters. The model learns by reducing error on these examples. It should be large enough to capture patterns. You never evaluate final performance on the training set. Training is your practice before the real test.
Validation set
The validation set is used to tune hyperparameters and compare models. It is not directly used to fit the model’s parameters. You measure performance here to guide choices. This helps avoid overfitting to the training data. It’s like rehearsal before the final show.
+28 more (click terms in content)
