📚 Stanford CME295 Transformers & LLMs8 / 9

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 8 - LLM Evaluation

Intermediate

Stanford

Machine LearningYouTube

Key Summary

•Kernel methods turn simple linear algorithms into powerful non-linear ones. Instead of drawing only straight lines to separate data, they let us curve and bend the boundary by working in a higher-dimensional feature space. This keeps training simple while unlocking complex patterns.
•A feature map, often written as φ(x), transforms the original input into new features. A linear model in this new space becomes a non-linear model back in the original space. This trick helps solve problems where straight-line boundaries fail.
•In a 2D example, a straight line could not separate blue and orange points, but a circle could. By mapping (x1, x2) to features like (x1^2, x2^2, √2 x1 x2), we can create models that behave like circles or other curves in the original space. This shows how feature maps enable non-linear decision boundaries.
•The kernel trick avoids computing φ(x) explicitly. A kernel function k(x, x') equals the dot product of φ(x) and φ(x') in feature space. This lets algorithms use rich features without the heavy cost of building them.
•The polynomial kernel k(x, x') = (x^T x' + c)^d represents all polynomial combinations up to degree d. It saves the cost of explicitly enumerating all polynomial features, which can explode as d grows. With it, we get non-linear power using only the kernel function.
•A worked example: with c=0 and d=2, (x^T x')^2 equals x1^2 x1'^2 + x2^2 x2'^2 + 2 x1 x2 x1' x2'. This matches the dot product of feature vectors [x1^2, x2^2, √2 x1 x2]. So a degree-2 polynomial kernel implicitly computes the same as a specific quadratic feature map.
•Choosing a feature map is part art and part science. Common, well-tested choices include polynomial and Gaussian (RBF) feature families. Domain knowledge and experimentation guide which kernel and parameters to use.
•Mercer’s theorem tells us how to recognize valid kernels. A function k is a valid kernel if it is symmetric and produces a positive semi-definite (PSD) kernel matrix on any dataset. This guarantees an underlying feature map exists.
•To check PSD of a matrix, ensure all eigenvalues are non-negative. Another method is checking that all leading principal minors (determinants of top-left submatrices) are non-negative, provided the matrix is symmetric. These conditions ensure it corresponds to some inner product.
•The Gaussian (RBF) kernel k(x, x') = exp(-||x - x'||^2 / (2σ^2)) depends only on distance. It corresponds to an infinite-dimensional feature space and can fit very flexible, smooth decision boundaries. The width σ controls how quickly similarity decays with distance.
•The sigmoid kernel k(x, x') = tanh(α x^T x' + c) behaves somewhat like a two-layer neural network. While not always PSD for all parameter settings, it often works in practice with proper choices. It connects kernel methods with neural net intuitions.
•The representer theorem is the backbone of kernel learning. For many regularized learning problems, the solution has the form f(x) = Σ α_i k(x, x_i). This reduces infinite-dimensional problems to choosing a finite set of coefficients tied to training points.
•Using kernels, we can keep the simplicity of linear algorithms (like linear SVMs) but gain non-linear decision boundaries. We only replace dot products with kernels in training and prediction. This unifies many algorithms under one flexible approach.
•A cautionary example showed that not all feature maps are equally useful. An initial quadratic feature map yielded degenerate decision boundaries (only the origin). Expanding the feature set to include original linear terms created a circular boundary that cleanly separated classes.
•Kernel choice and parameter tuning (like degree d or RBF width σ) strongly affect model behavior. Too small σ in RBF can overfit; too large can underfit. Similarly, polynomial degree controls flexibility and risk of feature explosion.
•Kernel matrices (Gram matrices) store pairwise similarities between all training points. Many kernel algorithms work entirely through this matrix. Its PSD property ensures stable optimization and meaningful geometry in feature space.

Why This Lecture Matters

Kernel methods let you keep the comfort and stability of linear algorithms while unlocking the power of non-linear modeling. This is valuable for data scientists and ML engineers who need accuracy on complex patterns but want reliable, well-understood optimization. Instead of inventing a brand-new non-linear method, you can often kernelize an existing linear one by swapping dot products for kernel functions. This approach reduces engineering overhead while achieving sophisticated decision boundaries. In real projects, kernel methods help when classes are not linearly separable or when relationships involve interactions and curved structures. The polynomial kernel captures feature interactions without enumerating them. The RBF kernel captures smooth, local similarities in an intuitive way: nearby points act alike. The representer theorem guarantees that solutions reduce to learning a finite set of coefficients, which is practical and interpretable in terms of similarity to training data. Professionally, knowing kernels strengthens your toolbox beyond plain linear models and deep networks. You can quickly prototype powerful models and benchmark performance before moving to heavier solutions. Kernels also connect to advanced topics—kernel PCA for non-linear dimensionality reduction and Gaussian processes for probabilistic modeling—broadening your capabilities. In the modern ML landscape, where flexibility and reliability both matter, kernel methods remain a cornerstone technique worth mastering.

Lecture Summary

Tap terms for definitions

01Overview

This lecture explains kernel methods, a family of techniques that transform simple linear learning algorithms into powerful non-linear ones. The central idea is to map inputs into a higher-dimensional feature space where linear models can create curved decision boundaries when viewed back in the original space. This mapping is called a feature map and is typically written as φ(x). Although feature maps can be explicit, the lecture’s core insight is the kernel trick: rather than compute φ(x) directly, we use a kernel function k(x, x') that equals the dot product of φ(x) and φ(x') in the feature space. With this substitution, many linear algorithms can operate as if they had rich non-linear features, without ever building them explicitly.

The lecture starts with motivation: many real-world relationships are non-linear, so linear models are often too simple. Non-linear models can be powerful but are sometimes hard to train and can involve many parameters. Kernel methods bridge this gap: they allow the use of well-understood linear optimization techniques while achieving non-linear behavior through feature maps and kernels. The instructor demonstrates this with a 2D classification example: a straight line cannot separate two classes, but a circle can. By mapping (x1, x2) to a higher-dimensional set of features, and then learning a linear separator in that space, the model can produce a circular boundary in the original space.

Next, the instructor shows specific feature maps and their properties. One quadratic map φ(x) = [x1^2, x2^2, √2 x1 x2] has a special property: the dot product of feature vectors equals (x^T x')^2. This illustrates how some feature maps correspond to simple kernel functions like the degree-2 polynomial kernel. However, not all feature maps are equally helpful: with certain weights, the first map produced degenerate boundaries. Adding original linear terms to the map (φ(x) = [x1^2, x2^2, x1, x2]) allowed a circular decision boundary centered at (1/2, 1/2), showing the importance of expressiveness.

The lecture then formalizes the kernel trick. A kernel is any function k(x, x') that acts as an inner product in some (possibly infinite) feature space: k(x, x') = φ(x)^T φ(x'). By replacing every occurrence of a dot product in a linear algorithm with k(x, x'), we can run that algorithm in the implicit feature space. The polynomial kernel k(x, x') = (x^T x' + c)^d is a standard example that corresponds to including all polynomial terms up to degree d without explicitly constructing them. This avoids the computational blow-up that happens when enumerating all polynomial features at high degrees.

Next, the lecture covers when a function is a valid kernel. Mercer’s theorem provides the condition: k must be symmetric and positive semi-definite (PSD). In practical terms, given any set of inputs, the resulting kernel (Gram) matrix K with entries K_ij = k(x_i, x_j) must be PSD. The instructor also explains how to check PSD: ensure all eigenvalues are non-negative, or (for a symmetric matrix) check that all leading principal minors (determinants of upper-left submatrices) are non-negative.

The lecture then surveys common kernels. The Gaussian or radial basis function (RBF) kernel k(x, x') = exp(-||x - x'||^2 / (2σ^2)) depends only on the distance between points and corresponds to an infinite-dimensional feature map. The sigmoid kernel k(x, x') = tanh(α x^T x' + c) behaves like a two-layer neural network. Each kernel has parameters (e.g., σ for RBF, degree d for polynomial) that control model flexibility. Choice of kernel and parameters is guided by domain knowledge and experimentation.

Finally, the lecture introduces the representer theorem, a foundational result that explains why kernel methods work so broadly. For a large class of regularized learning problems, the solution can be written as f(x) = Σ_i α_i k(x, x_i). This reduces the problem of searching over an enormous space of functions to solving for a finite set of coefficients attached to training examples. It also justifies the kernel trick and shows how learning reduces to working with the kernel matrix.

The lecture targets students with some basic machine learning background—familiarity with linear models, dot products, and classification. By the end, learners should understand how to construct non-linear models by swapping dot products for kernels, how to choose and validate kernels, and how the representer theorem structures solutions. The session lays the groundwork for upcoming topics like kernel PCA and Gaussian processes, which build directly on the ideas introduced here.

02Key Concepts

01
Kernel methods: 🎯 A way to make linear algorithms handle non-linear patterns. 🏠 Like wearing special glasses that make straight lines look curved where needed. 🔧 They map inputs to a higher-dimensional feature space and run a linear model there; the result is non-linear back in the original space. 💡 Without them, linear models would miss curved or complex boundaries. 📝 Example: A circle separating classes in 2D can be created by a linear boundary in feature space.
02
Feature map (φ): 🎯 A function that turns original inputs into new features. 🏠 Like turning a simple Lego block into a multi-piece set with more shapes to build complex models. 🔧 Mathematically, φ: X → H, where H is a (possibly high- or infinite-dimensional) feature space. 💡 It increases expressiveness so linear models can fit curves and shapes. 📝 Example: φ(x) = [x1^2, x2^2, √2 x1 x2] adds quadratic interactions to 2D inputs.
03
Kernel trick: 🎯 Replace dot products with a kernel function that equals an inner product in feature space. 🏠 Like ordering a pre-made cake instead of baking every layer yourself—you get the final taste without mixing ingredients. 🔧 If an algorithm uses only dot products, swap x^T x' with k(x, x'); computation happens implicitly in feature space. 💡 This avoids building huge feature vectors and saves time and memory. 📝 Example: Using k(x, x') = (x^T x' + c)^d instead of listing all polynomial features.
04
Linear vs non-linear boundaries: 🎯 Linear models draw straight lines (or planes), non-linear models draw curves. 🏠 Like using a ruler versus a flexible string to separate colored marbles on a table. 🔧 In input space, a line may fail; after φ mapping, a linear separator becomes a curve back in input space. 💡 Non-linear boundaries capture complex class shapes. 📝 Example: A circle separating blue and orange points that a line cannot.
05
Quadratic feature map property: 🎯 Some φ maps produce simple kernel forms. 🏠 Like a clever shortcut that turns a long multiplication into a square of a sum. 🔧 φ(x) = [x1^2, x2^2, √2 x1 x2] satisfies φ(x)^T φ(x') = (x^T x')^2. 💡 This connects explicit features to the degree-2 polynomial kernel. 📝 Example: It matches the expansion x1^2 x1'^2 + x2^2 x2'^2 + 2 x1 x2 x1' x2'.
06
Expressiveness of feature maps: 🎯 Not all feature maps are equally powerful. 🏠 Like having a toolbox: a tiny set of tools may not handle all jobs. 🔧 An initial quadratic map led to degenerate decision boundaries for some weights. 💡 Adding linear terms (e.g., x1 and x2) expanded expressiveness and allowed useful curves. 📝 Example: φ(x) = [x1^2, x2^2, x1, x2] produced a circle centered at (1/2, 1/2).
07
Polynomial kernel: 🎯 A kernel that models polynomial interactions up to degree d. 🏠 Like stacking building blocks into towers of height up to d. 🔧 k(x, x') = (x^T x' + c)^d implicitly includes all monomials up to degree d. 💡 It avoids the combinatorial explosion of explicit polynomial features. 📝 Example: For d=2, c=0, it equals (x^T x')^2 and matches a specific quadratic φ.
08
Computational savings: 🎯 Kernels avoid constructing large feature vectors. 🏠 Like using a shortcut path instead of climbing every step of a staircase. 🔧 Algorithms that depend only on dot products can work with k(x, x') directly. 💡 This makes high-degree or infinite-dimensional features feasible. 📝 Example: RBF kernels create infinite-dimensional embeddings but are still easy to compute.
09
Mercer’s theorem: 🎯 A rule to test if a function is a valid kernel. 🏠 Like a quality checklist to make sure a tool fits a machine. 🔧 k must be symmetric and produce a PSD Gram matrix for any dataset. 💡 This ensures there exists some φ with k(x, x') = φ(x)^T φ(x'). 📝 Example: If all eigenvalues of K are non-negative, k passes the test.
10
Positive semi-definite (PSD) matrices: 🎯 Matrices with non-negative eigenvalues. 🏠 Like a bowl that never flips inside-out; it can be flat but not concave down. 🔧 For a kernel matrix K, PSD ensures valid inner products in some space. 💡 Without PSD, optimization can break or geometry becomes invalid. 📝 Example: Check PSD by eigenvalues ≥ 0 or non-negative leading principal minors (for symmetric K).
11
Gaussian (RBF) kernel: 🎯 A kernel that measures similarity by distance, with smooth decay. 🏠 Like saying two points are friends if they are close on a map, less so if far. 🔧 k(x, x') = exp(-||x - x'||^2 / (2σ^2)); σ controls the width of similarity. 💡 It provides very flexible, smooth boundaries and corresponds to infinite features. 📝 Example: Small σ focuses on very local patterns; large σ smooths broadly.
12
Sigmoid kernel: 🎯 A kernel shaped like a two-layer neural net’s activation. 🏠 Like pressing inputs through a squashing gate that bends responses non-linearly. 🔧 k(x, x') = tanh(α x^T x' + c), for suitable α and c. 💡 Bridges kernel methods and neural network intuition; care needed to ensure PSD in practice. 📝 Example: With tuned α and c, it can model S-shaped relationships.
13
Kernel matrix (Gram matrix): 🎯 A table of all pairwise similarities using k. 🏠 Like a friendship chart where each entry shows how close two people are. 🔧 K_ij = k(x_i, x_j); many kernel algorithms use only K. 💡 It centralizes computation and encodes the geometry of data in feature space. 📝 Example: SVM training in dual form uses K instead of explicit features.
14
Choosing a kernel/feature map: 🎯 Selection depends on data patterns and goals. 🏠 Like picking the right lens for a camera based on the scene. 🔧 Common choices are polynomial and RBF; parameters (degree, σ) adjust flexibility. 💡 Good choices capture structure without overfitting. 📝 Example: Use RBF for smooth, distance-based similarity; polynomial for global interactions.
15
Representer theorem: 🎯 Solutions to many regularized problems lie in the span of training-point kernels. 🏠 Like building a song from a fixed set of notes (kernels at data points). 🔧 If we minimize norm-based regularized loss, f(x) = Σ α_i k(x, x_i) is optimal. 💡 This turns infinite-dimensional search into solving for finite coefficients. 📝 Example: Support vector machines and kernel ridge regression fit this form.
16
Regularization and norms: 🎯 Penalizing model complexity to prevent overfitting. 🏠 Like adding a gentle brake to keep a bike from wobbling wildly. 🔧 The objective includes ||w||^2 plus data loss; in kernels, this controls function smoothness. 💡 Without it, highly flexible kernels could fit noise. 📝 Example: The λ parameter balances data fit versus simplicity.
17
Why non-linearities matter: 🎯 Real data often needs curves, not just lines. 🏠 Like drawing a fence around animals that gather in a ring. 🔧 Kernels produce these curves by linear separation in feature space. 💡 This boosts accuracy on complex tasks. 📝 Example: Circular boundaries catch points that cluster around a center.
18
Checking PSD in practice: 🎯 Verify kernel validity on your dataset. 🏠 Like testing a bridge before letting cars drive over it. 🔧 Compute eigenvalues or leading principal minors to ensure non-negativity. 💡 Prevents broken optimization and invalid geometry. 📝 Example: If K has a small negative eigenvalue from rounding, adjust parameters or regularize.
19
Trade-offs in kernel parameters: 🎯 Settings change model bias and variance. 🏠 Like tuning a radio: too sharp and you catch static; too broad and stations blend. 🔧 Large σ (RBF) or low degree (polynomial) smooths; small σ or high degree increases complexity. 💡 Proper tuning reduces over- or underfitting. 📝 Example: Cross-validate σ or degree to find a sweet spot.
20
Unifying view across algorithms: 🎯 Many linear algorithms can be kernelized. 🏠 Like swapping a tool head onto the same power handle. 🔧 Wherever dot products appear, replace them with k(x, x') to get a non-linear counterpart. 💡 This gives a consistent, powerful approach without inventing new learners. 📝 Example: Linear SVM → kernel SVM by replacing x^T x' with k(x, x').

03Technical Details

Overall Architecture/Structure

Goal: Convert a linear algorithm into a non-linear one by operating in an implicit feature space. The standard linear algorithm (e.g., a linear classifier) computes predictions like sign(w^T x). In kernel methods, we first map x to φ(x) in a higher-dimensional space H; then use a linear model there: sign(w^T φ(x)). When viewed in input space, this linear boundary becomes a non-linear boundary.
Kernel Trick: Many linear algorithms depend on dot products between examples (x_i^T x_j). If we can compute an inner product in H without constructing φ(x), we can use a kernel function k(x, x') = φ(x)^T φ(x'). This replacement allows training and prediction using only kernel evaluations.
Data Flow: Inputs x_1,...,x_n → compute K_ij = k(x_i, x_j) → solve the linear algorithm’s dual or kernelized form using K → obtain coefficients α (and possibly bias b) → predict on a new x by computing f(x) = Σ_i α_i k(x, x_i) (+ b if present). No explicit φ(x) is required.

Concrete Example: From Lines to Circles in 2D

Linear failure: In 2D (x1, x2), a straight line might not separate classes (blue vs orange). Non-linear decision boundaries like circles succeed.
First feature map: φ(x) = [x1^2, x2^2, √2 x1 x2]. Property: φ(x)^T φ(x') = x1^2 x1'^2 + x2^2 x2'^2 + 2 x1 x2 x1' x2' = (x^T x')^2. This map corresponds to the degree-2 polynomial kernel with c=0.
Degenerate weights: With w = [1, 1, -1] or w = [1, 1, 1], the decision boundary derived from w^T φ(x) = 0 simplifies such that only (x1, x2) = (0, 0) satisfies it, giving a trivial boundary at the origin. This shows that even with a non-linear φ, some weight choices can be unhelpful.
Improved feature map: φ(x) = [x1^2, x2^2, x1, x2]. With w = [-1, -1, 1, 1], w^T φ(x) = -x1^2 - x2^2 + x1 + x2 = 0, which rearranges to (x1 - 1/2)^2 + (x2 - 1/2)^2 = 1/2. This is a circle centered at (1/2, 1/2) with radius √(1/2). The added linear terms increases expressiveness so a useful non-linear boundary emerges.

Kernel Trick in Detail

Core substitution: If an algorithm uses dot products, replace x^T x' with k(x, x') = φ(x)^T φ(x'). Training and prediction now rely on kernel evaluations. Example: In SVM dual, the objective involves x_i^T x_j; swapping in k(x_i, x_j) yields the kernel SVM.
Why it helps: Explicitly constructing polynomial features up to degree d in D dimensions leads to O(D^d) features. This can be huge. The kernel trick computes the same effect with one function call to k, keeping complexity manageable.
Polynomial kernel: k(x, x') = (x^T x' + c)^d. Interpretation: It includes all monomials of x and x' up to degree d with specific weighting determined by the binomial/multinomial expansions. Special case: c=0, d=2 gives (x^T x')^2, which corresponds to φ(x) = [x1^2, x2^2, √2 x1 x2] in 2D (and analogous terms in higher dimensions). Adding c introduces lower-order terms, effectively augmenting features with biases.

Valid Kernels and Mercer’s Theorem

Mercer’s condition: A function k is a valid kernel if it is symmetric (k(x, x') = k(x', x)) and for any dataset {x_1,...,x_n}, the Gram matrix K with K_ij = k(x_i, x_j) is positive semi-definite (PSD). PSD means v^T K v ≥ 0 for all vectors v (equivalently all eigenvalues of K are ≥ 0).
Intuition: PSD ensures k behaves like an inner product in some space H. Thus there exists a (possibly infinite-dimensional) φ such that k(x, x') = φ(x)^T φ(x').
Checking PSD: (1) Eigenvalue test—compute eigenvalues of K; all must be ≥ 0 (numerically small negatives may appear due to rounding and can often be corrected by slight regularization). (2) Leading principal minors—if K is symmetric, check that determinants of top-left k×k submatrices are non-negative for all k. Either method verifies PSD in practice.

Common Kernels and Their Behavior

Polynomial kernel: k(x, x') = (x^T x' + c)^d. Parameters: degree d (controls complexity) and constant c (shifts/augments features). Higher d increases flexibility but risks overfitting and numerical instability. Useful when interactions among features matter globally.
Gaussian (RBF) kernel: k(x, x') = exp(-||x - x'||^2 / (2σ^2)). Parameter σ (width) controls locality. Small σ makes similarity drop off quickly, focusing on very local patterns (risk: overfitting). Large σ smooths similarity broadly, capturing global trends (risk: underfitting). RBF corresponds to infinite-dimensional φ and typically yields smooth decision boundaries.
Sigmoid kernel: k(x, x') = tanh(α x^T x' + c). Parameters α and c influence the shape and PSD behavior. It resembles the activation of a two-layer neural network. While not universally PSD for all α, c, it can be effective with proper tuning.

Representer Theorem: Structure of Solutions

Problem setup: Consider minimizing an objective of the form ||w||^2 + λ Σ_i L(y_i, f(x_i)), where ||w||^2 is a regularization term (model complexity penalty), L is a loss function comparing predictions to targets, and λ controls the strength of regularization. Many standard problems fit this pattern (e.g., ridge regression, SVM-like losses, logistic-like losses with appropriate regularization forms).
The theorem: The minimizer f* can be written as f*(x) = Σ_i α_i k(x, x_i). That is, the optimal function lies in the span of kernel functions centered on the training points. This collapses the search from an infinite function class to solving for a finite number of coefficients α.
Implications: Training reduces to working with the kernel matrix K and solving for α (and possibly a bias term). Prediction for a new x requires computing k(x, x_i) for all training points used in the expansion (often many α_i will be zero or small, depending on the problem and regularization). This representation justifies why kernelized versions of many algorithms are both feasible and powerful.

Step-by-Step Implementation Guide (Conceptual)

Step 1: Choose a kernel family and parameters (e.g., polynomial with degree d, or RBF with width σ). Use domain knowledge or simple cross-validation to guide choices.
Step 2: Build the kernel matrix K where K_ij = k(x_i, x_j). Ensure K is symmetric; check PSD if needed (eigenvalues ≥ 0). If numerical issues arise, add a small diagonal jitter (εI) to K.
Step 3: Formulate the learning objective in terms of K. For example, in an SVM dual or kernel ridge regression, the objective depends on K and labels y. Regularization (e.g., λ) controls complexity.
Step 4: Solve for coefficients α (and bias b if present) using the chosen algorithm’s kernelized form. Numerical solvers or quadratic programming tools are commonly used for SVM-like problems; closed forms exist for kernel ridge regression.
Step 5: Predict for new inputs x_new by computing f(x_new) = Σ_i α_i k(x_new, x_i) (+ b). Decision rules (e.g., sign for classification) turn scores into labels.
Step 6: Tune parameters (σ, d, c, regularization λ) via validation. Monitor under/overfitting behavior.

Tips and Warnings

Feature map pitfalls: An explicit φ that seems reasonable may still yield poor or degenerate decision boundaries for some weights. Evaluate expressiveness and consider adding lower-order terms (e.g., include linear terms alongside higher-order ones) as shown by the circle example.
Kernel choice: RBF is a strong default for many problems; polynomial kernels can be better when interactions among features are inherently polynomial-like. The sigmoid kernel needs parameter care to remain effective and PSD in practice.
Parameter tuning: For RBF, σ too small → overfit; too large → underfit. For polynomial, high degree → risk of overfitting and numerical instability; c adjusts the balance between higher- and lower-order contributions.
PSD checks: Numerical round-off can produce tiny negative eigenvalues even for valid kernels. Adding a small εI to K can stabilize computations. If PSD violations are large, reconsider kernel parameters or the kernel itself.
Scalability: Kernel methods often require storing and manipulating the n×n Gram matrix, which can be expensive for very large n. For big datasets, consider approximate methods (e.g., random features) or sparse solutions (like SVMs) to reduce computational load.

How the 2D Circle Emerges (Algebra Details)

Improved φ: Let φ(x) = [x1^2, x2^2, x1, x2], and choose w = [-1, -1, 1, 1]. The decision boundary is w^T φ(x) = -x1^2 - x2^2 + x1 + x2 = 0. Rearranging: x1^2 - x1 + x2^2 - x2 = 0. Completing the square: (x1 - 1/2)^2 - 1/4 + (x2 - 1/2)^2 - 1/4 = 0 ⇒ (x1 - 1/2)^2 + (x2 - 1/2)^2 = 1/2. This is a circle centered at (1/2, 1/2) with radius √(1/2), clearly non-linear in input space.

Why the Initial Quadratic Map Gave Degenerate Boundaries

With φ(x) = [x1^2, x2^2, √2 x1 x2] and w chosen as [1, 1, -1] or [1, 1, 1], algebraic manipulation collapses the boundary to only the origin. This highlights that specific weight vectors, combined with a given φ, can fail to carve a meaningful separation. The lesson: expressiveness is not just about degree—coverage of the right terms matters (e.g., including linear components). Choosing or learning weights that reflect the target geometry is key.

Practical Q&A Insights

How to choose a feature map? In practice, start with standard kernel families like polynomial and RBF, then tune parameters. Use validation to compare performance. Domain knowledge (e.g., expecting smoothness or periodicity) points to certain kernels and parameter ranges.
How to check PSD? Compute eigenvalues of K and ensure they are non-negative; for symmetric matrices, check leading principal minors as well. Persistent violations mean reconsider kernel or parameters. Regularization and numeric stabilization (adding εI) can help.

Putting It All Together

Kernel methods let you keep the training simplicity of linear algorithms while capturing non-linear patterns. The kernel trick sidesteps explicit feature construction. Mercer’s theorem ensures validity of the kernel. The representer theorem structures solutions as sums of kernel evaluations at training points. With careful kernel and parameter choices, you can build accurate, flexible models without complex non-linear learners.

04Examples

💡
Linear vs circle decision boundary: Input points in 2D are colored blue and orange. A straight line fails to separate them, but a circle can. This demonstrates that a linear model in input space is insufficient. Kernel methods create such curved boundaries by learning a linear separator after a feature map.
💡
Quadratic feature map property check: Use φ(x) = [x1^2, x2^2, √2 x1 x2]. Compute φ(x)^T φ(x') = x1^2 x1'^2 + x2^2 x2'^2 + 2 x1 x2 x1' x2', which equals (x^T x')^2. This confirms the map matches the degree-2 polynomial kernel with c=0. It shows how explicit φ can correspond to a simple kernel expression.
💡
Degenerate boundary with w = [1, 1, -1]: The decision boundary w^T φ(x) = 0 simplifies to an equation with only the origin as a solution. Thus, the classifier is effectively trivial and unhelpful. This example warns that certain φ and w choices can produce poor separations. It motivates seeking more expressive features.
💡
Degenerate boundary with w = [1, 1, 1]: Repeating the process again yields only the origin as the boundary. This reinforces that merely changing weights may not fix expressiveness issues. The chosen φ can constrain what shapes are possible. A better φ is needed to obtain meaningful non-linear boundaries.
💡
Improved feature map with added linear terms: Choose φ(x) = [x1^2, x2^2, x1, x2] and w = [-1, -1, 1, 1]. The boundary becomes -x1^2 - x2^2 + x1 + x2 = 0, which simplifies to a circle centered at (1/2, 1/2) with radius √(1/2). This yields a useful non-linear separator that cleanly captures the class structure. It shows how adding features increases model flexibility.
💡
Polynomial features explosion: For two inputs x1, x2, degree-2 polynomial features include x1^2, x2^2, x1 x2, x1, x2, and 1 (six features). As degree grows, the number of combinations increases rapidly. Explicitly computing them becomes expensive. The polynomial kernel avoids this by computing (x^T x' + c)^d directly.
💡
Polynomial kernel equivalence demonstration: For c=0, d=2, (x^T x')^2 expands to x1^2 x1'^2 + x2^2 x2'^2 + 2 x1 x2 x1' x2'. This matches the dot product of [x1^2, x2^2, √2 x1 x2] with [x1'^2, x2'^2, √2 x1' x2']. Therefore, we get the same result without explicitly building the feature vectors. This shows the time-saving power of the kernel trick.
💡
Gaussian (RBF) kernel intuition: k(x, x') = exp(-||x - x'||^2 / (2σ^2)) depends only on distance. Points near each other have k close to 1; far points have k near 0. This builds smooth, locality-sensitive models. It captures complex boundaries by combining many local influences.
💡
Checking PSD by eigenvalues: Given a kernel matrix K from data, compute its eigenvalues. If any are negative (beyond tiny numerical errors), K is not PSD for these settings. You would then adjust kernel parameters or choose a different kernel. This ensures valid geometry in feature space.
💡
Leading principal minors method: For a symmetric K, compute determinants of its upper-left 1×1, 2×2, ..., n×n blocks. If any determinant is negative, K is not PSD. This provides an alternative PSD check. It connects linear algebra tests with kernel validity.
💡
Representer theorem in practice: Consider minimizing ||w||^2 + λ Σ_i L(y_i, f(x_i)). The theorem says f(x) = Σ α_i k(x, x_i) is optimal within this setting. So you solve for α_i instead of searching over all possible functions. This reduces complexity and ties solutions to training points.
💡
Choosing a kernel via intuition: If you expect smoothness and local similarity, start with RBF and tune σ. If global polynomial interactions matter, try a polynomial kernel and choose degree d carefully. Use validation to compare performance. This links problem structure to kernel choice.

05Conclusion

This lecture established kernel methods as a powerful way to make linear algorithms solve non-linear problems. The key mechanism is the feature map φ(x), which lifts inputs into a (possibly huge) feature space where a simple linear boundary becomes a curve back in the original space. The kernel trick avoids explicitly constructing φ(x) by substituting dot products with a kernel function k(x, x') = φ(x)^T φ(x'), enabling efficient computation even for high-degree or infinite-dimensional feature spaces.

We explored how particular feature maps behave, using 2D toy examples to show successes and pitfalls. A quadratic map that seemed promising ended up producing degenerate boundaries for some weights, while adding linear terms yielded a clean circular boundary. This highlighted the importance of expressiveness in feature design. The polynomial kernel provided a canonical example of turning combinatorially many polynomial features into a single, fast kernel computation. Mercer’s theorem then gave us a formal test for valid kernels—symmetry and positive semi-definiteness—along with practical PSD checks using eigenvalues or leading principal minors.

The lecture also introduced common kernels: polynomial (controlling global interactions via degree), Gaussian/RBF (capturing smooth, locality-based similarity with a width σ), and sigmoid (connecting to two-layer neural network behavior with parameters α and c). Finally, the representer theorem revealed why kernel methods are so broadly applicable: for many norm-regularized losses, the solution lies in the span of training-point kernels, f(x) = Σ α_i k(x, x_i). This reduces an infinite-dimensional search to solving for a finite set of coefficients and justifies the kernel-based formulation.

To practice, start by choosing a kernel family (RBF or polynomial), build the Gram matrix, and solve a kernelized version of a familiar linear algorithm (like SVM or ridge regression). Tune parameters such as σ, degree d, and regularization λ via validation. As a next step, dive into kernel PCA and Gaussian processes, which apply the same kernel machinery to unsupervised learning and probabilistic modeling. The core message to carry forward is simple but profound: by replacing dot products with valid kernels, you can keep the convenience of linear methods while capturing the rich, curved structures found in real-world data.

Key Takeaways

✓Start with the kernel trick: if your linear algorithm uses dot products, try replacing them with a kernel. This often gives you a non-linear version with almost no extra code. It’s a quick win to test whether non-linear structure improves accuracy. Keep your original training pipeline and just switch in the kernel function.
✓Use RBF as a strong default when you expect smooth, local patterns. Tune σ: small σ fits tightly (risking overfit), large σ smooths broadly (risking underfit). Perform cross-validation to pick a good σ. Watch learning curves to diagnose when to increase or decrease σ.
✓Consider polynomial kernels when feature interactions matter globally. Choose degree d carefully: higher d adds flexibility but increases overfitting risk. Use the constant c to balance lower- and higher-order effects. Validate multiple (d, c) settings to find the right complexity.
✓Always check or ensure PSD of your kernel matrix for stability. If eigenvalues show small negative values from rounding, add a small εI to K. Large negative values suggest the kernel or parameters are unsuitable. Stable PSD matrices lead to well-behaved optimization.
✓Leverage the representer theorem to simplify thinking about solutions. Remember that you only need to learn coefficients tied to training points. This turns infinite-dimensional function fitting into finite parameter learning. It also clarifies why prediction reduces to summing kernel similarities.
✓Add lower-order terms to increase feature map expressiveness when needed. If a non-linear map gives degenerate or weak boundaries, include linear terms as well. This can unlock useful shapes like circles or ellipses. Don’t assume higher degree alone fixes expressiveness gaps.
✓Control complexity with regularization (λ) to prevent overfitting. Larger λ smooths the function; smaller λ allows more flexibility. Combine λ tuning with kernel parameter tuning for best results. Use validation scores to balance fit and generalization.
✓Compute and use the Gram matrix efficiently. Ensure symmetry and consider numeric stabilization if needed. For larger datasets, monitor memory usage, since K is n×n. If needed, switch to approximations or sparsity-promoting methods.
✓Adopt a simple workflow: choose a kernel family, build K, train, validate, then iterate. Keep notes on parameter settings and results for reproducibility. Start with reasonable defaults (e.g., RBF with median-distance σ). Systematically refine based on validation performance.
✓Interpret kernel values as similarities in feature space. High k(x, x') means strong similarity; low values mean dissimilar points. This builds intuition about why certain predictions occur. Use this to debug and explain model behavior.
✓Beware of feature explosion with explicit polynomials; prefer kernels instead. The kernel computes the same effect much faster and simpler. This is crucial for high degrees or many input dimensions. Keep implementation lean by avoiding unnecessary explicit maps.
✓Use Q&A-style checks: can you justify your kernel choice and parameter settings? Can you verify PSD of K? Do your validation results support the configuration? Build these checks into your modeling routine.
✓Align kernel choice with domain knowledge: smooth physical processes often suit RBF, while symbol or interaction-heavy domains may suit polynomial. Let problem structure guide initial choices. Then refine using data-driven validation. This balances theory and evidence.
✓When in doubt, visualize decision boundaries in 2D toy examples. Test how σ or d changes the boundary shape. Use these insights to inform settings in higher-dimensional problems. Simple plots can save hours of blind tuning.
✓Remember the unifying principle: many linear methods become non-linear by swapping x^T x' for k(x, x'). This means one kernel library can power multiple algorithms. Reuse tools and intuition across tasks. It’s an efficient way to scale your ML practice.

Glossary

Kernel methods

A set of techniques that let linear algorithms learn non-linear patterns. They work by pretending data lives in a higher-dimensional space where a straight line can separate it. You don’t need to build these extra features yourself because the kernel trick does it for you. This keeps models simpler to train while becoming more powerful.

Feature map (φ)

A function that converts the original input into new, richer features. In the new space, a straight line can become a curve in the original space. Feature maps can be very high-dimensional or even infinite. They make simple models act complex without changing the learning rules much.

Kernel function

A function that gives the dot product of two inputs in the feature space without building the features. It is written as k(x, x') = φ(x)^T φ(x'). If k is valid, there exists some feature map that makes this true. This lets algorithms work with similarities directly.

Kernel trick

The idea of replacing dot products with a kernel function in algorithms. It saves you from computing huge feature vectors. If the algorithm only needs dot products, it can be run in the new space with no extra code for features. This enables non-linear learning with linear tools.

Dot product

A way to multiply two vectors that measures how aligned they are. In 2D, it’s x1*x1' + x2*x2'. It equals the cosine of the angle times the lengths of the vectors. In kernels, dot products in feature space are computed by k(x, x').

Decision boundary

The surface that separates different predicted classes. In 2D, it’s a line or curve; in higher dimensions, a plane or surface. Linear models make straight boundaries, while kernels can produce curves. It shows where the classifier switches labels.

Polynomial kernel

A kernel that models polynomial relationships of features up to a certain degree. It’s defined as k(x, x') = (x^T x' + c)^d. As degree d increases, the model becomes more flexible. It captures interactions like x1*x2 and higher-order terms.

Gaussian (RBF) kernel

A kernel measuring similarity by distance, using a bell-shaped formula. It’s k(x, x') = exp(-||x - x'||^2 / (2σ^2)). Points close together are very similar; far points are not. It corresponds to infinite features and makes smooth boundaries.

+23 more (click terms in content)

Version: 1