Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 2 - Transformer-Based Models & Tricks

Intermediate
Stanford
Machine LearningYouTube

Key Summary

  • Principal Component Analysis (PCA) is a method to turn high-dimensional data into a smaller set of numbers while keeping as much useful information as possible. The lecture explains three equivalent views of PCA: best low-dimensional representation, directions of maximum variance, and best reconstruction after projection. These views all lead to the same solution using eigenvectors and eigenvalues of a certain matrix built from the data.
  • To project data onto one dimension, you choose a direction u (a unit vector) and map each point x to a single number u^T x. The reconstruction of the original point from that single number is (u^T x)u, which lies on the line in direction u. The goal is to pick u that makes the total squared distance between x and (u^T x)u as small as possible.
  • Expanding the squared error ||x - (u^T x)u||^2 and simplifying shows that minimizing reconstruction error is the same as maximizing the sum of squared projections ∑(u^T x_i)^2. This can be written as u^T C u where C = ∑ x_i x_i^T. Choosing u that maximizes u^T C u, with u having length 1, solves the 1D PCA problem.
  • The optimization u^T C u subject to u^T u = 1 is a constrained optimization problem. Using Lagrange multipliers leads to C u = λ u, which is the eigenvector equation. The best u is the eigenvector of C with the largest eigenvalue λ.
  • Seeing PCA as maximizing variance: when you project data onto u, the variance of the projected numbers u^T x_i equals u^T C u (when data are zero-mean). So the direction that maximizes variance is exactly the eigenvector with the largest eigenvalue. Thus, maximizing variance and minimizing reconstruction error are the same goal.
  • For k-dimensional PCA, pick the top k eigenvectors of C corresponding to the largest eigenvalues. The low-dimensional code for x is [u1^T x, u2^T x, ..., uk^T x]. These directions define the best k-dimensional subspace for both high variance and low reconstruction error.
  • If the data are zero-mean, C is proportional to the covariance matrix. Centering (subtracting the mean) makes the math match the standard covariance definition and often improves numerical stability. The lecture keeps C as ∑ x_i x_i^T to avoid assuming zero-mean but notes the connection.
  • Geometrically, projecting is like dropping a perpendicular from a point to a line or plane. The point on the line is (u^T x)u, and the leftover error is the perpendicular distance. PCA picks the line or plane that makes those perpendicular distances small for all points together.
  • The eigenvalue λ you get for a chosen eigenvector u equals the value of u^T C u at that u. Because u^T u = 1, the objective equals λ, so picking the largest eigenvalue automatically maximizes the objective. This gives a clean rule: largest eigenvalue → first principal component.
  • The principal components are the chosen eigenvectors that define the new coordinate axes. The first component captures the most variance, the second captures the next most but at right angles to the first, and so on. This creates an ordered set of directions ranked by how much structure they explain.
  • The reconstruction from k components is the sum over components of (u_j^T x) u_j. This gives the closest point to x in the chosen k-dimensional subspace. The total reconstruction error equals the sum of the eigenvalues not included, which shows why the top-k rule is optimal.
  • PCA is unsupervised, meaning it does not need labels. It looks only at the structure of the inputs to find patterns. This makes it useful for compression, visualization, noise reduction, and as a preprocessing step for other models.
  • The optimization uses standard linear algebra tools. The matrix C is symmetric and positive semidefinite, so it has real eigenvalues and orthogonal eigenvectors. This ensures that the solution is well-defined and that different principal components are perpendicular.
  • Variance view and reconstruction view are two sides of the same coin. The direction that captures the most spread (variance) also minimizes the squared distances after projecting and reconstructing. These equivalent goals make PCA conceptually simple and mathematically elegant.
  • To apply PCA in practice, build C by summing x_i x_i^T over data (or use the covariance after centering). Then compute eigenvalues and eigenvectors, sort by eigenvalue size, and pick the top k. Project data by taking dot products with those eigenvectors.
  • If you project to 1D, you replace each point x with one number u^T x. If you project to kD, you replace it with k numbers [u1^T x, ..., uk^T x]. These numbers are your compressed representation, which keeps the most important information according to PCA.

Why This Lecture Matters

PCA matters because modern datasets often have many features, making them hard to store, visualize, and model efficiently. For data scientists and machine learning engineers, PCA offers a principled way to compress data while preserving the most important structure. Analysts can use it to discover dominant patterns, visualize high-dimensional data in 2D or 3D, and remove noise by dropping small-variance directions. In real projects, PCA serves as a common first step: it speeds up training, reduces overfitting risk by eliminating weak directions, and creates cleaner inputs for downstream models. PCA also helps domain experts who need to understand their data. For example, in manufacturing, PCA can reveal which combinations of sensor readings vary most during a process. In healthcare, it can summarize many lab values into a few composite indicators. In computer vision, it compresses images while keeping key shapes. Because PCA is unsupervised, it works even when labeled data is scarce, making it useful early in exploratory analysis. From a career perspective, mastering PCA builds strong intuition in linear algebra and optimization—skills that translate to many other techniques. Understanding the equivalence between maximizing variance and minimizing reconstruction error strengthens mathematical thinking and problem-solving. In industry, being able to implement PCA, choose k wisely, and explain the results clearly is highly valued. Overall, PCA remains a core tool in the toolkit because it is simple, fast, and effective across countless domains.

Lecture Summary

Tap terms for definitions

01Overview

This lecture teaches Principal Component Analysis (PCA), a fundamental technique in machine learning for reducing the number of features while keeping as much useful information as possible. The instructor presents PCA from three equivalent perspectives: finding the best low-dimensional representation that minimizes reconstruction error, finding directions in which the data varies the most (maximal variance), and finding directions that lead to the best reconstruction after projecting down and back up. Although these perspectives sound different, the lecture shows they lead to the exact same solution using linear algebra, specifically eigenvalues and eigenvectors of a matrix built from the data.

The lecture starts with a simple setup: you have n data points x1 through xn in d-dimensional space (for example, each xi could be a 100-dimensional vector). The goal is to find a lower-dimensional space, such as a line (1D) or a plane (kD), that captures the essence of the data. The instructor first handles the 1D case to build intuition. You pick a direction u with unit length and project each data point onto that direction, producing a single number u^T xi. Then you reconstruct back to the original space by multiplying that number by u, giving (u^T xi)u, which lies on the line defined by u. The key question is: which u gives the smallest total squared reconstruction error across all points?

By expanding the squared error ||xi − (u^T xi)u||^2, simplifying, and focusing on terms that involve u, the lecture shows that minimizing the reconstruction error is equivalent to maximizing the sum of squared projections ∑(u^T xi)^2. This objective can be written compactly as u^T C u, where C = ∑ xi xi^T. The problem becomes a constrained maximization: maximize u^T C u subject to u^T u = 1. Using Lagrange multipliers, you take the derivative, set it to zero, and find C u = λ u. This is the eigenvector equation, so the best u is an eigenvector of C, and the best one is the eigenvector with the largest eigenvalue.

Next, the instructor connects this to the variance view. If you project data onto u, the variance of the projected values u^T xi (assuming the data are zero-mean for simplicity) equals u^T C u. Therefore, the direction that maximizes variance is exactly the same direction that minimizes reconstruction error. In simple words, the line along which the data spreads out the most is also the line that gives you the least total squared distance when you drop perpendiculars from points to that line. This equality of goals makes PCA easy to reason about from different angles.

The lecture then generalizes from one dimension to k dimensions. Instead of one direction u, you now select k orthonormal directions u1, ..., uk. These are the eigenvectors of C associated with the k largest eigenvalues. Your low-dimensional code for a point x becomes [u1^T x, u2^T x, ..., uk^T x], and the reconstruction is the sum over j of (u_j^T x) u_j. This gives the best k-dimensional subspace in the sense of both variance captured and reconstruction accuracy.

Throughout, the instructor keeps the math clean and geometric: projection is like dropping a perpendicular to a line or plane; reconstruction is the foot of that perpendicular; and the best directions are those where the data varies most. The matrix C is like a covariance matrix when the data are zero-mean (the lecture uses C = ∑ xi xi^T without assuming zero mean to simplify setup). C is symmetric, so it has real eigenvalues and orthogonal eigenvectors, which guarantees a stable solution. The top eigenvector gives the first principal component; the next eigenvector gives the second principal component, and so on.

By the end, you understand how to compute PCA: build the matrix C, find its eigenvectors and eigenvalues, sort by eigenvalue size, and choose the top k eigenvectors as your principal components. You also understand why this works: these directions maximize variance, minimize reconstruction error, and provide the best possible reconstruction after projecting down and back up. The lecture is suitable for learners who know basic linear algebra (vectors, dot products, matrices, eigenvectors, eigenvalues) and basic optimization ideas (like Lagrange multipliers). After this lecture, you will be able to apply PCA to compress data, visualize high-dimensional datasets, denoise signals by keeping the most important directions, and create strong features for downstream machine learning tasks.

02Key Concepts

  • 01

    PCA definition and goal: PCA is a technique to reduce the number of features by projecting data onto fewer directions while keeping as much information as possible. Think of it like summarizing a long book into a few chapters that still tell the full story. Technically, it finds directions in space along which the data vary the most. This matters because high-dimensional data can be hard to store, visualize, and learn from. Example: Turning 100-pixel-wide images into just a few key numbers that capture most of the image’s structure.

  • 02

    Projection onto a direction (1D case): Projection turns a vector x into one number by taking the dot product with a unit vector u, written u^T x. It’s like shining a flashlight along direction u and measuring the shadow length of x on that line. Mathematically, u must have length 1 so the scale of the projection is correct. Without a clear projection rule, we cannot compare different directions fairly. Example: In 2D, projecting a point onto a line through the origin gives the closest point on that line.

  • 03

    Reconstruction after projection: After projecting x to u^T x, you can reconstruct back into the original space as (u^T x)u. It’s like taking the shadow length and placing a point on the line at that distance. Technically, this is the orthogonal projection of x onto the subspace spanned by u. This matters because we judge how much information we lost by how far the reconstruction is from x. Example: The perpendicular drop from a point to a line shows the error we made by using only that line.

  • 04

    Reconstruction error objective: The total error is the sum over data of ||x_i − (u^T x_i)u||^2. Imagine adding up all the squared distances from each point to the line. Expanding the square and simplifying reduces the problem to maximizing ∑(u^T x_i)^2. This shift is key because it turns a minimization of error into a maximization of a simpler quantity. Example: The best line is the one that makes the shadows, squared and summed, as big as possible.

  • 05

    Matrix form with C = ∑ x_i x_i^T: The sum ∑(u^T x_i)^2 can be written as u^T C u. This is like collecting all the data’s spread information into a single matrix. C is symmetric and positive semidefinite, so it has real eigenvalues and orthogonal eigenvectors. This structure lets us solve the optimization neatly. Example: For centered data, C is proportional to the covariance matrix.

  • 06

    Constrained optimization and Lagrange multipliers: We must keep u as a unit vector, so we maximize u^T C u subject to u^T u = 1. The Lagrangian adds a penalty term: L(u, λ) = u^T C u − λ(u^T u − 1). Taking the derivative and setting it to zero yields C u = λ u. This is exactly the eigenvector equation. Example: The best u is the eigenvector of C with the largest eigenvalue λ.

  • 07

    Eigenvectors and eigenvalues in PCA: An eigenvector is a special direction that a matrix stretches without rotating; its eigenvalue tells how much it’s stretched. Think of pushing a spring in certain directions—some directions stretch more than others. For PCA, eigenvectors of C are candidate directions, and eigenvalues measure the variance captured. Picking the largest eigenvalue picks the direction of greatest variance. Example: The first principal component points along the longest axis of an elliptical cloud of points.

  • 08

    Variance maximization viewpoint: The variance of projections u^T x_i (for zero-mean data) equals u^T C u. So maximizing variance is the same as maximizing u^T C u. This viewpoint says: choose the line where the data spread out the most. It matters because high variance often means more signal and structure. Example: If points are stretched more along one direction, that’s the direction PCA prefers.

  • 09

    Equivalence of reconstruction and variance views: The lecture proves minimizing reconstruction error equals maximizing projected variance. Geometrically, the line with the largest spread also has the smallest sum of perpendicular distances. This unifies two goals that might seem different at first. Knowing this helps you trust PCA’s solution. Example: Whether you care about shadows being long or perpendicular distances being short, you get the same line.

  • 10

    Extending to k dimensions: For k > 1, pick k orthonormal eigenvectors u1,…,uk of C with the largest eigenvalues. These form a k-dimensional subspace capturing the most variance. Your code for x is [u1^T x, …, uk^T x], and the reconstruction is ∑(u_j^T x) u_j. This gives the best k-dimensional approximation in least-squares sense. Example: A plane that hugs a curved sheet of data better than any single line could.

  • 11

    Principal components and ordering: The selected eigenvectors are called principal components, ordered by descending eigenvalues. Each next component is orthogonal to the previous ones and captures the next biggest chunk of variance. This creates a ranked list of directions of importance. It matters for choosing how many components to keep. Example: Keep the first few components until the remaining variance seems small.

  • 12

    Centering and covariance connection: If the data are zero-mean, C equals n times the covariance matrix. Centering is subtracting the mean from each point so the cloud is around the origin. This simplifies formulas and makes u^T C u exactly proportional to variance. Without centering, mean shifts can affect C. Example: Moving a picture so its center is at the origin before measuring its spread.

  • 13

    Geometric picture of projection: Projection is dropping a perpendicular from a point to a line or plane. The image on the line is the closest point in that subspace to the original. This idea extends to higher dimensions easily. It matters because closeness is measured with Euclidean distance. Example: The foot of the perpendicular from a point to a floor tile is the projection onto that tile.

  • 14

    Orthogonality of components: Different principal components are perpendicular (orthogonal). This means they capture independent directions of variance. Orthogonality makes the math clean and avoids double-counting variance. It matters for stable, interpretable decompositions. Example: Two axes at right angles chart different independent movements of the data.

  • 15

    Objective value equals eigenvalue: If u is an eigenvector with eigenvalue λ, then u^T C u = λ when u is unit length. So choosing the largest eigenvalue directly maximizes the objective. This gives a crisp selection rule. It also anchors the amount of variance captured. Example: The height of the top bar in an eigenvalue plot equals the best objective value.

  • 16

    Low-dimensional codes and reconstruction: The coordinates [u1^T x, …, uk^T x] are the compressed representation. Reconstruction adds these coordinates times their directions to get back an approximation. This keeps the parts of x that lie in the chosen subspace. The leftover is the error orthogonal to that subspace. Example: Keeping only the loudest notes in a chord and dropping faint background sounds.

  • 17

    PCA as unsupervised learning: PCA uses only the inputs and no labels to find structure. It discovers the main patterns of variance in the data. This is useful for preprocessing and understanding datasets. Without labels, it still finds compact summaries. Example: Summarizing a set of photos without knowing who is in them.

  • 18

    Practical computation steps: Build C = ∑ x_i x_i^T (or the covariance after centering). Compute eigenvalues and eigenvectors of C, sort by eigenvalue, and pick top k. Project data by dotting with each chosen eigenvector. This pipeline is straightforward and repeatable. Example: After sorting eigenvalues [5, 3, 1, 0.2], pick the first two for a 2D summary.

03Technical Details

  1. Overall architecture/structure of the PCA problem

Goal and data flow:

  • Input: n data points x1, x2, …, xn in R^d (each is a column vector with d numbers).
  • Choose a target dimension k (1 ≤ k ≤ d). For intuition, start with k = 1.
  • Pick direction(s) u (for 1D) or U = [u1, …, uk] (for kD), where each u_j is a unit vector and the set is orthonormal (perpendicular and length 1).
  • Project each x_i onto these directions to get low-dimensional coordinates: for 1D, z_i = u^T x_i; for kD, z_i = U^T x_i (a k-dimensional vector).
  • Optionally reconstruct back to the original space: for 1D, x̂_i = (u^T x_i)u; for kD, x̂_i = U(U^T x_i).
  • Objective: choose u or U that best preserves information. In this lecture, “best” is shown to be equivalent across three views: (a) minimum reconstruction error, (b) maximum variance, and (c) best reconstruction after down-and-up mapping.

Key matrix:

  • Define C = ∑_{i=1}^n x_i x_i^T. If data are centered (mean zero), C is proportional to the covariance matrix, which directly measures spread.
  • C is symmetric (C^T = C) and positive semidefinite (v^T C v ≥ 0 for any vector v). Thus, its eigenvalues are real and nonnegative, and it has an orthonormal basis of eigenvectors.

What PCA computes:

  • PCA computes eigenvectors and eigenvalues of C and selects the top k eigenvectors (those with the largest eigenvalues). These eigenvectors are the principal components.
  • The projection coordinates are obtained by dot products (u_j^T x for each component j). The reconstruction from k components is the sum of component contributions: x̂ = ∑_{j=1}^k (u_j^T x) u_j.
  1. Derivation and implementation details for the 1D case

Step A: Define the reconstruction error

  • For a single direction u with ||u|| = 1, the reconstructed version of x_i is x̂_i = (u^T x_i)u. The squared reconstruction error for x_i is ||x_i − x̂_i||^2.
  • Total error across the dataset: E(u) = ∑_{i=1}^n ||x_i − (u^T x_i)u||^2.

Step B: Expand the squared error

  • Expand ||x_i − (u^T x_i)u||^2 = (x_i − (u^T x_i)u)^T (x_i − (u^T x_i)u).
  • Multiply out: x_i^T x_i + (u^T x_i)^2 u^T u − 2 x_i^T ((u^T x_i)u).
  • Since ||u|| = 1, u^T u = 1, and u^T x_i is a scalar equal to its transpose. The middle term simplifies to (u^T x_i)^2. The third term simplifies to 2(u^T x_i)^2.
  • Thus, per point: ||x_i − (u^T x_i)u||^2 = x_i^T x_i − (u^T x_i)^2.
  • Summing over i: E(u) = ∑ x_i^T x_i − ∑ (u^T x_i)^2.

Step C: Turn minimization into maximization

  • The term ∑ x_i^T x_i does not depend on u, so minimizing E(u) is equivalent to maximizing ∑ (u^T x_i)^2.
  • Write ∑ (u^T x_i)^2 as u^T [∑ x_i x_i^T] u = u^T C u.
  • The 1D PCA problem is: maximize u^T C u subject to u^T u = 1.

Step D: Solve with Lagrange multipliers

  • Form the Lagrangian: L(u, λ) = u^T C u − λ(u^T u − 1).
  • Take gradient with respect to u: ∇_u L = 2 C u − 2 λ u. Set to zero: C u = λ u.
  • This is the eigenvalue equation. Valid solutions are eigenvectors u of C; λ is the corresponding eigenvalue.
  • Evaluate objective at an eigenvector u with unit norm: u^T C u = u^T (λ u) = λ (u^T u) = λ. So to maximize u^T C u, pick the eigenvector with the largest eigenvalue.

Interpretation:

  • The top eigenvector points along the direction where the dataset’s projected values have the largest sum of squares, i.e., the largest variance if the data are centered. Geometrically, it is the line that yields the smallest total squared perpendicular distances from points to that line (best 1D reconstruction).
  1. Variance perspective in detail

Definition of variance of projections:

  • For centered data (mean of x_i is zero), the variance of the projected values z_i = u^T x_i is Var(z) = (1/n) ∑ (u^T x_i)^2.
  • Up to a constant factor 1/n, this is exactly u^T C u. Therefore, maximizing variance is equivalent to maximizing u^T C u.

Equivalence of goals:

  • Because the reconstruction-error objective reduces (modulo a constant) to −u^T C u, minimizing error is equivalent to maximizing variance. Hence, both views choose the same direction u: the top eigenvector of C.
  1. Extension to k dimensions

Set up:

  • Choose k orthonormal vectors U = [u1, u2, …, uk] in R^d with U^T U = I_k. Project each x_i to z_i = U^T x_i (a k×1 vector). Reconstruct back as x̂_i = U z_i = U(U^T x_i).

Reconstruction error:

  • Total error is E(U) = ∑ ||x_i − U(U^T x_i)||^2. This equals the sum of squared distances from points to the k-dimensional subspace spanned by the columns of U.

Optimal U:

  • The optimal U that minimizes E(U) (equivalently, maximizes the sum of variances captured) is given by choosing u1,…,uk as the eigenvectors of C with the largest eigenvalues λ1 ≥ λ2 ≥ … ≥ λk.
  • These eigenvectors are mutually orthogonal and define the principal subspace.

Variance captured and ordering:

  • The value u_j^T C u_j equals λ_j. The total variance captured by the k-dimensional subspace is the sum of the top k eigenvalues ∑{j=1}^k λ_j (up to constant scaling if using covariance). The reconstruction error equals the sum of the remaining eigenvalues ∑{j=k+1}^d λ_j.
  1. Tools/libraries and computation notes

Mathematical tools:

  • Eigen-decomposition: For a symmetric matrix C, we compute eigenpairs (λ_j, u_j) satisfying C u_j = λ_j u_j, sorted by λ_j.
  • Orthonormality: The eigenvectors of a symmetric C can be chosen orthonormal. This gives U^T U = I_k and simplifies projection and reconstruction.

Implementation approach (language-agnostic):

  • Step 1: Arrange your data points x_i as column vectors in a matrix X of size d×n.
  • Step 2: (Optional but common) Center the data by subtracting the mean vector μ = (1/n) ∑ x_i from each x_i to get X_centered.
  • Step 3: Build C = ∑ x_i x_i^T. If using centered data, C = X_centered X_centered^T.
  • Step 4: Compute the eigenvalues and eigenvectors of C. Many numerical libraries do this (e.g., NumPy’s linalg.eigh for symmetric matrices).
  • Step 5: Sort eigenvalues in descending order and take the top k eigenvectors U_k = [u1,…,uk].
  • Step 6: To project a new point x, compute z = U_k^T x. To reconstruct, compute x̂ = U_k z.

Important parameters and meanings:

  • k: the target dimension. Larger k captures more variance but uses more numbers per data point. k = 1 gives a single best direction; k = d recovers the original data exactly.
  • Unit norm constraints: Each u_j must have length 1 to make projections comparable and to ensure U^T U = I.
  1. Step-by-step implementation guide (follow-along)
  • Step 1: Collect data as a list of vectors x1,…,xn in R^d.
  • Step 2: Compute the mean μ = (1/n) ∑ x_i (optional but recommended). Form centered data y_i = x_i − μ.
  • Step 3: Build C = ∑ y_i y_i^T (or if skipping centering, use x_i instead, as in the lecture’s notation). In matrix form with centered data: C = Y Y^T, where Y = [y1 … yn].
  • Step 4: Solve the eigenvalue problem C u = λ u. Use a symmetric eigen-solver. You get eigenvalues λ1 ≥ λ2 ≥ … ≥ λd and corresponding unit eigenvectors u1,…,ud.
  • Step 5: Choose k. Pick principal components U_k = [u1,…,uk].
  • Step 6: Encode each data point with z_i = U_k^T y_i (or U_k^T x_i if not centered). This gives k numbers per point.
  • Step 7: Reconstruct if desired: ŷ_i = U_k z_i; add back μ to get x̂_i = ŷ_i + μ if you centered.
  • Step 8: Evaluate quality: the total variance captured is ∑{j=1}^k λ_j, and the leftover variance is ∑{j=k+1}^d λ_j.
  1. Tips and warnings
  • Centering: While the lecture defined C without requiring zero mean, in practice center your data so that C describes pure variance. If you don’t, mean shifts can inflate directions unfairly.
  • Scaling: If features have very different scales (like centimeters vs. kilometers), consider normalizing before PCA so one feature doesn’t dominate variance just due to units.
  • Numerical stability: Use symmetric eigen-solvers (like eigh) on C. For very high dimensions with many fewer samples (d >> n), computing eigenvectors of the n×n matrix X^T X and back-transforming can be faster.
  • Choosing k: Look at eigenvalues; keep components until the marginal gain becomes small. This balances compression and information retained.
  • Interpretation: PCA finds directions of maximum variance, not necessarily directions that best separate classes. Use it as a preprocessing or visualization tool, not a classifier by itself.
  • Orthogonality: Ensure the chosen eigenvectors are orthonormal. Most libraries return normalized, orthogonal vectors for symmetric matrices, but verify U^T U ≈ I numerically.
  • Reconstruction check: After projection and reconstruction, errors should be perpendicular to the chosen subspace. If not, re-check your projection formula.
  1. Why the approaches are equivalent (intuitive recap)
  • Minimizing reconstruction error: You want the line/plane that hugs the data cloud most closely, so the perpendicular distances are as small as possible on average.
  • Maximizing variance: You want the direction where the shadows of points (their projections) are as long as possible overall. Longer shadows imply the line runs along the cloud’s long axis.
  • Algebra ties them together: Expanding the squared distance cancels constants and reveals the same u^T C u structure as the variance. The Lagrange multiplier trick lands on the eigenvector condition, sealing the equivalence.
  1. Geometric properties and outcomes
  • First principal component (PC1): Points along the axis of maximum spread. Captures the largest portion of total variance.
  • Second principal component (PC2): Orthogonal to PC1 and captures the next most spread. In 2D, PC1 and PC2 form a rotated coordinate system aligned with the data’s ellipse.
  • Higher components: Each new component is perpendicular to all previous ones and explains the next slice of variance. Stopping after k gives a k-dimensional subspace.
  • Reconstruction: The PCA reconstruction is always the orthogonal projection of x onto the chosen subspace. The error x − x̂ is orthogonal to that subspace.
  1. Practical example scaffolding (no code needed)
  • Imagine data that form a cigar-shaped cloud in 3D. PC1 runs along the cigar’s length. Projecting onto PC1 collapses each point to a single coordinate along that length. Adding PC2 captures the cigar’s minor width, and reconstruction with PC1+PC2 gives a flat, near-cigar sheet approximation in 3D.

This technical walkthrough mirrors the lecture’s derivations: starting with reconstruction error, simplifying to a matrix quadratic form, solving with eigen-decomposition under a unit-length constraint, understanding projections as variance, and generalizing from 1D to kD. It also adds clear, stepwise guidance so you can implement PCA on your own data using the exact formulas discussed.

04Examples

  • 💡

    1D projection and reconstruction of a single point: Input is a point x in 3D and a unit direction u. Compute the projection z = u^T x (a single number). Reconstruct as x̂ = z u, which lies on the line through the origin in direction u. The key point: reconstruction is the orthogonal projection of x onto the line.

  • 💡

    Sum of squared reconstruction errors over a dataset: Given points x1,…,xn and direction u, compute E(u) = ∑ ||x_i − (u^T x_i)u||^2. Expanding shows E(u) = ∑ x_i^T x_i − ∑ (u^T x_i)^2. Since the first sum is constant in u, minimizing E(u) is the same as maximizing ∑ (u^T x_i)^2. The instructor emphasized this flip from minimization to maximization.

  • 💡

    Matrix rewrite with C: For the same dataset, define C = ∑ x_i x_i^T. Then ∑ (u^T x_i)^2 = u^T C u. The optimization becomes maximize u^T C u subject to u^T u = 1. This compact form makes it solvable by eigen-decomposition.

  • 💡

    Lagrange multiplier solution: Set L(u, λ) = u^T C u − λ(u^T u − 1). Differentiate: ∇_u L = 2Cu − 2λu = 0, giving Cu = λu. Therefore the candidate solutions are eigenvectors of C, with objective value equal to λ. Pick the eigenvector with the largest eigenvalue to maximize the objective.

  • 💡

    Variance of projections: Assume centered data, and compute the projected values z_i = u^T x_i. The variance is (1/n) ∑ z_i^2, which equals (1/n) u^T C u. Thus, the direction that maximizes variance also maximizes u^T C u. This confirms equivalence with the reconstruction-error view.

  • 💡

    2D elongated cloud example: Imagine 2D points forming a stretched ellipse tilted 30 degrees from the x-axis. PC1 lies along the long axis of the ellipse, PC2 along the short axis, and they are perpendicular. Projecting onto PC1 alone gives 1D coordinates that spread widely; reconstruction uses that single axis and drops the short-axis detail. This demonstrates that PCA aligns axes to the data’s natural spread.

  • 💡

    k-dimensional projection and reconstruction: Choose k eigenvectors u1,…,uk with largest eigenvalues and form U. For a point x, compute z = U^T x (k numbers) and reconstruct x̂ = U z. This keeps the part of x that lies in the subspace spanned by U and drops the orthogonal remainder. It shows how PCA extends naturally beyond 1D.

  • 💡

    Centering and covariance connection: With mean μ, form centered points y_i = x_i − μ and build C = ∑ y_i y_i^T. Now u^T C u equals n times the variance of projections. This ties PCA directly to the classical covariance matrix. It explains why subtracting the mean is commonly done.

  • 💡

    Error as perpendicular distance: Draw a point x and a line in direction u. The projection x̂ = (u^T x)u is the foot of the perpendicular from x to the line. The vector x − x̂ is perpendicular to u. Summing squared lengths of these perpendiculars across points gives the total reconstruction error.

  • 💡

    Ordering by eigenvalues: Suppose eigenvalues are λ1 = 5, λ2 = 2, λ3 = 0.5. Keeping k = 2 components captures variance 5 + 2 = 7 units, while the leftover 0.5 is dropped. This numeric example shows how eigenvalues quantify captured vs. lost information. It helps decide how many components to keep.

  • 💡

    Projection as dot products: Take a data point x = [3, 4]^T and a unit vector u = [0.6, 0.8]^T. The projection is z = u^T x = 0.6×3 + 0.8×4 = 4.2 + 3.2 = 7.4. Reconstruction is x̂ = 7.4 u = [4.44, 5.92]^T. The difference x − x̂ is perpendicular to u, confirming correct projection.

  • 💡

    Orthogonality and no double-counting: In 3D, let u1 and u2 be top two principal components. Because u1 ⊥ u2, variance captured by u1 does not reappear in u2. The second component explains what the first did not. This example highlights why orthogonality is crucial for clean, additive variance explanation.

05Conclusion

This lecture built a complete, unified understanding of Principal Component Analysis (PCA) by showing three perspectives that lead to the same answer. First, PCA as minimizing reconstruction error: choose a line or k-dimensional subspace so that the total squared perpendicular distance from points to that subspace is as small as possible. Second, PCA as maximizing variance: pick directions where the projections of points spread out the most. Third, PCA as best reconstruction after down-and-up mapping: the same directions also give the most faithful reconstructions when you compress to k numbers and then expand back.

The heart of the math uses a single matrix C = ∑ x_i x_i^T. The 1D problem reduces to maximizing u^T C u with u being a unit vector, which is solved by the eigenvector of C with the largest eigenvalue. The lecture’s derivation via Lagrange multipliers gives the eigenvector equation C u = λ u, and evaluating the objective at an eigenvector shows it equals λ, making the selection rule obvious. Extending to k dimensions, you take the top k eigenvectors, which are orthonormal, and use them to define the principal subspace. Your compressed code for a point is its dot products with those eigenvectors, and reconstruction is their weighted sum.

Practically, PCA is straightforward to implement and broadly useful. Arrange your data, center it if desired, build C, compute eigenpairs, sort, and select k. The principal components are the eigenvectors with the largest eigenvalues, and they capture the most structure in the data. The method is unsupervised and serves as a powerful tool for compression, denoising, visualization, and preprocessing. Orthogonality guarantees clean, non-overlapping contributions from each component, and the eigenvalues quantify how much information each component holds.

To practice, try PCA on a small 2D dataset and visualize the principal axes, or on image patches to see how a few components can describe complex patterns. Next steps include exploring how to choose k by examining eigenvalues and understanding extensions like whitening or probabilistic PCA. The core message to remember is simple and strong: the same directions that minimize reconstruction error also maximize variance, and you find them by taking the top eigenvectors of ∑ x_i x_i^T. With this, you can confidently apply PCA to real-world datasets and use it as a foundational building block in your machine learning toolkit.

Key Takeaways

  • Start with centering your data unless you have a strong reason not to. Subtract the mean so that your spread matrix directly reflects variance and not location. This makes u^T C u exactly match projected variance up to a constant. Centering usually improves both numerical stability and interpretability.
  • Build the spread matrix as C = ∑ x_i x_i^T (or X_centered X_centered^T for centered data). Ensure C is symmetric, which it naturally is from construction. Use a symmetric eigen-solver to find eigenvalues and eigenvectors. Sorting eigenvalues in descending order prepares you to choose top components.
  • Use the largest eigenvalues to pick principal components. The eigenvector with the largest eigenvalue is the first principal component; the next largest gives the second, and so on. These are orthogonal directions capturing decreasing amounts of variance. This ranking guides how much to keep.
  • Encode points by dotting with principal components. For 1D, use z = u^T x; for kD, z = U^T x. These coordinates are your compressed representation. They often simplify later tasks like clustering or regression.
  • Reconstruct points by summing component contributions. For 1D, x̂ = (u^T x)u; for kD, x̂ = U(U^T x). The reconstruction is the orthogonal projection onto the principal subspace. Differences x − x̂ are perpendicular to the subspace and represent lost detail.
  • Interpret PCA results through eigenvalues. The sum of top-k eigenvalues tells how much variance is captured. The leftover eigenvalues show what you are discarding. Plotting them helps you decide a good k.
  • Remember the equivalence: maximum variance equals minimum reconstruction error. Don’t worry about choosing between these goals—they produce the same principal components. Use the viewpoint that makes more sense for your task. This unity is the core beauty of PCA.
  • Keep features on comparable scales when needed. If one feature has much larger units than others, it might dominate variance unfairly. Consider standardization so each feature contributes fairly. This can change principal components meaningfully.
  • Check orthonormality of components. Numerically, tiny errors can creep in, so verify U^T U ≈ I. Orthonormal components ensure clean encoding and reconstruction. If not orthonormal, fix by re-normalizing or re-computing.
  • Use PCA for visualization and compression wisely. For visualization, pick k = 2 or 3; for compression, choose k that balances file size with accuracy. Always measure reconstruction error or variance captured. These checks prevent over- or under-reduction.
  • Apply PCA as a preprocessing step to speed up models. Fewer dimensions mean faster training and often better generalization by removing noise. But remember PCA is unsupervised, so it may not align with class boundaries. Combine it with supervised learning when needed.
  • Be careful with interpretation: PCA axes are linear combinations of original features. They may not map cleanly to single, simple meanings. Use loadings (component entries) to understand contributions. Communicate that components are rotated axes capturing spread.
  • For small sample sizes, consider computational shortcuts. When d is large and n is small, compute eigenvectors via the n×n Gram matrix X^T X and back-map to the d×d space. This avoids heavy computation on very large matrices. It gives the same principal directions in the data span.
  • Validate your PCA pipeline end-to-end. Test projection, reconstruction, and error properties on a toy dataset first. Ensure errors are orthogonal and that captured variance matches the sum of chosen eigenvalues. This builds confidence before applying PCA to large, real datasets.

Glossary

Principal Component Analysis (PCA)

A method to reduce many features into fewer while keeping most of the important variation. It finds special directions in the data where the data spreads out the most. These directions are called principal components. Using them lets us compress data and still keep its shape. It is unsupervised, meaning it uses no labels.

Projection

The act of dropping a vector onto a line or plane to find the closest point within that line or plane. It is like shining a light and looking at the shadow on the wall. The projected value onto a unit vector u is u^T x. This keeps the part of x that lies along u and throws away the rest. Projection is key for dimensionality reduction.

Reconstruction

Building an approximation of the original data from its low-dimensional code. For one direction u, the reconstruction is (u^T x)u. It lies on the chosen line or subspace. The difference between the original and reconstructed point is the error we made. Good reconstructions mean we kept the important parts.

Reconstruction error

How far the reconstructed point is from the original point. It is measured as a squared distance and summed over all points. Small error means the low-dimensional subspace fits the data well. Large error means the subspace misses important structure. PCA chooses directions that make this error small.

Variance

A number that shows how spread out values are. High variance means values are very different from each other; low variance means they are close. In PCA, we look at how spread the projections are along a direction. The direction with the biggest spread is the most informative. This guides which directions to keep.

Eigenvector

A special direction that a matrix scales without rotating. When you apply the matrix to this vector, the result points in the same direction. Eigenvectors of a symmetric matrix are orthogonal to each other. In PCA they serve as principal components. They define the axes of the new low-dimensional space.

Eigenvalue

A number that shows how much an eigenvector is stretched by a matrix. It pairs with an eigenvector in the equation C u = λ u. In PCA, bigger eigenvalues mean more variance is captured along that direction. Sorting eigenvalues helps pick the top components. Each eigenvalue tells the importance of its direction.

Covariance matrix

A matrix that measures how features vary together. The diagonal entries show variance of each feature; off-diagonals show how two features change together. For zero-mean data, PCA’s matrix C is proportional to the covariance matrix. This makes u^T C u directly tied to variance. It’s the core of the variance view.

+23 more (click terms in content)

#principal component analysis#pca#dimensionality reduction#projection#reconstruction#variance maximization#eigenvector#eigenvalue#covariance matrix#lagrange multipliers#orthogonality#principal components#reconstruction error#unsupervised learning#quadratic form#outer product#centered data#orthogonal projection#least squares#subspace
Version: 1