Principal Component Analysis (PCA)
Key Points
- •Principal Component Analysis (PCA) finds new orthogonal axes (principal components) that capture the maximum variance in your data.
- •You must first center the data by subtracting the mean of each feature; otherwise PCA gives wrong directions.
- •Computationally, PCA is an eigendecomposition of the covariance matrix or, equivalently, an SVD of the centered data matrix.
- •Projecting data onto the top k principal components reduces dimensionality while preserving most variance and often denoises.
- •Explained variance ratios from eigenvalues help you choose k to meet a variance retention target (e.g., 95%).
- •For large datasets, avoid forming the covariance explicitly; use power iteration with matrix–vector products via and ^T.
- •Store the fitted mean and components to transform future data consistently and to reconstruct approximations if needed.
- •PCA is linear; for curved manifolds or nonlinear patterns consider kernel PCA or nonlinear methods.
Prerequisites
- →Vectors and matrices — PCA manipulates matrices, performs matrix–vector products, and relies on orthogonality and norms.
- →Variance and covariance — Understanding covariance clarifies why PCA maximizes variance along components.
- →Eigenvalues and eigenvectors — PCA is defined by the eigendecomposition of the covariance matrix.
- →Iterative eigenmethods (power iteration) — Efficient top-k PCA often uses power/orthogonal iteration instead of full eigendecomposition.
- →Feature scaling and standardization — When features have different units or scales, standardization affects PCA results.
Detailed Explanation
Tap terms for definitions01Overview
Principal Component Analysis (PCA) is a linear technique for reducing the dimensionality of a dataset while preserving as much variability (information) as possible. It re-expresses the data in terms of new, orthogonal axes called principal components. Each component is a direction in feature space along which the data varies the most, with the first component capturing the maximum variance, the second capturing the next largest variance subject to being orthogonal to the first, and so on. Practically, PCA helps compress data, visualize high-dimensional patterns, remove noise, and decorrelate features. Mathematically, PCA is performed by computing the covariance matrix of mean-centered data and then finding its eigenvalues and eigenvectors. The eigenvectors define the principal directions, and the eigenvalues quantify how much variance each direction captures. Alternatively, PCA can be obtained via the Singular Value Decomposition (SVD) of the centered data matrix, which is often numerically robust. After fitting PCA, we can project original samples onto the top k components to obtain a lower-dimensional representation, and we can approximately reconstruct original data by reversing the projection. Choosing k usually involves examining cumulative explained variance. Because PCA is linear and sensitive to scale and outliers, thoughtful preprocessing (centering, optionally standardization) is essential.
02Intuition & Analogies
Imagine a cloud of points representing your data as a flock of birds in 3D space. If you want to understand the flock’s main flight direction, you’d look for the longest axis of the cloud. That is the first principal component: the direction where the flock spreads out the most. Next, you’d look for the second longest axis, as long as it’s perpendicular to the first. Together, these axes describe the main shape of the cloud. If someone told you to summarize the flock with a simpler description—say, a 2D shadow—you would choose the best viewing plane to capture as much of the flock’s shape as possible. Projecting onto the first two principal components is exactly picking that best plane. In photography, this is like finding the angle where the object appears most detailed; in audio, it’s like separating loud, informative tones from faint background noise. PCA automates this by calculating the directions of maximum spread (variance) in your data. Each principal component is a recipe (a weighted mixture) of your original features that points along these informative directions. By keeping only the top few components, you keep the essence of the data while discarding directions that look mostly like noise. That’s why PCA is great for compression and denoising. But because it is a straight-line (linear) summary, it can miss curved patterns—like trying to approximate a spiral with straight rulers.
03Formal Definition
04When to Use
Use PCA when you need to reduce dimensionality while retaining most variance: for visualization (projecting to 2D/3D), compressing features before downstream tasks (e.g., clustering, regression), denoising by discarding low-variance directions, and decorrelating features to simplify models. It is well-suited when relationships are approximately linear, noise is roughly isotropic, and features are commensurate (or standardized if on different scales). PCA is also helpful when d is large and you want to mitigate the curse of dimensionality by capturing structure with far fewer components. Consider PCA before k-means clustering to improve cluster separation, or before linear regression to reduce multicollinearity (principal component regression). Use SVD-based approaches or covariance-free matrix–vector products for large n or d to avoid materializing huge covariance matrices. Do not rely on PCA when patterns are strongly nonlinear, when interpretability in original feature space is critical, or when outliers dominate variance. For heterogeneous units (e.g., kilograms vs meters), standardize features first. In supervised settings, PCA ignores labels; confirm that variance correlates with predictive signal or prefer supervised dimensionality reduction.
⚠️Common Mistakes
- Skipping centering: Not subtracting feature means causes the first component to align with the mean offset instead of true variation. Always center; standardize if units differ.
- Forgetting to sort eigenpairs: Some libraries return unsorted eigenvalues. Ensure components are ordered by decreasing eigenvalue.
- Mixing up shapes: Confusing whether samples are rows or columns leads to wrong covariance. In our convention, samples are rows and features are columns.
- Data leakage: Fitting PCA (means, components) on the full dataset before splitting leaks test information. Fit on training data; apply to validation/test with the stored mean and components.
- Keeping too many/few components: Arbitrary k may underfit or include mostly noise. Use explained variance plots or thresholds (e.g., 95%).
- Numerical instability: Explicit covariance of very high-dimensional or ill-conditioned data can be unstable. Prefer SVD or covariance-free multiplies X_c^{\top}(X_c v).
- Outliers: PCA is variance-driven and can chase outliers. Consider robust scaling or outlier removal.
- Misinterpretation: Loadings (eigenvectors) are linear combinations of features; large coefficients do not always imply causality.
Key Formulas
Feature mean
Explanation: The mean vector averages each feature across all samples. Subtracting this vector centers the data for PCA.
Mean-centering
Explanation: Center the data matrix by subtracting the feature means from every row. PCA requires centered data to measure variance about zero.
Sample covariance
Explanation: This symmetric matrix captures pairwise covariances between features. PCA performs an eigendecomposition of C.
Eigendecomposition
Explanation: The covariance decomposes into orthonormal eigenvectors and nonnegative eigenvalues. Eigenvectors define principal directions; eigenvalues give variances along them.
SVD relation
Explanation: The right singular vectors of centered data equal PCA loadings. Squared singular values (scaled by 1/(n−1)) are the eigenvalues of the covariance.
Projection (scores)
Explanation: Multiplying centered data by the top k eigenvectors yields k-dimensional coordinates that preserve maximal variance.
Reconstruction
Explanation: Map low-dimensional scores back to the original feature space using the top components, then add back the mean.
Explained variance
Explanation: is the fraction of variance captured by component i; is the cumulative fraction for the first k components. Use these to choose k.
Optimal reconstruction error
Explanation: The squared Frobenius error of the best rank-k PCA reconstruction equals the sum of discarded eigenvalues. PCA is the optimal linear rank-k approximation.
Power iteration
Explanation: Iteratively multiplying by C and normalizing converges to the dominant eigenvector. Orthogonalize to obtain subsequent components.
Whitening transform
Explanation: Scaling scores by inverse square roots of eigenvalues yields features with approximately identity covariance (unit variance and zero covariance).
Dual (n \ll d) trick
Explanation: When features far outnumber samples, compute eigenpairs of the sample Gram matrix G. Map left singular vectors back to feature space to obtain principal directions.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Helper to index row-major matrices stored in a flat vector 5 inline size_t idx(size_t r, size_t c, size_t ncols) { return r * ncols + c; } 6 7 // Compute column means (size d) and center X in place (n x d) 8 vector<double> center_columns(vector<double>& X, size_t n, size_t d) { 9 vector<double> mu(d, 0.0); 10 for (size_t j = 0; j < d; ++j) { 11 double s = 0.0; 12 for (size_t i = 0; i < n; ++i) s += X[idx(i, j, d)]; 13 mu[j] = s / static_cast<double>(n); 14 for (size_t i = 0; i < n; ++i) X[idx(i, j, d)] -= mu[j]; 15 } 16 return mu; 17 } 18 19 // y = X * v where X is (n x d), v is (d) 20 void matVec_X_v(const vector<double>& X, const vector<double>& v, vector<double>& y, size_t n, size_t d) { 21 fill(y.begin(), y.end(), 0.0); 22 for (size_t i = 0; i < n; ++i) { 23 double s = 0.0; 24 const double* row = &X[idx(i, 0, d)]; 25 for (size_t j = 0; j < d; ++j) s += row[j] * v[j]; 26 y[i] = s; 27 } 28 } 29 30 // z = X^T * u where X is (n x d), u is (n) 31 void matTVec_XT_u(const vector<double>& X, const vector<double>& u, vector<double>& z, size_t n, size_t d) { 32 fill(z.begin(), z.end(), 0.0); 33 for (size_t j = 0; j < d; ++j) { 34 double s = 0.0; 35 for (size_t i = 0; i < n; ++i) s += X[idx(i, j, d)] * u[i]; 36 z[j] = s; 37 } 38 } 39 40 // w = (1/(n-1)) * X^T * (X * v) (covariance times vector) without forming C 41 void covMatVec_via_data(const vector<double>& Xc, const vector<double>& v, vector<double>& w, size_t n, size_t d) { 42 static thread_local vector<double> tmp; // reusable buffer for Xc * v 43 tmp.assign(n, 0.0); 44 matVec_X_v(Xc, v, tmp, n, d); // tmp = Xc * v (n) 45 matTVec_XT_u(Xc, tmp, w, n, d); // w = Xc^T * tmp (d) 46 double scale = 1.0 / static_cast<double>(n > 1 ? (n - 1) : 1); 47 for (size_t j = 0; j < d; ++j) w[j] *= scale; 48 } 49 50 // Dot product and L2 norm 51 inline double dot(const vector<double>& a, const vector<double>& b) { 52 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i]; return s; 53 } 54 inline double norm2(const vector<double>& a) { return sqrt(max(0.0, dot(a, a))); } 55 56 // Orthogonalize v against columns of W (d x m) using Modified Gram-Schmidt 57 void orthogonalize(vector<double>& v, const vector<double>& W, size_t d, size_t m) { 58 for (size_t c = 0; c < m; ++c) { 59 const double* w = &W[idx(0, c, m)]; 60 double proj = 0.0; 61 for (size_t j = 0; j < d; ++j) proj += v[j] * w[j]; 62 for (size_t j = 0; j < d; ++j) v[j] -= proj * w[j]; 63 } 64 } 65 66 // Power iteration with orthogonalization to find top-k eigenvectors of covariance 67 struct PCAResult { vector<double> mean; vector<double> components; vector<double> eigenvalues; size_t n, d, k; }; 68 69 PCAResult pca_power_iteration(const vector<double>& X_in, size_t n, size_t d, size_t k, 70 size_t max_iter = 500, double tol = 1e-6, uint64_t seed = 42) { 71 vector<double> X = X_in; // copy to center in place 72 vector<double> mean = center_columns(X, n, d); 73 74 // Storage for components (d x k, column-major across k) and eigenvalues 75 vector<double> W(d * k, 0.0); 76 vector<double> lambdas(k, 0.0); 77 78 mt19937_64 rng(seed); 79 normal_distribution<double> dist(0.0, 1.0); 80 81 vector<double> v(d), w(d); 82 83 for (size_t comp = 0; comp < k; ++comp) { 84 // Random initialization 85 for (size_t j = 0; j < d; ++j) v[j] = dist(rng); 86 // Orthogonalize against previous components 87 if (comp > 0) orthogonalize(v, W, d, comp); 88 double nv = norm2(v); if (nv > 0) for (size_t j = 0; j < d; ++j) v[j] /= nv; 89 90 double prev_val = 0.0; 91 for (size_t it = 0; it < max_iter; ++it) { 92 // Multiply by covariance via data and orthogonalize 93 covMatVec_via_data(X, v, w, n, d); 94 if (comp > 0) orthogonalize(w, W, d, comp); 95 double nw = norm2(w); 96 if (nw == 0.0) break; // degenerate 97 for (size_t j = 0; j < d; ++j) v[j] = w[j] / nw; 98 // Rayleigh quotient as eigenvalue estimate 99 covMatVec_via_data(X, v, w, n, d); 100 double val = dot(v, w); 101 if (fabs(val - prev_val) < tol * max(1.0, fabs(prev_val))) { prev_val = val; break; } 102 prev_val = val; 103 } 104 // Save component v into W(:, comp) 105 for (size_t j = 0; j < d; ++j) W[idx(j, comp, k)] = v[j]; 106 // Final eigenvalue estimate 107 covMatVec_via_data(X, v, w, n, d); 108 lambdas[comp] = dot(v, w); 109 } 110 111 return {mean, W, lambdas, n, d, k}; 112 } 113 114 // Project centered data onto components: Y = Xc * W_k (n x k) 115 vector<double> project_scores(const vector<double>& X, const vector<double>& mean, 116 const vector<double>& W, size_t n, size_t d, size_t k) { 117 vector<double> Y(n * k, 0.0); 118 for (size_t i = 0; i < n; ++i) { 119 // Build centered row 120 for (size_t c = 0; c < k; ++c) { 121 double s = 0.0; 122 for (size_t j = 0; j < d; ++j) s += (X[idx(i, j, d)] - mean[j]) * W[idx(j, c, k)]; 123 Y[idx(i, c, k)] = s; 124 } 125 } 126 return Y; 127 } 128 129 // Reconstruct: X_hat = Y * W^T + 1 * mean^T 130 vector<double> reconstruct(const vector<double>& Y, const vector<double>& W, const vector<double>& mean, 131 size_t n, size_t d, size_t k) { 132 vector<double> Xhat(n * d, 0.0); 133 for (size_t i = 0; i < n; ++i) { 134 for (size_t j = 0; j < d; ++j) { 135 double s = 0.0; 136 for (size_t c = 0; c < k; ++c) s += Y[idx(i, c, k)] * W[idx(j, c, k)]; 137 Xhat[idx(i, j, d)] = s + mean[j]; 138 } 139 } 140 return Xhat; 141 } 142 143 int main() { 144 // Create synthetic 2D data with correlated features embedded in 3D 145 size_t n = 600, d = 3, k = 2; 146 vector<double> X(n * d); 147 mt19937_64 rng(123); 148 normal_distribution<double> g1(0.0, 3.0); // main variance 149 normal_distribution<double> g2(0.0, 1.0); // secondary variance 150 normal_distribution<double> noise(0.0, 0.2); 151 152 // True latent 2D, then mix into 3D via a loading matrix A (3x2) 153 array<array<double,2>,3> A = {{{0.8, 0.1}, {0.6, -0.2}, {0.0, 0.98}}}; 154 for (size_t i = 0; i < n; ++i) { 155 double z1 = g1(rng), z2 = g2(rng); 156 double x = A[0][0]*z1 + A[0][1]*z2 + noise(rng); 157 double y = A[1][0]*z1 + A[1][1]*z2 + noise(rng); 158 double z = A[2][0]*z1 + A[2][1]*z2 + noise(rng); 159 X[idx(i,0,d)] = x + 5.0; // add mean shift to test centering 160 X[idx(i,1,d)] = y - 2.0; 161 X[idx(i,2,d)] = z + 1.0; 162 } 163 164 auto res = pca_power_iteration(X, n, d, k, 300, 1e-7); 165 166 // Report eigenvalues and explained variance 167 double total_var = 0.0; for (double l : res.eigenvalues) total_var += l; // partial total (top-k) 168 // For true total variance, one would sum all d eigenvalues; here we show top-k share estimate 169 170 cout << fixed << setprecision(6); 171 cout << "Top-" << k << " eigenvalues (variances):\n"; 172 for (size_t i = 0; i < k; ++i) cout << " lambda_" << (i+1) << " = " << res.eigenvalues[i] << "\n"; 173 174 // Project and reconstruct first 5 samples 175 auto Y = project_scores(X, res.mean, res.components, n, d, k); 176 auto Xhat = reconstruct(Y, res.components, res.mean, n, d, k); 177 178 cout << "\nOriginal vs Reconstructed (first 3 samples):\n"; 179 for (size_t i = 0; i < 3; ++i) { 180 cout << "Sample " << i << ":\n orig: "; 181 for (size_t j = 0; j < d; ++j) cout << setw(9) << X[idx(i,j,d)] << ' '; 182 cout << "\n recon: "; 183 for (size_t j = 0; j < d; ++j) cout << setw(9) << Xhat[idx(i,j,d)] << ' '; 184 cout << "\n"; 185 } 186 187 // Show first component loadings 188 cout << "\nFirst component loadings (v1): "; 189 for (size_t j = 0; j < d; ++j) cout << res.components[idx(j,0,k)] << ' '; 190 cout << "\n"; 191 return 0; 192 } 193
This program implements PCA using orthogonal power iteration without explicitly forming the covariance matrix. It centers the data, repeatedly multiplies a vector by the covariance via X_c^T(X_c v)/(n−1), orthogonalizes against previously found components, and normalizes. The Rayleigh quotient estimates each eigenvalue (variance). It then projects samples onto the top k components and reconstructs approximate originals. This approach scales to larger problems by avoiding O(d^2) covariance storage and uses only matrix–vector products on the centered data.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 inline size_t idx(size_t r, size_t c, size_t ncols) { return r * ncols + c; } 5 6 struct OnlineCov { 7 size_t d = 0; // dimensionality 8 size_t n = 0; // count 9 vector<double> mean; // running mean (d) 10 vector<double> M2; // sum of outer products of deviations (d x d) 11 12 explicit OnlineCov(size_t dim) : d(dim), n(0), mean(d, 0.0), M2(d * d, 0.0) {} 13 14 void update(const vector<double>& x) { 15 if (x.size() != d) throw runtime_error("dimension mismatch"); 16 n++; 17 vector<double> delta(d), delta2(d); 18 for (size_t j = 0; j < d; ++j) delta[j] = x[j] - mean[j]; 19 // update mean 20 for (size_t j = 0; j < d; ++j) mean[j] += delta[j] / static_cast<double>(n); 21 for (size_t j = 0; j < d; ++j) delta2[j] = x[j] - mean[j]; 22 // M2 += outer(delta, delta2) 23 for (size_t r = 0; r < d; ++r) { 24 double dr = delta[r]; 25 for (size_t c = 0; c < d; ++c) M2[idx(r,c,d)] += dr * delta2[c]; 26 } 27 } 28 29 vector<double> covariance() const { 30 vector<double> C = M2; 31 double denom = (n > 1) ? (static_cast<double>(n) - 1.0) : 1.0; 32 for (double& v : C) v /= denom; 33 return C; // d x d 34 } 35 }; 36 37 inline double dot(const vector<double>& a, const vector<double>& b) { double s=0; for (size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; } 38 inline double norm2(const vector<double>& a) { return sqrt(max(0.0, dot(a,a))); } 39 40 // w = C * v for symmetric C (d x d) 41 void symMatVec(const vector<double>& C, const vector<double>& v, vector<double>& w, size_t d) { 42 fill(w.begin(), w.end(), 0.0); 43 for (size_t r = 0; r < d; ++r) { 44 const double* row = &C[idx(r,0,d)]; 45 double s = 0.0; for (size_t c = 0; c < d; ++c) s += row[c] * v[c]; 46 w[r] = s; 47 } 48 } 49 50 // Orthogonalize against W (d x m) 51 void orthogonalize(vector<double>& v, const vector<double>& W, size_t d, size_t m) { 52 for (size_t c = 0; c < m; ++c) { 53 const double* w = &W[idx(0, c, m)]; 54 double proj = 0.0; for (size_t j = 0; j < d; ++j) proj += v[j] * w[j]; 55 for (size_t j = 0; j < d; ++j) v[j] -= proj * w[j]; 56 } 57 } 58 59 struct EigenResult { vector<double> vecs; vector<double> vals; size_t d, k; }; 60 61 EigenResult topk_eigs_power(const vector<double>& C, size_t d, size_t k, size_t max_iter = 500, double tol = 1e-6, uint64_t seed=7) { 62 vector<double> W(d * k, 0.0), vals(k, 0.0); 63 mt19937_64 rng(seed); normal_distribution<double> nd(0.0, 1.0); 64 vector<double> v(d), w(d); 65 for (size_t comp = 0; comp < k; ++comp) { 66 for (size_t j = 0; j < d; ++j) v[j] = nd(rng); 67 if (comp > 0) orthogonalize(v, W, d, comp); 68 double nv = norm2(v); if (nv > 0) for (size_t j = 0; j < d; ++j) v[j] /= nv; 69 double prev = 0.0; 70 for (size_t it = 0; it < max_iter; ++it) { 71 symMatVec(C, v, w, d); 72 if (comp > 0) orthogonalize(w, W, d, comp); 73 double nw = norm2(w); if (nw == 0) break; 74 for (size_t j = 0; j < d; ++j) v[j] = w[j] / nw; 75 symMatVec(C, v, w, d); double val = dot(v, w); 76 if (fabs(val - prev) < tol * max(1.0, fabs(prev))) { prev = val; break; } 77 prev = val; 78 } 79 for (size_t j = 0; j < d; ++j) W[idx(j, comp, k)] = v[j]; 80 symMatVec(C, v, w, d); vals[comp] = dot(v, w); 81 } 82 return {W, vals, d, k}; 83 } 84 85 int main() { 86 // Stream synthetic 5D data (two strong directions + noise) 87 size_t d = 5; size_t n = 2000; 88 OnlineCov oc(d); 89 mt19937_64 rng(2024); 90 normal_distribution<double> z1(0.0, 4.0), z2(0.0, 2.0), noise(0.0, 0.4); 91 // Loading matrix A (5 x 2) 92 array<array<double,2>,5> A = {{{0.7, 0.2},{0.5, -0.3},{0.3, 0.8},{0.0, 0.4},{0.1, -0.1}}}; 93 94 for (size_t i = 0; i < n; ++i) { 95 double u1 = z1(rng), u2 = z2(rng); 96 vector<double> x(d); 97 for (size_t r = 0; r < d; ++r) x[r] = A[r][0]*u1 + A[r][1]*u2 + noise(rng); 98 // Add drift to test centering 99 if (i % 2 == 0) x[0] += 5.0; 100 oc.update(x); 101 } 102 103 auto C = oc.covariance(); 104 size_t k = 2; 105 auto er = topk_eigs_power(C, d, k, 300, 1e-7); 106 107 // Report eigenvalues and explained variance 108 double total_var = 0.0; for (size_t j = 0; j < d; ++j) total_var += C[idx(j,j,d)]; 109 cout << fixed << setprecision(6); 110 cout << "Top-" << k << " eigenvalues: \n"; 111 for (size_t i = 0; i < k; ++i) cout << " lambda_" << (i+1) << " = " << er.vals[i] << "\n"; 112 double cum = 0.0; for (size_t i = 0; i < k; ++i) cum += er.vals[i]; 113 cout << "Cumulative explained variance (top-" << k << ") = " << (cum / total_var) * 100.0 << "%\n"; 114 115 // Print first component loadings 116 cout << "First component loadings: "; 117 for (size_t j = 0; j < d; ++j) cout << er.vecs[idx(j,0,k)] << ' '; 118 cout << "\n"; 119 return 0; 120 } 121
This program computes a covariance matrix online using Welford’s algorithm, which updates the mean and the sum of outer-product deviations per arriving sample. After streaming all samples, it forms the sample covariance, then finds the top k eigenpairs using power iteration on the explicit symmetric covariance. It reports eigenvalues and cumulative explained variance, demonstrating how to handle data that do not fit in memory at once while still performing PCA at the end.