🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
đŸ›€ïžPaths📚Topics💡Concepts🎮Shorts
🎯Practice
⏱CoachđŸ§©Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathIntermediate

Low-Rank Approximation

Key Points

  • ‱
    Low-rank approximation replaces a big matrix with one that has far fewer degrees of freedom while preserving most of its action.
  • ‱
    The Eckart–Young–Mirsky theorem says the truncated SVD gives the best rank-k approximation under both Frobenius and spectral norms.
  • ‱
    Keeping only the top k singular values/vectors captures most of the matrix’s “energy,” often enabling huge compression and denoising.
  • ‱
    Choosing k is guided by the singular value decay, target error tolerance, or a desired energy-retention threshold.
  • ‱
    Full SVD is expensive for large matrices, but randomized algorithms compute high-quality rank-k approximations much faster.
  • ‱
    Low-rank ideas underpin PCA, image compression, recommender systems, and fast linear algebra.
  • ‱
    Errors are predictable: in Frobenius norm, the squared error equals the sum of squares of discarded singular values.
  • ‱
    In C++, Eigen’s SVD or a randomized range finder makes low-rank approximation straightforward and efficient.

Prerequisites

  • →Matrices and Linear Mappings — Understanding matrices as operators and notions like rank and subspaces is foundational.
  • →Matrix and Vector Norms — Eckart–Young’s optimality depends on Frobenius and spectral norms and their properties.
  • →Singular Value Decomposition — Low-rank approximation is built directly from the SVD and its singular values/vectors.
  • →QR Factorization and Orthonormalization — Randomized SVD and many algorithms require computing orthonormal bases via QR.
  • →Floating-Point Stability — SVD computations and truncation involve numerical sensitivity and conditioning.
  • →Random Projections and Concentration — Randomized algorithms rely on properties of Gaussian (or subgaussian) test matrices.
  • →Principal Component Analysis (PCA) — PCA is an application of low-rank approximation on centered data matrices.
  • →Convex Optimization Basics — Nuclear norm regularization and related relaxations appear in low-rank modeling.
  • →Sparse Linear Algebra — Efficient methods for large sparse matrices avoid dense SVD and exploit sparsity.

Detailed Explanation

Tap terms for definitions

01Overview

Low-rank approximation is the idea of summarizing a large matrix A by another matrix of lower rank that is simpler to store and compute with, yet remains close to A in a chosen norm. At the heart of this concept is the Singular Value Decomposition (SVD), which factors A into UΣV^T where Σ lists singular values in descending order. Truncating this factorization to the top k singular values and vectors produces a rank-k matrix A_k that balances accuracy and simplicity. The Eckart–Young–Mirsky theorem guarantees that this truncated SVD is the best possible rank-k approximation under both the Frobenius norm and the spectral (operator 2-) norm. Practically, this means we can compress data, filter noise, and accelerate computations by working with much smaller factors. Applications abound: reducing image storage while maintaining visual quality, performing Principal Component Analysis (PCA) for dimensionality reduction, computing latent factors in recommendation systems, and approximating large kernel matrices. While full SVD can be computationally heavy for huge matrices, modern randomized algorithms enable fast, high-quality approximations. The key tradeoff is between the approximation rank k (smaller means lighter and faster) and the approximation error (smaller k usually means larger error).

02Intuition & Analogies

Imagine taking a high-resolution photograph (a big matrix of pixel intensities). Most of what makes the photo recognizable—broad shapes and gradients—can be captured by a few dominant patterns, while tiny details and noise contribute less. Low-rank approximation keeps just those dominant patterns. Concretely, SVD discovers a set of “basis images” (columns of U and rows of V^T) and importance weights (singular values). Keeping the top k weights and their basis images yields a compressed photo that still looks good. Another analogy: think of a song represented as many instrument tracks. If only a few instruments carry the melody and harmony strongly, you can reconstruct a convincing version of the song using just those tracks, discarding faint background sounds. Likewise, in a spreadsheet of student grades across many assignments, overall performance might be explained by a few latent factors like mastery and consistency. The low-rank model captures these few underlying drivers, ignoring minor idiosyncrasies. Geometrically, a matrix maps input vectors to outputs. Many real datasets lie close to a low-dimensional subspace; the matrix mainly stretches and rotates within a small number of directions and does little in the remaining ones. The SVD orders these directions by strength. Truncation keeps the most influential directions and sets the rest to zero, which explains the predictable error: it’s exactly what you lose by removing those weaker directions. This is why low-rank approximations both compress (fewer parameters) and denoise (discard weak, often noisy components).

03Formal Definition

Given A ∈ Rm×n with singular value decomposition A=U ÎŁ V⊀, where U ∈ Rm×m, V ∈ Rn×n are orthogonal and ÎŁ = diag(σ1​,σ2​,
,σr​,0,
) with σ1​ ≄ σ2​ ≄ ⋯ ≄ σr​ > 0 and r = rank(A), the best rank-k approximation of A is the truncated SVD A_k = U_{:,1:k} ÎŁ1:k,1:k​ V:,1:k⊀​. The Eckart–Young–Mirsky theorem states that for any X with rank(X) ≀ k, we have \∣A−Ak​∄_2 ≀ \∣A−X∄_2 and \∣A−Ak​∄_F ≀ \∣A−X∄_F, where \âˆŁâ‹…âˆ„_2 is the spectral norm and \âˆŁâ‹…âˆ„_F is the Frobenius norm. Moreover, the minimal errors are \∣A−Ak​∄_2 = σk+1​ and \∣A−Ak​∄_F = ∑i=k+1r​σi2​​. Uniqueness holds if σk​ > σk+1​ (up to signs within singular vectors). The low-rank approximation problem can also be posed as minrank(X)≀k​ \∣A−X∄ where the norm is unitarily invariant (e.g., Schatten p-norms); the truncated SVD solves all such problems simultaneously. The approximation compresses A by storing only U:,1:k​, ÎŁ1:k,1:k​, and V:,1:k​, requiring O((m+n)k) memory versus mn for the full matrix.

04When to Use

Use low-rank approximation when your matrix exhibits redundancy or when you can tolerate small errors for big gains in speed or storage. Typical cases include: (1) Compression: images, videos, or scientific data where singular values decay quickly; store U_k, \Sigma_k, V_k instead of A. (2) Denoising: signals or datasets with noise largely confined to small singular directions; truncation removes noisy components. (3) Dimensionality reduction: PCA is precisely low-rank approximation of a centered data matrix, enabling visualization, clustering, and regression in lower dimensions. (4) Recommender systems: implicit low-rank structure in user–item matrices yields latent-factor models. (5) Fast linear algebra: approximate preconditioners, fast matrix–vector products, and reduced storage in iterative solvers. (6) Kernel and graph methods: approximate large kernel/graph Laplacian matrices for scalable learning. Prefer truncated or randomized SVD when m,n are large but the target rank k is modest (k \ll \min(m,n)). Use full SVD for small to medium matrices or when exact singular values are needed. If the matrix is sparse or structured, choose algorithms that exploit sparsity (Lanczos/IRL, randomized methods) instead of dense SVD.

⚠Common Mistakes

  • Confusing PCA with SVD without centering: PCA requires centering columns of the data matrix. Skipping centering changes the subspace and can mislead conclusions.
  • Choosing k blindly: Picking k without inspecting the singular value decay or setting a target energy threshold can underfit (too small k) or waste resources (too large k). Plot or examine \sigma_i and cumulative energy.
  • Ignoring the norm: Eckart–Young guarantees optimality for Frobenius and spectral norms (and more generally unitarily invariant norms). Minimizing entrywise absolute error or other norms may not be solved by truncated SVD.
  • Recomputing full SVD unnecessarily: For very large matrices and small k, full SVD is overkill. Use truncated (Lanczos/IRL) or randomized SVD for big speedups.
  • Numerical issues: Not sorting singular values, forgetting to request thin factors, or forming products like U\SigmaV^{\top} with poor scaling can degrade accuracy or performance.
  • Mishandling sparsity: Converting a large sparse matrix to dense for SVD can exceed memory. Prefer sparse-aware methods.
  • Overfitting noise in denoising tasks: Setting k too large can reintroduce noise. Use validation, cross-validation, or heuristics (e.g., scree plot) to select k.
  • Misinterpreting uniqueness: If \sigma_k = \sigma_{k+1}, the optimal subspace is not unique; algorithms may return different but equally optimal bases.

Key Formulas

SVD

A=UÎŁV⊀,U⊀U=I, V⊀V=I, Σ=diag(σ1​,
,σr​)

Explanation: Any real matrix factors into orthogonal U and V and a diagonal matrix of singular values in descending order. This decomposition reveals intrinsic directions and strengths.

Truncated SVD

Ak​=i=1∑k​σi​ui​vi⊀​=U:,1:k​Σ1:k,1:k​V:,1:k⊀​

Explanation: The best rank-k approximation keeps the top k singular triplets. It is compact and captures the dominant action of A.

Eckart–Young (Frobenius)

rank(X)≀kmin​∄A−X∄F​=∄A−Ak​∄F​=i=k+1∑r​σi2​​

Explanation: Among all rank-k matrices, the truncated SVD uniquely minimizes Frobenius error. The squared error equals the sum of squares of discarded singular values.

Eckart–Young (Spectral)

rank(X)≀kmin​∄A−X∄2​=∄A−Ak​∄2​=σk+1​

Explanation: The worst-case operator error of Ak​ is exactly the next singular value. No other rank-k matrix can do better in 2-norm.

Frobenius norm via singular values

∄A∄F2​=i=1∑r​σi2​

Explanation: The total energy of a matrix equals the sum of squared singular values. This connects SVD to energy retention criteria.

Best rank-k projector

projrank k​(A)=argrank(X)≀kmin​∄A−X∄F​

Explanation: Conceptually, truncating the SVD is projecting A onto the manifold of rank-≀ k matrices under Frobenius norm.

Randomized SVD pipeline

Y=AΩ,Q=orth(Y),B=Q⊀A,B=U~ÎŁV⊀,U=QU~

Explanation: A random test matrix Ω captures A’s range. Orthonormalizing yields Q; SVD on the small B gives an approximate SVD of A.

Power iteration in randomized SVD

Y=(AA⊀)qAΩ

Explanation: Applying q power iterations amplifies the spectral gap, improving approximation accuracy when singular values decay slowly.

Projection error (intuition)

∄A−QQ⊀A∄2​≈σk+1​

Explanation: If Q spans (nearly) the top-k left singular subspace, projecting A onto Q captures all but the next singular value’s contribution.

Energy fraction

Energy retained(k)=∑i=1r​σi2​∑i=1k​σi2​​

Explanation: This fraction quantifies how much of the matrix’s total energy (Frobenius norm squared) is preserved by rank k.

Eckart–Young–Mirsky (general)

rank(X)≀kmin​∄A−X∄=∄A−Ak​∄for any unitarily invariant norm ∄⋅∄

Explanation: Truncated SVD minimizes all norms that depend only on singular values, not on singular vectors, e.g., Schatten p-norms.

Nuclear norm regularization

Xargmin​ 21​∄A−X∄F2​+λ∄X∄∗​

Explanation: Convex proxy for promoting low rank, useful when rank constraints are hard; solutions shrink singular values by soft-thresholding.

Complexity Analysis

Let A be m×n with m≄n without loss of generality. Computing a full dense SVD via standard algorithms (e.g., bidiagonalization + QR) costs O(mn2) time and O(mn) space to store A plus O((m+n)min(m,n)) for factors. This is expensive when both dimensions are large. However, many applications only need a rank-k approximation with k â‰Ș min(m,n). Truncating after computing the full SVD still costs O(mn2), but using specialized methods reduces the cost dramatically. Truncated Krylov methods IRLLanczos​ for the top k singular triplets on sparse or structured matrices often run in O(nnz(A)·k) time plus orthogonalization overhead O(k2(m+n)), with memory proportional to O(nnz(A) + (m+n)k). For dense matrices, randomized SVD achieves O(mn(k+p) + (m+n)(k+p)^2) time, where p is a small oversampling parameter (e.g., 5–20), and O((m+n)(k+p)) extra space. With q power iterations, the leading term scales like (2q+1)mn(k+p), improving accuracy when the spectral gap is small. The small matrix SVD on B ∈ R(k+p)×n costs O(n(k+p)^2). Forming the approximation Ak​=Uk​Σk​VkT​ and multiplying by vectors thereafter costs O((m+n)k) storage and O(mk + nk) time per multiply, versus O(mn) storage and O(mn) time for dense A. Error guarantees are crisp: for exact truncated SVD, the spectral error equals σk+1​, and Frobenius error equals (∑_{i>k} σi2​)1/2. Randomized methods achieve comparable bounds up to small factors depending on p and q, with high probability. In practice, the dominant costs are matrix–matrix multiplies AΩ and repeated orthonormalizations; careful BLAS-level implementations attain high performance.

Code Examples

Best rank-k approximation via truncated SVD (Eigen)
1// Compile with: g++ -O2 -std=c++17 -I /usr/include/eigen3 svd_rankk.cpp -o svd_rankk
2#include <Eigen/Dense>
3#include <iostream>
4#include <random>
5#include <vector>
6using namespace std;
7using namespace Eigen;
8
9// Utility: compute Frobenius norm of a matrix
10double frobNorm(const MatrixXd &M) {
11 return M.norm(); // Eigen's .norm() is Frobenius for matrices
12}
13
14int main() {
15 // Example: build a matrix that is approximately rank-2 plus small noise
16 int m = 8, n = 6, k = 2; // target rank k
17 std::mt19937 rng(42);
18 std::normal_distribution<double> gauss(0.0, 1.0);
19
20 MatrixXd U0 = MatrixXd::NullaryExpr(m, k, [&](){ return gauss(rng); });
21 HouseholderQR<MatrixXd> qrU(U0);
22 MatrixXd U = qrU.householderQ() * MatrixXd::Identity(m, k); // orthonormal m x k
23
24 MatrixXd V0 = MatrixXd::NullaryExpr(n, k, [&](){ return gauss(rng); });
25 HouseholderQR<MatrixXd> qrV(V0);
26 MatrixXd V = qrV.householderQ() * MatrixXd::Identity(n, k); // orthonormal n x k
27
28 Vector2d s; s << 10.0, 3.0; // singular values
29 MatrixXd A = U * s.asDiagonal() * V.transpose();
30
31 // Add small Gaussian noise
32 MatrixXd Noise = MatrixXd::NullaryExpr(m, n, [&](){ return 0.05 * gauss(rng); });
33 A += Noise;
34
35 // Compute thin SVD of A
36 JacobiSVD<MatrixXd> svd(A, ComputeThinU | ComputeThinV);
37 VectorXd sigma = svd.singularValues();
38 MatrixXd Ufull = svd.matrixU();
39 MatrixXd Vfull = svd.matrixV();
40
41 // Truncate to rank k
42 MatrixXd Uk = Ufull.leftCols(k);
43 MatrixXd Vk = Vfull.leftCols(k);
44 MatrixXd Sk = sigma.head(k).asDiagonal();
45
46 // Form A_k = U_k S_k V_k^T
47 MatrixXd Ak = Uk * Sk * Vk.transpose();
48
49 // Report errors and energy retention
50 double errF = frobNorm(A - Ak);
51 double totalEnergy = sigma.array().square().sum();
52 double keptEnergy = sigma.head(k).array().square().sum();
53 double energyFrac = keptEnergy / totalEnergy;
54
55 cout << "Matrix size: " << m << "x" << n << "\n";
56 cout << "Target rank k: " << k << "\n";
57 cout << "Top singular values: " << sigma.head(min<int>(5, sigma.size())).transpose() << "\n";
58 cout << "Frobenius error ||A - A_k||_F: " << errF << "\n";
59 cout << "Energy retained: " << 100.0 * energyFrac << "%\n";
60
61 // Optional: show A and Ak for inspection on small matrices
62 // cout << "A:\n" << A << "\n";
63 // cout << "A_k:\n" << Ak << "\n";
64
65 return 0;
66}
67

We synthesize a near rank-2 matrix A by multiplying random orthonormal U and V with chosen singular values, then add small noise. We compute the (thin) SVD of A using Eigen, keep only the top k singular values/vectors, and reconstruct the best rank-k approximation A_k. We then compute the Frobenius error and the fraction of energy retained to illustrate Eckart–Young’s guarantees and the compression tradeoff.

Time: O(m n min(m, n)) for full dense SVD; reconstruction and reporting are O(m n + (m+n)k).Space: O(m n) to store A plus O((m+n)k) for truncated factors.
Randomized SVD (range finder with power iterations)
1// Compile with: g++ -O2 -std=c++17 -I /usr/include/eigen3 randomized_svd.cpp -o randomized_svd
2#include <Eigen/Dense>
3#include <iostream>
4#include <random>
5using namespace std;
6using namespace Eigen;
7
8// Generate a standard normal random matrix of size rows x cols
9MatrixXd gaussianMatrix(int rows, int cols, std::mt19937 &rng) {
10 std::normal_distribution<double> gauss(0.0, 1.0);
11 MatrixXd G(rows, cols);
12 for (int i = 0; i < rows; ++i)
13 for (int j = 0; j < cols; ++j)
14 G(i, j) = gauss(rng);
15 return G;
16}
17
18// Orthonormalize columns using QR and return thin Q
19MatrixXd orth(const MatrixXd &Y) {
20 HouseholderQR<MatrixXd> qr(Y);
21 int k = min(Y.rows(), Y.cols());
22 return qr.householderQ() * MatrixXd::Identity(Y.rows(), k);
23}
24
25int main() {
26 int m = 100, n = 80; // size of A
27 int true_rank = 5; // build a low-rank ground-truth matrix
28 int k = 5; // target rank
29 int p = 10; // oversampling
30 int q = 1; // power iterations
31
32 std::mt19937 rng(123);
33
34 // Construct A = U S V^T with decaying singular values plus small noise
35 MatrixXd U0 = gaussianMatrix(m, true_rank, rng);
36 MatrixXd U = orth(U0);
37 MatrixXd V0 = gaussianMatrix(n, true_rank, rng);
38 MatrixXd V = orth(V0);
39
40 VectorXd s(true_rank);
41 for (int i = 0; i < true_rank; ++i) s(i) = pow(0.6, i) * 20.0; // geometric decay
42 MatrixXd A = U * s.asDiagonal() * V.transpose();
43 A += 0.01 * gaussianMatrix(m, n, rng); // small noise
44
45 // Randomized range finder
46 MatrixXd Omega = gaussianMatrix(n, k + p, rng); // test matrix
47 MatrixXd Y = A * Omega; // sample the range of A
48
49 // Power iterations to improve separation
50 for (int i = 0; i < q; ++i) {
51 MatrixXd Z = A.transpose() * Y; // go to co-range
52 Y = A * Z; // return to range
53 }
54
55 MatrixXd Q = orth(Y); // orthonormal basis for range(A)
56 MatrixXd B = Q.transpose() * A; // small (k+p) x n matrix
57
58 // SVD on small B
59 JacobiSVD<MatrixXd> svdB(B, ComputeThinU | ComputeThinV);
60 VectorXd sigma = svdB.singularValues();
61 MatrixXd Utilde = svdB.matrixU();
62 MatrixXd Vhat = svdB.matrixV();
63
64 // Form approximate rank-k SVD of A
65 MatrixXd Uk = (Q * Utilde).leftCols(k);
66 MatrixXd Vk = Vhat.leftCols(k);
67 MatrixXd Sk = sigma.head(k).asDiagonal();
68
69 MatrixXd Ak = Uk * Sk * Vk.transpose();
70
71 double errF = (A - Ak).norm();
72 double totalEnergy = (A).squaredNorm(); // equals ||A||_F^2
73 double approxEnergy = (Ak).squaredNorm();
74
75 cout << "Randomized SVD approximation (m=" << m << ", n=" << n << ")\n";
76 cout << "Target rank k=" << k << ", oversampling p=" << p << ", power iters q=" << q << "\n";
77 cout << "Frobenius error ||A - A_k||_F: " << errF << "\n";
78 cout << "Energy retained (approx): " << 100.0 * (approxEnergy / totalEnergy) << "%\n";
79
80 return 0;
81}
82

This implements the standard randomized SVD pipeline: draw a random test matrix, sample A’s range via Y = AΩ, optionally apply q power iterations to enhance separation, orthonormalize to get Q, compute a small SVD of B = Q^T A, and lift back to form an approximate rank-k SVD of A. It achieves high-quality approximations at a fraction of the cost of full SVD when k â‰Ș min(m,n).

Time: O((2q+1) m n (k+p) + (m+n)(k+p)^2) for dense A; the small SVD on B is O(n (k+p)^2).Space: O((m+n)(k+p)) extra beyond storing A.
#low-rank approximation#eckart-young theorem#svd#truncated svd#frobenius norm#spectral norm#pca#randomized svd#matrix compression#image compression#ky fan norm#nuclear norm#rank-k approximation#qr factorization#power iteration