Linear Algebra Theory
Key Points
- •Linear algebra studies vectors, linear combinations, and transformations that preserve addition and scalar multiplication.
- •Matrices represent linear transformations; their columns tell you where basis vectors go.
- •Eigenvalue decomposition factors a matrix into directions that are only stretched, not rotated, by the matrix.
- •The spectral theorem says every real symmetric matrix has orthonormal eigenvectors and can be diagonalized by an orthogonal matrix.
- •Singular Value Decomposition (SVD) factors any matrix into rotations/reflections and axis-aligned scalings; it always exists.
- •Rank equals the number of non-zero singular values, and key norms relate to singular values: the spectral norm is the largest singular value and the Frobenius norm squares sum to all singular values squared.
- •Low-rank approximation via truncated SVD gives the best rank-k approximation in Frobenius norm (Eckart–Young theorem).
- •In AI/ML, linear algebra powers PCA, embeddings, optimization steps, and attention; understanding norms and condition numbers guides stability and generalization.
Prerequisites
- →Basic linear algebra notation (vectors, matrices) — To read and write formulas, index entries, and understand matrix–vector products.
- →Calculus (multivariable basics) — To interpret gradients, Hessians, and how eigenvalues relate to curvature in optimization.
- →Numerical computing and floating-point — To understand rounding errors, stability, and the need for pivoting/orthonormalization.
- →Probability and statistics — To connect SVD with PCA and covariance matrices in ML applications.
Detailed Explanation
Tap terms for definitions01Overview
Linear algebra is the language of vectors and the transformations that act on them. It studies vector spaces (sets of objects you can add and scale) and linear transformations that preserve these operations. In computation, we represent linear transformations as matrices, so manipulating matrices is central. Concepts like linear independence, basis, and dimension tell us how many directions of freedom a vector space has, and which minimal set of vectors can generate all others. Eigenvalues and eigenvectors reveal special directions where a transformation only scales vectors, not rotates them. For real symmetric matrices, the spectral theorem guarantees an orthonormal basis of eigenvectors, making such matrices particularly well-behaved. The Singular Value Decomposition (SVD) extends these ideas to any rectangular matrix, factoring it into orthogonal matrices and a diagonal matrix of nonnegative singular values. These singular values connect directly to matrix properties like rank, operator norms, and conditioning. In modern AI/ML, these tools are indispensable: PCA (an SVD of the data matrix) finds principal components, neural network layers multiply matrices by activations, attention is batched matrix multiplication, and low-rank approximations compress and speed up models while controlling error via norms.
02Intuition & Analogies
Imagine vectors as arrows in space and linear transformations as machines that stretch, shrink, rotate, and shear arrows but always preserve the origin and straight lines. If you draw a grid on paper and apply a linear transformation, the grid lines remain straight and parallel, though the grid might tilt and stretch. A basis is like choosing the two (in 2D) or three (in 3D) most convenient grid directions so that every point can be described by coordinates along those directions. Linear independence means your chosen directions aren’t redundant; you can’t recreate one arrow by mixing the others. Eigenvectors are the magical directions the machine doesn’t rotate—only stretches or compresses them—like rubber bands aligned with the machine’s main pull directions. For symmetric machines (imagine a perfectly even stretch in mirror-like ways), these special directions are perpendicular to each other and form a clean coordinate system. SVD says any messy transformation can be decomposed into three simpler steps: rotate (Vᵀ), stretch along coordinate axes (Σ), then rotate again (U). This is like turning a crumpled ellipsoid so that you can measure its principal radii and orientations. Rank measures how many independent directions survive the transformation; if rank is low, most directions are collapsed, which is why low-rank approximations capture the most important structure with fewer numbers. Norms measure how big the machine’s effect is: the spectral norm is the maximum stretch of any unit vector; the Frobenius norm measures total energy of all stretches. The condition number compares the largest and smallest stretches and warns how sensitive solutions are to small changes.
03Formal Definition
04When to Use
Use bases and dimension when you need minimal coordinate representations (e.g., compressing data or verifying feature redundancy). Apply linear independence tests via rank to detect multicollinearity in regression or to ensure a set of features provides unique information. Use eigenvalues/eigenvectors for analyzing stability (e.g., Jacobian linearization), graph clustering (Laplacian eigenvectors), and for iterative solvers and PageRank (power iteration). When matrices are symmetric (covariances, Laplacians, Hessians), use the spectral theorem to diagonalize with an orthogonal matrix—this simplifies computations, ensures real eigenvalues, and supports orthogonal projections. Use SVD for any rectangular data matrix: PCA (via SVD of centered data), least-squares with pseudoinverse, noise filtering, and low-rank compression of embeddings or weight matrices. The Eckart–Young theorem guides optimal rank-k truncation when you want the best approximation under the Frobenius norm. Rely on the spectral norm to bound worst-case amplification (robustness) and the Frobenius norm as an energy measure across all directions. Monitor the condition number to assess numerical stability of linear solves and to decide whether to regularize (ridge regression) or precondition.
⚠️Common Mistakes
- Confusing linear independence with orthogonality: independent vectors need not be perpendicular; always check rank or attempt to express one as a combination of others.
- Assuming every matrix is diagonalizable by eigenvectors: non-symmetric or defective matrices may not have a full eigenbasis; use SVD or Jordan form theory when needed.
- Mixing up eigenvalues and singular values: for non-symmetric matrices, eigenvalues can be complex or negative, while singular values are always real and nonnegative; norms and rank are tied to singular values, not eigenvalues.
- Ignoring scaling of eigenvectors: eigenvectors are not unique in length; normalize them when comparing or computing Rayleigh quotients to avoid misleading magnitudes.
- Using plain Gaussian elimination without pivoting: this can be numerically unstable; prefer partial pivoting or QR for solving and rank.
- Misinterpreting condition number: a large \kappa means small input errors can cause large output errors; it is not a direct error but a sensitivity bound.
- Believing Frobenius and spectral norms behave the same: |A|{F} measures overall energy, while |A|{2} captures the maximum stretch; choose the norm that matches your guarantee.
- Assuming randomized low-rank projections always give the best rank-k approximation: they approximate the top singular subspace; only truncated SVD is provably optimal under Eckart–Young.
Key Formulas
Span
Explanation: The span of a set S is all vectors you can reach by linear combinations of S. It defines the subspace generated by S.
Rank
Explanation: The rank equals the dimension of the column space (image) of A. It counts independent output directions of the linear transformation.
Eigenvalue equation
Explanation: An eigenvector v is only scaled (by ) when multiplied by A. This reveals invariant directions of the transformation.
Eigen-decomposition
Explanation: A diagonalizable matrix can be factored into eigenvectors and eigenvalues. Computations simplify in this eigenbasis.
Spectral theorem
Explanation: Real symmetric matrices are orthogonally diagonalizable with real eigenvalues and orthonormal eigenvectors. This ensures numerical stability and simplicity.
SVD
Explanation: Any matrix factors into left/right orthogonal matrices and nonnegative singular values. It generalizes eigen-decomposition to rectangular and non-normal matrices.
Norms via singular values
Explanation: The spectral norm is the largest singular value; the Frobenius norm is the square root of the sum of squares of all singular values. These connect geometry to computation.
Rank via SVD
Explanation: The number of nonzero singular values equals the rank. This is robust to choice of basis and works for rectangular matrices.
Truncated SVD
Explanation: The best rank-k approximation keeps only the top k singular triplets. It minimizes error among all rank-k matrices.
Eckart–Young (Frobenius)
Explanation: The squared Frobenius error of the best rank-k approximation equals the sum of squared discarded singular values. This quantifies information loss.
Condition number
Explanation: The 2-norm condition number measures sensitivity of solving Ax=b. Larger values imply solutions can change a lot under small perturbations.
Rayleigh quotient
Explanation: For symmetric A, the Rayleigh quotient of a nonzero x lies between the smallest and largest eigenvalues. It reaches extremes at eigenvectors.
Gaussian elimination cost
Explanation: The arithmetic cost for solving a dense n-by-n system via elimination is about operations. This dominates for large n.
Power iteration step
Explanation: Repeatedly multiplying and normalizing converges to the dominant eigenvector if it is unique and the start vector has a component in that direction.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Numerical threshold for treating values as zero 5 const double EPS = 1e-9; 6 7 // Perform Gaussian elimination with partial pivoting to row-echelon form. 8 // Returns pivot column indices and transforms A in-place to an upper-triangular-like form. 9 pair<int, vector<int>> rowEchelon(vector<vector<double>>& A) { 10 int m = (int)A.size(); 11 int n = (int)A[0].size(); 12 int row = 0; 13 vector<int> pivotCol; 14 for (int col = 0; col < n && row < m; ++col) { 15 // Find pivot in this column at or below current row 16 int sel = row; 17 for (int i = row + 1; i < m; ++i) if (fabs(A[i][col]) > fabs(A[sel][col])) sel = i; 18 if (fabs(A[sel][col]) < EPS) continue; // no pivot in this column 19 swap(A[sel], A[row]); 20 pivotCol.push_back(col); 21 // Normalize pivot row (optional for rank, helpful for RREF-like form) 22 double piv = A[row][col]; 23 for (int j = col; j < n; ++j) A[row][j] /= piv; 24 // Eliminate below pivot 25 for (int i = row + 1; i < m; ++i) { 26 double factor = A[i][col]; 27 if (fabs(factor) < EPS) continue; 28 for (int j = col; j < n; ++j) A[i][j] -= factor * A[row][j]; 29 } 30 ++row; 31 } 32 int rank = (int)pivotCol.size(); 33 return {rank, pivotCol}; 34 } 35 36 // Check if a set of column vectors is linearly independent: form matrix with vectors as columns. 37 bool isIndependent(const vector<vector<double>>& cols) { 38 int n = (int)cols.size(); 39 int m = (int)cols[0].size(); 40 vector<vector<double>> A(m, vector<double>(n)); 41 for (int j = 0; j < n; ++j) for (int i = 0; i < m; ++i) A[i][j] = cols[j][i]; 42 auto tmp = A; 43 auto res = rowEchelon(tmp); 44 return res.first == n; // rank equals number of columns 45 } 46 47 int main() { 48 // Example: matrix A with dependent columns 49 vector<vector<double>> A = { 50 {1, 2, 3, 6}, 51 {0, 1, 1, 2}, 52 {2, 5, 7, 14} 53 }; // Note: col4 = 2*col3, so rank should be 3 54 55 auto A_copy = A; 56 auto [rankA, pivots] = rowEchelon(A_copy); 57 cout << fixed << setprecision(6); 58 cout << "Rank(A) = " << rankA << "\n"; 59 cout << "Pivot columns (0-based): "; 60 for (int c : pivots) cout << c << ' '; 61 cout << "\n"; 62 63 // Build columns as vectors to test independence of first 3 columns 64 vector<vector<double>> cols(3, vector<double>(3)); 65 for (int j = 0; j < 3; ++j) 66 for (int i = 0; i < 3; ++i) 67 cols[j][i] = A[i][j]; 68 69 cout << "First 3 columns independent? " << (isIndependent(cols) ? "yes" : "no") << "\n"; 70 71 return 0; 72 } 73
We implement Gaussian elimination with partial pivoting to transform a matrix to row echelon form. The number of pivot columns equals the rank, and these pivot columns identify a basis for the column space. We also test linear independence by checking whether the rank equals the number of vectors. Partial pivoting chooses the largest absolute value in a column to improve numerical stability.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 const double EPS = 1e-10; 5 6 vector<double> matvec(const vector<vector<double>>& A, const vector<double>& x) { 7 int n = (int)A.size(); 8 vector<double> y(n, 0.0); 9 for (int i = 0; i < n; ++i) { 10 for (int j = 0; j < n; ++j) y[i] += A[i][j] * x[j]; 11 } 12 return y; 13 } 14 15 double dot(const vector<double>& a, const vector<double>& b) { 16 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i]*b[i]; return s; 17 } 18 19 double norm2(const vector<double>& x) { 20 return sqrt(max(0.0, dot(x,x))); 21 } 22 23 pair<double, vector<double>> power_iteration(const vector<vector<double>>& A, int max_iters=1000, double tol=1e-9) { 24 int n = (int)A.size(); 25 // Random initial vector 26 mt19937 rng(42); 27 uniform_real_distribution<double> dist(-1.0, 1.0); 28 vector<double> b(n); for (int i=0;i<n;++i) b[i] = dist(rng); 29 double lambda_old = 0.0; 30 31 for (int it = 0; it < max_iters; ++it) { 32 // Multiply and normalize 33 vector<double> Ab = matvec(A, b); 34 double nb = norm2(Ab); 35 if (nb < EPS) break; // A maps b near zero 36 for (int i=0;i<n;++i) b[i] = Ab[i] / nb; 37 // Rayleigh quotient estimate 38 vector<double> Ab_norm = matvec(A, b); 39 double lambda = dot(b, Ab_norm); 40 if (fabs(lambda - lambda_old) < tol * max(1.0, fabs(lambda_old))) { 41 return {lambda, b}; 42 } 43 lambda_old = lambda; 44 } 45 // Final estimate 46 vector<double> Ab = matvec(A, b); 47 double lambda = dot(b, Ab); 48 return {lambda, b}; 49 } 50 51 int main(){ 52 // Symmetric matrix with known eigenvalues 53 vector<vector<double>> A = { 54 {4, 1, 0}, 55 {1, 3, 0}, 56 {0, 0, 2} 57 }; // eigenvalues: 5, 2, 2 (dominant = 5) 58 59 auto [lambda, v] = power_iteration(A, 10000, 1e-12); 60 61 cout << fixed << setprecision(8); 62 cout << "Dominant eigenvalue ~ " << lambda << "\n"; 63 cout << "Eigenvector (normalized): "; 64 for (double vi : v) cout << vi << ' '; cout << "\n"; 65 66 // Residual norm ||Av - lambda v|| to check accuracy 67 vector<double> Av = matvec(A, v); 68 for (int i=0;i<(int)v.size();++i) Av[i] -= lambda * v[i]; 69 double res = 0; for (double t: Av) res += t*t; res = sqrt(res); 70 cout << "Residual norm = " << res << "\n"; 71 72 return 0; 73 } 74
Power iteration repeatedly multiplies a vector by A and normalizes it. For symmetric matrices with a unique largest eigenvalue in magnitude, this process converges to the dominant eigenvector, and the Rayleigh quotient gives the eigenvalue estimate. The small residual verifies Av ≈ λv.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Utility: matrix multiply C = A * B (A: m x p, B: p x n) 5 vector<vector<double>> matmul(const vector<vector<double>>& A, const vector<vector<double>>& B) { 6 int m = (int)A.size(), p = (int)A[0].size(), n = (int)B[0].size(); 7 vector<vector<double>> C(m, vector<double>(n, 0.0)); 8 for (int i=0;i<m;++i) 9 for (int k=0;k<p;++k) if (A[i][k] != 0.0) 10 for (int j=0;j<n;++j) 11 C[i][j] += A[i][k] * B[k][j]; 12 return C; 13 } 14 15 // Transpose 16 vector<vector<double>> transpose(const vector<vector<double>>& A){ 17 int m=(int)A.size(), n=(int)A[0].size(); 18 vector<vector<double>> AT(n, vector<double>(m)); 19 for(int i=0;i<m;++i) for(int j=0;j<n;++j) AT[j][i]=A[i][j]; 20 return AT; 21 } 22 23 // Modified Gram-Schmidt to orthonormalize columns of Y (m x k) 24 vector<vector<double>> mgs(const vector<vector<double>>& Y){ 25 int m=(int)Y.size(), k=(int)Y[0].size(); 26 vector<vector<double>> Q(m, vector<double>(k, 0.0)); 27 // Copy Y into Q's columns 28 for(int j=0;j<k;++j) for(int i=0;i<m;++i) Q[i][j]=Y[i][j]; 29 for(int j=0;j<k;++j){ 30 // Subtract projections onto previous q's 31 for(int i2=0;i2<j;++i2){ 32 double r=0.0; for(int i=0;i<m;++i) r += Q[i][i2]*Q[i][j]; 33 for(int i=0;i<m;++i) Q[i][j] -= r*Q[i][i2]; 34 } 35 // Normalize 36 double nrm=0.0; for(int i=0;i<m;++i) nrm += Q[i][j]*Q[i][j]; nrm = sqrt(max(1e-18, nrm)); 37 for(int i=0;i<m;++i) Q[i][j] /= nrm; 38 } 39 return Q; 40 } 41 42 // Frobenius norm 43 double frob(const vector<vector<double>>& A){ 44 double s=0.0; for (auto& r: A) for (double x: r) s += x*x; return sqrt(s); 45 } 46 47 // Randomized range finder: returns Q (m x k) with orthonormal columns approximating range(A) 48 vector<vector<double>> randomized_range_finder(const vector<vector<double>>& A, int k, int oversample=5){ 49 int m=(int)A.size(), n=(int)A[0].size(); 50 int l = min(n, k+oversample); 51 // Omega: n x l random 52 mt19937 rng(42); normal_distribution<double> dist(0.0,1.0); 53 vector<vector<double>> Omega(n, vector<double>(l)); 54 for(int i=0;i<n;++i) for(int j=0;j<l;++j) Omega[i][j] = dist(rng); 55 // Y = A * Omega (m x l) 56 vector<vector<double>> Y = matmul(A, Omega); 57 // Orthonormalize columns to get Q (m x l) 58 vector<vector<double>> Q = mgs(Y); 59 // Truncate to k columns 60 Q.resize(m); 61 for(int i=0;i<m;++i) Q[i].resize(k); 62 return Q; 63 } 64 65 int main(){ 66 // Build a low-rank-ish matrix A = U * S * V^T synthetically 67 int m=50, n=40, r=3; // true rank r 68 mt19937 rng(0); normal_distribution<double> dist(0.0,1.0); 69 vector<vector<double>> U(m, vector<double>(r)), V(n, vector<double>(r)); 70 for(int i=0;i<m;++i) for(int j=0;j<r;++j) U[i][j]=dist(rng); 71 for(int i=0;i<n;++i) for(int j=0;j<r;++j) V[i][j]=dist(rng); 72 // Orthonormalize U,V columns quickly via MGS 73 U = mgs(U); V = mgs(V); 74 vector<double> s = {10.0, 3.0, 1.0}; 75 vector<vector<double>> S(r, vector<double>(r, 0.0)); 76 for(int i=0;i<r;++i) S[i][i]=s[i]; 77 vector<vector<double>> A = matmul(matmul(U,S), transpose(V)); 78 79 // Add small noise to make it full rank but still low-rank dominant 80 vector<vector<double>> noise(m, vector<double>(n)); 81 for(int i=0;i<m;++i) for(int j=0;j<n;++j) noise[i][j] = 0.01*dist(rng); 82 for(int i=0;i<m;++i) for(int j=0;j<n;++j) A[i][j] += noise[i][j]; 83 84 // Randomized rank-k approximation 85 int k=3; // target rank 86 vector<vector<double>> Q = randomized_range_finder(A, k, 5); // m x k 87 vector<vector<double>> QT = transpose(Q); // k x m 88 vector<vector<double>> B = matmul(QT, A); // k x n 89 vector<vector<double>> A_approx = matmul(Q, B); // m x n 90 91 double rel_err = frob(A_approx) > 0 ? frob(A_approx) : 1.0; // placeholder to avoid unused warnings 92 double err = frob(A); // will recompute below 93 94 // Compute relative Frobenius error ||A - A_approx||_F / ||A||_F 95 vector<vector<double>> Diff = A; 96 for(int i=0;i<m;++i) for(int j=0;j<n;++j) Diff[i][j] -= A_approx[i][j]; 97 double relF = frob(Diff) / max(1e-18, frob(A)); 98 99 cout << fixed << setprecision(6); 100 cout << "Relative Frobenius error (rank-" << k << ") = " << relF << "\n"; 101 102 return 0; 103 } 104
This code computes a randomized low-rank approximation by finding an orthonormal basis Q for the range of A via Y = AΩ and Modified Gram–Schmidt, then projecting A onto that subspace: Â = Q(QᵀA). When k matches the number of dominant singular directions, the approximation is close to the optimal truncated SVD (Eckart–Young). The example constructs a synthetic low-rank matrix with small noise and reports the relative Frobenius error.