📚TheoryIntermediate

Linear Algebra Theory

Key Points

  • Linear algebra studies vectors, linear combinations, and transformations that preserve addition and scalar multiplication.
  • Matrices represent linear transformations; their columns tell you where basis vectors go.
  • Eigenvalue decomposition factors a matrix into directions that are only stretched, not rotated, by the matrix.
  • The spectral theorem says every real symmetric matrix has orthonormal eigenvectors and can be diagonalized by an orthogonal matrix.
  • Singular Value Decomposition (SVD) factors any matrix into rotations/reflections and axis-aligned scalings; it always exists.
  • Rank equals the number of non-zero singular values, and key norms relate to singular values: the spectral norm is the largest singular value and the Frobenius norm squares sum to all singular values squared.
  • Low-rank approximation via truncated SVD gives the best rank-k approximation in Frobenius norm (Eckart–Young theorem).
  • In AI/ML, linear algebra powers PCA, embeddings, optimization steps, and attention; understanding norms and condition numbers guides stability and generalization.

Prerequisites

  • Basic linear algebra notation (vectors, matrices)To read and write formulas, index entries, and understand matrix–vector products.
  • Calculus (multivariable basics)To interpret gradients, Hessians, and how eigenvalues relate to curvature in optimization.
  • Numerical computing and floating-pointTo understand rounding errors, stability, and the need for pivoting/orthonormalization.
  • Probability and statisticsTo connect SVD with PCA and covariance matrices in ML applications.

Detailed Explanation

Tap terms for definitions

01Overview

Linear algebra is the language of vectors and the transformations that act on them. It studies vector spaces (sets of objects you can add and scale) and linear transformations that preserve these operations. In computation, we represent linear transformations as matrices, so manipulating matrices is central. Concepts like linear independence, basis, and dimension tell us how many directions of freedom a vector space has, and which minimal set of vectors can generate all others. Eigenvalues and eigenvectors reveal special directions where a transformation only scales vectors, not rotates them. For real symmetric matrices, the spectral theorem guarantees an orthonormal basis of eigenvectors, making such matrices particularly well-behaved. The Singular Value Decomposition (SVD) extends these ideas to any rectangular matrix, factoring it into orthogonal matrices and a diagonal matrix of nonnegative singular values. These singular values connect directly to matrix properties like rank, operator norms, and conditioning. In modern AI/ML, these tools are indispensable: PCA (an SVD of the data matrix) finds principal components, neural network layers multiply matrices by activations, attention is batched matrix multiplication, and low-rank approximations compress and speed up models while controlling error via norms.

02Intuition & Analogies

Imagine vectors as arrows in space and linear transformations as machines that stretch, shrink, rotate, and shear arrows but always preserve the origin and straight lines. If you draw a grid on paper and apply a linear transformation, the grid lines remain straight and parallel, though the grid might tilt and stretch. A basis is like choosing the two (in 2D) or three (in 3D) most convenient grid directions so that every point can be described by coordinates along those directions. Linear independence means your chosen directions aren’t redundant; you can’t recreate one arrow by mixing the others. Eigenvectors are the magical directions the machine doesn’t rotate—only stretches or compresses them—like rubber bands aligned with the machine’s main pull directions. For symmetric machines (imagine a perfectly even stretch in mirror-like ways), these special directions are perpendicular to each other and form a clean coordinate system. SVD says any messy transformation can be decomposed into three simpler steps: rotate (Vᵀ), stretch along coordinate axes (Σ), then rotate again (U). This is like turning a crumpled ellipsoid so that you can measure its principal radii and orientations. Rank measures how many independent directions survive the transformation; if rank is low, most directions are collapsed, which is why low-rank approximations capture the most important structure with fewer numbers. Norms measure how big the machine’s effect is: the spectral norm is the maximum stretch of any unit vector; the Frobenius norm measures total energy of all stretches. The condition number compares the largest and smallest stretches and warns how sensitive solutions are to small changes.

03Formal Definition

A vector space V over a field F (e.g., F = or ) is a set with addition and scalar multiplication satisfying the usual axioms (associativity, commutativity of addition, distributivity, etc.). A list of vectors ,, V is linearly independent if implies all = 0. The span of a set S V is all finite linear combinations of elements of S. A basis is a linearly independent set whose span equals V. The dimension of V, (V), is the size of any basis (well-defined by the Steinitz exchange lemma). A linear transformation T: V W satisfies T( u + v) = T(u) + T(v). Relative to chosen bases, T is represented by a matrix A such that [T(x)] = A[x]. An eigenvalue–eigenvector pair (, v 0) of A satisfies Av = v. For diagonalizable A, there exists an invertible V with , where is diagonal of eigenvalues. For real symmetric , there exists an orthogonal Q with . For any A , the SVD is , with U , V orthogonal, and diagonal with nonnegative singular values 0. The rank of A equals the number of nonzero singular values. Matrix norms include the spectral norm \_{2} = and Frobenius norm \_{F} = . The condition number in the 2-norm is (A) = / .

04When to Use

Use bases and dimension when you need minimal coordinate representations (e.g., compressing data or verifying feature redundancy). Apply linear independence tests via rank to detect multicollinearity in regression or to ensure a set of features provides unique information. Use eigenvalues/eigenvectors for analyzing stability (e.g., Jacobian linearization), graph clustering (Laplacian eigenvectors), and for iterative solvers and PageRank (power iteration). When matrices are symmetric (covariances, Laplacians, Hessians), use the spectral theorem to diagonalize with an orthogonal matrix—this simplifies computations, ensures real eigenvalues, and supports orthogonal projections. Use SVD for any rectangular data matrix: PCA (via SVD of centered data), least-squares with pseudoinverse, noise filtering, and low-rank compression of embeddings or weight matrices. The Eckart–Young theorem guides optimal rank-k truncation when you want the best approximation under the Frobenius norm. Rely on the spectral norm to bound worst-case amplification (robustness) and the Frobenius norm as an energy measure across all directions. Monitor the condition number to assess numerical stability of linear solves and to decide whether to regularize (ridge regression) or precondition.

⚠️Common Mistakes

  • Confusing linear independence with orthogonality: independent vectors need not be perpendicular; always check rank or attempt to express one as a combination of others.
  • Assuming every matrix is diagonalizable by eigenvectors: non-symmetric or defective matrices may not have a full eigenbasis; use SVD or Jordan form theory when needed.
  • Mixing up eigenvalues and singular values: for non-symmetric matrices, eigenvalues can be complex or negative, while singular values are always real and nonnegative; norms and rank are tied to singular values, not eigenvalues.
  • Ignoring scaling of eigenvectors: eigenvectors are not unique in length; normalize them when comparing or computing Rayleigh quotients to avoid misleading magnitudes.
  • Using plain Gaussian elimination without pivoting: this can be numerically unstable; prefer partial pivoting or QR for solving and rank.
  • Misinterpreting condition number: a large \kappa means small input errors can cause large output errors; it is not a direct error but a sensitivity bound.
  • Believing Frobenius and spectral norms behave the same: |A|{F} measures overall energy, while |A|{2} captures the maximum stretch; choose the norm that matches your guarantee.
  • Assuming randomized low-rank projections always give the best rank-k approximation: they approximate the top singular subspace; only truncated SVD is provably optimal under Eckart–Young.

Key Formulas

Span

Explanation: The span of a set S is all vectors you can reach by linear combinations of S. It defines the subspace generated by S.

Rank

Explanation: The rank equals the dimension of the column space (image) of A. It counts independent output directions of the linear transformation.

Eigenvalue equation

Explanation: An eigenvector v is only scaled (by ) when multiplied by A. This reveals invariant directions of the transformation.

Eigen-decomposition

Explanation: A diagonalizable matrix can be factored into eigenvectors and eigenvalues. Computations simplify in this eigenbasis.

Spectral theorem

Explanation: Real symmetric matrices are orthogonally diagonalizable with real eigenvalues and orthonormal eigenvectors. This ensures numerical stability and simplicity.

SVD

Explanation: Any matrix factors into left/right orthogonal matrices and nonnegative singular values. It generalizes eigen-decomposition to rectangular and non-normal matrices.

Norms via singular values

Explanation: The spectral norm is the largest singular value; the Frobenius norm is the square root of the sum of squares of all singular values. These connect geometry to computation.

Rank via SVD

Explanation: The number of nonzero singular values equals the rank. This is robust to choice of basis and works for rectangular matrices.

Truncated SVD

Explanation: The best rank-k approximation keeps only the top k singular triplets. It minimizes error among all rank-k matrices.

Eckart–Young (Frobenius)

Explanation: The squared Frobenius error of the best rank-k approximation equals the sum of squared discarded singular values. This quantifies information loss.

Condition number

Explanation: The 2-norm condition number measures sensitivity of solving Ax=b. Larger values imply solutions can change a lot under small perturbations.

Rayleigh quotient

Explanation: For symmetric A, the Rayleigh quotient of a nonzero x lies between the smallest and largest eigenvalues. It reaches extremes at eigenvectors.

Gaussian elimination cost

Explanation: The arithmetic cost for solving a dense n-by-n system via elimination is about operations. This dominates for large n.

Power iteration step

Explanation: Repeatedly multiplying and normalizing converges to the dominant eigenvector if it is unique and the start vector has a component in that direction.

Complexity Analysis

Core dense linear algebra has cubic-time bottlenecks. Forming a row echelon form via Gaussian elimination with partial pivoting costs about arithmetic operations for an n-by-n matrix and O() space to store the matrix; counting nonzero rows yields the rank. For rectangular m-by-n matrices (), elimination takes O(m). The eigenvalue problem in full generality requires O() time using QR iterations or divide-and-conquer SVD methods, with O() space. However, many ML tasks only need the top singular vectors or eigenpairs: power iteration or Lanczos reduces per-iteration cost to O() (for dense n-by-n) and O(nnz(A)) for sparse matrices, with O(n) additional space; convergence typically requires O((1/)) iterations depending on the spectral gap. Randomized low-rank approximation using a range finder has complexity O(mnk) for computing Y = A, plus O(m) for orthonormalizing Y and O(kmn) for forming the projection Q(A); when k ≪ min(m,n), this is far cheaper than full SVD, and memory drops to O((m+n)k). Norm computations vary: the Frobenius norm costs O(mn) time and O(1) extra space, while the spectral norm generally requires iterative methods. Condition number estimation in the 2-norm is hard exactly but can be bounded via iterative norms or through SVD when available. In practice, numerical stability affects effective complexity: partial pivoting adds negligible overhead yet improves robustness, while orthonormalizations (Modified Gram–Schmidt or Householder) maintain accuracy in low-rank methods at O(m) cost. Overall, careful algorithm choice (dense vs. sparse, full vs. truncated) and data structure selection dominate performance in ML pipelines.

Code Examples

Gaussian elimination with partial pivoting: rank, linear independence, and pivot columns (basis)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Numerical threshold for treating values as zero
5const double EPS = 1e-9;
6
7// Perform Gaussian elimination with partial pivoting to row-echelon form.
8// Returns pivot column indices and transforms A in-place to an upper-triangular-like form.
9pair<int, vector<int>> rowEchelon(vector<vector<double>>& A) {
10 int m = (int)A.size();
11 int n = (int)A[0].size();
12 int row = 0;
13 vector<int> pivotCol;
14 for (int col = 0; col < n && row < m; ++col) {
15 // Find pivot in this column at or below current row
16 int sel = row;
17 for (int i = row + 1; i < m; ++i) if (fabs(A[i][col]) > fabs(A[sel][col])) sel = i;
18 if (fabs(A[sel][col]) < EPS) continue; // no pivot in this column
19 swap(A[sel], A[row]);
20 pivotCol.push_back(col);
21 // Normalize pivot row (optional for rank, helpful for RREF-like form)
22 double piv = A[row][col];
23 for (int j = col; j < n; ++j) A[row][j] /= piv;
24 // Eliminate below pivot
25 for (int i = row + 1; i < m; ++i) {
26 double factor = A[i][col];
27 if (fabs(factor) < EPS) continue;
28 for (int j = col; j < n; ++j) A[i][j] -= factor * A[row][j];
29 }
30 ++row;
31 }
32 int rank = (int)pivotCol.size();
33 return {rank, pivotCol};
34}
35
36// Check if a set of column vectors is linearly independent: form matrix with vectors as columns.
37bool isIndependent(const vector<vector<double>>& cols) {
38 int n = (int)cols.size();
39 int m = (int)cols[0].size();
40 vector<vector<double>> A(m, vector<double>(n));
41 for (int j = 0; j < n; ++j) for (int i = 0; i < m; ++i) A[i][j] = cols[j][i];
42 auto tmp = A;
43 auto res = rowEchelon(tmp);
44 return res.first == n; // rank equals number of columns
45}
46
47int main() {
48 // Example: matrix A with dependent columns
49 vector<vector<double>> A = {
50 {1, 2, 3, 6},
51 {0, 1, 1, 2},
52 {2, 5, 7, 14}
53 }; // Note: col4 = 2*col3, so rank should be 3
54
55 auto A_copy = A;
56 auto [rankA, pivots] = rowEchelon(A_copy);
57 cout << fixed << setprecision(6);
58 cout << "Rank(A) = " << rankA << "\n";
59 cout << "Pivot columns (0-based): ";
60 for (int c : pivots) cout << c << ' ';
61 cout << "\n";
62
63 // Build columns as vectors to test independence of first 3 columns
64 vector<vector<double>> cols(3, vector<double>(3));
65 for (int j = 0; j < 3; ++j)
66 for (int i = 0; i < 3; ++i)
67 cols[j][i] = A[i][j];
68
69 cout << "First 3 columns independent? " << (isIndependent(cols) ? "yes" : "no") << "\n";
70
71 return 0;
72}
73

We implement Gaussian elimination with partial pivoting to transform a matrix to row echelon form. The number of pivot columns equals the rank, and these pivot columns identify a basis for the column space. We also test linear independence by checking whether the rank equals the number of vectors. Partial pivoting chooses the largest absolute value in a column to improve numerical stability.

Time: O(m n^2) for an m-by-n matrix (O(n^3) if square).Space: O(m n) to store the matrix; O(1) extra beyond input storage.
Power iteration for dominant eigenvalue and eigenvector (symmetric A)
1#include <bits/stdc++.h>
2using namespace std;
3
4const double EPS = 1e-10;
5
6vector<double> matvec(const vector<vector<double>>& A, const vector<double>& x) {
7 int n = (int)A.size();
8 vector<double> y(n, 0.0);
9 for (int i = 0; i < n; ++i) {
10 for (int j = 0; j < n; ++j) y[i] += A[i][j] * x[j];
11 }
12 return y;
13}
14
15double dot(const vector<double>& a, const vector<double>& b) {
16 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i]*b[i]; return s;
17}
18
19double norm2(const vector<double>& x) {
20 return sqrt(max(0.0, dot(x,x)));
21}
22
23pair<double, vector<double>> power_iteration(const vector<vector<double>>& A, int max_iters=1000, double tol=1e-9) {
24 int n = (int)A.size();
25 // Random initial vector
26 mt19937 rng(42);
27 uniform_real_distribution<double> dist(-1.0, 1.0);
28 vector<double> b(n); for (int i=0;i<n;++i) b[i] = dist(rng);
29 double lambda_old = 0.0;
30
31 for (int it = 0; it < max_iters; ++it) {
32 // Multiply and normalize
33 vector<double> Ab = matvec(A, b);
34 double nb = norm2(Ab);
35 if (nb < EPS) break; // A maps b near zero
36 for (int i=0;i<n;++i) b[i] = Ab[i] / nb;
37 // Rayleigh quotient estimate
38 vector<double> Ab_norm = matvec(A, b);
39 double lambda = dot(b, Ab_norm);
40 if (fabs(lambda - lambda_old) < tol * max(1.0, fabs(lambda_old))) {
41 return {lambda, b};
42 }
43 lambda_old = lambda;
44 }
45 // Final estimate
46 vector<double> Ab = matvec(A, b);
47 double lambda = dot(b, Ab);
48 return {lambda, b};
49}
50
51int main(){
52 // Symmetric matrix with known eigenvalues
53 vector<vector<double>> A = {
54 {4, 1, 0},
55 {1, 3, 0},
56 {0, 0, 2}
57 }; // eigenvalues: 5, 2, 2 (dominant = 5)
58
59 auto [lambda, v] = power_iteration(A, 10000, 1e-12);
60
61 cout << fixed << setprecision(8);
62 cout << "Dominant eigenvalue ~ " << lambda << "\n";
63 cout << "Eigenvector (normalized): ";
64 for (double vi : v) cout << vi << ' '; cout << "\n";
65
66 // Residual norm ||Av - lambda v|| to check accuracy
67 vector<double> Av = matvec(A, v);
68 for (int i=0;i<(int)v.size();++i) Av[i] -= lambda * v[i];
69 double res = 0; for (double t: Av) res += t*t; res = sqrt(res);
70 cout << "Residual norm = " << res << "\n";
71
72 return 0;
73}
74

Power iteration repeatedly multiplies a vector by A and normalizes it. For symmetric matrices with a unique largest eigenvalue in magnitude, this process converges to the dominant eigenvector, and the Rayleigh quotient gives the eigenvalue estimate. The small residual verifies Av ≈ λv.

Time: O(n^2 * iters) for dense n-by-n matrices (O(nnz(A) * iters) for sparse).Space: O(n^2) to store A (dense), O(n) extra for vectors.
Randomized low-rank approximation A ≈ Q(QᵀA) via range finder (Eckart–Young-inspired)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Utility: matrix multiply C = A * B (A: m x p, B: p x n)
5vector<vector<double>> matmul(const vector<vector<double>>& A, const vector<vector<double>>& B) {
6 int m = (int)A.size(), p = (int)A[0].size(), n = (int)B[0].size();
7 vector<vector<double>> C(m, vector<double>(n, 0.0));
8 for (int i=0;i<m;++i)
9 for (int k=0;k<p;++k) if (A[i][k] != 0.0)
10 for (int j=0;j<n;++j)
11 C[i][j] += A[i][k] * B[k][j];
12 return C;
13}
14
15// Transpose
16vector<vector<double>> transpose(const vector<vector<double>>& A){
17 int m=(int)A.size(), n=(int)A[0].size();
18 vector<vector<double>> AT(n, vector<double>(m));
19 for(int i=0;i<m;++i) for(int j=0;j<n;++j) AT[j][i]=A[i][j];
20 return AT;
21}
22
23// Modified Gram-Schmidt to orthonormalize columns of Y (m x k)
24vector<vector<double>> mgs(const vector<vector<double>>& Y){
25 int m=(int)Y.size(), k=(int)Y[0].size();
26 vector<vector<double>> Q(m, vector<double>(k, 0.0));
27 // Copy Y into Q's columns
28 for(int j=0;j<k;++j) for(int i=0;i<m;++i) Q[i][j]=Y[i][j];
29 for(int j=0;j<k;++j){
30 // Subtract projections onto previous q's
31 for(int i2=0;i2<j;++i2){
32 double r=0.0; for(int i=0;i<m;++i) r += Q[i][i2]*Q[i][j];
33 for(int i=0;i<m;++i) Q[i][j] -= r*Q[i][i2];
34 }
35 // Normalize
36 double nrm=0.0; for(int i=0;i<m;++i) nrm += Q[i][j]*Q[i][j]; nrm = sqrt(max(1e-18, nrm));
37 for(int i=0;i<m;++i) Q[i][j] /= nrm;
38 }
39 return Q;
40}
41
42// Frobenius norm
43double frob(const vector<vector<double>>& A){
44 double s=0.0; for (auto& r: A) for (double x: r) s += x*x; return sqrt(s);
45}
46
47// Randomized range finder: returns Q (m x k) with orthonormal columns approximating range(A)
48vector<vector<double>> randomized_range_finder(const vector<vector<double>>& A, int k, int oversample=5){
49 int m=(int)A.size(), n=(int)A[0].size();
50 int l = min(n, k+oversample);
51 // Omega: n x l random
52 mt19937 rng(42); normal_distribution<double> dist(0.0,1.0);
53 vector<vector<double>> Omega(n, vector<double>(l));
54 for(int i=0;i<n;++i) for(int j=0;j<l;++j) Omega[i][j] = dist(rng);
55 // Y = A * Omega (m x l)
56 vector<vector<double>> Y = matmul(A, Omega);
57 // Orthonormalize columns to get Q (m x l)
58 vector<vector<double>> Q = mgs(Y);
59 // Truncate to k columns
60 Q.resize(m);
61 for(int i=0;i<m;++i) Q[i].resize(k);
62 return Q;
63}
64
65int main(){
66 // Build a low-rank-ish matrix A = U * S * V^T synthetically
67 int m=50, n=40, r=3; // true rank r
68 mt19937 rng(0); normal_distribution<double> dist(0.0,1.0);
69 vector<vector<double>> U(m, vector<double>(r)), V(n, vector<double>(r));
70 for(int i=0;i<m;++i) for(int j=0;j<r;++j) U[i][j]=dist(rng);
71 for(int i=0;i<n;++i) for(int j=0;j<r;++j) V[i][j]=dist(rng);
72 // Orthonormalize U,V columns quickly via MGS
73 U = mgs(U); V = mgs(V);
74 vector<double> s = {10.0, 3.0, 1.0};
75 vector<vector<double>> S(r, vector<double>(r, 0.0));
76 for(int i=0;i<r;++i) S[i][i]=s[i];
77 vector<vector<double>> A = matmul(matmul(U,S), transpose(V));
78
79 // Add small noise to make it full rank but still low-rank dominant
80 vector<vector<double>> noise(m, vector<double>(n));
81 for(int i=0;i<m;++i) for(int j=0;j<n;++j) noise[i][j] = 0.01*dist(rng);
82 for(int i=0;i<m;++i) for(int j=0;j<n;++j) A[i][j] += noise[i][j];
83
84 // Randomized rank-k approximation
85 int k=3; // target rank
86 vector<vector<double>> Q = randomized_range_finder(A, k, 5); // m x k
87 vector<vector<double>> QT = transpose(Q); // k x m
88 vector<vector<double>> B = matmul(QT, A); // k x n
89 vector<vector<double>> A_approx = matmul(Q, B); // m x n
90
91 double rel_err = frob(A_approx) > 0 ? frob(A_approx) : 1.0; // placeholder to avoid unused warnings
92 double err = frob(A); // will recompute below
93
94 // Compute relative Frobenius error ||A - A_approx||_F / ||A||_F
95 vector<vector<double>> Diff = A;
96 for(int i=0;i<m;++i) for(int j=0;j<n;++j) Diff[i][j] -= A_approx[i][j];
97 double relF = frob(Diff) / max(1e-18, frob(A));
98
99 cout << fixed << setprecision(6);
100 cout << "Relative Frobenius error (rank-" << k << ") = " << relF << "\n";
101
102 return 0;
103}
104

This code computes a randomized low-rank approximation by finding an orthonormal basis Q for the range of A via Y = AΩ and Modified Gram–Schmidt, then projecting A onto that subspace: Â = Q(QᵀA). When k matches the number of dominant singular directions, the approximation is close to the optimal truncated SVD (Eckart–Young). The example constructs a synthetic low-rank matrix with small noise and reports the relative Frobenius error.

Time: O(m n k) for forming Y and the projection, plus O(m k^2) for orthonormalization.Space: O((m+n)k) extra beyond storing A.