Implicit Bias of Gradient Descent
Key Points
- âąIn underdetermined linear systems (more variables than equations), gradient descent started at zero converges to the minimum Euclidean norm solution without any explicit regularizer.
- âąThis happens because the gradient lies in the row space of A, so gradient descent never changes components in the null space of A.
- âąThe minimum-norm solution for (full row rank and consistent) is x* = (A )^{-1} b, the MooreâPenrose pseudoinverse solution.
- âąWith a nonzero initialization that has a null-space component, gradient descent converges to the minimum-norm solution plus that unchanged null-space component.
- âąA stable learning rate must satisfy 0 < η < 2 / where is the largest singular value of A.
- âąFor inconsistent systems, gradient descent converges to the minimum-norm least-squares solution, still given by the pseudoinverse.
- âąIn classification with separable data (logistic loss), gradient descentâs direction converges to the L2 maximum-margin separatorâanother example of implicit bias.
- âąYou can verify this phenomenon numerically in C++ by comparing gradient descent iterates to the closed-form pseudoinverse solution.
Prerequisites
- âLinear algebra: vector spaces, subspaces, orthogonality â Understanding row space, null space, and projections is essential for why GD preserves null-space components.
- âLeast squares and normal equations â The objective f(x) = 1/2||Ax â b||^2 and its gradient structure underpin the GD updates.
- âSVD and pseudoinverse â Relates the closed-form minimum-norm solution x* = A^+ b and informs numerical stability considerations.
- âGradient descent fundamentals â Knowing convergence conditions and step-size selection is critical for the implicit bias result.
- âEigenvalues and singular values â The learning-rate bound 0 < η < 2/Ï_max^2 and conditioning depend on spectral properties.
Detailed Explanation
Tap terms for definitions01Overview
Hook: Imagine you have infinitely many ways to tie a rope between two posts because the rope is much longer than the straight-line distance. Which tying pattern will you pick if no one tells you explicitly? Concept: In underdetermined linear systems (more parameters than constraints), there are infinitely many solutions to Ax = b. Yet, when we run plain gradient descent on the squared loss starting from zero, it systematically converges to the one with the smallest Euclidean length (L2 norm). This behavior is called the implicit bias (or implicit regularization) of gradient descent. Example: For a 2Ă4 matrix A with full row rank and a consistent b, the set of solutions forms a flat plane in 4D. Gradient descent started at the origin moves only in a special subspace (the row space) and thus ends up at the unique point on that plane closest to the originâthe minimum-norm solution x* = A^T(AA^T)^{-1}b. The punchline is that even without adding an explicit penalty like ||x||^2, the algorithmâs dynamics favor a simple solution.
02Intuition & Analogies
Hook: Think of standing in a wide, flat valley with a gentle slope pointing eastâwest but perfectly flat northâsouth. If you roll a ball from the origin, gravity pulls it only along the eastâwest slope; it never moves northâsouth because thereâs no slope in that direction. Concept: For the least-squares objective f(x) = 1/2 ||Ax â b||^2, the gradient âf(x) = A^T(Ax â b) always lies in the row space of A. Directions orthogonal to this spaceâthe null space of Aâare perfectly flat: moving along them doesnât change Ax and hence doesnât change the loss. Example: Start at x0 = 0. Every update x_{t+1} = x_t â ηA^T(Ax_t â b) is a step within the row space. Because you never pick up any null-space component, you end in the unique point where the plane of solutions intersects the row space, which is exactly the minimum-norm solution. If instead you start with a nonzero vector in the null space (like a sideways push along the flat direction), gradient descent canât remove it because the gradient never points that way. You converge to the same minimum-norm point in the row space plus this unchanged null-space component. In this sense, the algorithmâs geometryâwhat directions the gradient can and cannot affectâimplicitly biases the final answer, even without explicit regularization.
03Formal Definition
04When to Use
Hook: Why care if you can just compute the pseudoinverse once and be done? Concept: In large-scale or streaming problems, forming A^+ explicitly (via SVD or matrix inversion) can be too costly in time or memory, while gradient descent scales with matrixâvector products and can be stopped early. Use Cases: 1) High-dimensional linear regression where n â« m (more features than samples). Setting x0 = 0 and using a stable learning rate yields the minimum-norm interpolator, which often generalizes better. 2) Overparameterized models (e.g., wide neural networks). Even without explicit L2 regularization, gradient-based training picks low-complexity solutions (e.g., max-margin in separable classification), which helps explain generalization. 3) Signal processing and inverse problems, where the physics yields underdetermined linear maps A; gradient methods naturally recover the least-energy solution. 4) Distributed or online settings where storing or factorizing A is expensive; you can stream A and b, applying iterative updates. Example: In practical C++, computing x^\star via x^\star = A^\top(AA^\top)^{-1}b requires inverting an mĂm matrixâfine for small m but expensive for large m. Gradient descent instead uses repeated A and A^\top multiplies, which are often cache-friendly and parallelizable.
â ïžCommon Mistakes
Hook: The theory is clean, but implementations can quietly violate assumptions. Concept: 1) Step size too large: If η â„ 2/Ï_max^2, gradient descent can diverge or oscillate. Always estimate or backtrack to a safe η. 2) Assuming any initialization gives the minimum-norm solution: Not true if x0 has a null-space component; that part remains forever. Start from x0 = 0 (or project x0 onto the row space) to get the pure minimum-norm solution. 3) Ignoring inconsistency: If b â Col(A), there is no exact solution; gradient descent converges to the minimum-norm least-squares solution insteadâdonât misinterpret residuals. 4) Numerical instability: Forming AA^T and inverting it can be ill-conditioned; prefer SVD for A^+ in production or add Tikhonov regularization. 5) Confusing implicit bias with explicit regularization: Weight decay (L2) changes the objective; implicit bias arises from algorithm dynamics even without penalties. 6) Stopping too early or too late: Early stopping can act like regularization; know whether you want the exact min-norm interpolator or a partially fitted model. Example: If you warm-start from a previous solution containing null-space drift, gradient descent wonât remove it; explicitly project onto the row space if you want the pure min-norm solution.
Key Formulas
Gradient Descent Update (Least Squares)
Explanation: Each step moves opposite to the gradient of 1/2||^2. The term (Axâb) lies in the row space, so updates never change null-space components.
Minimum-Norm Solution (Full Row Rank)
Explanation: When A has full row rank and Ax=b is consistent, the unique minimum-norm solution is given by the MooreâPenrose pseudoinverse. This is the point of smallest Euclidean norm among all exact solutions.
Step Size Stability
Explanation: For the quadratic least-squares objective, the gradient is Lipschitz with constant Choosing η in this range guarantees convergence of gradient descent.
Decomposition and Projections
Explanation: Any vector splits uniquely into row-space and null-space parts. These projection matrices extract those components when A has full row rank.
Solution Set Structure
Explanation: All solutions form an affine set: the minimum-norm solution plus any vector in the null space. This shows why there are infinitely many solutions when m < n.
GD Limit with Arbitrary Initialization
Explanation: Gradient descent preserves the null-space component of the initialization and converges to the minimum-norm point in the row space. Starting at zero removes the first term.
Gradient Lipschitz Constant
Explanation: For least squares, the gradientâs Lipschitz constant equals the spectral norm of A. This value controls step-size choices for convergence.
Hard-Margin SVM
Explanation: In separable classification, gradient descent on logistic loss converges in direction to this maximum-margin separator, showing implicit bias beyond linear regression.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Basic linear algebra helpers for small dense matrices 5 using Mat = vector<vector<double>>; 6 using Vec = vector<double>; 7 8 Mat transpose(const Mat &A){ 9 size_t m=A.size(), n=A[0].size(); 10 Mat AT(n, vector<double>(m,0.0)); 11 for(size_t i=0;i<m;++i) for(size_t j=0;j<n;++j) AT[j][i]=A[i][j]; 12 return AT; 13 } 14 15 Vec matVec(const Mat &A, const Vec &x){ 16 size_t m=A.size(), n=A[0].size(); 17 Vec y(m,0.0); 18 for(size_t i=0;i<m;++i){ 19 double s=0.0; for(size_t j=0;j<n;++j) s += A[i][j]*x[j]; 20 y[i]=s; 21 } 22 return y; 23 } 24 25 Mat matMul(const Mat &A, const Mat &B){ 26 size_t m=A.size(), k=A[0].size(), n=B[0].size(); 27 Mat C(m, vector<double>(n,0.0)); 28 for(size_t i=0;i<m;++i){ 29 for(size_t t=0;t<k;++t){ double a=A[i][t]; if(a==0) continue; for(size_t j=0;j<n;++j) C[i][j]+=a*B[t][j]; } 30 } 31 return C; 32 } 33 34 Vec vecAdd(const Vec &a, const Vec &b, double alpha=1.0){ 35 size_t n=a.size(); Vec c(n); for(size_t i=0;i<n;++i) c[i]=a[i]+alpha*b[i]; return c; 36 } 37 38 Vec vecScale(const Vec &a, double s){ Vec c=a; for(double &v:c) v*=s; return c; } 39 40 double dot(const Vec &a, const Vec &b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; } 41 42 double norm2(const Vec &a){ return sqrt(max(0.0, dot(a,a))); } 43 44 // Gauss-Jordan inversion for a small square matrix (with partial pivoting) 45 Mat inverse(Mat A){ 46 size_t n=A.size(); 47 Mat I(n, vector<double>(n,0.0)); 48 for(size_t i=0;i<n;++i) I[i][i]=1.0; 49 for(size_t col=0; col<n; ++col){ 50 // pivot 51 size_t piv=col; double best=fabs(A[col][col]); 52 for(size_t r=col+1;r<n;++r){ double v=fabs(A[r][col]); if(v>best){best=v;piv=r;} } 53 if(best < 1e-12) throw runtime_error("Matrix is singular or ill-conditioned"); 54 if(piv!=col){ swap(A[piv],A[col]); swap(I[piv],I[col]); } 55 // normalize row 56 double diag=A[col][col]; double invd=1.0/diag; 57 for(size_t j=0;j<n;++j){ A[col][j]*=invd; I[col][j]*=invd; } 58 // eliminate other rows 59 for(size_t r=0;r<n;++r){ if(r==col) continue; double f=A[r][col]; if(f==0) continue; for(size_t j=0;j<n;++j){ A[r][j]-=f*A[col][j]; I[r][j]-=f*I[col][j]; } } 60 } 61 return I; 62 } 63 64 int main(){ 65 // Construct a 2x4 full-row-rank underdetermined system A x = b 66 // Rows are linearly independent; null space has dimension at least 2. 67 Mat A = { 68 {1, 0, 1, 0}, 69 {0, 1, 0, 1} 70 }; // m=2, n=4 71 size_t m=2, n=4; 72 73 // Create a true x with a nonzero null-space component so min-norm != x_true 74 Vec x_true = {1, 2, -1, 0}; // contains [1,0,-1,0] component (in null space) 75 Vec b = matVec(A, x_true); // consistent system by construction 76 77 // Compute the minimum-norm solution x_min = A^T (A A^T)^{-1} b 78 Mat AT = transpose(A); 79 Mat M = matMul(A, AT); // 2x2 80 Mat Minv = inverse(M); // invert AA^T 81 // Compute y = Minv * b 82 Vec y(m,0.0); for(size_t i=0;i<m;++i){ for(size_t j=0;j<m;++j) y[i]+=Minv[i][j]*b[j]; } 83 // x_min = A^T * y 84 Vec x_min(n,0.0); for(size_t i=0;i<n;++i){ for(size_t j=0;j<m;++j) x_min[i]+=AT[i][j]*y[j]; } 85 86 // Gradient Descent from zero initialization 87 Vec x(n, 0.0); 88 double eta = 0.25; // stable step size for this small example 89 int T = 2000; 90 for(int t=0; t<T; ++t){ 91 Vec r = vecAdd(matVec(A, x), b, -1.0); // r = A x - b 92 // g = A^T r 93 Vec g(n,0.0); 94 for(size_t i=0;i<n;++i){ double s=0; for(size_t j=0;j<m;++j) s += AT[i][j]*r[j]; g[i]=s; } 95 // x <- x - eta * g 96 for(size_t i=0;i<n;++i) x[i] -= eta * g[i]; 97 } 98 99 // Report distances 100 cout.setf(std::ios::fixed); cout<<setprecision(6); 101 cout << "||A x - b|| (GD) = " << norm2(vecAdd(matVec(A,x), b, -1.0)) << "\n"; 102 cout << "||x_GD - x_min|| = " << norm2(vecAdd(x, x_min, -1.0)) << "\n"; 103 cout << "x_min = ["; for(size_t i=0;i<n;++i) cout << (i?", ":"") << x_min[i]; cout << "]\n"; 104 cout << "x_GD = ["; for(size_t i=0;i<n;++i) cout << (i?", ":"") << x[i]; cout << "]\n"; 105 return 0; 106 } 107
We build a 2Ă4 underdetermined, consistent system and compute the minimum-norm solution via x* = A^T(AA^T)^{-1}b. Starting gradient descent at zero with a stable step size ensures updates stay in the row space and converge to x*. The output shows a small residual and that ||x_GD â x_min|| is near zero, confirming the implicit bias.
1 #include <bits/stdc++.h> 2 using namespace std; 3 using Mat = vector<vector<double>>; using Vec = vector<double>; 4 5 // Helpers 6 Mat transpose(const Mat &A){ size_t m=A.size(), n=A[0].size(); Mat AT(n, vector<double>(m)); for(size_t i=0;i<m;++i) for(size_t j=0;j<n;++j) AT[j][i]=A[i][j]; return AT; } 7 Vec matVec(const Mat &A, const Vec &x){ size_t m=A.size(), n=A[0].size(); Vec y(m,0); for(size_t i=0;i<m;++i){ double s=0; for(size_t j=0;j<n;++j) s+=A[i][j]*x[j]; y[i]=s; } return y; } 8 Vec add(const Vec &a, const Vec &b, double alpha=1.0){ Vec c=a; for(size_t i=0;i<a.size();++i) c[i]+=alpha*b[i]; return c; } 9 double dot(const Vec &a, const Vec &b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; } 10 double norm2(const Vec &a){ return sqrt(max(0.0, dot(a,a))); } 11 12 int main(){ 13 // Same A as before 14 Mat A = { 15 {1, 0, 1, 0}, 16 {0, 1, 0, 1} 17 }; size_t m=2, n=4; 18 Mat AT = transpose(A); 19 20 // Build a consistent right-hand side 21 Vec x_row_only = {0.5, 1.0, 0.5, 1.0}; // lies in Row(A): it's A^T y for y=[1,1] 22 Vec b = matVec(A, x_row_only); 23 24 // Compute minimum-norm solution: it should equal x_row_only here 25 // because x_row_only has zero null-space component and satisfies Ax=b. 26 // We'll just store it as x_min for comparison. 27 Vec x_min = x_row_only; 28 29 // Choose a nonzero null-space vector z with A z = 0 30 Vec z_null = {1, 0, -1, 0}; 31 32 // Initialize GD at x0 = z_null (pure null-space component) 33 Vec x = z_null; 34 35 // Run GD: x_{t+1} = x_t - eta * A^T (A x_t - b) 36 double eta = 0.25; int T=2000; 37 for(int t=0;t<T;++t){ 38 Vec r = add(matVec(A, x), b, -1.0); // A x - b 39 Vec g(n,0.0); 40 for(size_t i=0;i<n;++i){ double s=0; for(size_t j=0;j<m;++j) s+=AT[i][j]*r[j]; g[i]=s; } 41 for(size_t i=0;i<n;++i) x[i] -= eta * g[i]; 42 } 43 44 // Theory predicts: limit = z_null + x_min 45 Vec predicted = add(x_min, z_null, 1.0); 46 47 cout.setf(std::ios::fixed); cout<<setprecision(6); 48 cout << "||A x - b|| (GD) = " << norm2(add(matVec(A,x), b, -1.0)) << "\n"; 49 cout << "||x_GD - (x_min + z_null)|| = " << norm2(add(x, predicted, -1.0)) << "\n"; 50 cout << "x_GD = ["; for(size_t i=0;i<n;++i) cout << (i?", ":"") << x[i]; cout << "]\n"; 51 cout << "x_min + z_null = ["; for(size_t i=0;i<n;++i) cout << (i?", ":"") << predicted[i]; cout << "]\n"; 52 return 0; 53 } 54
We craft a consistent system with a known row-space solution x_min and start gradient descent from a vector in the null space. Because the gradient lies in the row space, the null-space component cannot be removed by updates. The final iterate equals x_min plus the initial null-space vector, confirming the persistence of null-space components and the role of initialization.