🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
⚙️AlgorithmAdvanced

ADMM (Alternating Direction Method of Multipliers)

Key Points

  • •
    ADMM splits a hard optimization problem into two easier subproblems that communicate through simple averaging-like steps.
  • •
    It alternates minimizing with respect to each block of variables and then updates a running “penalty-adjusted” dual variable.
  • •
    The method is especially effective for large, structured, and distributed convex problems, such as L1-regularized learning (LASSO) and consensus averaging.
  • •
    ADMM’s core engine is the augmented Lagrangian, which stabilizes constraints with a quadratic penalty controlled by a parameter \(ρ\).
  • •
    Stopping is based on primal and dual residuals that measure constraint satisfaction and update stability.
  • •
    Per-iteration cost is usually dominated by linear algebra in each subproblem; warm-starts and precomputed factorizations make ADMM fast in practice.
  • •
    ADMM is robust and easy to implement with proximal operators (like soft-thresholding) but typically converges sublinearly (often \(Ok1​\)).
  • •
    Choosing \(ρ\) and good scaling strongly affects speed; heuristic adjustments of \(ρ\) can help convergence.

Prerequisites

  • →Convex optimization basics — ADMM assumes convexity for convergence guarantees and uses concepts like convex functions and optimality.
  • →Linear algebra (matrix factorizations) — Efficient x-updates rely on solving linear systems using Cholesky/QR and understanding A^T A.
  • →Lagrangian duality — The augmented Lagrangian and dual variables are central to ADMM’s design and convergence.
  • →Proximal operators — Many ADMM subproblems reduce to evaluating proximal mappings like soft-thresholding.
  • →Numerical stability and conditioning — Choosing formulations and factorizations that are stable is key for performance and accuracy.

Detailed Explanation

Tap terms for definitions

01Overview

The Alternating Direction Method of Multipliers (ADMM) is a powerful algorithm for solving optimization problems that can be decomposed into parts. Think of problems where an objective naturally splits into two terms, each depending on a different block of variables, tied together by linear constraints. ADMM leverages this structure by alternating between optimizing each block while keeping the other fixed, and then enforcing agreement through a dual update. This combines the decomposability of dual ascent with the stability of the augmented Lagrangian. In practice, ADMM shines in large-scale convex optimization, especially when the cost function involves non-smooth penalties like the L1 norm. Examples include LASSO for sparse regression, total variation denoising, and distributed consensus problems. Because each subproblem can often be solved with simple operations (like soft-thresholding) or fast linear algebra (like solving a system with a precomputed factorization), ADMM can handle very large datasets efficiently and can be distributed across machines. While its theoretical convergence rate is typically sublinear, its robustness, simplicity, and ability to exploit problem structure make it a go-to method in modern optimization and machine learning.

02Intuition & Analogies

Imagine two teams trying to write a joint report. Team X writes the content (x), Team Z designs the layout (z). They must agree on a final version (the constraint ties x and z). Working together on every sentence is slow, so they adopt a routine: Team X drafts content given the current layout, Team Z refines the layout given the new draft, and a coordinator checks how different their versions are, adding a “penalty note” if they disagree. Over time, the fear of accumulating penalties pushes both teams toward agreement while each focuses on their specialty. That is ADMM. More concretely, ADMM introduces a soft “rubber band” between the two variable blocks: if x and z drift apart (violating the constraint), the rubber band (quadratic penalty) pulls them back. The dual variable u is like an accountant tracking the disagreement; when disagreement persists, u tightens the rubber band by increasing the effective cost of misalignment. Because each team only solves their own subproblem (often easy), the overall method scales well. In distributed settings, think of multiple sensors estimating a common quantity: each computes a local update using its data, then everyone averages their answers and adjusts their internal bias (u). Repeat this, and the network reaches consensus without any single node solving the entire problem. The alternating, penalty-aided negotiation is what makes ADMM both stable and modular.

03Formal Definition

Consider convex functions f and g, matrices A and B, and vector c. The problem form is minimize f(x) + g(z) subject to A x + B z=c. The augmented Lagrangian with penalty parameter \(ρ > 0\) is Lρ​(x,z,y) = f(x) + g(z) + y⊤(A x + B z - c) + 2ρ​∥ A x + B z - c ∥22​. In the scaled form with u = y/ρ, ADMM performs the iterations: xk+1 = argminx​ f(x) + 2ρ​ ∥ A x + B zk - c + uk ∥22​; zk+1 = argminz​ g(z) + 2ρ​ ∥ A xk+1 + B z - c + uk ∥22​; uk+1=uk+A xk+1 + B zk+1 - c. Convergence (under standard assumptions: closed, proper, convex f and g; existence of a saddle point; mild qualification conditions) guarantees that the primal residual rk=A xk + B zk - c approaches zero and that the objective values converge to the optimal value. Stopping criteria are typically based on the norms of the primal residual and the dual residual sk, compared against absolute and relative tolerances that scale with problem size and variable magnitudes.

04When to Use

Use ADMM when your problem naturally splits into two (or more) parts connected by simple linear constraints, and each part is easy to optimize on its own. Classic scenarios include: (1) Sparse learning, like LASSO, where the smooth least-squares term couples with a non-smooth L1 penalty; (2) Total variation and fused lasso problems where proximal operators are simple; (3) Distributed or federated optimization where data is partitioned across machines—ADMM allows local updates and lightweight global coordination; (4) Consensus and sharing problems where many local variables must agree on a common value; (5) Large-scale problems where precomputations (like a Cholesky factor) can be reused across many iterations. ADMM is less ideal when subproblems are themselves expensive or ill-conditioned without a good preconditioner, when ultra-fast high-accuracy is needed (second-order interior-point methods may win there), or when the problem does not decompose in a way that yields simple proximal steps. It is a strong default for convex, structured, and separable objectives where modest accuracy suffices and scalability matters.

⚠️Common Mistakes

  • Ignoring scaling and normalization: Poorly scaled A, variables, or penalties can slow convergence dramatically. Normalize features and tune (\rho) (or adapt it during iterations).
  • Not precomputing reusable factorizations: Re-solving the same linear system from scratch each iteration wastes time. Factorize (A^{\top}A + \rho I) once and reuse triangular solves.
  • Loose or incorrect stopping criteria: Failing to monitor both primal and dual residuals can cause premature stopping or endless looping. Use standard absolute/relative tolerances that scale with problem size.
  • Using too small or too large (\rho): Too small makes constraints weakly enforced (slow primal convergence); too large causes tiny steps (slow dual convergence). Implement heuristic (\rho) adjustment based on residual magnitudes.
  • Forgetting warm starts: ADMM benefits from warm-starting x, z, and u, especially when solving a sequence of related problems (e.g., along a regularization path).
  • Overlooking numerical stability: For ill-conditioned matrices, prefer numerically stable factorizations (Cholesky for SPD, QR otherwise) and avoid forming A^{\top}A explicitly when possible (use CG or normal equations with care).

Key Formulas

ADMM canonical form

x,zmin​f(x)+g(z)s.t.Ax+Bz=c

Explanation: This is the standard problem structure ADMM targets: a sum of two functions with a linear constraint tying variables together. Many practical problems can be written this way.

Augmented Lagrangian

Lρ​(x,z,y)=f(x)+g(z)+y⊤(Ax+Bz−c)+2ρ​∥Ax+Bz−c∥22​

Explanation: Adds a quadratic penalty to the classical Lagrangian to stabilize dual ascent. The parameter \(ρ\) controls how strongly constraint violations are penalized.

Scaled ADMM updates

xk+1zk+1uk+1​=argxmin​f(x)+2ρ​∥Ax+Bzk−c+uk∥22​=argzmin​g(z)+2ρ​∥Axk+1+Bz−c+uk∥22​=uk+Axk+1+Bzk+1−c​

Explanation: These are the core iteration steps using the scaled dual variable u = y/ρ. Each step solves a simpler subproblem focusing on one variable block while the other is fixed.

Residuals (general form)

rk+1=Axk+1+Bzk+1−c,sk+1=ρA⊤B(zk+1−zk)

Explanation: The primal residual r measures constraint violation; the dual residual s measures dual feasibility progress. They are used to decide when to stop.

Residuals (x = z splitting)

rk+1=xk+1−zk+1,sk+1=ρ(zk+1−zk)

Explanation: For problems written with A = I, B = -I, c = 0, the residuals simplify. This is the common form used in LASSO and many proximal splitting problems.

Soft-thresholding

Sκ​(vi​)=sign(vi​)max(∣vi​∣−κ,0)

Explanation: The proximal operator for the L1 norm. Apply it elementwise with threshold \(κ = λ/ρ\) in LASSO-like problems.

Stopping tolerances

εpri​=n​εabs​+εrel​max{∥Ax∥2​,∥Bz∥2​,∥c∥2​},εdual​=n​εabs​+εrel​∥ρA⊤u∥2​

Explanation: Absolute and relative tolerances scale with dimension to make stopping criteria robust. Stop when the residual norms are below these thresholds.

LASSO x-update normal equations

(A⊤A+ρI)x=A⊤b+ρ(z−u)

Explanation: For the LASSO splitting with x = z, the x-subproblem reduces to solving a linear system. Precomputing a factorization makes each iteration cheap.

Consensus z-update

zk+1=N1​i=1∑N​(xik+1​+uik​)

Explanation: In consensus ADMM, the global variable is the average of local primal-plus-dual terms. This step communicates minimal information across workers.

Ergodic convergence rate

O(1/k)

Explanation: Under convexity, averages of iterates often converge at a rate proportional to 1/k. This describes typical sublinear convergence for first-order splitting methods.

Complexity Analysis

ADMM’s per-iteration cost depends on the two subproblems. If each subproblem has a cheap proximal operator (e.g., L1 soft-thresholding), per-iteration time is dominated by linear algebra for the smooth term. For LASSO with m samples and n features in the splitting x=z, the x-update solves (AT A + ρ I) x=AT b + ρ(z − u). Precomputing AT A and its Cholesky factor costs O(m n2 + n3) once. Each iteration then performs two triangular solves in O(n2), plus O(n) vector operations and an O(n) soft-thresholding; total O(n2) per iteration after setup. If n is large and m ≫ n, forming AT A may be acceptable; if n is huge, prefer iterative solvers (e.g., conjugate gradient) with matrix-vector costs O(m n) per CG iteration. For consensus ADMM with N workers and d-dimensional variables, local x-updates are O(d) per worker (closed-form), the global z-update is an O(N d) average, and dual updates are O(d) per worker; total O(N d) per iteration with minimal communication (one vector per worker). Space complexity typically stores the problem data (e.g., A in O(m n)), auxiliary variables x, z, u (O(n) or O(N d)), and any factorization (e.g., Cholesky factors in O(n2)). ADMM’s memory footprint is therefore dominated by data matrices and cached factorizations. Convergence is usually sublinear; the number of iterations varies with conditioning and ρ. Adaptive ρ, preconditioning, and good scaling can substantially reduce iterations in practice.

Code Examples

ADMM for LASSO: min 1/2 ||A x - b||_2^2 + λ ||x||_1
1#include <bits/stdc++.h>
2using namespace std;
3
4// Utility: 2-norm of a vector
5static double norm2(const vector<double>& v){
6 double s=0; for(double x: v) s += x*x; return sqrt(s);
7}
8
9// Matrix-vector multiply: y = A x, A is m x n
10static vector<double> matvec(const vector<vector<double>>& A, const vector<double>& x){
11 int m = (int)A.size(), n = (int)A[0].size();
12 vector<double> y(m,0.0);
13 for(int i=0;i<m;++i){
14 double s=0.0; for(int j=0;j<n;++j) s += A[i][j]*x[j];
15 y[i]=s;
16 }
17 return y;
18}
19
20// Compute AtA (n x n) and Atb (n)
21static void AtA_Atb(const vector<vector<double>>& A, const vector<double>& b,
22 vector<vector<double>>& AtA, vector<double>& Atb){
23 int m = (int)A.size(), n = (int)A[0].size();
24 AtA.assign(n, vector<double>(n, 0.0));
25 Atb.assign(n, 0.0);
26 for(int i=0;i<m;++i){
27 for(int j=0;j<n;++j){
28 Atb[j] += A[i][j]*b[i];
29 for(int k=0;k<n;++k){
30 AtA[j][k] += A[i][j]*A[i][k];
31 }
32 }
33 }
34}
35
36// Cholesky decomposition for SPD matrix M (n x n): M = L L^T
37static bool cholesky(vector<vector<double>> M, vector<vector<double>>& L){
38 int n = (int)M.size();
39 L.assign(n, vector<double>(n, 0.0));
40 for(int i=0;i<n;++i){
41 for(int j=0;j<=i;++j){
42 double s = M[i][j];
43 for(int k=0;k<j;++k) s -= L[i][k]*L[j][k];
44 if(i==j){
45 if(s <= 0.0) return false;
46 L[i][i] = sqrt(max(s, 0.0));
47 }else{
48 L[i][j] = s / L[j][j];
49 }
50 }
51 }
52 return true;
53}
54
55// Solve (L L^T) x = b
56static vector<double> chol_solve(const vector<vector<double>>& L, const vector<double>& b){
57 int n = (int)L.size();
58 vector<double> y(n,0.0), x(n,0.0);
59 // Forward solve: L y = b
60 for(int i=0;i<n;++i){
61 double s = b[i];
62 for(int k=0;k<i;++k) s -= L[i][k]*y[k];
63 y[i] = s / L[i][i];
64 }
65 // Backward solve: L^T x = y
66 for(int i=n-1;i>=0;--i){
67 double s = y[i];
68 for(int k=i+1;k<n;++k) s -= L[k][i]*x[k];
69 x[i] = s / L[i][i];
70 }
71 return x;
72}
73
74// Soft-thresholding (prox of L1): z_i = sign(v_i) * max(|v_i| - kappa, 0)
75static void soft_threshold(vector<double>& v, double kappa){
76 for(double &x : v){
77 double a = fabs(x) - kappa;
78 if(a > 0) x = copysign(a, x);
79 else x = 0.0;
80 }
81}
82
83int main(){
84 ios::sync_with_stdio(false);
85 cin.tie(nullptr);
86
87 // Example data: small LASSO instance (m samples, n features)
88 int m = 6, n = 4;
89 vector<vector<double>> A = {
90 {0.50, -0.10, 0.30, 0.00},
91 {0.20, 0.40, -0.20, 0.10},
92 {0.00, 0.10, 0.50, -0.30},
93 {0.40, -0.30, 0.10, 0.20},
94 {-0.10, 0.20, 0.00, 0.50},
95 {0.30, 0.10, -0.40, 0.10}
96 };
97 vector<double> b = { 0.9, 0.1, 0.4, 0.7, -0.2, 0.3 };
98
99 // ADMM parameters
100 double lambda = 0.2; // L1 regularization strength
101 double rho = 1.0; // Augmented Lagrangian penalty
102 double abs_tol = 1e-4, rel_tol = 1e-3;
103 int max_iters = 200;
104
105 // Precompute normal equations: (A^T A + rho I)
106 vector<vector<double>> AtA; vector<double> Atb;
107 AtA_Atb(A, b, AtA, Atb);
108 for(int i=0;i<n;++i) AtA[i][i] += rho; // add rho I
109
110 // Cholesky factor of (A^T A + rho I)
111 vector<vector<double>> L;
112 if(!cholesky(AtA, L)){
113 cerr << "Matrix not SPD; try increasing rho or using QR/CG.\n";
114 return 0;
115 }
116
117 // Initialize x, z, u (warm start with zeros)
118 vector<double> x(n,0.0), z(n,0.0), u(n,0.0);
119 vector<double> z_old(n,0.0);
120
121 // Precompute Atb (constant)
122 // Note: we already have Atb from AtA_Atb
123
124 cout << fixed << setprecision(6);
125 for(int k=0;k<max_iters;++k){
126 // x-update: solve (A^T A + rho I) x = Atb + rho (z - u)
127 vector<double> rhs(n,0.0);
128 for(int i=0;i<n;++i) rhs[i] = Atb[i] + rho*(z[i] - u[i]);
129 x = chol_solve(L, rhs);
130
131 // z-update: soft-thresholding of x + u with kappa = lambda/rho
132 z_old = z;
133 z = x; // start from x
134 for(int i=0;i<n;++i) z[i] += u[i];
135 soft_threshold(z, lambda / rho);
136
137 // u-update: u = u + x - z
138 for(int i=0;i<n;++i) u[i] += x[i] - z[i];
139
140 // Residuals for stopping
141 // Primal: r = x - z; Dual: s = rho * (z - z_old)
142 vector<double> r(n,0.0), s(n,0.0);
143 for(int i=0;i<n;++i){ r[i] = x[i] - z[i]; s[i] = rho*(z[i] - z_old[i]); }
144
145 double r_norm = norm2(r);
146 double s_norm = norm2(s);
147
148 // Tolerances
149 double x_norm = norm2(x), z_norm = norm2(z), u_norm = norm2(u);
150 double eps_pri = sqrt((double)n)*abs_tol + rel_tol*max(x_norm, z_norm);
151 double eps_dual = sqrt((double)n)*abs_tol + rel_tol*norm2(u)*rho; // approximate form
152
153 if(k % 10 == 0){
154 cout << "iter=" << setw(3) << k
155 << " r_norm=" << setw(10) << r_norm
156 << " s_norm=" << setw(10) << s_norm
157 << " eps_pri=" << setw(10) << eps_pri
158 << " eps_dual=" << setw(10) << eps_dual << "\n";
159 }
160
161 if(r_norm <= eps_pri && s_norm <= eps_dual){
162 cout << "Converged at iter " << k << "\n";
163 break;
164 }
165 }
166
167 // Output solution
168 cout << "x (sparse): ";
169 for(double xi: x) cout << xi << ' ';
170 cout << "\n";
171 cout << "z (prox result): ";
172 for(double zi: z) cout << zi << ' ';
173 cout << "\n";
174
175 return 0;
176}
177

Solves the LASSO problem by splitting x and z with the constraint x = z. The x-update solves a linear system with a precomputed Cholesky factor of (A^T A + ρ I); the z-update applies elementwise soft-thresholding (the proximal operator of the L1 norm). The dual variable u accumulates constraint disagreement. Stopping uses primal and dual residual norms relative to absolute/relative tolerances. Precomputing the factorization makes each iteration fast and numerically stable for moderately sized n.

Time: Setup: O(m n^2 + n^3) to form A^T A and compute Cholesky. Each ADMM iteration: O(n^2) for two triangular solves plus O(n) vector ops and soft-thresholding.Space: O(m n) to store A; O(n^2) for A^T A and its Cholesky; O(n) for vectors x, z, u.
Consensus ADMM for distributed averaging
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute 2-norm
5static double norm2(const vector<double>& v){ double s=0; for(double x: v) s+=x*x; return sqrt(s); }
6
7int main(){
8 ios::sync_with_stdio(false);
9 cin.tie(nullptr);
10
11 // Suppose N workers each hold a d-dimensional observation a_i.
12 int N = 5, d = 3;
13 vector<vector<double>> a = {
14 { 1.0, 2.0, 0.0},
15 { 0.5, 1.5, -1.0},
16 {-0.5, 2.5, 1.0},
17 { 2.0, -1.0, 0.5},
18 { 1.5, 0.0, -0.5}
19 };
20
21 // ADMM variables per worker: x_i, u_i; and global z
22 vector<vector<double>> x(N, vector<double>(d, 0.0));
23 vector<vector<double>> u(N, vector<double>(d, 0.0));
24 vector<double> z(d, 0.0);
25
26 double rho = 1.0; // penalty parameter
27 double abs_tol = 1e-5, rel_tol = 1e-3;
28 int max_iters = 200;
29
30 auto vec_add = [&](vector<double>& y, const vector<double>& v){
31 for(int j=0;j<d;++j) y[j] += v[j];
32 };
33
34 for(int k=0;k<max_iters;++k){
35 // x-update (local, closed-form): x_i = (a_i + rho z - u_i) / (1 + rho)
36 for(int i=0;i<N;++i){
37 for(int j=0;j<d;++j){
38 x[i][j] = (a[i][j] + rho*z[j] - u[i][j]) / (1.0 + rho);
39 }
40 }
41
42 // z-update (global averaging): z = (1/N) * sum_i (x_i + u_i)
43 vector<double> sum(d, 0.0);
44 for(int i=0;i<N;++i){
45 for(int j=0;j<d;++j){ sum[j] += x[i][j] + u[i][j]; }
46 }
47 for(int j=0;j<d;++j) z[j] = sum[j] / (double)N;
48
49 // u-update: u_i = u_i + x_i - z
50 for(int i=0;i<N;++i){
51 for(int j=0;j<d;++j){ u[i][j] += x[i][j] - z[j]; }
52 }
53
54 // Residuals: primal r_i = x_i - z, dual s = rho * (z - z_prev) aggregated
55 static vector<double> z_prev; if(k==0) z_prev = vector<double>(d, 0.0);
56 vector<double> r_flat; r_flat.reserve(N*d);
57 for(int i=0;i<N;++i){
58 for(int j=0;j<d;++j) r_flat.push_back(x[i][j] - z[j]);
59 }
60 vector<double> s_vec(d, 0.0);
61 for(int j=0;j<d;++j) s_vec[j] = rho * (z[j] - z_prev[j]);
62
63 double r_norm = norm2(r_flat);
64 double s_norm = norm2(s_vec);
65
66 double eps_pri = sqrt((double)N*d)*abs_tol + rel_tol*max(norm2(z), 0.0);
67 double eps_dual = sqrt((double)N*d)*abs_tol + rel_tol*norm2(u[0])*rho; // heuristic; u magnitudes comparable
68
69 if(k % 10 == 0){
70 cout << "iter=" << setw(3) << k
71 << " r_norm=" << r_norm
72 << " s_norm=" << s_norm << "\n";
73 }
74
75 if(r_norm <= eps_pri && s_norm <= eps_dual){
76 cout << "Converged at iter " << k << "\n";
77 break;
78 }
79 z_prev = z;
80 }
81
82 // Report result and compare to direct average of a_i
83 vector<double> avg(d, 0.0);
84 for(int i=0;i<N;++i){ for(int j=0;j<d;++j) avg[j] += a[i][j]; }
85 for(int j=0;j<d;++j) avg[j] /= (double)N;
86
87 cout << fixed << setprecision(6);
88 cout << "Consensus z: "; for(double v: z) cout << v << ' '; cout << "\n";
89 cout << "Direct avg : "; for(double v: avg) cout << v << ' '; cout << "\n";
90
91 return 0;
92}
93

Solves a consensus problem where each worker i wants x_i close to its local vector a_i, with the constraint that all x_i equal a global consensus z. ADMM yields simple closed-form local updates, a global averaging step, and dual accumulation. The final z matches the direct average of the local data, illustrating distributed averaging with lightweight communication.

Time: O(N d) per iteration (local updates and one global average).Space: O(N d) for local variables x_i and u_i, plus O(d) for z.
#admm#alternating direction method of multipliers#augmented lagrangian#proximal operator#lasso#consensus admm#distributed optimization#soft-thresholding#cholesky factorization#dual variable#primal residual#dual residual#sparse regression#convex optimization