📚TheoryIntermediate

Lagrangian Duality

Key Points

  • Lagrangian duality turns a constrained minimization problem into a related maximization problem that provides lower bounds on the original objective.
  • The Lagrangian adds weighted penalties (Lagrange multipliers) for constraint violations to the original objective and then minimizes over the primal variables.
  • The dual function is concave and its maximization over nonnegative multipliers (and free multipliers for equalities) defines the dual problem.
  • Weak duality always holds: the best dual value is never greater than the best primal value, so the dual is a certified lower bound.
  • Under conditions like Slater's condition for convex problems, strong duality holds and the optimal primal and dual values are equal.
  • The Karush–Kuhn–Tucker (KKT) conditions characterize optimality; in convex problems they are necessary and sufficient.
  • Complementary slackness links constraints and multipliers: each inequality is either tight or has zero multiplier at optimum.
  • Duality is fundamental in ML: the SVM training problem is often solved in dual form, enabling kernels and efficient algorithms.

Prerequisites

  • Multivariable calculusGradients and stationary points are essential to form and interpret KKT stationarity.
  • Linear algebraVectors, matrices, and solving linear systems are needed to derive and compute x*(lambda) and dual functions.
  • Convex analysisConvexity, Slater’s condition, and properties of convex functions underlie strong duality and KKT sufficiency.
  • Optimization basicsUnderstanding objective/constraints, feasibility, and gradient-based methods helps in applying dual ascent.
  • C++ programmingImplementing optimization routines, numerical linear algebra, and careful floating-point computations.

Detailed Explanation

Tap terms for definitions

01Overview

Imagine trying to minimize a cost while obeying rules. If you break a rule, you pay a penalty. Lagrangian duality formalizes this idea: we attach a nonnegative penalty (Lagrange multiplier) to each inequality and a signed multiplier to each equality, add these penalties to the objective, and then ask, "What is the best cost we could get if we were allowed to adjust the decision and pay penalties?" The minimum over the decision variables yields the dual function, which is always a lower bound on the true best cost. Then we flip the perspective and choose penalties to make that bound as tight as possible—this is the dual problem. The power of duality is twofold: it provides certified lower bounds (useful to prove optimality or stop early) and it often simplifies or decomposes the original problem. In convex optimization, and under mild regularity conditions (like Slater’s condition), there is no gap between the best primal and dual values, so solving the dual recovers the primal solution. This perspective underpins many algorithms and ML models, notably the support vector machine (SVM), where the dual exposes the kernel trick and sparsity in support vectors.

02Intuition & Analogies

Hook: Think of planning a road trip with a budget and time limits. You try to pick routes (decisions) to minimize fuel cost while obeying constraints like total time and toll limits. Now suppose a friend says, "If you exceed the time limit, I’ll fine you 10perextrahour;ifyouexceedtolls,Illfineyou10 per extra hour; if you exceed tolls, I’ll fine you 2 per extra dollar." You now minimize fuel plus fines. If the fines are large, you will naturally obey the rules; if they’re small, you might break them but pay. Concept: Those fine rates are the Lagrange multipliers. The Lagrangian is fuel cost plus the fines for each rule violation. For any fixed set of fine rates, you can pick a best route (minimize over routes). That minimum is a lower bound on the true best possible cost with rules enforced, because when penalties are high enough, violating becomes too expensive. Example: If the time fine is huge, any plan that violates time will be worse than staying within time—so the best penalized plan equals the best feasible plan. Duality says, "Choose fine rates to make this bound as tight as possible." When chosen just right in convex problems (with some slack feasibility), the penalty scheme pins the best route exactly—you recover the true constrained optimum. Complementary slackness matches intuition: a rule that’s not tight has zero fine at optimum (no reason to price it), and a tight rule has a positive fine (its price reflects its scarcity).

03Formal Definition

Primal problem: minimize f(x) subject to inequality constraints (x) 0 and equality constraints (x) = 0. The Lagrangian is L(x, , ) = f(x) + (x) + (x), with multipliers 0 and free. The dual function is g(, ) = L(x, , ). Because it is a pointwise infimum of affine functions in (, ), g is concave, even if f, , are not. The dual problem maximizes this concave function over the feasible multipliers: maximize g(, ) subject to 0. Let be the optimal primal value and the optimal dual value. Weak duality states always. In convex problems (f convex, convex, affine) and under a constraint qualification such as Slater’s condition (existence of a strictly feasible point), strong duality holds: , and optimality is characterized by the KKT conditions: stationarity, primal feasibility, dual feasibility, and complementary slackness. When strong duality holds, any saddle point (, , ) satisfying L(, , ) L(, , ) L(x, , ) certifies optimality.

04When to Use

Use Lagrangian duality when you need certified lower bounds, decomposition, or a simpler equivalent problem. Typical contexts include: (1) Convex optimization with inequality/equality constraints, where strong duality often holds and KKT conditions enable efficient algorithms. (2) Quadratic programs (QPs), where the dual may be lower dimensional or more structured (e.g., box constraints), enabling faster solvers. (3) Machine learning models like SVMs: the dual transforms constraints into a kernelized quadratic form with simple box constraints and one equality, making kernels and sparsity natural. (4) Resource allocation and networks: dual variables act as prices coordinating decentralized decisions (e.g., in networking or distributed control). (5) Algorithm design: dual ascent, projected gradient on the dual, and ADMM use dual variables to enforce constraints iteratively. (6) Certification: if you have a candidate feasible primal solution, a dual solution provides a gap certificate p(x) - d(\lambda, \nu) to measure suboptimality and stop early. Especially apply it when Slater’s condition is easy to verify, the dual is easier to optimize, or decomposition across constraints/blocks yields parallelism.

⚠️Common Mistakes

• Assuming strong duality without checking qualifications: for nonconvex problems or convex problems without Slater’s condition, there may be a duality gap d^{} < p^{}. Always verify convexity and feasibility conditions. • Sign and direction errors: inequalities must be of the form g_{i}(x) \le 0 with \lambda_{i} \ge 0. If your constraints are of the form a^{T}x \ge b, rewrite as b - a^{T}x \le 0. • Ignoring dual feasibility: optimizing g(\lambda, \nu) with negative \lambda can yield invalid lower bounds. Keep \lambda \ge 0 via projection. • Misusing KKT: in nonconvex problems, KKT conditions are necessary but not sufficient; they may identify saddle points. In convex problems, they are necessary and sufficient. • Forgetting complementary slackness: at optimum, each inequality constraint either binds (g_{i}(x^{}) = 0, \lambda_{i}^{} > 0) or is slack (g_{i}(x^{}) < 0, \lambda_{i}^{} = 0). Violations indicate numerical issues or non-optimality. • Numerical instability: computing x^{}(\lambda) may require solving linear systems; poor conditioning can stall dual ascent. Use robust solvers (e.g., Cholesky for SPD Q) and step-size schedules. • Confusing primal and dual variables: dual gradients are constraint violations at x^{}(\lambda); mixing them with gradients of f leads to incorrect updates. Always use \nabla_{\lambda} g(\lambda) = g(x^{*}(\lambda)) for inequality constraints.

Key Formulas

Lagrangian

Explanation: The Lagrangian augments the objective with penalties for constraint violations. Inequality multipliers must be nonnegative; equality multipliers are free.

Dual Function

Explanation: For fixed multipliers, the best (lowest) value of the Lagrangian over x defines a concave function. It is a lower bound on the optimal primal value.

Dual Problem

Explanation: The dual problem chooses multipliers to maximize the lower bound provided by the dual function, subject to dual feasibility.

Weak Duality

Explanation: The optimal dual value never exceeds the optimal primal value. This guarantees any dual solution provides a certified lower bound.

Strong Duality

Explanation: Under conditions such as convexity and Slater's condition, the duality gap closes and optimal values match, allowing recovery of primal solutions from dual ones.

KKT Stationarity

Explanation: At optimum, the gradient of the Lagrangian with respect to primal variables must vanish, balancing objective and constraint gradients.

KKT Feasibility

Explanation: Primal constraints are satisfied and inequality multipliers are nonnegative at the optimal solution.

Complementary Slackness

Explanation: Each inequality is either active (equals zero with positive multiplier) or inactive (strictly negative with zero multiplier) at optimum.

Saddle-Point Condition

Explanation: A saddle point of the Lagrangian certifies optimality when strong duality holds, simultaneously minimizing over x and maximizing over multipliers.

QP Dual Function

Explanation: For a convex QP with inequalities Ax b and positive definite Q, the dual function has this closed form. It is concave in .

Dual Gradient (Inequalities)

Explanation: The gradient of the dual with respect to an inequality multiplier equals the corresponding constraint residual at the minimizing ().

SVM Dual

Explanation: The dual of soft-margin SVM is a QP with simple box constraints and one equality, enabling kernel methods and sparse solutions.

Duality Gap

Explanation: The difference between a feasible primal value and a dual value is always nonnegative. A zero gap indicates optimality under strong duality.

Slater's Condition

Explanation: If a strictly feasible point exists for convex problems with affine equalities, then strong duality holds and KKT conditions are sufficient.

Complexity Analysis

Optimizing the dual typically involves repeatedly evaluating the dual function and its gradient (or subgradient) and enforcing dual feasibility. For problems where the Lagrangian minimizer (, ) has a closed form or can be computed by solving a linear system, each dual iteration cost is dominated by that inner solve. For a convex quadratic program with n variables and m linear inequality constraints, evaluating () requires solving Q x = -(c + ). Using a dense direct solver is O() time and O() space; computing the gradient g() = A - b is O(mn) time and O(m) space. Projected gradient (or subgradient) ascent on the dual typically needs O(1/) iterations to reach -suboptimality without acceleration, yielding O(( + mn)/) time. If Q is fixed, factorization (e.g., Cholesky) can be precomputed once in O() time and reused, reducing per-iteration time to O( + mn). Memory usage is dominated by storing Q (O()), A (O(mn)), and vectors (O(n + m)). In SVM duals with a dense kernel matrix K of size n n, a naive gradient step costs O() time and O() space; specialized methods like SMO exploit structure to achieve practical efficiency and low memory via working sets. When constraints decompose or the problem is sparse, dual ascent can be implemented with near-linear per-iteration cost using iterative solvers (e.g., conjugate gradients) and sparse algebra, significantly improving scalability while maintaining the duality-gap certificate for stopping.

Code Examples

KKT verification via dual ascent on a 1D convex problem (active constraint)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Primal: minimize f(x) = (x - 3)^2 subject to x >= 5.
5// Inequality in standard form: g(x) = 5 - x <= 0.
6// Lagrangian: L(x, lambda) = (x-3)^2 + lambda * (5 - x), lambda >= 0.
7// For fixed lambda, minimizer over x: d/dx L = 2(x-3) - lambda = 0 => x*(lambda) = 3 + lambda/2.
8// Dual function: g(lambda) = L(x*(lambda), lambda) = -lambda^2/4 + 2*lambda (concave parabola).
9// Optimal dual: lambda* = 4, x* = 5, p* = d* = 4.
10
11struct Result {
12 double lambda_opt;
13 double x_opt;
14 double primal_value;
15 double dual_value;
16};
17
18// Compute x*(lambda) and dual quantities
19static inline double x_star(double lambda) {
20 return 3.0 + 0.5 * lambda; // x*(lambda) = 3 + lambda/2
21}
22
23static inline double Lagrangian(double x, double lambda) {
24 double f = (x - 3.0) * (x - 3.0);
25 double g = 5.0 - x; // inequality value
26 return f + lambda * g;
27}
28
29static inline double dual_value(double lambda) {
30 double x = x_star(lambda);
31 return Lagrangian(x, lambda);
32}
33
34static inline double dual_gradient(double lambda) {
35 // For inequality constraints, \nabla g(lambda) = g(x*(lambda))
36 double x = x_star(lambda);
37 return 5.0 - x; // = 5 - (3 + lambda/2) = 2 - lambda/2
38}
39
40Result solve_dual_projected_gradient(int iters = 200, double alpha0 = 0.5) {
41 double lambda = 0.0; // start feasible (lambda >= 0)
42 for (int t = 0; t < iters; ++t) {
43 double grad = dual_gradient(lambda);
44 double alpha = alpha0 / (1.0 + t); // diminishing step size
45 lambda = lambda + alpha * grad; // gradient ascent step on concave dual
46 if (lambda < 0.0) lambda = 0.0; // projection onto lambda >= 0
47 }
48 double xopt = x_star(lambda);
49 // Primal and dual values
50 double p = (xopt - 3.0) * (xopt - 3.0); // since x* is feasible at optimum
51 double d = dual_value(lambda);
52 return {lambda, xopt, p, d};
53}
54
55int main() {
56 ios::sync_with_stdio(false);
57 cin.tie(nullptr);
58
59 Result r = solve_dual_projected_gradient();
60
61 cout << fixed << setprecision(6);
62 cout << "Approx dual lambda*: " << r.lambda_opt << "\n";
63 cout << "Primal x*(lambda): " << r.x_opt << "\n";
64 cout << "Primal value p(x): " << r.primal_value << "\n";
65 cout << "Dual value g(lambda):" << r.dual_value << "\n";
66 cout << "Duality gap p - d: " << (r.primal_value - r.dual_value) << "\n\n";
67
68 // Verify KKT conditions numerically
69 double x = r.x_opt;
70 double lambda = r.lambda_opt;
71 double stationarity = fabs(2.0 * (x - 3.0) - lambda); // |\nabla f - lambda * \nabla g|
72 double primal_feas = 5.0 - x; // should be <= 0
73 double dual_feas = lambda; // should be >= 0
74 double comp_slack = lambda * (5.0 - x);
75
76 cout << "KKT checks:" << "\n";
77 cout << " Stationarity |2(x-3) - lambda| = " << stationarity << "\n";
78 cout << " Primal feas g(x)=5-x = " << primal_feas << " (<= 0)\n";
79 cout << " Dual feas lambda = " << dual_feas << " (>= 0)\n";
80 cout << " Complementary slackness = " << comp_slack << " (\u2248 0)\n";
81
82 // Closed-form optimum for reference
83 cout << "\nClosed-form: lambda*=4, x*=5, p*=4, d*=4\n";
84 return 0;
85}
86

This program maximizes the dual function for a 1D convex problem using projected gradient ascent on the nonnegative multiplier. It computes x*(lambda) in closed form, uses the property that the dual gradient equals the constraint residual, and verifies KKT conditions and the duality gap. The optimal solution has an active constraint (x* = 5) and positive multiplier (lambda* = 4).

Time: O(T) where T is the number of gradient steps (all operations are O(1) per step).Space: O(1)
Dual projected gradient for a small convex QP: min 0.5 x^T Q x + c^T x s.t. A x <= b
1#include <bits/stdc++.h>
2using namespace std;
3
4// Solve a small convex QP via dual ascent.
5// Primal: minimize 0.5 x^T Q x + c^T x subject to A x <= b (inequalities only).
6// Assumptions: Q is symmetric positive definite (SPD) to ensure unique x*(lambda).
7// Dual ascent uses: x*(lambda) = argmin_x L = solve(Q x = -(c + A^T lambda)).
8// Dual gradient: grad(lambda) = A x*(lambda) - b. Project onto lambda >= 0.
9
10// Utility: Solve linear system M x = rhs with Gaussian elimination and partial pivoting.
11// Works for small dense matrices; for SPD Q, a Cholesky would be better in practice.
12
13vector<double> solveLinearSystem(vector<vector<double>> M, vector<double> rhs) {
14 int n = (int)M.size();
15 for (int i = 0; i < n; ++i) M[i].push_back(rhs[i]); // augment
16 // Forward elimination with partial pivoting
17 for (int col = 0; col < n; ++col) {
18 int piv = col;
19 for (int r = col + 1; r < n; ++r)
20 if (fabs(M[r][col]) > fabs(M[piv][col])) piv = r;
21 if (fabs(M[piv][col]) < 1e-12) throw runtime_error("Singular matrix detected");
22 if (piv != col) swap(M[piv], M[col]);
23 double diag = M[col][col];
24 for (int j = col; j <= n; ++j) M[col][j] /= diag;
25 for (int r = col + 1; r < n; ++r) {
26 double f = M[r][col];
27 if (fabs(f) < 1e-18) continue;
28 for (int j = col; j <= n; ++j) M[r][j] -= f * M[col][j];
29 }
30 }
31 // Back substitution
32 vector<double> x(n, 0.0);
33 for (int i = n - 1; i >= 0; --i) {
34 double s = M[i][n];
35 for (int j = i + 1; j < n; ++j) s -= M[i][j] * x[j];
36 x[i] = s; // since diagonal is 1 after normalization
37 }
38 return x;
39}
40
41struct QP {
42 vector<vector<double>> Q; // n x n SPD
43 vector<double> c; // n
44 vector<vector<double>> A; // m x n
45 vector<double> b; // m
46};
47
48vector<double> matVec(const vector<vector<double>>& M, const vector<double>& x) {
49 int r = (int)M.size();
50 int c = (int)M[0].size();
51 vector<double> y(r, 0.0);
52 for (int i = 0; i < r; ++i) {
53 double s = 0.0;
54 for (int j = 0; j < c; ++j) s += M[i][j] * x[j];
55 y[i] = s;
56 }
57 return y;
58}
59
60vector<double> vecAdd(const vector<double>& a, const vector<double>& b, double beta = 1.0) {
61 int n = (int)a.size();
62 vector<double> r(n);
63 for (int i = 0; i < n; ++i) r[i] = a[i] + beta * b[i];
64 return r;
65}
66
67vector<vector<double>> transpose(const vector<vector<double>>& M) {
68 int r = (int)M.size(), c = (int)M[0].size();
69 vector<vector<double>> T(c, vector<double>(r));
70 for (int i = 0; i < r; ++i)
71 for (int j = 0; j < c; ++j)
72 T[j][i] = M[i][j];
73 return T;
74}
75
76// Compute x*(lambda): solve Q x = -(c + A^T lambda)
77vector<double> x_star(const QP& qp, const vector<double>& lambda) {
78 auto AT = transpose(qp.A);
79 vector<double> ATlam = matVec(AT, lambda);
80 vector<double> rhs = vecAdd(qp.c, ATlam, 1.0); // c + A^T lambda
81 for (double& v : rhs) v = -v; // -(c + A^T lambda)
82 return solveLinearSystem(qp.Q, rhs);
83}
84
85// Lagrangian at (x, lambda): 0.5 x^T Q x + c^T x + lambda^T (A x - b)
86double Lagrangian(const QP& qp, const vector<double>& x, const vector<double>& lambda) {
87 int n = (int)x.size();
88 // 0.5 x^T Q x
89 double quad = 0.0;
90 for (int i = 0; i < n; ++i)
91 for (int j = 0; j < n; ++j)
92 quad += 0.5 * x[i] * qp.Q[i][j] * x[j];
93 // c^T x
94 double lin = 0.0;
95 for (int i = 0; i < n; ++i) lin += qp.c[i] * x[i];
96 // lambda^T (A x - b)
97 vector<double> Ax = matVec(qp.A, x);
98 double pen = 0.0;
99 for (int i = 0; i < (int)lambda.size(); ++i) pen += lambda[i] * (Ax[i] - qp.b[i]);
100 return quad + lin + pen;
101}
102
103struct DualResult {
104 vector<double> lambda;
105 vector<double> x;
106 double primal_value;
107 double dual_value;
108};
109
110DualResult dual_projected_gradient(const QP& qp, int iters = 500, double alpha0 = 0.2) {
111 int m = (int)qp.A.size();
112 vector<double> lambda(m, 0.0);
113 vector<double> x(qp.c.size(), 0.0);
114 for (int t = 0; t < iters; ++t) {
115 x = x_star(qp, lambda);
116 vector<double> Ax = matVec(qp.A, x);
117 // gradient = A x - b
118 vector<double> grad(m);
119 for (int i = 0; i < m; ++i) grad[i] = Ax[i] - qp.b[i];
120 double alpha = alpha0 / sqrt(1.0 + t); // diminishing step size
121 for (int i = 0; i < m; ++i) {
122 lambda[i] = lambda[i] + alpha * grad[i]; // ascent on concave dual
123 if (lambda[i] < 0.0) lambda[i] = 0.0; // project onto lambda >= 0
124 }
125 }
126 x = x_star(qp, lambda);
127 // primal value: 0.5 x^T Q x + c^T x
128 double quad = 0.0; int n = (int)x.size();
129 for (int i = 0; i < n; ++i)
130 for (int j = 0; j < n; ++j)
131 quad += 0.5 * x[i] * qp.Q[i][j] * x[j];
132 double lin = 0.0; for (int i = 0; i < n; ++i) lin += qp.c[i] * x[i];
133 double p = quad + lin;
134 double d = Lagrangian(qp, x, lambda); // = g(lambda) since x = argmin L(.,lambda)
135 return {lambda, x, p, d};
136}
137
138int main() {
139 ios::sync_with_stdio(false);
140 cin.tie(nullptr);
141
142 // Example QP (n=2, m=2) with Slater's condition satisfied.
143 QP qp;
144 qp.Q = {{2.0, 0.0}, {0.0, 1.0}}; // SPD
145 qp.c = {-2.0, -5.0}; // linear term
146 qp.A = {{1.0, 2.0}, {-1.0, 2.0}}; // constraints A x <= b
147 qp.b = {4.0, 2.0};
148
149 auto res = dual_projected_gradient(qp, 600, 0.25);
150
151 cout << fixed << setprecision(6);
152 cout << "lambda*: [" << res.lambda[0] << ", " << res.lambda[1] << "]\n";
153 cout << "x*: [" << res.x[0] << ", " << res.x[1] << "]\n";
154 cout << "Primal value p*: " << res.primal_value << "\n";
155 cout << "Dual value d*: " << res.dual_value << "\n";
156 cout << "Duality gap : " << (res.primal_value - res.dual_value) << "\n\n";
157
158 // Check feasibility and complementary slackness
159 vector<double> Ax = matVec(qp.A, res.x);
160 cout << "A x - b = [" << (Ax[0] - qp.b[0]) << ", " << (Ax[1] - qp.b[1]) << "] (should be <= 0)\n";
161 double cs = 0.0; for (int i = 0; i < (int)res.lambda.size(); ++i) cs += res.lambda[i] * (Ax[i] - qp.b[i]);
162 cout << "Sum lambda_i * (A x - b)_i = " << cs << " (\u2248 0 at optimum)\n";
163
164 return 0;
165}
166

This program solves the dual of a small convex quadratic program using projected gradient ascent on the dual variables. Each iteration solves a linear system for x*(lambda), computes the dual gradient as the constraint violations, takes an ascent step, and projects onto lambda >= 0. At convergence (under Slater’s condition), the primal solution x*(lambda*) is optimal and the duality gap is near zero, validating strong duality and KKT.

Time: O(T (n^3 + m n)) for T iterations using a dense direct solver (Gaussian elimination) per iteration.Space: O(n^2 + m n) to store Q, A, and auxiliary vectors.