∑MathIntermediate

Surrogate Loss Theory

Key Points

•
0-1 loss directly measures classification error but is discontinuous and non-convex, making optimization computationally hard.
•
Convex surrogate losses (like hinge, logistic, or exponential) replace the 0-1 loss with a smooth or piecewise-linear upper bound that is easy to optimize.
•
Minimizing a well-chosen convex surrogate can guarantee that you also minimize classification error asymptotically; this is called classification calibration.
•
Margin-based surrogates operate on $z = y$ f(x) where y ∈ {−1,1} and f(x) is a real-valued score; larger margins imply lower loss.
•
Logistic loss supports probability estimation via the sigmoid link, while hinge loss emphasizes a large margin without calibrated probabilities.
•
Empirical Risk Minimization with convex surrogates enables gradient-based algorithms like SGD to train linear classifiers efficiently.
•
Regularization (e.g., L2) is typically combined with surrogate losses to control overfitting and improve generalization.
•
Choosing the right surrogate depends on goals: calibrated probabilities (logistic), robust large-margin classification (hinge), or fast boosting (exponential).

Prerequisites

→Convex functions and optimization — Surrogate losses rely on convexity to guarantee tractable optimization and global minima.
→Probability and expectation — Risk and surrogate risk are expectations; calibration and Bayes rules use conditional probabilities.
→Linear models and dot products — Most surrogate examples use linear scores f(x) = w^T x + b.
→Gradients and subgradients — Training with convex surrogates uses (sub)gradient-based methods like SGD.
→Regularization (L2/L1) — Pairing surrogates with regularization improves generalization and sometimes strong convexity.

Detailed Explanation

Tap terms for definitions

01Overview

Surrogate Loss Theory studies how to replace hard-to-optimize objective functions—especially the 0-1 classification loss—with easier, convex alternatives that preserve the goal of accurate classification. The 0-1 loss simply asks whether a prediction is right or wrong, assigning 0 to correct decisions and 1 to mistakes. While ideal for measuring accuracy, it creates a jagged optimization landscape with flat regions and discontinuous jumps. As a result, directly minimizing 0-1 loss is generally NP-hard even for linear classifiers. Convex surrogate losses, such as hinge, logistic, and exponential losses, provide smooth or piecewise-linear upper bounds to the 0-1 loss. They transform the problem into convex optimization, where global minima can be found efficiently with gradient methods. The central theoretical question is calibration: does minimizing the surrogate risk also minimize the true classification error? Surrogate Loss Theory answers this by characterizing conditions—like margin-based convexity and proper composite structures—under which consistency holds and by quantifying how fast excess surrogate risk translates into excess 0-1 risk. In practice, these surrogates underpin SVMs (hinge), logistic regression and cross-entropy training (logistic), and boosting algorithms (exponential), making large-scale classification feasible.

02Intuition & Analogies

Imagine teaching a robot to sort apples and oranges by color using a dial that controls a threshold. The 0-1 loss is like a buzzer that screams only when the robot misclassifies—silent otherwise. If you try to tune the dial using just the buzzer, you’ll hear silence most of the time and sudden blasts when you cross unfortunate thresholds. There is no gentle feedback telling you which way to turn the dial to improve. Surrogate losses replace the buzzer with a dimmer light that brightens as mistakes become more likely. For margin-based losses, the light depends on how confidently the robot is classifying: if an apple is far on the “apple” side, the light is dim (low loss); near the boundary, it’s brighter (higher loss); on the wrong side, it’s very bright (large loss). This graded feedback enables smooth adjustments via gradient methods. Another analogy: hiking down a mountain to reach the lowest valley. The 0-1 landscape is like a plateaued staircase with cliffs—you can walk for miles on a flat ledge (no change in loss) and suddenly hit a drop (a misclassification flips). You can’t tell which direction descends because the surface is flat almost everywhere. Convex surrogates reshape the terrain into a gentle bowl leading to the valley, so following the slope (the gradient) reliably takes you to the bottom. Different surrogates sculpt different bowls: hinge creates a V-shaped ramp with a flat bottom (margin satisfied), logistic makes a smooth bowl with calibrated probability outputs, and exponential is steeper, which boosting exploits to focus quickly on hard examples.

03Formal Definition

Consider binary classification with feature space X, labels Y = {−1, 1}, and a real-valued scoring function f:

X \to R

. The 0-1 loss is \(

ℓ_{01}

(y, f(x)) =

1

\{y f(x)

\leq

0\}\). A surrogate loss \(

ϕ

R

\to

[0,

\infty

)\) is typically margin-based, so the surrogate loss is \(

ℓ_{ϕ}

(y, f(x)) =

ϕ

(y f(x))\). Common convex choices include hinge \(

ϕ

(z) =

max

(0, 1 - z)\), logistic \(

ϕ

(z) =

lo g

(1 +

e^{- z}

)\), and exponential \(

ϕ

(z) =

e^{- z}

\). The true (0-1) risk and surrogate risk of a classifier f are \(R(f) =

E

[

ℓ_{01}

(Y, f(X))]\) and \(

R_{ϕ}

(f) =

E

[

ϕ

(Y f(X))]\), respectively, with expectations over the joint distribution of (X, Y). A surrogate \(

ϕ

\) is classification-calibrated if, for every conditional probability \(

η

(x) =

P

(

Y = 1

∣

X = x

)\), any minimizer \(

f_{ϕ}^{⋆}

(x)

\in

ar g min_{t \in R}

η

(x)\,

ϕ

(t) + (1-

η

(x))\,

ϕ

(-t)\) satisfies \(

sign

(

f_{ϕ}^{⋆}

(x)) =

sign

(

η

(x) -

\frac{1}{2}

)\). In words, minimizing the conditional surrogate risk yields the Bayes-optimal decision boundary. Calibration is often strengthened by excess risk bounds: there exists a nondecreasing function \(

ψ

\) such that \(R(f) -

R^{⋆}

\leq

ψ^{- 1}

(

R_{ϕ}

(f) -

R_{ϕ}^{⋆}

)\). Together with convexity, this ensures that empirical risk minimization with \(

ϕ

\) is computationally tractable and statistically consistent under appropriate conditions.

04When to Use

Use convex surrogate losses whenever you need to train classifiers efficiently, especially with linear or deep models at scale. If you want calibrated probabilities for decisions, thresholds, or interpretability, prefer logistic loss (cross-entropy) because it aligns with the sigmoid/softmax link and supports probability estimation. If you want a large-margin classifier that emphasizes separation and can be robust to label noise near the boundary, hinge loss (SVM-style) is a good match. If your workflow involves boosting (e.g., AdaBoost), exponential loss is standard because it amplifies misclassified points aggressively. Choose surrogates when 0-1 loss is too hard to optimize (which is typical) and you require gradient-based methods like SGD or Adam. They also pair naturally with regularization (L2/L1) to control complexity. For imbalanced data, you can use cost-sensitive or class-weighted versions of these surrogates to shift the decision boundary. In multi-class settings, use the softmax cross-entropy (a convex proper composite surrogate) or multi-class hinge. Avoid direct 0-1 optimization unless problem sizes are tiny or specialized combinatorial methods are available. When strict probability calibration is critical (risk-sensitive decisions, medical diagnosis), prefer logistic and perform post-hoc calibration checks (e.g., reliability diagrams).

⚠️Common Mistakes

Assuming any surrogate guarantees minimal 0-1 error: only classification-calibrated surrogates provide this consistency. Verify calibration and, if needed, use proper composite losses.
Confusing score sign and class prediction: with labels in {−1,1}, predict y = sign(f(x)), but with {0,1}, you must map consistently or use a link function (e.g., sigmoid) and threshold at 0.5.
Ignoring regularization: convex surrogates optimize easily, but without L2/L1 regularization you can overfit, especially with linearly separable data (hinge may push margins indefinitely).
Expecting hinge loss to yield calibrated probabilities: hinge is margin-oriented; its scores are not probabilities. If you need probabilities, use logistic or apply post-hoc calibration (Platt scaling).
Mishandling class imbalance: unweighted surrogates mimic average error; use class weights or focal/weighted variants to reflect asymmetric costs.
Using a single fixed learning rate for all surrogates: logistic (smooth) may allow larger steps than hinge (nonsmooth); tune optimizers accordingly and consider using subgradients for hinge.
Mixing feature scaling: convex optimization benefits from normalized features; poor scaling slows convergence and destabilizes SGD.
Evaluating only surrogate loss: always report 0-1 accuracy and, if relevant, calibration metrics; a small surrogate risk does not always imply optimal finite-sample accuracy.

Key Formulas

0-1 Loss

ℓ_{01} (y, f (x)) = 1 {y f (x) \leq 0}

Explanation: This counts a mistake whenever the signed margin y f(x) is non-positive. It directly measures classification error but is discontinuous and non-convex.

Margin-Based Surrogate

ℓ_{ϕ} (y, f (x)) = ϕ (y f (x))

Explanation: The loss depends only on the margin z = y f(x). Larger positive margins reduce loss, while negative margins incur large penalties, guiding optimization smoothly.

Hinge Loss

ϕ_{hinge} (z) = max (0, 1 - z)

Explanation: This convex, piecewise-linear loss is zero once the margin exceeds 1, encouraging a large safety margin as in SVMs.

Logistic Loss

ϕ_{log} (z) = lo g (1 + e^{- z})

Explanation: A smooth convex loss that connects to probability estimation through the sigmoid link. It penalizes negative margins exponentially while being differentiable.

Exponential Loss

ϕ_{e x p} (z) = e^{- z}

Explanation: A convex loss used in boosting that strongly emphasizes misclassified examples with negative margins.

Risk and Surrogate Risk

R (f) = E [ℓ_{01} (Y, f (X))], R_{ϕ} (f) = E [ϕ (Y f (X))]

Explanation: These are expected losses over the data distribution. Minimizing surrogate risk is a tractable proxy for minimizing the 0-1 risk.

Conditional Surrogate Minimizer

f_{ϕ}^{⋆} (x) \in ar g t \in R min η (x) ϕ (t) + (1 - η (x)) ϕ (- t)

Explanation: For each feature vector x with class probability $η (x),$ this is the score t that minimizes expected surrogate loss at x. Its sign should match the Bayes decision if the surrogate is calibrated.

Calibration (Excess Risk) Inequality

R (f) - R^{⋆} \leq ψ^{- 1} (R_{ϕ} (f) - R_{ϕ}^{⋆})

Explanation: This inequality bounds excess 0-1 risk by a function of excess surrogate risk, formalizing how surrogate optimization translates to real classification accuracy.

Regularized ERM with Surrogates

w, b min \frac{1}{n} i = 1 \sum n ϕ (y_{i} (w^{⊤} x_{i} + b)) + λ ∥ w ∥_{2}^{2}

Explanation: A standard convex objective for linear classifiers combines surrogate loss with L2 regularization to control complexity and improve generalization.

Logistic Gradient

\nabla_{w} \frac{1}{n} i = 1 \sum n lo g (1 + e^{- y_{i} (w^{⊤} x_{i})}) = - \frac{1}{n} i = 1 \sum n \frac{y _{i} x _{i}}{1 + e ^{y_{i} (w^{⊤} x_{i})}}

Explanation: The logistic loss has a smooth gradient, enabling efficient optimization with gradient descent or SGD.

SGD/Subgradient Update

w_{t + 1} = w_{t} - η_{t} (\partial_{w} ϕ (y (w_{t}^{⊤} x)) + 2 λ w_{t})

Explanation: Generic per-example update for convex surrogate losses with L2 regularization. For hinge loss, use a subgradient; for logistic, use the gradient.

Complexity Analysis

Let n be the number of training examples and d the number of features. For linear models f(x) = w^

⊤

x + b trained with a convex surrogate, the per-iteration cost of computing the loss and gradient on one example is O(d). In a full-batch gradient step this becomes O(nd), while in stochastic gradient descent (SGD) each update costs O(d) and an epoch over all data costs O(nd). If features are sparse with nnz nonzeros per example on average, costs reduce to O(nnz) per example and O(n·nnz) per epoch. Logistic loss is smooth with Lipschitz-continuous gradients, enabling stable larger steps and sometimes faster practical convergence. Hinge loss is non-smooth; subgradient methods still converge but may require smaller step sizes or more epochs to reach similar accuracy. Memory usage is dominated by storing the parameter vector w (O(d)) and, if using mini-batches, an additional O(batc

h_{s} i ze

· d) working buffer; the dataset storage cost is O(nd) or O(total nnz) for sparse matrices. With L2 regularization, the objective is strongly convex (for full column-rank designs or in expectation), yielding faster convergence rates—e.g., O

\frac{1}{T}

for SGD on the expected risk under standard assumptions. Advanced solvers (coordinate descent, quasi-Newton) can achieve improved constants but retain similar asymptotic per-iteration complexity. Overall training time is approximately O(E · nd) for E epochs in simple SGD. Prediction on m new examples costs O(md).

Code Examples

Train a linear classifier with selectable convex surrogate (hinge or logistic) using SGD

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Utility: sign function returning {-1, 1}
5 int sign01(double s) { return (s >= 0.0) ? 1 : -1; }
6 
7 // Surrogate interface via function objects
8 struct Surrogate {
9     virtual ~Surrogate() = default;
10     // loss on margin z = y * (w^T x + b)
11     virtual double loss(double z) const = 0;
12     // derivative w.r.t. z (subgradient for hinge)
13     virtual double dloss(double z) const = 0;
14     virtual const char* name() const = 0;
15 };
16 
17 struct HingeLoss : Surrogate {
18     double loss(double z) const override { return max(0.0, 1.0 - z); }
19     double dloss(double z) const override {
20         if (z > 1.0) return 0.0;           // margin satisfied
21         if (z < 1.0) return -1.0;          // subgradient anywhere z<1
22         return -0.5;                        // arbitrary subgradient at kink
23     }
24     const char* name() const override { return "hinge"; }
25 };
26 
27 struct LogisticLoss : Surrogate {
28     double loss(double z) const override { return log1p(exp(-z)); }
29     double dloss(double z) const override { return -1.0 / (1.0 + exp(z)); }
30     const char* name() const override { return "logistic"; }
31 };
32 
33 struct Example { vector<double> x; int y; };
34 
35 struct SGDConfig {
36     int epochs = 10;
37     double eta0 = 0.1;      // initial learning rate
38     double lambda = 1e-4;   // L2 regularization strength
39     int seed = 42;
40     bool shuffle_each_epoch = true;
41 };
42 
43 struct LinearModel {
44     vector<double> w; double b = 0.0;
45     explicit LinearModel(int d=0) : w(d, 0.0), b(0.0) {}
46     double score(const vector<double>& x) const {
47         double s = b; for (size_t j=0;j<w.size();++j) s += w[j]*x[j]; return s;
48     }
49     int predict(const vector<double>& x) const { return sign01(score(x)); }
50 };
51 
52 // Train with SGD on chosen surrogate
53 void train_sgd(LinearModel& model, const vector<Example>& data, const Surrogate& phi, const SGDConfig& cfg) {
54     int n = (int)data.size(); int d = (int)model.w.size();
55     vector<int> idx(n); iota(idx.begin(), idx.end(), 0);
56     mt19937 rng(cfg.seed);
57     for (int e=0; e<cfg.epochs; ++e) {
58         if (cfg.shuffle_each_epoch) shuffle(idx.begin(), idx.end(), rng);
59         double eta = cfg.eta0 / (1.0 + 0.1*e); // simple decay
60         for (int ii=0; ii<n; ++ii) {
61             const Example& ex = data[idx[ii]];
62             // Compute margin z = y * (w^T x + b)
63             double s = model.score(ex.x);
64             double z = ex.y * s;
65             // Sub/gradient of loss w.r.t. z
66             double g = phi.dloss(z);
67             // Chain rule: dL/dw = g * d(z)/dw = g * y * x
68             double factor = g * ex.y;
69             // Apply L2 regularization gradient 2*lambda*w
70             for (int j=0;j<d;++j) model.w[j] -= eta * (factor * ex.x[j] + 2.0 * cfg.lambda * model.w[j]);
71             model.b -= eta * (factor); // bias not regularized here
72         }
73         // Optional: report epoch stats
74         double avg_loss = 0.0; int correct = 0;
75         for (const auto& ex : data) {
76             double z = ex.y * model.score(ex.x);
77             avg_loss += phi.loss(z);
78             correct += (sign01(model.score(ex.x)) == ex.y);
79         }
80         avg_loss /= n;
81         cout << "Epoch " << e+1 << ": surrogate=" << phi.name() << ", avg_loss=" << avg_loss
82              << ", acc=" << (100.0*correct/n) << "%\n";
83     }
84 }
85 
86 // Simple synthetic separable dataset in 2D
87 vector<Example> make_toy(int n_per_class=100, int seed=0) {
88     mt19937 rng(seed); normal_distribution<double> N(0.0, 1.0);
89     vector<Example> data; data.reserve(2*n_per_class);
90     // Class +1 centered at (2,2); class -1 centered at (-2,-2)
91     for (int i=0;i<n_per_class;++i) {
92         vector<double> x = {2.0 + N(rng), 2.0 + N(rng)}; data.push_back({x, +1});
93     }
94     for (int i=0;i<n_per_class;++i) {
95         vector<double> x = {-2.0 + N(rng), -2.0 + N(rng)}; data.push_back({x, -1});
96     }
97     return data;
98 }
99 
100 int main(){
101     auto data = make_toy();
102     int d = 2;
103     LinearModel model_h(d), model_l(d);
104     HingeLoss hinge; LogisticLoss logi; SGDConfig cfg; cfg.epochs = 10; cfg.eta0=0.1;
105 
106     cout << "Training with hinge loss\n";
107     train_sgd(model_h, data, hinge, cfg);
108 
109     cout << "\nTraining with logistic loss\n";
110     train_sgd(model_l, data, logi, cfg);
111 
112     return 0;
113 }
114

This program trains a linear classifier using SGD with either hinge or logistic surrogate loss. The loss is applied to the margin z = y (w^T x + b). For hinge, we use a subgradient; for logistic, a smooth gradient. A small synthetic dataset illustrates learning progress through epoch-wise average surrogate loss and accuracy. L2 regularization stabilizes training and encourages generalization.

Time: Per epoch O(n d). With T epochs, O(T n d).Space: O(d) for parameters plus O(n d) if data is stored densely.

From scores to probabilities and decisions: calibration with logistic link

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Sigmoid function for probability mapping
5 static inline double sigmoid(double t) { return 1.0 / (1.0 + exp(-t)); }
6 
7 struct Example { vector<double> x; int y; };
8 
9 struct LinearModel { vector<double> w; double b=0.0; double score(const vector<double>& x) const { double s=b; for(size_t j=0;j<w.size();++j) s+=w[j]*x[j]; return s; } };
10 
11 // Compute 0-1 error, average logistic loss, and Brier score (calibration proxy)
12 struct Metrics { double acc, zero_one, logloss, brier; };
13 
14 Metrics evaluate(const LinearModel& m, const vector<Example>& data) {
15     double correct=0, logloss=0, brier=0; int n=(int)data.size();
16     for (const auto& ex : data) {
17         double s = m.score(ex.x);
18         double p = sigmoid(s);
19         int y01 = (ex.y==1)?1:0; // {0,1} for probability metrics
20         int yhat = (p >= 0.5) ? 1 : -1; // decision threshold
21         correct += (yhat == ex.y);
22         // logistic (cross-entropy) on {0,1}
23         p = min(max(p, 1e-12), 1.0-1e-12);
24         logloss += -( y01*log(p) + (1-y01)*log(1-p) );
25         // Brier score: squared error of probabilities
26         brier += (p - y01)*(p - y01);
27     }
28     Metrics M; M.acc = correct/n; M.zero_one = 1.0 - M.acc; M.logloss = logloss/n; M.brier = brier/n; return M;
29 }
30 
31 int main(){
32     // A tiny linearly separable dataset in 1D for demonstration
33     vector<Example> data;
34     for (int i=0;i<10;++i) data.push_back({{ -2.0 + 0.1*i }, -1});
35     for (int i=0;i<10;++i) data.push_back({{  1.0 + 0.1*i },  1});
36 
37     // A hand-crafted linear model
38     LinearModel m; m.w = {2.0}; m.b = -0.5; // score = 2x - 0.5
39 
40     auto M = evaluate(m, data);
41     cout << fixed << setprecision(4);
42     cout << "Accuracy: " << M.acc << "\n";
43     cout << "0-1 loss: " << M.zero_one << "\n";
44     cout << "Average logistic (cross-entropy) loss: " << M.logloss << "\n";
45     cout << "Brier score (probability calibration): " << M.brier << "\n";
46 
47     // Show probabilities for a few points
48     for (double x : {-1.5, -0.5, 0.0, 0.5, 1.5}) {
49         double s = m.w[0]*x + m.b; double p = sigmoid(s);
50         cout << "x=" << setw(5) << x << ", score=" << setw(6) << s << ", p(y=1|x)≈" << p
51              << ", decision=" << ((p>=0.5)?1:-1) << "\n";
52     }
53     return 0;
54 }
55

This code demonstrates how logistic loss naturally yields probabilities via the sigmoid link: p = σ(f(x)). It evaluates a linear model with cross-entropy (logistic) loss and Brier score, illustrating the difference between calibrated probability metrics and the 0-1 error rate used for decisions.

Time: O(n d) to evaluate n examples in d dimensions.Space: O(d) for model parameters; O(1) extra working memory.

Landscape comparison: empirical 0-1 vs hinge vs logistic in 1D parameter

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Ex { double x; int y; };
5 
6 double zero_one(int y, double s){ return (y * s <= 0.0) ? 1.0 : 0.0; }
7 double hinge(int y, double s){ double z=y*s; return max(0.0, 1.0 - z); }
8 double logistic(int y, double s){ double z=y*s; return log1p(exp(-z)); }
9 
10 int main(){
11     // Tiny dataset not perfectly separable
12     vector<Ex> D = { {-2, -1}, {-1, -1}, {-0.5, -1}, {0.2, 1}, {0.5, 1}, {1.5, 1} };
13 
14     // Model: f(x) = w * x + b; we fix b=0 for simplicity and scan w
15     double b = 0.0;
16     double best_w_01=0, best_L_01=1e9;
17     double best_w_h=0, best_L_h=1e9;
18     double best_w_l=0, best_L_l=1e9;
19 
20     for (double w=-5.0; w<=5.0; w+=0.01) {
21         double L01=0, Lh=0, Ll=0;
22         for (auto [x,y] : D) {
23             double s = w*x + b;
24             L01 += zero_one(y, s);
25             Lh  += hinge(y, s);
26             Ll  += logistic(y, s);
27         }
28         L01 /= D.size(); Lh /= D.size(); Ll /= D.size();
29         if (L01 < best_L_01) { best_L_01 = L01; best_w_01 = w; }
30         if (Lh  < best_L_h ) { best_L_h  = Lh ; best_w_h  = w; }
31         if (Ll  < best_L_l ) { best_L_l  = Ll ; best_w_l  = w; }
32     }
33 
34     cout << fixed << setprecision(4);
35     cout << "Best w by 0-1 loss:      " << best_w_01 << ", loss=" << best_L_01 << "\n";
36     cout << "Best w by hinge loss:    " << best_w_h  << ", loss=" << best_L_h  << "\n";
37     cout << "Best w by logistic loss: " << best_w_l  << ", loss=" << best_L_l  << "\n";
38 
39     // Show that 0-1 landscape is piecewise-constant with jumps (implicit here),
40     // while hinge is piecewise-linear and logistic is smooth, enabling gradients.
41     return 0;
42 }
43

By scanning a single parameter w, this program compares the empirical landscapes of 0-1, hinge, and logistic losses. The 0-1 risk is flat with sudden jumps, which defeats gradient-based methods, whereas hinge is piecewise-linear and logistic is smooth with a unique basin, explaining why convex surrogates are easy to optimize.

Time: O(G · n) where G is the number of grid points for w and n is dataset size.Space: O(1) beyond storing the small dataset.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Utility: sign function returning {-1, 1}
5	int sign01(double s) { return (s >= 0.0) ? 1 : -1; }
6
7	// Surrogate interface via function objects
8	struct Surrogate {
9	virtual ~Surrogate() = default;
10	// loss on margin z = y * (w^T x + b)
11	virtual double loss(double z) const = 0;
12	// derivative w.r.t. z (subgradient for hinge)
13	virtual double dloss(double z) const = 0;
14	virtual const char* name() const = 0;
15	};
16
17	struct HingeLoss : Surrogate {
18	double loss(double z) const override { return max(0.0, 1.0 - z); }
19	double dloss(double z) const override {
20	if (z > 1.0) return 0.0; // margin satisfied
21	if (z < 1.0) return -1.0; // subgradient anywhere z<1
22	return -0.5; // arbitrary subgradient at kink
23	}
24	const char* name() const override { return "hinge"; }
25	};
26
27	struct LogisticLoss : Surrogate {
28	double loss(double z) const override { return log1p(exp(-z)); }
29	double dloss(double z) const override { return -1.0 / (1.0 + exp(z)); }
30	const char* name() const override { return "logistic"; }
31	};
32
33	struct Example { vector<double> x; int y; };
34
35	struct SGDConfig {
36	int epochs = 10;
37	double eta0 = 0.1; // initial learning rate
38	double lambda = 1e-4; // L2 regularization strength
39	int seed = 42;
40	bool shuffle_each_epoch = true;
41	};
42
43	struct LinearModel {
44	vector<double> w; double b = 0.0;
45	explicit LinearModel(int d=0) : w(d, 0.0), b(0.0) {}
46	double score(const vector<double>& x) const {
47	double s = b; for (size_t j=0;j<w.size();++j) s += w[j]*x[j]; return s;
48	}
49	int predict(const vector<double>& x) const { return sign01(score(x)); }
50	};
51
52	// Train with SGD on chosen surrogate
53	void train_sgd(LinearModel& model, const vector<Example>& data, const Surrogate& phi, const SGDConfig& cfg) {
54	int n = (int)data.size(); int d = (int)model.w.size();
55	vector<int> idx(n); iota(idx.begin(), idx.end(), 0);
56	mt19937 rng(cfg.seed);
57	for (int e=0; e<cfg.epochs; ++e) {
58	if (cfg.shuffle_each_epoch) shuffle(idx.begin(), idx.end(), rng);
59	double eta = cfg.eta0 / (1.0 + 0.1*e); // simple decay
60	for (int ii=0; ii<n; ++ii) {
61	const Example& ex = data[idx[ii]];
62	// Compute margin z = y * (w^T x + b)
63	double s = model.score(ex.x);
64	double z = ex.y * s;
65	// Sub/gradient of loss w.r.t. z
66	double g = phi.dloss(z);
67	// Chain rule: dL/dw = g * d(z)/dw = g * y * x
68	double factor = g * ex.y;
69	// Apply L2 regularization gradient 2lambdaw
70	for (int j=0;j<d;++j) model.w[j] -= eta * (factor * ex.x[j] + 2.0 * cfg.lambda * model.w[j]);
71	model.b -= eta * (factor); // bias not regularized here
72	}
73	// Optional: report epoch stats
74	double avg_loss = 0.0; int correct = 0;
75	for (const auto& ex : data) {
76	double z = ex.y * model.score(ex.x);
77	avg_loss += phi.loss(z);
78	correct += (sign01(model.score(ex.x)) == ex.y);
79	}
80	avg_loss /= n;
81	cout << "Epoch " << e+1 << ": surrogate=" << phi.name() << ", avg_loss=" << avg_loss
82	<< ", acc=" << (100.0*correct/n) << "%\n";
83	}
84	}
85
86	// Simple synthetic separable dataset in 2D
87	vector<Example> make_toy(int n_per_class=100, int seed=0) {
88	mt19937 rng(seed); normal_distribution<double> N(0.0, 1.0);
89	vector<Example> data; data.reserve(2*n_per_class);
90	// Class +1 centered at (2,2); class -1 centered at (-2,-2)
91	for (int i=0;i<n_per_class;++i) {
92	vector<double> x = {2.0 + N(rng), 2.0 + N(rng)}; data.push_back({x, +1});
93	}
94	for (int i=0;i<n_per_class;++i) {
95	vector<double> x = {-2.0 + N(rng), -2.0 + N(rng)}; data.push_back({x, -1});
96	}
97	return data;
98	}
99
100	int main(){
101	auto data = make_toy();
102	int d = 2;
103	LinearModel model_h(d), model_l(d);
104	HingeLoss hinge; LogisticLoss logi; SGDConfig cfg; cfg.epochs = 10; cfg.eta0=0.1;
105
106	cout << "Training with hinge loss\n";
107	train_sgd(model_h, data, hinge, cfg);
108
109	cout << "\nTraining with logistic loss\n";
110	train_sgd(model_l, data, logi, cfg);
111
112	return 0;
113	}
114

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Sigmoid function for probability mapping
5	static inline double sigmoid(double t) { return 1.0 / (1.0 + exp(-t)); }
6
7	struct Example { vector<double> x; int y; };
8
9	struct LinearModel { vector<double> w; double b=0.0; double score(const vector<double>& x) const { double s=b; for(size_t j=0;j<w.size();++j) s+=w[j]*x[j]; return s; } };
10
11	// Compute 0-1 error, average logistic loss, and Brier score (calibration proxy)
12	struct Metrics { double acc, zero_one, logloss, brier; };
13
14	Metrics evaluate(const LinearModel& m, const vector<Example>& data) {
15	double correct=0, logloss=0, brier=0; int n=(int)data.size();
16	for (const auto& ex : data) {
17	double s = m.score(ex.x);
18	double p = sigmoid(s);
19	int y01 = (ex.y==1)?1:0; // {0,1} for probability metrics
20	int yhat = (p >= 0.5) ? 1 : -1; // decision threshold
21	correct += (yhat == ex.y);
22	// logistic (cross-entropy) on {0,1}
23	p = min(max(p, 1e-12), 1.0-1e-12);
24	logloss += -( y01log(p) + (1-y01)log(1-p) );
25	// Brier score: squared error of probabilities
26	brier += (p - y01)*(p - y01);
27	}
28	Metrics M; M.acc = correct/n; M.zero_one = 1.0 - M.acc; M.logloss = logloss/n; M.brier = brier/n; return M;
29	}
30
31	int main(){
32	// A tiny linearly separable dataset in 1D for demonstration
33	vector<Example> data;
34	for (int i=0;i<10;++i) data.push_back({{ -2.0 + 0.1*i }, -1});
35	for (int i=0;i<10;++i) data.push_back({{ 1.0 + 0.1*i }, 1});
36
37	// A hand-crafted linear model
38	LinearModel m; m.w = {2.0}; m.b = -0.5; // score = 2x - 0.5
39
40	auto M = evaluate(m, data);
41	cout << fixed << setprecision(4);
42	cout << "Accuracy: " << M.acc << "\n";
43	cout << "0-1 loss: " << M.zero_one << "\n";
44	cout << "Average logistic (cross-entropy) loss: " << M.logloss << "\n";
45	cout << "Brier score (probability calibration): " << M.brier << "\n";
46
47	// Show probabilities for a few points
48	for (double x : {-1.5, -0.5, 0.0, 0.5, 1.5}) {
49	double s = m.w[0]*x + m.b; double p = sigmoid(s);
50	cout << "x=" << setw(5) << x << ", score=" << setw(6) << s << ", p(y=1\|x)≈" << p
51	<< ", decision=" << ((p>=0.5)?1:-1) << "\n";
52	}
53	return 0;
54	}
55

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Ex { double x; int y; };
5
6	double zero_one(int y, double s){ return (y * s <= 0.0) ? 1.0 : 0.0; }
7	double hinge(int y, double s){ double z=y*s; return max(0.0, 1.0 - z); }
8	double logistic(int y, double s){ double z=y*s; return log1p(exp(-z)); }
9
10	int main(){
11	// Tiny dataset not perfectly separable
12	vector<Ex> D = { {-2, -1}, {-1, -1}, {-0.5, -1}, {0.2, 1}, {0.5, 1}, {1.5, 1} };
13
14	// Model: f(x) = w * x + b; we fix b=0 for simplicity and scan w
15	double b = 0.0;
16	double best_w_01=0, best_L_01=1e9;
17	double best_w_h=0, best_L_h=1e9;
18	double best_w_l=0, best_L_l=1e9;
19
20	for (double w=-5.0; w<=5.0; w+=0.01) {
21	double L01=0, Lh=0, Ll=0;
22	for (auto [x,y] : D) {
23	double s = w*x + b;
24	L01 += zero_one(y, s);
25	Lh += hinge(y, s);
26	Ll += logistic(y, s);
27	}
28	L01 /= D.size(); Lh /= D.size(); Ll /= D.size();
29	if (L01 < best_L_01) { best_L_01 = L01; best_w_01 = w; }
30	if (Lh < best_L_h ) { best_L_h = Lh ; best_w_h = w; }
31	if (Ll < best_L_l ) { best_L_l = Ll ; best_w_l = w; }
32	}
33
34	cout << fixed << setprecision(4);
35	cout << "Best w by 0-1 loss: " << best_w_01 << ", loss=" << best_L_01 << "\n";
36	cout << "Best w by hinge loss: " << best_w_h << ", loss=" << best_L_h << "\n";
37	cout << "Best w by logistic loss: " << best_w_l << ", loss=" << best_L_l << "\n";
38
39	// Show that 0-1 landscape is piecewise-constant with jumps (implicit here),
40	// while hinge is piecewise-linear and logistic is smooth, enabling gradients.
41	return 0;
42	}
43