📚TheoryIntermediate

Universal Approximation Theorem

Key Points

•
The Universal Approximation Theorem (UAT) says a feedforward neural network with one hidden layer and a non-polynomial activation (like sigmoid or ReLU) can approximate any continuous function on a compact set as closely as we want.
•
The result is existential: it guarantees that some weights exist but does not tell us how to find them or how many neurons are needed.
•
Depth matters: while one hidden layer suffices in principle, shallow networks may need exponentially many neurons compared to deeper ones for the same function.
•
The theorem uses the uniform norm on compact domains, meaning the approximation error is small everywhere on the input region.
•
ReLU networks are especially powerful because any 1D piecewise linear function can be written exactly as a sum of shifted ReLUs.
•
Random-feature methods (fixing hidden weights, training only output weights) can already approximate many functions well, hinting at the breadth of the UAT.
•
The UAT does not address training dynamics, data efficiency, or generalization; it is about representational capacity only.
•
In practice, using depth leverages compositionality, reusing features across layers to represent complex structure with fewer neurons.

Prerequisites

→Linear algebra — Underpins vector spaces, dot products w^T x, matrix equations in training, and feature mappings.
→Calculus and real analysis — Continuity, compactness, and limits are central to the theorem’s statement and meaning.
→Probability and statistics — Needed for understanding random features, generalization, and error measurement.
→Optimization and gradient descent — Practical training of networks relies on these methods even though UAT is existential.
→Neural network basics — Architecture, activations, and forward/backward passes are required to implement models.
→Numerical linear algebra — Solving normal equations and handling conditioning in random-feature training.
→Function approximation theory — Provides context for universality, rates, and relation to splines and kernels.

Detailed Explanation

Tap terms for definitions

01Overview

The Universal Approximation Theorem (UAT) is a cornerstone result in neural network theory. It states that a feedforward neural network with just a single hidden layer and a suitable non-polynomial activation function (e.g., sigmoid, tanh, ReLU) can approximate any continuous function defined on a compact set (such as a closed and bounded interval or box in Euclidean space) to arbitrary precision. More precisely, for any continuous target function and any tolerance level, there exists a network width (number of hidden neurons) and a set of weights such that the network’s output is uniformly close to the target function everywhere on the domain. This result, proved in various forms by Cybenko (1989) and Hornik (1991), provides a theoretical justification for using neural networks as flexible function approximators. However, it is crucially an existence theorem: it does not specify how many neurons are needed for a given accuracy, nor does it provide an algorithm to find the requisite weights efficiently. It also says nothing about learnability, data requirements, or generalization. Modern practice emphasizes depth: while a single hidden layer suffices in principle, deeper networks can represent many functions with exponentially fewer parameters by exploiting compositional structure. This explains why deep learning is effective in high-dimensional tasks, even though shallow networks are theoretically universal. The UAT thus reassures us that neural networks are expressive enough, while follow-up results help explain why certain architectures and depths are practical.

02Intuition & Analogies

Imagine building shapes out of Lego bricks. If you have enough small bricks, you can approximate any curved shape by stacking them cleverly. In the same spirit, neural networks combine many simple building blocks—"bumps" or "hinges" created by activations—so that their weighted sums trace out the desired function’s shape. A sigmoid unit behaves like a smooth step; a ReLU behaves like a hinge that is flat on one side and linear on the other. By adding many such steps or hinges at the right positions and scales, you can sculpt almost any curve or surface. Another analogy is mixing paints to get a target color: each hidden neuron contributes a particular tint (a ridge-shaped feature along some direction), and the output layer mixes these tints to match the final hue (the target function). If you allow enough ingredients (neurons), you can match subtle shades (fine function details) more closely. Depth is like building with pre-assembled Lego substructures. You could approximate a castle wall brick by brick (a wide, shallow network), but it’s more efficient to first assemble windows, arches, and towers (features) and then combine them (a deep network). Each layer composes and refines features, capturing structure in fewer pieces. This is the essence of compositionality: many real-world functions can be described as compositions of simpler parts (edges → textures → objects), and deep networks naturally mirror this hierarchy. Thus, while the UAT says shallow networks can eventually do the job, adding depth lets us get there with far fewer bricks.

03Formal Definition

Let K

\subset

R^{n}

be compact and let

σ

R

\to

R

be a continuous activation that is not a polynomial (e.g., sigmoid, tanh, ReLU). Consider the class of single-hidden-layer neural networks F = \{ x

\mapsto

\sum_{i = 1}^{N}

a_{i}

σ

(

w_{i}^{⊤}

x +

b_{i}

) + c : N

\in

N

a_{i}

\in

R

w_{i}

\in

R^{n}

b_{i}

, c

\in

R

\}. The Universal Approximation Theorem states that F is dense in C(K) (the space of continuous functions on K) under the uniform norm \

∣ g ∥

\infty

} =

sup_{x \in K}

∣ g (x) ∣

. That is, for every f

\in

C(K) and every

ε

> 0, there exist N and parameters (

a_{i}

w_{i}

b_{i}

, c) such that

sup_{x \in K}

f (x) - \sum_{i = 1}^{N} a_{i} σ (w_{i}^{⊤} x + b_{i}) - c

ε

. Cybenko (1989) proved this for sigmoidal activations on [0,1]^{n}; Hornik (1991) extended it to broader activation classes and norms (including

L^{p}

). ReLU, though unbounded, also yields density on compact sets. The non-polynomial condition excludes activations that can only generate finite-dimensional polynomial spaces, which would not be dense in C(K). The theorem makes no claim about the minimal N for a given

ε

, nor about the computational complexity of finding such parameters.

04When to Use

Use the UAT as a conceptual guarantee that your network architecture is expressive enough to represent your target mapping—especially when selecting activation functions and considering whether a single hidden layer could, in principle, do the job. It’s particularly relevant in function approximation scenarios (regression, system identification, control, PDE surrogates) on bounded input domains. When you adopt ReLU or sigmoid activations, the UAT reassures you that adding enough neurons can approximate the desired behavior. However, in practice you should lean on depth when you suspect the target function has compositional or hierarchical structure (vision, language, audio). Deeper models often achieve the same approximation with dramatically fewer neurons and parameters, improving statistical and computational efficiency. The UAT also motivates random-feature methods (fix hidden weights, learn output layer) or kernel approximations as practical ways to get good approximations quickly. Do not use the UAT to justify training feasibility, sample complexity, or generalization: it does not address optimization landscapes, data noise, or overfitting. Pair it with regularization, validation, and architectural bias (convolutions, attention, residual connections) to exploit structure and achieve learnability. Finally, remember the compact-domain assumption: rescale inputs or constrain operating ranges so the theoretical guarantee applies cleanly.

⚠️Common Mistakes

• Equating existence with learnability: The UAT does not say gradient descent will find the approximating weights, nor that it will do so quickly. Remedy: separate expressivity claims from optimization guarantees and monitor training with validation curves. • Ignoring domain compactness: The theorem assumes a compact domain (e.g., [0,1]^{n}). Remedy: normalize or bound inputs; avoid relying on UAT to justify extrapolation. • Choosing polynomial or linear activations: Purely linear or polynomial activations do not yield universality in the stated sense. Remedy: use non-polynomial activations like ReLU, sigmoid, tanh, GELU, etc. • Overestimating shallow networks: While a single hidden layer is universal, width requirements can be exponential in input dimension for certain functions. Remedy: prefer deeper architectures for compositional tasks. • Expecting rates from the UAT: The theorem is non-quantitative about approximation error vs. width. Remedy: consult rate results (e.g., Barron bounds) or empirical validation when sizing models. • Conflating approximation and generalization: Perfect training-fit does not imply good test performance. Remedy: use regularization, early stopping, and appropriate data splits. • Neglecting parameter scaling: Very large weights can cause numerical instability. Remedy: standardize features and initialize weights carefully. • Misapplying classification logic: UAT for continuous functions differs from separability results in classification (e.g., universal classifiers under different settings). Remedy: be precise about the function class and norm (uniform vs. L^{p}).

Key Formulas

Universal Approximation (Uniform Norm)

x \in K sup f (x) - i = 1 \sum N a_{i} σ (w_{i}^{⊤} x + b_{i}) - c < ε

Explanation: For any continuous function f on a compact set K and any tolerance $ε,$ there exists a single-hidden-layer network whose output stays within ε of f everywhere on K.

Uniform Norm

∥ g ∥_{\infty} = x \in K sup ∣ g (x) ∣

Explanation: The infinity norm measures the maximum absolute value of a function over a set K. UAT uses this to quantify approximation error uniformly over the domain.

Single-Hidden-Layer Model

F (x) = i = 1 \sum N a_{i} σ (w_{i}^{⊤} x + b_{i}) + c

Explanation: This is the function class considered by the UAT: linear combinations of nonlinear ridge functions plus a bias term.

ReLU Spline Representation

g (x) = c_{0} + c_{1} x + j = 1 \sum m a_{j} (x - t_{j})_{+}, (x)_{+} = max {0, x}

Explanation: Any 1D piecewise-linear function with breakpoints $t_{j}$ can be represented exactly by a sum of shifted ReLUs, showing constructive universality in 1D.

Ridge Regression for Random Features

a = (Z^{⊤} Z + λ I)^{- 1} Z^{⊤} y

Explanation: Given hidden-layer features Z and targets y, the optimal output weights a in L2-regularized least squares are obtained by solving this linear system.

Barron Rate (Random Features)

E [∥ f_{m} - f ∥_{2}^{2}] \leq \frac{C}{m}

Explanation: For functions in Barron space, the L2 approximation error using m random features decays on average as O $\frac{1}{m}$ . This provides a quantitative rate absent in the UAT.

Parameter Count (Single Output)

params = H (n + 2) + 1

Explanation: A single-hidden-layer network with H neurons in n dimensions has H input-weight vectors and biases, H output weights, and one output bias.

Forward-Pass Cost

T_{forward} (n, H) = O (n H)

Explanation: Evaluating a single-hidden-layer network requires computing H dot products in n dimensions plus activations and a final sum.

Linear Regions (Single ReLU Layer)

R_{ReLU, 1-layer} (H, n) \leq j = 0 \sum n (j H)

Explanation: A layer of H ReLU units in n dimensions can partition space into at most this many linear regions. Depth multiplies regions across layers, increasing expressivity.

Curse of Dimensionality (Grid Pieces)

N_{grid} (k, n) = k^{n}

Explanation: Approximating a function by a grid with k pieces per dimension needs $k^{n}$ cells, which grows exponentially with dimension. Depth can mitigate this by compositional reuse.

Complexity Analysis

For a single-hidden-layer network with H hidden units and input dimension n, a forward pass costs O(nH) time: H dot products in ℝ^n (each O(n)), H activations, and one weighted sum. Memory is O(nH) for input weights, plus O(H) for biases and output weights, totaling O(nH) parameters. For batch inference on B inputs, time is O(B n H) with trivial parallelism across both B and H. Training complexity depends on the method. Full-batch gradient descent on mean-squared error with E epochs and N training points costs O(E N n H) per parameter update pass in this shallow setting (assuming one output). Mini-batching with batch size B yields O(E

\frac{N}{B}

B n H) = O(E N n H) arithmetic overall, with better cache behavior and possible GPU speedups. Backpropagation doubles the cost of a forward pass up to constant factors but remains O(nH) per example. If only the output layer is trained (random features), we first compute the feature matrix Z ∈

R^{N \times H}

in O(N n H) time, then solve the ridge regression normal equations (

Z^{T}

Z + λI)

a = Z^{T}

y. Forming

Z^{T}

Z costs O(N

H^{2}

) and

Z^{T}

y costs O(N H). Solving the resulting H×H linear system via Gaussian elimination is O(

H^{3}

). Thus, random-feature training is O(N n H + N

H^{2}

H^{3}

) time and O(N H +

H^{2}

) space, which is attractive when H is moderate. Approximation complexity (how H must grow with error

ε)

is not provided by the UAT. For certain smooth functions in Barron space, rates like O(1/√H) in L2 are known for random features, but worst-case functions in high dimensions may require exponentially many units for shallow networks. Depth can reduce parameter complexity dramatically by reusing features compositionally, so deep architectures often achieve better accuracy with smaller H per layer than a comparable shallow model.

Code Examples

Train a 1D Single-Hidden-Layer Network (Sigmoid) to Approximate sin(2πx)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Simple 1D single-hidden-layer neural network with sigmoid activation.
5 // We train it by vanilla gradient descent to approximate f(x) = sin(2*pi*x) on [0,1].
6 
7 struct Net1D {
8     int H;                 // number of hidden units
9     double c;              // output bias
10     vector<double> w;      // input weights (size H)
11     vector<double> b;      // hidden biases (size H)
12     vector<double> a;      // output weights (size H)
13 
14     Net1D(int H_, unsigned seed=42): H(H_), c(0.0), w(H_), b(H_), a(H_) {
15         mt19937 rng(seed);
16         normal_distribution<double> nd(0.0, 0.5); // small random init
17         for (int i = 0; i < H; ++i) {
18             w[i] = nd(rng);
19             b[i] = nd(rng);
20             a[i] = nd(rng) * 0.1; // smaller output weights
21         }
22     }
23 
24     static inline double sigmoid(double z) {
25         if (z >= 0) {
26             double ez = exp(-z);
27             return 1.0 / (1.0 + ez);
28         } else {
29             double ez = exp(z);
30             return ez / (1.0 + ez);
31         }
32     }
33 
34     // Forward pass for a single scalar x
35     double forward(double x, vector<double>* h_out=nullptr) const {
36         vector<double> h(H);
37         for (int i = 0; i < H; ++i) {
38             h[i] = sigmoid(w[i]*x + b[i]);
39         }
40         double y = c;
41         for (int i = 0; i < H; ++i) y += a[i]*h[i];
42         if (h_out) *h_out = move(h);
43         return y;
44     }
45 };
46 
47 static inline double target(double x) {
48     const double PI = acos(-1.0);
49     return sin(2.0 * PI * x);
50 }
51 
52 int main() {
53     ios::sync_with_stdio(false);
54     cin.tie(nullptr);
55 
56     int H = 50;                 // hidden units
57     int N = 256;                // training samples
58     int epochs = 2000;          // training epochs
59     double lr = 0.05;           // learning rate
60 
61     // Generate training data uniformly on [0,1]
62     vector<double> xs(N), ys(N);
63     for (int i = 0; i < N; ++i) {
64         xs[i] = (i + 0.5) / N;
65         ys[i] = target(xs[i]);
66     }
67 
68     Net1D net(H, 123);
69 
70     // Training loop: mean squared error with full-batch gradient descent
71     for (int epoch = 0; epoch < epochs; ++epoch) {
72         // Accumulate gradients
73         double gc = 0.0;              // grad wrt c
74         vector<double> ga(H, 0.0);    // grad wrt a
75         vector<double> gw(H, 0.0);    // grad wrt w
76         vector<double> gb(H, 0.0);    // grad wrt b
77 
78         double mse = 0.0;
79         for (int i = 0; i < N; ++i) {
80             vector<double> h;
81             double yhat = net.forward(xs[i], &h);
82             double err = yhat - ys[i];
83             mse += err * err;
84 
85             // dL/dc = 2/N * err
86             gc += 2.0 * err / N;
87             // dL/da_j = 2/N * err * h_j
88             for (int j = 0; j < net.H; ++j) ga[j] += 2.0 * err * h[j] / N;
89             // For sigmoid, dh_j/dz = h_j*(1-h_j); z = w_j*x + b_j
90             for (int j = 0; j < net.H; ++j) {
91                 double dh_dz = h[j] * (1.0 - h[j]);
92                 double dz_dw = xs[i];
93                 double dz_db = 1.0;
94                 double dL_dz = 2.0 * err * net.a[j] * dh_dz / N;
95                 gw[j] += dL_dz * dz_dw;
96                 gb[j] += dL_dz * dz_db;
97             }
98         }
99         mse /= N;
100 
101         // Gradient step
102         net.c -= lr * gc;
103         for (int j = 0; j < net.H; ++j) {
104             net.a[j] -= lr * ga[j];
105             net.w[j] -= lr * gw[j];
106             net.b[j] -= lr * gb[j];
107         }
108 
109         if (epoch % 200 == 0) {
110             cout << "Epoch " << epoch << ", MSE = " << mse << "\n";
111         }
112     }
113 
114     // Report a few predictions
115     cout << fixed << setprecision(6);
116     for (double x : {0.0, 0.125, 0.25, 0.5, 0.75, 1.0}) {
117         double yhat = net.forward(x);
118         cout << "x=" << setw(6) << x << "  f(x)=" << setw(9) << target(x)
119              << "  yhat=" << setw(10) << yhat << "\n";
120     }
121 
122     return 0;
123 }
124

This program trains a single-hidden-layer sigmoid network to approximate f(x) = sin(2πx) on [0,1]. It uses full-batch gradient descent on mean squared error, updating input weights, hidden biases, output weights, and output bias. The decreasing MSE and accurate sample predictions illustrate the UAT in practice: with enough neurons and training, a shallow network can approximate a smooth target function closely.

Time: Training: O(epochs × N × H). Inference per point: O(H).Space: O(H) parameters for 1D input (weights, biases, and output weights).

Random-Feature (ReLU) Regressor in 2D with Ridge Regression (Extreme Learning Machine)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Approximate a 2D function using a single hidden layer with fixed random ReLU features
5 // and solve the output weights via ridge regression: (Z^T Z + λI)a = Z^T y.
6 
7 struct RandomFeatures2D {
8     int H; // number of hidden units
9     vector<array<double,2>> W; // hidden weights w_i in R^2
10     vector<double> b;          // hidden biases
11     RandomFeatures2D(int H_, unsigned seed=42): H(H_), W(H_), b(H_) {
12         mt19937 rng(seed);
13         normal_distribution<double> nd(0.0, 2.0); // somewhat wide to cover [0,1]^2
14         uniform_real_distribution<double> ub(-1.0, 1.0);
15         for (int i = 0; i < H; ++i) {
16             W[i] = {nd(rng), nd(rng)};
17             b[i] = ub(rng);
18         }
19     }
20     static inline double relu(double z) { return z > 0 ? z : 0; }
21     // Compute hidden features for a single (x,y)
22     vector<double> phi(double x, double y) const {
23         vector<double> h(H);
24         for (int i = 0; i < H; ++i) {
25             double z = W[i][0]*x + W[i][1]*y + b[i];
26             h[i] = relu(z);
27         }
28         return h;
29     }
30 };
31 
32 // Solve (A)x = b by Gaussian elimination with partial pivoting (small H only)
33 bool gaussian_solve(vector<vector<double>>& A, vector<double>& b, vector<double>& x) {
34     int n = (int)A.size();
35     x.assign(n, 0.0);
36     // Augment A with b
37     for (int i = 0; i < n; ++i) A[i].push_back(b[i]);
38     // Elimination
39     for (int col = 0; col < n; ++col) {
40         // Pivot
41         int piv = col;
42         for (int r = col+1; r < n; ++r)
43             if (fabs(A[r][col]) > fabs(A[piv][col])) piv = r;
44         if (fabs(A[piv][col]) < 1e-12) return false; // singular
45         swap(A[piv], A[col]);
46         // Normalize row
47         double div = A[col][col];
48         for (int j = col; j <= n; ++j) A[col][j] /= div;
49         // Eliminate others
50         for (int r = 0; r < n; ++r) if (r != col) {
51             double factor = A[r][col];
52             for (int j = col; j <= n; ++j) A[r][j] -= factor * A[col][j];
53         }
54     }
55     for (int i = 0; i < n; ++i) x[i] = A[i][n];
56     return true;
57 }
58 
59 static inline double ftarget(double x, double y) {
60     // A smooth 2D bump centered at (0.5, 0.5)
61     double dx = x - 0.5, dy = y - 0.5;
62     return exp(-(dx*dx + dy*dy) / 0.02);
63 }
64 
65 int main() {
66     ios::sync_with_stdio(false);
67     cin.tie(nullptr);
68 
69     int H = 200;          // random ReLU features
70     double lambda = 1e-3; // ridge regularization
71 
72     // Build training grid on [0,1]^2
73     int G = 20; // 20x20 grid => N=400
74     int N = G * G;
75     vector<array<double,2>> X(N);
76     vector<double> y(N);
77     int idx = 0;
78     for (int i = 0; i < G; ++i) {
79         for (int j = 0; j < G; ++j) {
80             double x = (i + 0.5) / G;
81             double yy = (j + 0.5) / G;
82             X[idx] = {x, yy};
83             y[idx] = ftarget(x, yy);
84             ++idx;
85         }
86     }
87 
88     RandomFeatures2D rf(H, 7);
89 
90     // Compute feature matrix Z (N x H)
91     vector<vector<double>> Z(N, vector<double>(H));
92     for (int i = 0; i < N; ++i) {
93         auto h = rf.phi(X[i][0], X[i][1]);
94         for (int j = 0; j < H; ++j) Z[i][j] = h[j];
95     }
96 
97     // Form normal equations: (Z^T Z + λI)a = Z^T y
98     vector<vector<double>> A(H, vector<double>(H, 0.0));
99     vector<double> By(H, 0.0);
100     for (int i = 0; i < H; ++i) {
101         for (int j = 0; j < H; ++j) {
102             double s = 0.0;
103             for (int k = 0; k < N; ++k) s += Z[k][i] * Z[k][j];
104             A[i][j] = s + (i == j ? lambda : 0.0);
105         }
106         double t = 0.0;
107         for (int k = 0; k < N; ++k) t += Z[k][i] * y[k];
108         By[i] = t;
109     }
110 
111     // Solve for output weights a
112     vector<double> a;
113     bool ok = gaussian_solve(A, By, a);
114     if (!ok) {
115         cerr << "Failed to solve normal equations (possibly ill-conditioned).\n";
116         return 0;
117     }
118 
119     // Evaluate training MSE
120     double mse = 0.0;
121     for (int i = 0; i < N; ++i) {
122         double pred = 0.0;
123         for (int j = 0; j < H; ++j) pred += Z[i][j] * a[j];
124         double err = pred - y[i];
125         mse += err * err;
126     }
127     mse /= N;
128     cout << fixed << setprecision(6);
129     cout << "Training MSE: " << mse << "\n";
130 
131     // Test a few points
132     for (auto P : vector<array<double,2>>{{0.1,0.1},{0.5,0.5},{0.8,0.2}}) {
133         auto h = rf.phi(P[0], P[1]);
134         double pred = 0.0;
135         for (int j = 0; j < H; ++j) pred += h[j] * a[j];
136         cout << "(x,y)=(" << P[0] << "," << P[1] << ")  f=" << ftarget(P[0],P[1])
137              << "  yhat=" << pred << "\n";
138     }
139 
140     return 0;
141 }
142

This program implements a random-feature ReLU network in 2D. Hidden weights and biases are fixed randomly, the feature matrix is computed, and output weights are learned in closed form using ridge regression via the normal equations. Good accuracy on a smooth 2D bump demonstrates practical universality: even with fixed hidden layers, linear combinations of many nonlinear features can approximate the target well.

Time: O(N n H + N H^2 + H^3) for building features and solving the normal equations (n=2 here).Space: O(N H + H^2) to store the feature matrix and normal-equation matrices.

Constructive 1D ReLU Spline Approximator via Hinges

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Approximate a 1D continuous function on [0,1] by linear splines and
5 // represent the spline exactly as a ReLU network: g(x) = c0 + c1*x + sum a_j * ReLU(x - t_j).
6 // This demonstrates a constructive instance of the UAT in 1D.
7 
8 static inline double relu(double z) { return z > 0 ? z : 0; }
9 
10 // Build spline coefficients from sampled points (t_0=0 < t_1 < ... < t_K=1)
11 struct ReLUSpline {
12     vector<double> t;   // breakpoints t_j, size K+1
13     vector<double> a;   // hinge weights a_j, size K-1 (for j=1..K-1)
14     double c0;          // intercept
15     double c1;          // slope of first segment
16 
17     // Given samples (t_j, f_j), produce exact PWL interpolant and convert to ReLU form
18     ReLUSpline(const vector<double>& T, const vector<double>& F) {
19         int K = (int)T.size() - 1; // number of segments
20         t = T;
21         // Compute segment slopes s_j on [t_j, t_{j+1}]
22         vector<double> s(K);
23         for (int j = 0; j < K; ++j) {
24             s[j] = (F[j+1] - F[j]) / (T[j+1] - T[j]);
25         }
26         c0 = F[0];
27         c1 = s[0];
28         // a_j are slope jumps: a_j = s_j - s_{j-1} for j=1..K-1
29         a.assign(K-1, 0.0);
30         for (int j = 1; j < K; ++j) a[j-1] = s[j] - s[j-1];
31     }
32 
33     double eval(double x) const {
34         // g(x) = c0 + c1*x + sum_{j=1}^{K-1} a_j * ReLU(x - t_j)
35         double y = c0 + c1 * x;
36         for (int j = 0; j < (int)a.size(); ++j) y += a[j] * relu(x - t[j+1]);
37         return y;
38     }
39 };
40 
41 static inline double ftrue(double x) {
42     // target function to approximate
43     const double PI = acos(-1.0);
44     return 0.5 * sin(2.0 * PI * x) + 0.3 * x * (1.0 - x);
45 }
46 
47 int main() {
48     ios::sync_with_stdio(false);
49     cin.tie(nullptr);
50 
51     // Choose K segments for the spline
52     int K = 16; // more segments => better uniform approximation
53     vector<double> T(K+1), F(K+1);
54     for (int j = 0; j <= K; ++j) {
55         T[j] = (double)j / K;
56         F[j] = ftrue(T[j]);
57     }
58 
59     ReLUSpline spline(T, F);
60 
61     // Evaluate error on a fine grid
62     int M = 1000;
63     double maxerr = 0.0;
64     for (int i = 0; i <= M; ++i) {
65         double x = (double)i / M;
66         double err = fabs(spline.eval(x) - ftrue(x));
67         if (err > maxerr) maxerr = err;
68     }
69 
70     cout << fixed << setprecision(6);
71     cout << "Uniform error with K=" << K << " segments: " << maxerr << "\n";
72 
73     // Show a few sample evaluations
74     for (double x : {0.0, 0.1, 0.25, 0.5, 0.75, 1.0}) {
75         cout << "x=" << x << "  f(x)=" << ftrue(x) << "  spline(x)=" << spline.eval(x) << "\n";
76     }
77 
78     return 0;
79 }
80

This example constructs a linear-spline interpolant of a 1D function and rewrites it exactly as a ReLU network with K−1 hidden units: g(x) = c0 + c1 x + Σ a_j ReLU(x − t_j). As K increases, the spline approximates the continuous function uniformly, so the corresponding ReLU network achieves arbitrarily small error—providing a constructive intuition for the UAT in 1D.

Time: Building coefficients: O(K). Evaluating at one x: O(K).Space: O(K) to store breakpoints and hinge weights.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Simple 1D single-hidden-layer neural network with sigmoid activation.
5	// We train it by vanilla gradient descent to approximate f(x) = sin(2pix) on [0,1].
6
7	struct Net1D {
8	int H; // number of hidden units
9	double c; // output bias
10	vector<double> w; // input weights (size H)
11	vector<double> b; // hidden biases (size H)
12	vector<double> a; // output weights (size H)
13
14	Net1D(int H_, unsigned seed=42): H(H_), c(0.0), w(H_), b(H_), a(H_) {
15	mt19937 rng(seed);
16	normal_distribution<double> nd(0.0, 0.5); // small random init
17	for (int i = 0; i < H; ++i) {
18	w[i] = nd(rng);
19	b[i] = nd(rng);
20	a[i] = nd(rng) * 0.1; // smaller output weights
21	}
22	}
23
24	static inline double sigmoid(double z) {
25	if (z >= 0) {
26	double ez = exp(-z);
27	return 1.0 / (1.0 + ez);
28	} else {
29	double ez = exp(z);
30	return ez / (1.0 + ez);
31	}
32	}
33
34	// Forward pass for a single scalar x
35	double forward(double x, vector<double>* h_out=nullptr) const {
36	vector<double> h(H);
37	for (int i = 0; i < H; ++i) {
38	h[i] = sigmoid(w[i]*x + b[i]);
39	}
40	double y = c;
41	for (int i = 0; i < H; ++i) y += a[i]*h[i];
42	if (h_out) *h_out = move(h);
43	return y;
44	}
45	};
46
47	static inline double target(double x) {
48	const double PI = acos(-1.0);
49	return sin(2.0 * PI * x);
50	}
51
52	int main() {
53	ios::sync_with_stdio(false);
54	cin.tie(nullptr);
55
56	int H = 50; // hidden units
57	int N = 256; // training samples
58	int epochs = 2000; // training epochs
59	double lr = 0.05; // learning rate
60
61	// Generate training data uniformly on [0,1]
62	vector<double> xs(N), ys(N);
63	for (int i = 0; i < N; ++i) {
64	xs[i] = (i + 0.5) / N;
65	ys[i] = target(xs[i]);
66	}
67
68	Net1D net(H, 123);
69
70	// Training loop: mean squared error with full-batch gradient descent
71	for (int epoch = 0; epoch < epochs; ++epoch) {
72	// Accumulate gradients
73	double gc = 0.0; // grad wrt c
74	vector<double> ga(H, 0.0); // grad wrt a
75	vector<double> gw(H, 0.0); // grad wrt w
76	vector<double> gb(H, 0.0); // grad wrt b
77
78	double mse = 0.0;
79	for (int i = 0; i < N; ++i) {
80	vector<double> h;
81	double yhat = net.forward(xs[i], &h);
82	double err = yhat - ys[i];
83	mse += err * err;
84
85	// dL/dc = 2/N * err
86	gc += 2.0 * err / N;
87	// dL/da_j = 2/N * err * h_j
88	for (int j = 0; j < net.H; ++j) ga[j] += 2.0 * err * h[j] / N;
89	// For sigmoid, dh_j/dz = h_j(1-h_j); z = w_jx + b_j
90	for (int j = 0; j < net.H; ++j) {
91	double dh_dz = h[j] * (1.0 - h[j]);
92	double dz_dw = xs[i];
93	double dz_db = 1.0;
94	double dL_dz = 2.0 * err * net.a[j] * dh_dz / N;
95	gw[j] += dL_dz * dz_dw;
96	gb[j] += dL_dz * dz_db;
97	}
98	}
99	mse /= N;
100
101	// Gradient step
102	net.c -= lr * gc;
103	for (int j = 0; j < net.H; ++j) {
104	net.a[j] -= lr * ga[j];
105	net.w[j] -= lr * gw[j];
106	net.b[j] -= lr * gb[j];
107	}
108
109	if (epoch % 200 == 0) {
110	cout << "Epoch " << epoch << ", MSE = " << mse << "\n";
111	}
112	}
113
114	// Report a few predictions
115	cout << fixed << setprecision(6);
116	for (double x : {0.0, 0.125, 0.25, 0.5, 0.75, 1.0}) {
117	double yhat = net.forward(x);
118	cout << "x=" << setw(6) << x << " f(x)=" << setw(9) << target(x)
119	<< " yhat=" << setw(10) << yhat << "\n";
120	}
121
122	return 0;
123	}
124

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Approximate a 2D function using a single hidden layer with fixed random ReLU features
5	// and solve the output weights via ridge regression: (Z^T Z + λI)a = Z^T y.
6
7	struct RandomFeatures2D {
8	int H; // number of hidden units
9	vector<array<double,2>> W; // hidden weights w_i in R^2
10	vector<double> b; // hidden biases
11	RandomFeatures2D(int H_, unsigned seed=42): H(H_), W(H_), b(H_) {
12	mt19937 rng(seed);
13	normal_distribution<double> nd(0.0, 2.0); // somewhat wide to cover [0,1]^2
14	uniform_real_distribution<double> ub(-1.0, 1.0);
15	for (int i = 0; i < H; ++i) {
16	W[i] = {nd(rng), nd(rng)};
17	b[i] = ub(rng);
18	}
19	}
20	static inline double relu(double z) { return z > 0 ? z : 0; }
21	// Compute hidden features for a single (x,y)
22	vector<double> phi(double x, double y) const {
23	vector<double> h(H);
24	for (int i = 0; i < H; ++i) {
25	double z = W[i][0]x + W[i][1]y + b[i];
26	h[i] = relu(z);
27	}
28	return h;
29	}
30	};
31
32	// Solve (A)x = b by Gaussian elimination with partial pivoting (small H only)
33	bool gaussian_solve(vector<vector<double>>& A, vector<double>& b, vector<double>& x) {
34	int n = (int)A.size();
35	x.assign(n, 0.0);
36	// Augment A with b
37	for (int i = 0; i < n; ++i) A[i].push_back(b[i]);
38	// Elimination
39	for (int col = 0; col < n; ++col) {
40	// Pivot
41	int piv = col;
42	for (int r = col+1; r < n; ++r)
43	if (fabs(A[r][col]) > fabs(A[piv][col])) piv = r;
44	if (fabs(A[piv][col]) < 1e-12) return false; // singular
45	swap(A[piv], A[col]);
46	// Normalize row
47	double div = A[col][col];
48	for (int j = col; j <= n; ++j) A[col][j] /= div;
49	// Eliminate others
50	for (int r = 0; r < n; ++r) if (r != col) {
51	double factor = A[r][col];
52	for (int j = col; j <= n; ++j) A[r][j] -= factor * A[col][j];
53	}
54	}
55	for (int i = 0; i < n; ++i) x[i] = A[i][n];
56	return true;
57	}
58
59	static inline double ftarget(double x, double y) {
60	// A smooth 2D bump centered at (0.5, 0.5)
61	double dx = x - 0.5, dy = y - 0.5;
62	return exp(-(dxdx + dydy) / 0.02);
63	}
64
65	int main() {
66	ios::sync_with_stdio(false);
67	cin.tie(nullptr);
68
69	int H = 200; // random ReLU features
70	double lambda = 1e-3; // ridge regularization
71
72	// Build training grid on [0,1]^2
73	int G = 20; // 20x20 grid => N=400
74	int N = G * G;
75	vector<array<double,2>> X(N);
76	vector<double> y(N);
77	int idx = 0;
78	for (int i = 0; i < G; ++i) {
79	for (int j = 0; j < G; ++j) {
80	double x = (i + 0.5) / G;
81	double yy = (j + 0.5) / G;
82	X[idx] = {x, yy};
83	y[idx] = ftarget(x, yy);
84	++idx;
85	}
86	}
87
88	RandomFeatures2D rf(H, 7);
89
90	// Compute feature matrix Z (N x H)
91	vector<vector<double>> Z(N, vector<double>(H));
92	for (int i = 0; i < N; ++i) {
93	auto h = rf.phi(X[i][0], X[i][1]);
94	for (int j = 0; j < H; ++j) Z[i][j] = h[j];
95	}
96
97	// Form normal equations: (Z^T Z + λI)a = Z^T y
98	vector<vector<double>> A(H, vector<double>(H, 0.0));
99	vector<double> By(H, 0.0);
100	for (int i = 0; i < H; ++i) {
101	for (int j = 0; j < H; ++j) {
102	double s = 0.0;
103	for (int k = 0; k < N; ++k) s += Z[k][i] * Z[k][j];
104	A[i][j] = s + (i == j ? lambda : 0.0);
105	}
106	double t = 0.0;
107	for (int k = 0; k < N; ++k) t += Z[k][i] * y[k];
108	By[i] = t;
109	}
110
111	// Solve for output weights a
112	vector<double> a;
113	bool ok = gaussian_solve(A, By, a);
114	if (!ok) {
115	cerr << "Failed to solve normal equations (possibly ill-conditioned).\n";
116	return 0;
117	}
118
119	// Evaluate training MSE
120	double mse = 0.0;
121	for (int i = 0; i < N; ++i) {
122	double pred = 0.0;
123	for (int j = 0; j < H; ++j) pred += Z[i][j] * a[j];
124	double err = pred - y[i];
125	mse += err * err;
126	}
127	mse /= N;
128	cout << fixed << setprecision(6);
129	cout << "Training MSE: " << mse << "\n";
130
131	// Test a few points
132	for (auto P : vector<array<double,2>>{{0.1,0.1},{0.5,0.5},{0.8,0.2}}) {
133	auto h = rf.phi(P[0], P[1]);
134	double pred = 0.0;
135	for (int j = 0; j < H; ++j) pred += h[j] * a[j];
136	cout << "(x,y)=(" << P[0] << "," << P[1] << ") f=" << ftarget(P[0],P[1])
137	<< " yhat=" << pred << "\n";
138	}
139
140	return 0;
141	}
142

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Approximate a 1D continuous function on [0,1] by linear splines and
5	// represent the spline exactly as a ReLU network: g(x) = c0 + c1x + sum a_j ReLU(x - t_j).
6	// This demonstrates a constructive instance of the UAT in 1D.
7
8	static inline double relu(double z) { return z > 0 ? z : 0; }
9
10	// Build spline coefficients from sampled points (t_0=0 < t_1 < ... < t_K=1)
11	struct ReLUSpline {
12	vector<double> t; // breakpoints t_j, size K+1
13	vector<double> a; // hinge weights a_j, size K-1 (for j=1..K-1)
14	double c0; // intercept
15	double c1; // slope of first segment
16
17	// Given samples (t_j, f_j), produce exact PWL interpolant and convert to ReLU form
18	ReLUSpline(const vector<double>& T, const vector<double>& F) {
19	int K = (int)T.size() - 1; // number of segments
20	t = T;
21	// Compute segment slopes s_j on [t_j, t_{j+1}]
22	vector<double> s(K);
23	for (int j = 0; j < K; ++j) {
24	s[j] = (F[j+1] - F[j]) / (T[j+1] - T[j]);
25	}
26	c0 = F[0];
27	c1 = s[0];
28	// a_j are slope jumps: a_j = s_j - s_{j-1} for j=1..K-1
29	a.assign(K-1, 0.0);
30	for (int j = 1; j < K; ++j) a[j-1] = s[j] - s[j-1];
31	}
32
33	double eval(double x) const {
34	// g(x) = c0 + c1x + sum_{j=1}^{K-1} a_j ReLU(x - t_j)
35	double y = c0 + c1 * x;
36	for (int j = 0; j < (int)a.size(); ++j) y += a[j] * relu(x - t[j+1]);
37	return y;
38	}
39	};
40
41	static inline double ftrue(double x) {
42	// target function to approximate
43	const double PI = acos(-1.0);
44	return 0.5 * sin(2.0 * PI * x) + 0.3 * x * (1.0 - x);
45	}
46
47	int main() {
48	ios::sync_with_stdio(false);
49	cin.tie(nullptr);
50
51	// Choose K segments for the spline
52	int K = 16; // more segments => better uniform approximation
53	vector<double> T(K+1), F(K+1);
54	for (int j = 0; j <= K; ++j) {
55	T[j] = (double)j / K;
56	F[j] = ftrue(T[j]);
57	}
58
59	ReLUSpline spline(T, F);
60
61	// Evaluate error on a fine grid
62	int M = 1000;
63	double maxerr = 0.0;
64	for (int i = 0; i <= M; ++i) {
65	double x = (double)i / M;
66	double err = fabs(spline.eval(x) - ftrue(x));
67	if (err > maxerr) maxerr = err;
68	}
69
70	cout << fixed << setprecision(6);
71	cout << "Uniform error with K=" << K << " segments: " << maxerr << "\n";
72
73	// Show a few sample evaluations
74	for (double x : {0.0, 0.1, 0.25, 0.5, 0.75, 1.0}) {
75	cout << "x=" << x << " f(x)=" << ftrue(x) << " spline(x)=" << spline.eval(x) << "\n";
76	}
77
78	return 0;
79	}
80