📚TheoryIntermediate

Universal Approximation Theorem

Key Points

  • The Universal Approximation Theorem (UAT) says a feedforward neural network with one hidden layer and a non-polynomial activation (like sigmoid or ReLU) can approximate any continuous function on a compact set as closely as we want.
  • The result is existential: it guarantees that some weights exist but does not tell us how to find them or how many neurons are needed.
  • Depth matters: while one hidden layer suffices in principle, shallow networks may need exponentially many neurons compared to deeper ones for the same function.
  • The theorem uses the uniform norm on compact domains, meaning the approximation error is small everywhere on the input region.
  • ReLU networks are especially powerful because any 1D piecewise linear function can be written exactly as a sum of shifted ReLUs.
  • Random-feature methods (fixing hidden weights, training only output weights) can already approximate many functions well, hinting at the breadth of the UAT.
  • The UAT does not address training dynamics, data efficiency, or generalization; it is about representational capacity only.
  • In practice, using depth leverages compositionality, reusing features across layers to represent complex structure with fewer neurons.

Prerequisites

  • Linear algebraUnderpins vector spaces, dot products w^T x, matrix equations in training, and feature mappings.
  • Calculus and real analysisContinuity, compactness, and limits are central to the theorem’s statement and meaning.
  • Probability and statisticsNeeded for understanding random features, generalization, and error measurement.
  • Optimization and gradient descentPractical training of networks relies on these methods even though UAT is existential.
  • Neural network basicsArchitecture, activations, and forward/backward passes are required to implement models.
  • Numerical linear algebraSolving normal equations and handling conditioning in random-feature training.
  • Function approximation theoryProvides context for universality, rates, and relation to splines and kernels.

Detailed Explanation

Tap terms for definitions

01Overview

The Universal Approximation Theorem (UAT) is a cornerstone result in neural network theory. It states that a feedforward neural network with just a single hidden layer and a suitable non-polynomial activation function (e.g., sigmoid, tanh, ReLU) can approximate any continuous function defined on a compact set (such as a closed and bounded interval or box in Euclidean space) to arbitrary precision. More precisely, for any continuous target function and any tolerance level, there exists a network width (number of hidden neurons) and a set of weights such that the network’s output is uniformly close to the target function everywhere on the domain. This result, proved in various forms by Cybenko (1989) and Hornik (1991), provides a theoretical justification for using neural networks as flexible function approximators. However, it is crucially an existence theorem: it does not specify how many neurons are needed for a given accuracy, nor does it provide an algorithm to find the requisite weights efficiently. It also says nothing about learnability, data requirements, or generalization. Modern practice emphasizes depth: while a single hidden layer suffices in principle, deeper networks can represent many functions with exponentially fewer parameters by exploiting compositional structure. This explains why deep learning is effective in high-dimensional tasks, even though shallow networks are theoretically universal. The UAT thus reassures us that neural networks are expressive enough, while follow-up results help explain why certain architectures and depths are practical.

02Intuition & Analogies

Imagine building shapes out of Lego bricks. If you have enough small bricks, you can approximate any curved shape by stacking them cleverly. In the same spirit, neural networks combine many simple building blocks—"bumps" or "hinges" created by activations—so that their weighted sums trace out the desired function’s shape. A sigmoid unit behaves like a smooth step; a ReLU behaves like a hinge that is flat on one side and linear on the other. By adding many such steps or hinges at the right positions and scales, you can sculpt almost any curve or surface. Another analogy is mixing paints to get a target color: each hidden neuron contributes a particular tint (a ridge-shaped feature along some direction), and the output layer mixes these tints to match the final hue (the target function). If you allow enough ingredients (neurons), you can match subtle shades (fine function details) more closely. Depth is like building with pre-assembled Lego substructures. You could approximate a castle wall brick by brick (a wide, shallow network), but it’s more efficient to first assemble windows, arches, and towers (features) and then combine them (a deep network). Each layer composes and refines features, capturing structure in fewer pieces. This is the essence of compositionality: many real-world functions can be described as compositions of simpler parts (edges → textures → objects), and deep networks naturally mirror this hierarchy. Thus, while the UAT says shallow networks can eventually do the job, adding depth lets us get there with far fewer bricks.

03Formal Definition

Let K be compact and let : be a continuous activation that is not a polynomial (e.g., sigmoid, tanh, ReLU). Consider the class of single-hidden-layer neural networks F = \{ x ( x + ) + c : N , , , , c \}. The Universal Approximation Theorem states that F is dense in C(K) (the space of continuous functions on K) under the uniform norm \_{} = . That is, for every f C(K) and every > 0, there exist N and parameters (, , , c) such that < . Cybenko (1989) proved this for sigmoidal activations on [0,1]^{n}; Hornik (1991) extended it to broader activation classes and norms (including ). ReLU, though unbounded, also yields density on compact sets. The non-polynomial condition excludes activations that can only generate finite-dimensional polynomial spaces, which would not be dense in C(K). The theorem makes no claim about the minimal N for a given , nor about the computational complexity of finding such parameters.

04When to Use

Use the UAT as a conceptual guarantee that your network architecture is expressive enough to represent your target mapping—especially when selecting activation functions and considering whether a single hidden layer could, in principle, do the job. It’s particularly relevant in function approximation scenarios (regression, system identification, control, PDE surrogates) on bounded input domains. When you adopt ReLU or sigmoid activations, the UAT reassures you that adding enough neurons can approximate the desired behavior. However, in practice you should lean on depth when you suspect the target function has compositional or hierarchical structure (vision, language, audio). Deeper models often achieve the same approximation with dramatically fewer neurons and parameters, improving statistical and computational efficiency. The UAT also motivates random-feature methods (fix hidden weights, learn output layer) or kernel approximations as practical ways to get good approximations quickly. Do not use the UAT to justify training feasibility, sample complexity, or generalization: it does not address optimization landscapes, data noise, or overfitting. Pair it with regularization, validation, and architectural bias (convolutions, attention, residual connections) to exploit structure and achieve learnability. Finally, remember the compact-domain assumption: rescale inputs or constrain operating ranges so the theoretical guarantee applies cleanly.

⚠️Common Mistakes

• Equating existence with learnability: The UAT does not say gradient descent will find the approximating weights, nor that it will do so quickly. Remedy: separate expressivity claims from optimization guarantees and monitor training with validation curves. • Ignoring domain compactness: The theorem assumes a compact domain (e.g., [0,1]^{n}). Remedy: normalize or bound inputs; avoid relying on UAT to justify extrapolation. • Choosing polynomial or linear activations: Purely linear or polynomial activations do not yield universality in the stated sense. Remedy: use non-polynomial activations like ReLU, sigmoid, tanh, GELU, etc. • Overestimating shallow networks: While a single hidden layer is universal, width requirements can be exponential in input dimension for certain functions. Remedy: prefer deeper architectures for compositional tasks. • Expecting rates from the UAT: The theorem is non-quantitative about approximation error vs. width. Remedy: consult rate results (e.g., Barron bounds) or empirical validation when sizing models. • Conflating approximation and generalization: Perfect training-fit does not imply good test performance. Remedy: use regularization, early stopping, and appropriate data splits. • Neglecting parameter scaling: Very large weights can cause numerical instability. Remedy: standardize features and initialize weights carefully. • Misapplying classification logic: UAT for continuous functions differs from separability results in classification (e.g., universal classifiers under different settings). Remedy: be precise about the function class and norm (uniform vs. L^{p}).

Key Formulas

Universal Approximation (Uniform Norm)

Explanation: For any continuous function f on a compact set K and any tolerance there exists a single-hidden-layer network whose output stays within ε of f everywhere on K.

Uniform Norm

Explanation: The infinity norm measures the maximum absolute value of a function over a set K. UAT uses this to quantify approximation error uniformly over the domain.

Single-Hidden-Layer Model

Explanation: This is the function class considered by the UAT: linear combinations of nonlinear ridge functions plus a bias term.

ReLU Spline Representation

Explanation: Any 1D piecewise-linear function with breakpoints can be represented exactly by a sum of shifted ReLUs, showing constructive universality in 1D.

Ridge Regression for Random Features

Explanation: Given hidden-layer features Z and targets y, the optimal output weights a in L2-regularized least squares are obtained by solving this linear system.

Barron Rate (Random Features)

Explanation: For functions in Barron space, the L2 approximation error using m random features decays on average as O. This provides a quantitative rate absent in the UAT.

Parameter Count (Single Output)

Explanation: A single-hidden-layer network with H neurons in n dimensions has H input-weight vectors and biases, H output weights, and one output bias.

Forward-Pass Cost

Explanation: Evaluating a single-hidden-layer network requires computing H dot products in n dimensions plus activations and a final sum.

Linear Regions (Single ReLU Layer)

Explanation: A layer of H ReLU units in n dimensions can partition space into at most this many linear regions. Depth multiplies regions across layers, increasing expressivity.

Curse of Dimensionality (Grid Pieces)

Explanation: Approximating a function by a grid with k pieces per dimension needs cells, which grows exponentially with dimension. Depth can mitigate this by compositional reuse.

Complexity Analysis

For a single-hidden-layer network with H hidden units and input dimension n, a forward pass costs O(nH) time: H dot products in ℝ^n (each O(n)), H activations, and one weighted sum. Memory is O(nH) for input weights, plus O(H) for biases and output weights, totaling O(nH) parameters. For batch inference on B inputs, time is O(B n H) with trivial parallelism across both B and H. Training complexity depends on the method. Full-batch gradient descent on mean-squared error with E epochs and N training points costs O(E N n H) per parameter update pass in this shallow setting (assuming one output). Mini-batching with batch size B yields O(E B n H) = O(E N n H) arithmetic overall, with better cache behavior and possible GPU speedups. Backpropagation doubles the cost of a forward pass up to constant factors but remains O(nH) per example. If only the output layer is trained (random features), we first compute the feature matrix Z ∈ in O(N n H) time, then solve the ridge regression normal equations ( Z + λI) y. Forming Z costs O(N ) and y costs O(N H). Solving the resulting H×H linear system via Gaussian elimination is O(). Thus, random-feature training is O(N n H + N + ) time and O(N H + ) space, which is attractive when H is moderate. Approximation complexity (how H must grow with error is not provided by the UAT. For certain smooth functions in Barron space, rates like O(1/√H) in L2 are known for random features, but worst-case functions in high dimensions may require exponentially many units for shallow networks. Depth can reduce parameter complexity dramatically by reusing features compositionally, so deep architectures often achieve better accuracy with smaller H per layer than a comparable shallow model.

Code Examples

Train a 1D Single-Hidden-Layer Network (Sigmoid) to Approximate sin(2πx)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Simple 1D single-hidden-layer neural network with sigmoid activation.
5// We train it by vanilla gradient descent to approximate f(x) = sin(2*pi*x) on [0,1].
6
7struct Net1D {
8 int H; // number of hidden units
9 double c; // output bias
10 vector<double> w; // input weights (size H)
11 vector<double> b; // hidden biases (size H)
12 vector<double> a; // output weights (size H)
13
14 Net1D(int H_, unsigned seed=42): H(H_), c(0.0), w(H_), b(H_), a(H_) {
15 mt19937 rng(seed);
16 normal_distribution<double> nd(0.0, 0.5); // small random init
17 for (int i = 0; i < H; ++i) {
18 w[i] = nd(rng);
19 b[i] = nd(rng);
20 a[i] = nd(rng) * 0.1; // smaller output weights
21 }
22 }
23
24 static inline double sigmoid(double z) {
25 if (z >= 0) {
26 double ez = exp(-z);
27 return 1.0 / (1.0 + ez);
28 } else {
29 double ez = exp(z);
30 return ez / (1.0 + ez);
31 }
32 }
33
34 // Forward pass for a single scalar x
35 double forward(double x, vector<double>* h_out=nullptr) const {
36 vector<double> h(H);
37 for (int i = 0; i < H; ++i) {
38 h[i] = sigmoid(w[i]*x + b[i]);
39 }
40 double y = c;
41 for (int i = 0; i < H; ++i) y += a[i]*h[i];
42 if (h_out) *h_out = move(h);
43 return y;
44 }
45};
46
47static inline double target(double x) {
48 const double PI = acos(-1.0);
49 return sin(2.0 * PI * x);
50}
51
52int main() {
53 ios::sync_with_stdio(false);
54 cin.tie(nullptr);
55
56 int H = 50; // hidden units
57 int N = 256; // training samples
58 int epochs = 2000; // training epochs
59 double lr = 0.05; // learning rate
60
61 // Generate training data uniformly on [0,1]
62 vector<double> xs(N), ys(N);
63 for (int i = 0; i < N; ++i) {
64 xs[i] = (i + 0.5) / N;
65 ys[i] = target(xs[i]);
66 }
67
68 Net1D net(H, 123);
69
70 // Training loop: mean squared error with full-batch gradient descent
71 for (int epoch = 0; epoch < epochs; ++epoch) {
72 // Accumulate gradients
73 double gc = 0.0; // grad wrt c
74 vector<double> ga(H, 0.0); // grad wrt a
75 vector<double> gw(H, 0.0); // grad wrt w
76 vector<double> gb(H, 0.0); // grad wrt b
77
78 double mse = 0.0;
79 for (int i = 0; i < N; ++i) {
80 vector<double> h;
81 double yhat = net.forward(xs[i], &h);
82 double err = yhat - ys[i];
83 mse += err * err;
84
85 // dL/dc = 2/N * err
86 gc += 2.0 * err / N;
87 // dL/da_j = 2/N * err * h_j
88 for (int j = 0; j < net.H; ++j) ga[j] += 2.0 * err * h[j] / N;
89 // For sigmoid, dh_j/dz = h_j*(1-h_j); z = w_j*x + b_j
90 for (int j = 0; j < net.H; ++j) {
91 double dh_dz = h[j] * (1.0 - h[j]);
92 double dz_dw = xs[i];
93 double dz_db = 1.0;
94 double dL_dz = 2.0 * err * net.a[j] * dh_dz / N;
95 gw[j] += dL_dz * dz_dw;
96 gb[j] += dL_dz * dz_db;
97 }
98 }
99 mse /= N;
100
101 // Gradient step
102 net.c -= lr * gc;
103 for (int j = 0; j < net.H; ++j) {
104 net.a[j] -= lr * ga[j];
105 net.w[j] -= lr * gw[j];
106 net.b[j] -= lr * gb[j];
107 }
108
109 if (epoch % 200 == 0) {
110 cout << "Epoch " << epoch << ", MSE = " << mse << "\n";
111 }
112 }
113
114 // Report a few predictions
115 cout << fixed << setprecision(6);
116 for (double x : {0.0, 0.125, 0.25, 0.5, 0.75, 1.0}) {
117 double yhat = net.forward(x);
118 cout << "x=" << setw(6) << x << " f(x)=" << setw(9) << target(x)
119 << " yhat=" << setw(10) << yhat << "\n";
120 }
121
122 return 0;
123}
124

This program trains a single-hidden-layer sigmoid network to approximate f(x) = sin(2πx) on [0,1]. It uses full-batch gradient descent on mean squared error, updating input weights, hidden biases, output weights, and output bias. The decreasing MSE and accurate sample predictions illustrate the UAT in practice: with enough neurons and training, a shallow network can approximate a smooth target function closely.

Time: Training: O(epochs × N × H). Inference per point: O(H).Space: O(H) parameters for 1D input (weights, biases, and output weights).
Random-Feature (ReLU) Regressor in 2D with Ridge Regression (Extreme Learning Machine)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Approximate a 2D function using a single hidden layer with fixed random ReLU features
5// and solve the output weights via ridge regression: (Z^T Z + λI)a = Z^T y.
6
7struct RandomFeatures2D {
8 int H; // number of hidden units
9 vector<array<double,2>> W; // hidden weights w_i in R^2
10 vector<double> b; // hidden biases
11 RandomFeatures2D(int H_, unsigned seed=42): H(H_), W(H_), b(H_) {
12 mt19937 rng(seed);
13 normal_distribution<double> nd(0.0, 2.0); // somewhat wide to cover [0,1]^2
14 uniform_real_distribution<double> ub(-1.0, 1.0);
15 for (int i = 0; i < H; ++i) {
16 W[i] = {nd(rng), nd(rng)};
17 b[i] = ub(rng);
18 }
19 }
20 static inline double relu(double z) { return z > 0 ? z : 0; }
21 // Compute hidden features for a single (x,y)
22 vector<double> phi(double x, double y) const {
23 vector<double> h(H);
24 for (int i = 0; i < H; ++i) {
25 double z = W[i][0]*x + W[i][1]*y + b[i];
26 h[i] = relu(z);
27 }
28 return h;
29 }
30};
31
32// Solve (A)x = b by Gaussian elimination with partial pivoting (small H only)
33bool gaussian_solve(vector<vector<double>>& A, vector<double>& b, vector<double>& x) {
34 int n = (int)A.size();
35 x.assign(n, 0.0);
36 // Augment A with b
37 for (int i = 0; i < n; ++i) A[i].push_back(b[i]);
38 // Elimination
39 for (int col = 0; col < n; ++col) {
40 // Pivot
41 int piv = col;
42 for (int r = col+1; r < n; ++r)
43 if (fabs(A[r][col]) > fabs(A[piv][col])) piv = r;
44 if (fabs(A[piv][col]) < 1e-12) return false; // singular
45 swap(A[piv], A[col]);
46 // Normalize row
47 double div = A[col][col];
48 for (int j = col; j <= n; ++j) A[col][j] /= div;
49 // Eliminate others
50 for (int r = 0; r < n; ++r) if (r != col) {
51 double factor = A[r][col];
52 for (int j = col; j <= n; ++j) A[r][j] -= factor * A[col][j];
53 }
54 }
55 for (int i = 0; i < n; ++i) x[i] = A[i][n];
56 return true;
57}
58
59static inline double ftarget(double x, double y) {
60 // A smooth 2D bump centered at (0.5, 0.5)
61 double dx = x - 0.5, dy = y - 0.5;
62 return exp(-(dx*dx + dy*dy) / 0.02);
63}
64
65int main() {
66 ios::sync_with_stdio(false);
67 cin.tie(nullptr);
68
69 int H = 200; // random ReLU features
70 double lambda = 1e-3; // ridge regularization
71
72 // Build training grid on [0,1]^2
73 int G = 20; // 20x20 grid => N=400
74 int N = G * G;
75 vector<array<double,2>> X(N);
76 vector<double> y(N);
77 int idx = 0;
78 for (int i = 0; i < G; ++i) {
79 for (int j = 0; j < G; ++j) {
80 double x = (i + 0.5) / G;
81 double yy = (j + 0.5) / G;
82 X[idx] = {x, yy};
83 y[idx] = ftarget(x, yy);
84 ++idx;
85 }
86 }
87
88 RandomFeatures2D rf(H, 7);
89
90 // Compute feature matrix Z (N x H)
91 vector<vector<double>> Z(N, vector<double>(H));
92 for (int i = 0; i < N; ++i) {
93 auto h = rf.phi(X[i][0], X[i][1]);
94 for (int j = 0; j < H; ++j) Z[i][j] = h[j];
95 }
96
97 // Form normal equations: (Z^T Z + λI)a = Z^T y
98 vector<vector<double>> A(H, vector<double>(H, 0.0));
99 vector<double> By(H, 0.0);
100 for (int i = 0; i < H; ++i) {
101 for (int j = 0; j < H; ++j) {
102 double s = 0.0;
103 for (int k = 0; k < N; ++k) s += Z[k][i] * Z[k][j];
104 A[i][j] = s + (i == j ? lambda : 0.0);
105 }
106 double t = 0.0;
107 for (int k = 0; k < N; ++k) t += Z[k][i] * y[k];
108 By[i] = t;
109 }
110
111 // Solve for output weights a
112 vector<double> a;
113 bool ok = gaussian_solve(A, By, a);
114 if (!ok) {
115 cerr << "Failed to solve normal equations (possibly ill-conditioned).\n";
116 return 0;
117 }
118
119 // Evaluate training MSE
120 double mse = 0.0;
121 for (int i = 0; i < N; ++i) {
122 double pred = 0.0;
123 for (int j = 0; j < H; ++j) pred += Z[i][j] * a[j];
124 double err = pred - y[i];
125 mse += err * err;
126 }
127 mse /= N;
128 cout << fixed << setprecision(6);
129 cout << "Training MSE: " << mse << "\n";
130
131 // Test a few points
132 for (auto P : vector<array<double,2>>{{0.1,0.1},{0.5,0.5},{0.8,0.2}}) {
133 auto h = rf.phi(P[0], P[1]);
134 double pred = 0.0;
135 for (int j = 0; j < H; ++j) pred += h[j] * a[j];
136 cout << "(x,y)=(" << P[0] << "," << P[1] << ") f=" << ftarget(P[0],P[1])
137 << " yhat=" << pred << "\n";
138 }
139
140 return 0;
141}
142

This program implements a random-feature ReLU network in 2D. Hidden weights and biases are fixed randomly, the feature matrix is computed, and output weights are learned in closed form using ridge regression via the normal equations. Good accuracy on a smooth 2D bump demonstrates practical universality: even with fixed hidden layers, linear combinations of many nonlinear features can approximate the target well.

Time: O(N n H + N H^2 + H^3) for building features and solving the normal equations (n=2 here).Space: O(N H + H^2) to store the feature matrix and normal-equation matrices.
Constructive 1D ReLU Spline Approximator via Hinges
1#include <bits/stdc++.h>
2using namespace std;
3
4// Approximate a 1D continuous function on [0,1] by linear splines and
5// represent the spline exactly as a ReLU network: g(x) = c0 + c1*x + sum a_j * ReLU(x - t_j).
6// This demonstrates a constructive instance of the UAT in 1D.
7
8static inline double relu(double z) { return z > 0 ? z : 0; }
9
10// Build spline coefficients from sampled points (t_0=0 < t_1 < ... < t_K=1)
11struct ReLUSpline {
12 vector<double> t; // breakpoints t_j, size K+1
13 vector<double> a; // hinge weights a_j, size K-1 (for j=1..K-1)
14 double c0; // intercept
15 double c1; // slope of first segment
16
17 // Given samples (t_j, f_j), produce exact PWL interpolant and convert to ReLU form
18 ReLUSpline(const vector<double>& T, const vector<double>& F) {
19 int K = (int)T.size() - 1; // number of segments
20 t = T;
21 // Compute segment slopes s_j on [t_j, t_{j+1}]
22 vector<double> s(K);
23 for (int j = 0; j < K; ++j) {
24 s[j] = (F[j+1] - F[j]) / (T[j+1] - T[j]);
25 }
26 c0 = F[0];
27 c1 = s[0];
28 // a_j are slope jumps: a_j = s_j - s_{j-1} for j=1..K-1
29 a.assign(K-1, 0.0);
30 for (int j = 1; j < K; ++j) a[j-1] = s[j] - s[j-1];
31 }
32
33 double eval(double x) const {
34 // g(x) = c0 + c1*x + sum_{j=1}^{K-1} a_j * ReLU(x - t_j)
35 double y = c0 + c1 * x;
36 for (int j = 0; j < (int)a.size(); ++j) y += a[j] * relu(x - t[j+1]);
37 return y;
38 }
39};
40
41static inline double ftrue(double x) {
42 // target function to approximate
43 const double PI = acos(-1.0);
44 return 0.5 * sin(2.0 * PI * x) + 0.3 * x * (1.0 - x);
45}
46
47int main() {
48 ios::sync_with_stdio(false);
49 cin.tie(nullptr);
50
51 // Choose K segments for the spline
52 int K = 16; // more segments => better uniform approximation
53 vector<double> T(K+1), F(K+1);
54 for (int j = 0; j <= K; ++j) {
55 T[j] = (double)j / K;
56 F[j] = ftrue(T[j]);
57 }
58
59 ReLUSpline spline(T, F);
60
61 // Evaluate error on a fine grid
62 int M = 1000;
63 double maxerr = 0.0;
64 for (int i = 0; i <= M; ++i) {
65 double x = (double)i / M;
66 double err = fabs(spline.eval(x) - ftrue(x));
67 if (err > maxerr) maxerr = err;
68 }
69
70 cout << fixed << setprecision(6);
71 cout << "Uniform error with K=" << K << " segments: " << maxerr << "\n";
72
73 // Show a few sample evaluations
74 for (double x : {0.0, 0.1, 0.25, 0.5, 0.75, 1.0}) {
75 cout << "x=" << x << " f(x)=" << ftrue(x) << " spline(x)=" << spline.eval(x) << "\n";
76 }
77
78 return 0;
79}
80

This example constructs a linear-spline interpolant of a 1D function and rewrites it exactly as a ReLU network with K−1 hidden units: g(x) = c0 + c1 x + Σ a_j ReLU(x − t_j). As K increases, the spline approximates the continuous function uniformly, so the corresponding ReLU network achieves arbitrarily small error—providing a constructive intuition for the UAT in 1D.

Time: Building coefficients: O(K). Evaluating at one x: O(K).Space: O(K) to store breakpoints and hinge weights.