📚TheoryAdvanced

Neural Network Expressivity

Key Points

  • Neural network expressivity studies what kinds of functions different network architectures can represent and how efficiently they can do so.
  • Depth brings powerful compounding effects: deep ReLU networks can create exponentially many linear regions using only polynomially many parameters.
  • There are depth-separation theorems: some functions computed by depth-d networks require exponentially larger width if the depth is reduced to d−1.
  • For Boolean functions, depth hierarchies of threshold circuits show exponential size blowups when you reduce depth, mirroring neural network depth benefits.
  • From a tensor viewpoint, deep networks implement hierarchical tensor factorizations that can represent high-order interactions much more compactly than shallow ones.
  • The universal approximation theorem guarantees shallow networks can approximate any continuous function but may need an impractically large width.
  • The lottery ticket hypothesis suggests that large dense networks contain sparse subnetworks that can achieve similar performance, indicating redundancy in parameterization but not necessarily in expressivity.
  • In practice, the right depth/width tradeoff depends on the task’s compositional structure, desired sample efficiency, and compute/memory constraints.

Prerequisites

  • Linear algebra (vectors, matrices, rank, eigenvalues)Network layers are affine maps; tensor and capacity arguments rely on linear algebra concepts.
  • Calculus and optimization basicsUnderstanding training, loss minimization, and function approximation error requires derivatives and optimization.
  • Complexity theory foundationsDepth separation connects to circuit complexity and lower bounds on size vs depth.
  • Piecewise-linear functions and geometryReLU networks are CPWL; linear region counts and hyperplane arrangements are central.
  • Probability and statisticsGeneralization, VC dimension, and approximation error bounds are probabilistic/statistical.
  • Fourier analysis (Barron space intuition)Rates for two-layer networks depend on spectral properties of the target function.
  • Tensor decompositionsUnderstanding hierarchical expressivity uses CP/Tucker and tensor-network analogies.
  • Neural network basics (MLP, activation functions)Expressivity is defined relative to standard architectures and activations.
  • Numerical linear algebraFitting shallow models via least squares requires stable solvers like Cholesky.

Detailed Explanation

Tap terms for definitions

01Overview

Neural network expressivity asks: given an architecture (depth, width, activation functions), what is the set of functions it can represent, and how many parameters are required to represent functions of a given complexity? The field seeks principled reasons why deep networks outperform shallow ones, and it quantifies how architectural choices shape the hypothesis class. Classic results like the universal approximation theorem ensure that even shallow networks (one hidden layer) can approximate any continuous function on a compact set. However, the key distinction is efficiency: some functions are representable with a small, deep network but require exponentially many neurons in a shallow network. For ReLU networks, a core insight is that they define piecewise-linear functions; depth lets these pieces multiply through compositions, creating a large number of linear regions. In Boolean settings, related circuit complexity results show exponential lower bounds when reducing depth for threshold circuits. Complementary views come from tensors: deep architectures can be seen as hierarchical tensor decompositions that capture high-order interactions with far fewer parameters than flat decompositions. Modern perspectives like the lottery ticket hypothesis further suggest overparameterized networks contain simpler subnetworks capable of strong performance, reshaping how we think about capacity and trainability versus representational power. Understanding expressivity guides architecture design, capacity control, and theoretical limits.

02Intuition & Analogies

Imagine folding a strip of paper. Each fold adds a kink. If you just add creases one by one along a flat strip (like a very wide, single-layer network), you get a linear sequence of kinks—useful but limited. Now imagine you repeatedly fold, then fold the folded paper again, and again. Each new fold multiplies the number of segments and creases because previous folds are carried forward and get folded over themselves. This is depth: compositions of simple transformations create exponentially rich structure compared to stacking many independent creases side-by-side. Another analogy: Think of a kitchen assembly line. A shallow network is like one long station with many workers doing tasks in parallel—each worker adds one simple feature. A deep network is multiple stations in sequence, where outputs from one stage are refined by the next. With the same total number of workers, the sequential pipeline can build far more intricate products through composition. For ReLU networks, each neuron adds a hinge (a piecewise-linear kink) along some direction. With a single hidden layer, all hinges are directly applied to the input, so you only get as many kinks as you have neurons. With multiple layers, hinges are applied to previous hinge-made surfaces, recursively subdividing space. A 1D demonstration shows this vividly: by composing a simple triangular wave map a few times, you can create an oscillatory function with 2^k linear pieces using a network whose size grows only linearly with k. In Boolean circuits, depth is like the number of logic levels. Some functions (e.g., parity) are easy if you can stack enough layers of gates, but if you restrict to too few layers, you need an explosion in the number of gates. Deep learning inherits this principle: certain compositions are compact only when you allow enough layers.

03Formal Definition

Fix an activation function \(\) (e.g., ReLU \((z)=\{0,z\}\)). A feedforward network with input dimension \(\), depth \(L\), and widths \((,,)\) represents functions \(f: \) of the form \[ f(x) = \,\sigma\big( \,( ( x + ) ) + \big) + . \] The hypothesis class \((; ,,)\) is the set of all such functions as weights/biases vary. Expressivity can be measured by: (1) what functions are in \(\), (2) how many parameters are needed to represent a target function up to error \(\), and (3) structural complexity measures, such as the number of linear regions for ReLU networks. For ReLU nets, each layer applies an affine map followed by elementwise max, yielding a continuous piecewise-linear (CPWL) function. One quantitative measure is the maximum number of affine-linear regions that the input space can be partitioned into, often growing exponentially with depth under suitable width. For Boolean inputs \(\{0,1\}^n\), threshold circuits (gates compute \((w x - )\)) define classes analogously parameterized by depth and size. Depth hierarchy theorems assert that for every \(d\), there exist functions computable by polynomial-size depth-\(d\) circuits that require exponential size at depth \(d-1\). Tensor network views formalize deep networks as hierarchical factorizations of high-order tensors implementing functions; ranks multiply through layers, explaining succinct representations of complex interactions.

04When to Use

Use depth when your target mapping has a compositional or hierarchical structure: language (characters→tokens→phrases→sentences), images (edges→motifs→parts→objects), audio (waveforms→phonemes→words). Deep networks excel at progressively refining features, allowing exponential growth of representable patterns with modest parameter growth. If you need sharp local variations (many kinks/oscillations), deep ReLU networks can realize them compactly by composing simple fold-like operations. Use wider, shallower networks when: (1) the function is smooth and global with limited hierarchical structure (Barron-type functions), (2) you prioritize faster inference with fewer sequential steps (latency-constrained settings), or (3) you rely on convex or nearly convex optimization landscapes in two-layer settings. Architectural principles guided by expressivity include: residual connections (enable very deep compositions without vanishing gradients), convolution (parameter sharing exploits locality; depth compounds receptive fields), normalization (stabilizes deep compositions), and attention (flexible, depth-amplified interactions). If memory is tight or you fear overfitting, consider depth with bottlenecks and sparsity; the lottery ticket hypothesis indicates that overparameterized models may contain efficient sparse subnetworks you can prune to. In theory-informed design, if reducing depth is unavoidable (e.g., hardware constraints), be ready to scale width substantially to maintain expressivity for certain tasks. Conversely, modest depth increases can dramatically reduce the required width for compositional problems.

⚠️Common Mistakes

  1. Assuming universal approximation means depth is irrelevant. While a single hidden layer can approximate any continuous function, the number of neurons needed can be exponentially larger than for a deep model; this impacts trainability, generalization, and compute.
  2. Equating parameter count with expressivity. Two networks with the same number of parameters can have very different function classes; arrangement across depth/width matters.
  3. Ignoring activation choice. ReLU networks are CPWL; they excel at piecewise-linear structure and can approximate smooth functions via many small pieces. Using saturating or polynomial activations changes expressivity and optimization behavior.
  4. Misreading linear-region counts. The number of regions along a particular line can be far smaller than the global maximum; empirical region counts are lower bounds. Also, more regions do not automatically mean better accuracy.
  5. Overgeneralizing Boolean circuit results. Depth separations in threshold circuits inform intuition but live in a discrete setting with different noise and optimization characteristics; treat analogies carefully.
  6. Confusing trainability with representability. A function might be representable by a compact deep net, yet gradient-based training might still struggle without proper initialization, normalization, or data.
  7. Overlooking tensor perspectives. Failing to exploit hierarchical structure (e.g., via convolution or attention) can force shallow networks to learn enormous flat decompositions, wasting parameters.
  8. Treating lottery tickets as a silver bullet. Sparse subnetworks exist, but finding them reliably and efficiently is nontrivial; pruning strategies and retraining details matter.

Key Formulas

Universal Approximation (Two-Layer)

Explanation: Any continuous function on a compact set can be approximated uniformly by a single-hidden-layer neural network with enough neurons. This shows possibility, not efficiency.

Montfafar Lower Bound on Linear Regions

Explanation: A deep ReLU network with input dimension n0 and widths can realize at least this many linear regions. It formalizes that depth can multiply regions.

Global Upper Bound on Regions

Explanation: With N total ReLU neurons and input dimension n0, the maximum number of linear regions is at most this combinatorial quantity. It constrains how complex the partition can get.

Telgarsky Triangle Composition

Explanation: Composing a simple triangular map k times yields a 1D function with 2^k linear pieces. A ReLU network with depth O(k) and small width can implement this.

Capacity vs Parameters

Explanation: The VC dimension of many neural network classes scales on the order of the number of parameters times a logarithmic factor. It links learnability to model size.

Depth Hierarchy (Threshold Circuits)

Explanation: For each depth d, there exists a Boolean function computed by a polynomial-size depth-d threshold circuit but requiring exponential size at depth d−1. This shows exponential depth separation.

Barron Approximation Rate

Explanation: For functions with finite Barron norm B, two-layer networks achieve mean-squared error decreasing like 1/m in the number of hidden units. This identifies when shallow nets are efficient.

Parameter Count (Fully Connected)

Explanation: The total number of parameters (weights and biases) in an L-layer fully connected network. It is important for capacity and memory considerations.

Hierarchical Rank Growth (Tensor View)

Explanation: In tensor-network analogies for deep nets, representational rank can grow multiplicatively with depth. This explains compact representations of complex interactions.

Forward Pass Cost

Explanation: Time per forward pass is linear in the number of multiplications/additions across layers; space is dominated by activations and parameters. This affects sampling-based expressivity probes.

Complexity Analysis

Our C++ examples focus on evaluating and probing 1D expressivity and fitting a shallow network to a deep function. For the Telgarsky composition example, evaluating (x) requires k compositions of a small fixed-width ReLU block, so a single evaluation costs O(k). Sampling G points on an interval and estimating the number of linear pieces via finite differences costs O(kG) time and O(1) extra space beyond storing a few running values; storing the sampled outputs costs O(G) if retained. This demonstrates how evaluation time scales linearly in depth while the number of linear regions (pieces) can grow exponentially (2^k). For the shallow least-squares fitting example, we fix ReLU features at predetermined knots and solve for output weights to approximate a target deep function. Building the design matrix implicitly and accumulating normal equations A and y over N samples and M features costs O(N) time (or O(NM) if A is updated feature-wise using optimized routines), and O() space for A plus O(M) for vectors. Solving the M×M linear system via Cholesky takes O() time and O() space. In practice, with N ≫ M, the dominant cost is O(N). We also compute the mean-squared error over N points in O(NM) time if recomputing features, or O(N) if we cache predictions. Forward evaluation of a general fully connected ReLU network with widths (,…,) has time O(∑ ) per sample and space O(ma ) for activations (excluding parameter storage). Counting linear regions along a line by dense sampling costs O(G∑ ) time for G points. These complexities highlight that inexpensive evaluations can still reveal exponentially rich expressivity as depth increases, while approximating such deep functions with a shallow model tends to require width (M) that grows rapidly, stressing both compute and memory.

Code Examples

Exponential linear pieces from composing a simple ReLU triangle map (Telgarsky construction)
1#include <bits/stdc++.h>
2using namespace std;
3
4// ReLU activation
5inline double relu(double z) { return z > 0.0 ? z : 0.0; }
6
7// One-layer triangle map tau(x) with peaks in [0,1]:
8// tau(x) = ReLU(x) - 2*ReLU(x-0.5) + ReLU(x-1)
9// This is a piecewise-linear 'tent' function over [0,1] with 2 linear pieces.
10inline double triangle_once(double x) {
11 return relu(x) - 2.0 * relu(x - 0.5) + relu(x - 1.0);
12}
13
14// Compose the triangle map k times: f_k = tau o tau o ... o tau (k times).
15// This can be implemented by feeding the output back into tau repeatedly.
16double triangle_composed(double x, int k) {
17 double y = x;
18 for (int i = 0; i < k; ++i) y = triangle_once(y);
19 return y;
20}
21
22// Estimate the number of linear pieces by sampling and counting slope changes.
23// For a piecewise-linear function f, the discrete slope should be nearly constant within a piece.
24int count_linear_pieces(function<double(double)> f, double a, double b, int G, double eps=1e-6) {
25 vector<double> xs(G+1), ys(G+1);
26 for (int i = 0; i <= G; ++i) {
27 xs[i] = a + (b - a) * (double)i / G;
28 ys[i] = f(xs[i]);
29 }
30 int pieces = 1; // at least one piece
31 double dx = (b - a) / G;
32 // compute discrete slopes
33 vector<double> slopes(G);
34 for (int i = 0; i < G; ++i) slopes[i] = (ys[i+1] - ys[i]) / dx;
35 // count significant slope changes
36 for (int i = 1; i < G; ++i) {
37 if (fabs(slopes[i] - slopes[i-1]) > 1e-3) pieces++;
38 }
39 return pieces;
40}
41
42int main() {
43 cout.setf(std::ios::fixed); cout << setprecision(6);
44
45 // Demonstrate exponential growth of linear pieces with depth k.
46 // We evaluate f_k on [0,1] and count the number of linear segments.
47 vector<int> ks = {1,2,3,4,5,6,7,8};
48 int G = 20000; // fine sampling for robust counting
49
50 cout << "k, estimated_pieces, expected_pieces (2^k)\n";
51 for (int k : ks) {
52 auto fk = [k](double x){ return triangle_composed(x, k); };
53 int pieces = count_linear_pieces(fk, 0.0, 1.0, G);
54 int expected = 1 << k; // 2^k
55 cout << k << ", " << pieces << ", " << expected << "\n";
56 }
57
58 // Print a small sample of values for visualization
59 int kvis = 5;
60 auto fvis = [kvis](double x){ return triangle_composed(x, kvis); };
61 cout << "\nSample values for k=" << kvis << " on [0,1]:\n";
62 for (int i = 0; i <= 10; ++i) {
63 double x = i / 10.0;
64 cout << "x=" << x << ", f_k(x)=" << fvis(x) << "\n";
65 }
66
67 return 0;
68}
69

We implement a simple 1D ReLU network block (the triangle map) and compose it k times. The composition can be realized by a deep network of width 3 per triangle layer (or width 2 with minor variations) and depth proportional to k. Sampling the resulting function on [0,1] and counting slope changes shows that the number of linear pieces grows as approximately 2^k despite evaluation time growing only linearly in k. This empirically demonstrates depth-induced exponential expressivity for ReLU networks in 1D.

Time: O(k G) for sampling G points and composing k timesSpace: O(G) to store sampled values (O(1) if streamed)
Shallow ReLU approximation of a deep function: width growth vs. error (least squares fit)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Target: deep function via k-fold composition of a triangle map (same as Example 1)
5inline double relu(double z) { return z > 0.0 ? z : 0.0; }
6inline double triangle_once(double x) { return relu(x) - 2.0 * relu(x - 0.5) + relu(x - 1.0); }
7double triangle_composed(double x, int k) { double y = x; for (int i = 0; i < k; ++i) y = triangle_once(y); return y; }
8
9// Build fixed ReLU features for a one-hidden-layer model on [0,1]:
10// phi_0(x)=1 (bias), phi_1(x)=x (optional), and for i=0..(K-1): phi_{i+2}(x) = max(0, x - t_i),
11// where t_i are uniformly spaced knots in [0,1]. We learn only the output weights w by least squares.
12struct ShallowReLU {
13 vector<double> knots; // t_i
14 bool use_x = true; // include linear term x
15
16 explicit ShallowReLU(int K, bool use_x_term=true): use_x(use_x_term) {
17 knots.resize(K);
18 for (int i = 0; i < K; ++i) knots[i] = (i + 1.0) / (K + 1.0); // avoid endpoints
19 }
20
21 int feature_dim() const { return 1 + (use_x ? 1 : 0) + (int)knots.size(); }
22
23 // Evaluate feature vector phi(x)
24 void features(double x, vector<double>& phi) const {
25 int M = feature_dim();
26 phi.assign(M, 0.0);
27 int idx = 0;
28 phi[idx++] = 1.0; // bias
29 if (use_x) phi[idx++] = x;
30 for (double t : knots) phi[idx++] = relu(x - t);
31 }
32};
33
34// Solve (A^T A + lambda I) w = A^T y via Cholesky, given streaming access to features.
35// We accumulate AtA (M x M) and Aty (M) over N samples.
36struct RidgeSolver {
37 int M; double lambda;
38 vector<double> AtA; // row-major MxM
39 vector<double> Aty; // size M
40 explicit RidgeSolver(int M, double lambda=1e-6): M(M), lambda(lambda), AtA(M*M, 0.0), Aty(M, 0.0) {}
41
42 void add_sample(const vector<double>& phi, double y) {
43 // AtA += phi * phi^T; Aty += phi * y
44 for (int i = 0; i < M; ++i) {
45 Aty[i] += phi[i] * y;
46 const double pi = phi[i];
47 const double* pj = phi.data();
48 double* row = &AtA[i*M];
49 for (int j = 0; j < M; ++j) row[j] += pi * pj[j];
50 }
51 }
52
53 // Cholesky factorization of SPD matrix (AtA + lambda I) and solve for w
54 vector<double> solve() {
55 // Add ridge
56 for (int i = 0; i < M; ++i) AtA[i*M + i] += lambda;
57
58 // Copy to L (we'll store in-place lower-triangular Cholesky)
59 vector<double> L = AtA;
60 // Cholesky decomposition: L becomes lower-triangular such that L L^T = AtA
61 for (int i = 0; i < M; ++i) {
62 for (int j = 0; j <= i; ++j) {
63 double sum = L[i*M + j];
64 for (int k = 0; k < j; ++k) sum -= L[i*M + k] * L[j*M + k];
65 if (i == j) {
66 if (sum <= 0) sum = 1e-12; // numerical guard
67 L[i*M + j] = sqrt(sum);
68 } else {
69 L[i*M + j] = sum / L[j*M + j];
70 }
71 }
72 // zero out upper triangle for clarity
73 for (int j = i+1; j < M; ++j) L[i*M + j] = 0.0;
74 }
75
76 // Solve L z = Aty (forward substitution)
77 vector<double> z(M, 0.0);
78 for (int i = 0; i < M; ++i) {
79 double sum = Aty[i];
80 for (int k = 0; k < i; ++k) sum -= L[i*M + k] * z[k];
81 z[i] = sum / L[i*M + i];
82 }
83 // Solve L^T w = z (back substitution)
84 vector<double> w(M, 0.0);
85 for (int i = M-1; i >= 0; --i) {
86 double sum = z[i];
87 for (int k = i+1; k < M; ++k) sum -= L[k*M + i] * w[k];
88 w[i] = sum / L[i*M + i];
89 }
90 return w;
91 }
92};
93
94// Predict using learned weights
95double predict(const ShallowReLU& model, const vector<double>& w, double x) {
96 vector<double> phi; model.features(x, phi);
97 double y = 0.0; for (int i = 0; i < (int)w.size(); ++i) y += w[i] * phi[i];
98 return y;
99}
100
101int main() {
102 cout.setf(std::ios::fixed); cout << setprecision(6);
103
104 // Target is deep function f_k for moderate k to induce many pieces
105 int k = 6; // 2^k = 64 pieces
106
107 // Training data: N samples on [0,1]
108 const int N = 4096;
109 vector<double> xs(N), ys(N);
110 for (int i = 0; i < N; ++i) {
111 xs[i] = (i + 0.5) / N;
112 ys[i] = triangle_composed(xs[i], k);
113 }
114
115 // Try increasing widths (number of knots) for the shallow model
116 vector<int> widths = {8, 16, 32, 64, 128};
117
118 cout << "Shallow ReLU approximation of deep f_k (k=" << k << ")\n";
119 cout << "knots, features, train_MSE\n";
120
121 for (int K : widths) {
122 ShallowReLU model(K, true);
123 int M = model.feature_dim();
124 RidgeSolver solver(M, 1e-6);
125 vector<double> phi;
126 for (int i = 0; i < N; ++i) {
127 model.features(xs[i], phi);
128 solver.add_sample(phi, ys[i]);
129 }
130 vector<double> w = solver.solve();
131
132 // Evaluate training MSE
133 double mse = 0.0;
134 for (int i = 0; i < N; ++i) {
135 double yhat = predict(model, w, xs[i]);
136 double e = yhat - ys[i];
137 mse += e * e;
138 }
139 mse /= N;
140 cout << K << ", " << M << ", " << mse << "\n";
141 }
142
143 return 0;
144}
145

We approximate the deep triangle composition f_k (with 2^k linear pieces) using a single-hidden-layer ReLU model with fixed hinge locations and learned output weights (ridge regression). As we increase the number of knots (width), the training MSE decreases. To closely match f_k, the shallow model typically needs width on the order of the number of pieces (≈ 2^k), illustrating width blow-up for shallow approximation of a deep compositional function.

Time: O(N M^2 + M^3) for building normal equations and Cholesky solve (N samples, M features)Space: O(M^2) for AtA and O(M) for vectors