Neural Network Expressivity
Key Points
- •Neural network expressivity studies what kinds of functions different network architectures can represent and how efficiently they can do so.
- •Depth brings powerful compounding effects: deep ReLU networks can create exponentially many linear regions using only polynomially many parameters.
- •There are depth-separation theorems: some functions computed by depth-d networks require exponentially larger width if the depth is reduced to d−1.
- •For Boolean functions, depth hierarchies of threshold circuits show exponential size blowups when you reduce depth, mirroring neural network depth benefits.
- •From a tensor viewpoint, deep networks implement hierarchical tensor factorizations that can represent high-order interactions much more compactly than shallow ones.
- •The universal approximation theorem guarantees shallow networks can approximate any continuous function but may need an impractically large width.
- •The lottery ticket hypothesis suggests that large dense networks contain sparse subnetworks that can achieve similar performance, indicating redundancy in parameterization but not necessarily in expressivity.
- •In practice, the right depth/width tradeoff depends on the task’s compositional structure, desired sample efficiency, and compute/memory constraints.
Prerequisites
- →Linear algebra (vectors, matrices, rank, eigenvalues) — Network layers are affine maps; tensor and capacity arguments rely on linear algebra concepts.
- →Calculus and optimization basics — Understanding training, loss minimization, and function approximation error requires derivatives and optimization.
- →Complexity theory foundations — Depth separation connects to circuit complexity and lower bounds on size vs depth.
- →Piecewise-linear functions and geometry — ReLU networks are CPWL; linear region counts and hyperplane arrangements are central.
- →Probability and statistics — Generalization, VC dimension, and approximation error bounds are probabilistic/statistical.
- →Fourier analysis (Barron space intuition) — Rates for two-layer networks depend on spectral properties of the target function.
- →Tensor decompositions — Understanding hierarchical expressivity uses CP/Tucker and tensor-network analogies.
- →Neural network basics (MLP, activation functions) — Expressivity is defined relative to standard architectures and activations.
- →Numerical linear algebra — Fitting shallow models via least squares requires stable solvers like Cholesky.
Detailed Explanation
Tap terms for definitions01Overview
Neural network expressivity asks: given an architecture (depth, width, activation functions), what is the set of functions it can represent, and how many parameters are required to represent functions of a given complexity? The field seeks principled reasons why deep networks outperform shallow ones, and it quantifies how architectural choices shape the hypothesis class. Classic results like the universal approximation theorem ensure that even shallow networks (one hidden layer) can approximate any continuous function on a compact set. However, the key distinction is efficiency: some functions are representable with a small, deep network but require exponentially many neurons in a shallow network. For ReLU networks, a core insight is that they define piecewise-linear functions; depth lets these pieces multiply through compositions, creating a large number of linear regions. In Boolean settings, related circuit complexity results show exponential lower bounds when reducing depth for threshold circuits. Complementary views come from tensors: deep architectures can be seen as hierarchical tensor decompositions that capture high-order interactions with far fewer parameters than flat decompositions. Modern perspectives like the lottery ticket hypothesis further suggest overparameterized networks contain simpler subnetworks capable of strong performance, reshaping how we think about capacity and trainability versus representational power. Understanding expressivity guides architecture design, capacity control, and theoretical limits.
02Intuition & Analogies
Imagine folding a strip of paper. Each fold adds a kink. If you just add creases one by one along a flat strip (like a very wide, single-layer network), you get a linear sequence of kinks—useful but limited. Now imagine you repeatedly fold, then fold the folded paper again, and again. Each new fold multiplies the number of segments and creases because previous folds are carried forward and get folded over themselves. This is depth: compositions of simple transformations create exponentially rich structure compared to stacking many independent creases side-by-side. Another analogy: Think of a kitchen assembly line. A shallow network is like one long station with many workers doing tasks in parallel—each worker adds one simple feature. A deep network is multiple stations in sequence, where outputs from one stage are refined by the next. With the same total number of workers, the sequential pipeline can build far more intricate products through composition. For ReLU networks, each neuron adds a hinge (a piecewise-linear kink) along some direction. With a single hidden layer, all hinges are directly applied to the input, so you only get as many kinks as you have neurons. With multiple layers, hinges are applied to previous hinge-made surfaces, recursively subdividing space. A 1D demonstration shows this vividly: by composing a simple triangular wave map a few times, you can create an oscillatory function with 2^k linear pieces using a network whose size grows only linearly with k. In Boolean circuits, depth is like the number of logic levels. Some functions (e.g., parity) are easy if you can stack enough layers of gates, but if you restrict to too few layers, you need an explosion in the number of gates. Deep learning inherits this principle: certain compositions are compact only when you allow enough layers.
03Formal Definition
04When to Use
Use depth when your target mapping has a compositional or hierarchical structure: language (characters→tokens→phrases→sentences), images (edges→motifs→parts→objects), audio (waveforms→phonemes→words). Deep networks excel at progressively refining features, allowing exponential growth of representable patterns with modest parameter growth. If you need sharp local variations (many kinks/oscillations), deep ReLU networks can realize them compactly by composing simple fold-like operations. Use wider, shallower networks when: (1) the function is smooth and global with limited hierarchical structure (Barron-type functions), (2) you prioritize faster inference with fewer sequential steps (latency-constrained settings), or (3) you rely on convex or nearly convex optimization landscapes in two-layer settings. Architectural principles guided by expressivity include: residual connections (enable very deep compositions without vanishing gradients), convolution (parameter sharing exploits locality; depth compounds receptive fields), normalization (stabilizes deep compositions), and attention (flexible, depth-amplified interactions). If memory is tight or you fear overfitting, consider depth with bottlenecks and sparsity; the lottery ticket hypothesis indicates that overparameterized models may contain efficient sparse subnetworks you can prune to. In theory-informed design, if reducing depth is unavoidable (e.g., hardware constraints), be ready to scale width substantially to maintain expressivity for certain tasks. Conversely, modest depth increases can dramatically reduce the required width for compositional problems.
⚠️Common Mistakes
- Assuming universal approximation means depth is irrelevant. While a single hidden layer can approximate any continuous function, the number of neurons needed can be exponentially larger than for a deep model; this impacts trainability, generalization, and compute.
- Equating parameter count with expressivity. Two networks with the same number of parameters can have very different function classes; arrangement across depth/width matters.
- Ignoring activation choice. ReLU networks are CPWL; they excel at piecewise-linear structure and can approximate smooth functions via many small pieces. Using saturating or polynomial activations changes expressivity and optimization behavior.
- Misreading linear-region counts. The number of regions along a particular line can be far smaller than the global maximum; empirical region counts are lower bounds. Also, more regions do not automatically mean better accuracy.
- Overgeneralizing Boolean circuit results. Depth separations in threshold circuits inform intuition but live in a discrete setting with different noise and optimization characteristics; treat analogies carefully.
- Confusing trainability with representability. A function might be representable by a compact deep net, yet gradient-based training might still struggle without proper initialization, normalization, or data.
- Overlooking tensor perspectives. Failing to exploit hierarchical structure (e.g., via convolution or attention) can force shallow networks to learn enormous flat decompositions, wasting parameters.
- Treating lottery tickets as a silver bullet. Sparse subnetworks exist, but finding them reliably and efficiently is nontrivial; pruning strategies and retraining details matter.
Key Formulas
Universal Approximation (Two-Layer)
Explanation: Any continuous function on a compact set can be approximated uniformly by a single-hidden-layer neural network with enough neurons. This shows possibility, not efficiency.
Montfafar Lower Bound on Linear Regions
Explanation: A deep ReLU network with input dimension n0 and widths can realize at least this many linear regions. It formalizes that depth can multiply regions.
Global Upper Bound on Regions
Explanation: With N total ReLU neurons and input dimension n0, the maximum number of linear regions is at most this combinatorial quantity. It constrains how complex the partition can get.
Telgarsky Triangle Composition
Explanation: Composing a simple triangular map k times yields a 1D function with 2^k linear pieces. A ReLU network with depth O(k) and small width can implement this.
Capacity vs Parameters
Explanation: The VC dimension of many neural network classes scales on the order of the number of parameters times a logarithmic factor. It links learnability to model size.
Depth Hierarchy (Threshold Circuits)
Explanation: For each depth d, there exists a Boolean function computed by a polynomial-size depth-d threshold circuit but requiring exponential size at depth d−1. This shows exponential depth separation.
Barron Approximation Rate
Explanation: For functions with finite Barron norm B, two-layer networks achieve mean-squared error decreasing like 1/m in the number of hidden units. This identifies when shallow nets are efficient.
Parameter Count (Fully Connected)
Explanation: The total number of parameters (weights and biases) in an L-layer fully connected network. It is important for capacity and memory considerations.
Hierarchical Rank Growth (Tensor View)
Explanation: In tensor-network analogies for deep nets, representational rank can grow multiplicatively with depth. This explains compact representations of complex interactions.
Forward Pass Cost
Explanation: Time per forward pass is linear in the number of multiplications/additions across layers; space is dominated by activations and parameters. This affects sampling-based expressivity probes.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // ReLU activation 5 inline double relu(double z) { return z > 0.0 ? z : 0.0; } 6 7 // One-layer triangle map tau(x) with peaks in [0,1]: 8 // tau(x) = ReLU(x) - 2*ReLU(x-0.5) + ReLU(x-1) 9 // This is a piecewise-linear 'tent' function over [0,1] with 2 linear pieces. 10 inline double triangle_once(double x) { 11 return relu(x) - 2.0 * relu(x - 0.5) + relu(x - 1.0); 12 } 13 14 // Compose the triangle map k times: f_k = tau o tau o ... o tau (k times). 15 // This can be implemented by feeding the output back into tau repeatedly. 16 double triangle_composed(double x, int k) { 17 double y = x; 18 for (int i = 0; i < k; ++i) y = triangle_once(y); 19 return y; 20 } 21 22 // Estimate the number of linear pieces by sampling and counting slope changes. 23 // For a piecewise-linear function f, the discrete slope should be nearly constant within a piece. 24 int count_linear_pieces(function<double(double)> f, double a, double b, int G, double eps=1e-6) { 25 vector<double> xs(G+1), ys(G+1); 26 for (int i = 0; i <= G; ++i) { 27 xs[i] = a + (b - a) * (double)i / G; 28 ys[i] = f(xs[i]); 29 } 30 int pieces = 1; // at least one piece 31 double dx = (b - a) / G; 32 // compute discrete slopes 33 vector<double> slopes(G); 34 for (int i = 0; i < G; ++i) slopes[i] = (ys[i+1] - ys[i]) / dx; 35 // count significant slope changes 36 for (int i = 1; i < G; ++i) { 37 if (fabs(slopes[i] - slopes[i-1]) > 1e-3) pieces++; 38 } 39 return pieces; 40 } 41 42 int main() { 43 cout.setf(std::ios::fixed); cout << setprecision(6); 44 45 // Demonstrate exponential growth of linear pieces with depth k. 46 // We evaluate f_k on [0,1] and count the number of linear segments. 47 vector<int> ks = {1,2,3,4,5,6,7,8}; 48 int G = 20000; // fine sampling for robust counting 49 50 cout << "k, estimated_pieces, expected_pieces (2^k)\n"; 51 for (int k : ks) { 52 auto fk = [k](double x){ return triangle_composed(x, k); }; 53 int pieces = count_linear_pieces(fk, 0.0, 1.0, G); 54 int expected = 1 << k; // 2^k 55 cout << k << ", " << pieces << ", " << expected << "\n"; 56 } 57 58 // Print a small sample of values for visualization 59 int kvis = 5; 60 auto fvis = [kvis](double x){ return triangle_composed(x, kvis); }; 61 cout << "\nSample values for k=" << kvis << " on [0,1]:\n"; 62 for (int i = 0; i <= 10; ++i) { 63 double x = i / 10.0; 64 cout << "x=" << x << ", f_k(x)=" << fvis(x) << "\n"; 65 } 66 67 return 0; 68 } 69
We implement a simple 1D ReLU network block (the triangle map) and compose it k times. The composition can be realized by a deep network of width 3 per triangle layer (or width 2 with minor variations) and depth proportional to k. Sampling the resulting function on [0,1] and counting slope changes shows that the number of linear pieces grows as approximately 2^k despite evaluation time growing only linearly in k. This empirically demonstrates depth-induced exponential expressivity for ReLU networks in 1D.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Target: deep function via k-fold composition of a triangle map (same as Example 1) 5 inline double relu(double z) { return z > 0.0 ? z : 0.0; } 6 inline double triangle_once(double x) { return relu(x) - 2.0 * relu(x - 0.5) + relu(x - 1.0); } 7 double triangle_composed(double x, int k) { double y = x; for (int i = 0; i < k; ++i) y = triangle_once(y); return y; } 8 9 // Build fixed ReLU features for a one-hidden-layer model on [0,1]: 10 // phi_0(x)=1 (bias), phi_1(x)=x (optional), and for i=0..(K-1): phi_{i+2}(x) = max(0, x - t_i), 11 // where t_i are uniformly spaced knots in [0,1]. We learn only the output weights w by least squares. 12 struct ShallowReLU { 13 vector<double> knots; // t_i 14 bool use_x = true; // include linear term x 15 16 explicit ShallowReLU(int K, bool use_x_term=true): use_x(use_x_term) { 17 knots.resize(K); 18 for (int i = 0; i < K; ++i) knots[i] = (i + 1.0) / (K + 1.0); // avoid endpoints 19 } 20 21 int feature_dim() const { return 1 + (use_x ? 1 : 0) + (int)knots.size(); } 22 23 // Evaluate feature vector phi(x) 24 void features(double x, vector<double>& phi) const { 25 int M = feature_dim(); 26 phi.assign(M, 0.0); 27 int idx = 0; 28 phi[idx++] = 1.0; // bias 29 if (use_x) phi[idx++] = x; 30 for (double t : knots) phi[idx++] = relu(x - t); 31 } 32 }; 33 34 // Solve (A^T A + lambda I) w = A^T y via Cholesky, given streaming access to features. 35 // We accumulate AtA (M x M) and Aty (M) over N samples. 36 struct RidgeSolver { 37 int M; double lambda; 38 vector<double> AtA; // row-major MxM 39 vector<double> Aty; // size M 40 explicit RidgeSolver(int M, double lambda=1e-6): M(M), lambda(lambda), AtA(M*M, 0.0), Aty(M, 0.0) {} 41 42 void add_sample(const vector<double>& phi, double y) { 43 // AtA += phi * phi^T; Aty += phi * y 44 for (int i = 0; i < M; ++i) { 45 Aty[i] += phi[i] * y; 46 const double pi = phi[i]; 47 const double* pj = phi.data(); 48 double* row = &AtA[i*M]; 49 for (int j = 0; j < M; ++j) row[j] += pi * pj[j]; 50 } 51 } 52 53 // Cholesky factorization of SPD matrix (AtA + lambda I) and solve for w 54 vector<double> solve() { 55 // Add ridge 56 for (int i = 0; i < M; ++i) AtA[i*M + i] += lambda; 57 58 // Copy to L (we'll store in-place lower-triangular Cholesky) 59 vector<double> L = AtA; 60 // Cholesky decomposition: L becomes lower-triangular such that L L^T = AtA 61 for (int i = 0; i < M; ++i) { 62 for (int j = 0; j <= i; ++j) { 63 double sum = L[i*M + j]; 64 for (int k = 0; k < j; ++k) sum -= L[i*M + k] * L[j*M + k]; 65 if (i == j) { 66 if (sum <= 0) sum = 1e-12; // numerical guard 67 L[i*M + j] = sqrt(sum); 68 } else { 69 L[i*M + j] = sum / L[j*M + j]; 70 } 71 } 72 // zero out upper triangle for clarity 73 for (int j = i+1; j < M; ++j) L[i*M + j] = 0.0; 74 } 75 76 // Solve L z = Aty (forward substitution) 77 vector<double> z(M, 0.0); 78 for (int i = 0; i < M; ++i) { 79 double sum = Aty[i]; 80 for (int k = 0; k < i; ++k) sum -= L[i*M + k] * z[k]; 81 z[i] = sum / L[i*M + i]; 82 } 83 // Solve L^T w = z (back substitution) 84 vector<double> w(M, 0.0); 85 for (int i = M-1; i >= 0; --i) { 86 double sum = z[i]; 87 for (int k = i+1; k < M; ++k) sum -= L[k*M + i] * w[k]; 88 w[i] = sum / L[i*M + i]; 89 } 90 return w; 91 } 92 }; 93 94 // Predict using learned weights 95 double predict(const ShallowReLU& model, const vector<double>& w, double x) { 96 vector<double> phi; model.features(x, phi); 97 double y = 0.0; for (int i = 0; i < (int)w.size(); ++i) y += w[i] * phi[i]; 98 return y; 99 } 100 101 int main() { 102 cout.setf(std::ios::fixed); cout << setprecision(6); 103 104 // Target is deep function f_k for moderate k to induce many pieces 105 int k = 6; // 2^k = 64 pieces 106 107 // Training data: N samples on [0,1] 108 const int N = 4096; 109 vector<double> xs(N), ys(N); 110 for (int i = 0; i < N; ++i) { 111 xs[i] = (i + 0.5) / N; 112 ys[i] = triangle_composed(xs[i], k); 113 } 114 115 // Try increasing widths (number of knots) for the shallow model 116 vector<int> widths = {8, 16, 32, 64, 128}; 117 118 cout << "Shallow ReLU approximation of deep f_k (k=" << k << ")\n"; 119 cout << "knots, features, train_MSE\n"; 120 121 for (int K : widths) { 122 ShallowReLU model(K, true); 123 int M = model.feature_dim(); 124 RidgeSolver solver(M, 1e-6); 125 vector<double> phi; 126 for (int i = 0; i < N; ++i) { 127 model.features(xs[i], phi); 128 solver.add_sample(phi, ys[i]); 129 } 130 vector<double> w = solver.solve(); 131 132 // Evaluate training MSE 133 double mse = 0.0; 134 for (int i = 0; i < N; ++i) { 135 double yhat = predict(model, w, xs[i]); 136 double e = yhat - ys[i]; 137 mse += e * e; 138 } 139 mse /= N; 140 cout << K << ", " << M << ", " << mse << "\n"; 141 } 142 143 return 0; 144 } 145
We approximate the deep triangle composition f_k (with 2^k linear pieces) using a single-hidden-layer ReLU model with fixed hinge locations and learned output weights (ridge regression). As we increase the number of knots (width), the training MSE decreases. To closely match f_k, the shallow model typically needs width on the order of the number of pieces (≈ 2^k), illustrating width blow-up for shallow approximation of a deep compositional function.