📚TheoryAdvanced

Neural Tangent Kernel (NTK) Theory

Key Points

•
The Neural Tangent Kernel (NTK) connects very wide neural networks to classical kernel methods, letting us study training as if it were kernel regression.
•
At infinite width and small learning rate, the network’s Jacobian features stay nearly constant, so gradient descent becomes linear in function space.
•
The NTK is defined as K(x, x') = E[∇_θ f(x; $θ)^{T}$ ∇_θ f(x'; $θ)]$ at random initialization, where f is the network output and θ are parameters.
•
Under squared loss and gradient flow, predictions follow df/dt = −K (f − y), yielding exponential convergence when K is positive definite on the data.
•
In the infinite-width limit, the NTK does not change during training, implying global convergence for sufficiently overparameterized models.
•
NTK explains behavior at initialization and early training but does not capture feature learning in practical, finite-width networks.
•
NTK links deep learning to Gaussian processes and Reproducing Kernel Hilbert Spaces (RKHS), giving theoretical tools for generalization and optimization.
•
Empirical NTKs for finite networks can be computed by taking parameter gradients and forming their dot products, enabling NTK-based kernel regression baselines.

Prerequisites

→Multivariable calculus and gradients — NTK is defined using parameter gradients (Jacobians) and relies on chain rule and differential calculus.
→Linear algebra and eigen-decomposition — Training dynamics and kernel regression solutions use Gram matrices, positive definiteness, and eigen-analysis.
→Kernel methods and RKHS — NTK training equivalence maps neural networks to kernel regression in an RKHS defined by the NTK.
→Gradient descent and gradient flow — The linear dynamics under NTK are derived from gradient flow and relate to step-size choices in GD.
→Gaussian processes — Random wide networks at initialization converge to GPs (NNGP), closely related to NTK analysis.
→Feedforward neural networks and backpropagation — Computing tangent features requires understanding forward and backward passes through an MLP.
→Numerically stable linear solves — Kernel ridge regression requires stable solutions to linear systems with potentially ill-conditioned matrices.

Detailed Explanation

Tap terms for definitions

01Overview

The Neural Tangent Kernel (NTK) is a theoretical framework that explains how very wide neural networks train under gradient descent. It shows that when a network’s width tends to infinity and we use small learning rates, the network’s behavior becomes linear in function space: predictions evolve according to a fixed kernel, the NTK. Concretely, the NTK captures how small parameter changes affect outputs by measuring the inner products of parameter gradients at different inputs. In this regime, training a neural network is equivalent to performing kernel regression with the NTK as the kernel. This connection bridges deep learning and classical kernel methods, allowing the use of tools from Gaussian processes (GP) and Reproducing Kernel Hilbert Spaces (RKHS) to analyze optimization and generalization. The key payoff is simplicity: training dynamics are governed by a linear differential equation whose solution is explicit and whose convergence is guaranteed under mild conditions. However, the NTK’s most precise statements hold in the infinite-width limit, whereas practical networks have finite width and learn features over time, meaning their NTK changes during training. Still, the NTK provides accurate predictions for very wide networks and at early training, and it offers principled baselines and insights into how architecture and initialization influence learning.

02Intuition & Analogies

Imagine tuning a complex mixing console with thousands of knobs (the network’s parameters). Turning any one knob slightly changes the sound at the speakers. If you map how sensitive the output is to each knob for a particular song (input), you get a long vector of sensitivities. Two songs are considered similar by this console if their sensitivity vectors point in similar directions; then, nudging the same knobs will affect both songs similarly. The NTK is exactly this similarity: it measures how similarly two inputs respond to infinitesimal parameter tweaks by taking the dot product of their sensitivity vectors. Now, picture training as repeatedly making tiny adjustments to the knobs to reduce the difference between the sound and a target. If the sensitivity vectors don’t change much as you train (because there are so many knobs that each moves very little), then the way changes propagate from knob-space to output-space stays essentially fixed. This means the entire training process acts like a simple, linear system in output-space governed by a fixed matrix—the NTK Gram matrix on your dataset. In that case, you can predict how the outputs evolve over time using the same math you’d use for a damped spring or a resistor-capacitor circuit: exponential convergence toward the targets along directions defined by the kernel’s eigenvectors. As networks become extremely wide, each individual weight update becomes vanishingly small and the sensitivity vectors barely change; the kernel effectively freezes. That’s why infinite-width networks follow neat, predictable, kernel-like dynamics. In contrast, in regular-width networks, sensitivity vectors do change (features evolve), so the effective kernel drifts during training—this is real feature learning, which NTK only partially captures.

03Formal Definition

Let f(x;

θ

) be a neural network mapping inputs x

\in

R^{d}

to outputs in

R

(for simplicity), with parameters

θ

\in

R^{p}

. Define the neural tangent feature map at parameters

θ

Φ_{θ}

(x) =

\nabla_{θ}

f(x;

θ

)

\in

R^{p}

. The Neural Tangent Kernel (NTK) at parameters

θ

K_{θ}

(x, x') =

Φ_{θ}

(x)^{

⊤

}

Φ_{θ}

(x'). At random initialization

θ_{0}

, the infinite-width NTK is defined as K(x, x') =

E_{θ_{0}}

[

Φ_{θ_{0}}

(x)^{

⊤

}

Φ_{θ_{0}}

(x')], where the expectation is over the initialization distribution. Consider squared loss L(

θ

) =

\frac{1}{2}

\sum_{i = 1}^{n}

(f(

x_{i}

;

θ

) -

y_{i}

)^{2} and continuous-time gradient flow d

θ

/dt = -

\nabla_{θ}

θ

). Then the predictions

f_{t}

:= (f(

x_{1}

;

θ_{t}

\dots

, f(

x_{n}

;

θ_{t}

))

\in

R^{n}

evolve as

d f_{t}

/dt = -K_{

θ_{t}

}(X, X) (

f_{t}

- y), where K_{

θ_{t}

}(X, X) is the n

\times

n Gram matrix with entries K_{

θ_{t}

}(

x_{i}

x_{j}

). In the infinite-width limit, K_{

θ_{t}

}(X, X)

\to

K(X, X) and remains constant during training. The solution for squared loss is

f_{t} = y

exp

(-t K(X, X)) (

f_{0}

- y), implying exponential convergence along eigen-directions of K. Furthermore, the terminal predictor coincides with kernel regression in the RKHS induced by K (with no explicit regularization under exact optimization) or kernel ridge regression when weight decay or early stopping is used. This connects wide-network training to classical kernel theory and Gaussian processes: at initialization, f(x;

θ_{0}

) converges in distribution to a Gaussian process with covariance given by the Neural Network Gaussian Process (NNGP) kernel, and the NTK governs the subsequent training dynamics.

04When to Use

Use NTK theory when analyzing optimization and generalization of very wide networks, especially to understand early training behavior. It is particularly helpful for: 1) proving global convergence of gradient descent in overparameterized settings where K(X, X) is positive definite; 2) comparing architectures and initializations by their induced kernels, predicting which ones yield faster convergence or better inductive biases; 3) building kernel regression baselines using empirical NTKs on small datasets; 4) sanity-checking learning-rate and scaling choices that aim to keep the NTK stable (e.g., NTK parameterization); and 5) connecting neural network predictions to Gaussian process priors and RKHS norms for interpretability and generalization bounds. Avoid relying solely on NTK predictions when feature learning is crucial (e.g., narrow or moderately wide networks, long training, large learning rates, batch normalization with non-NTK scalings), because the empirical NTK then drifts during training. In those cases, NTK still offers a first-order approximation or early-time description but not a full account of representation learning. Practically, if you need a fast, theoretically grounded baseline on small to medium datasets, computing the empirical NTK and running kernel ridge regression can provide a strong and stable reference without training a full neural network.

⚠️Common Mistakes

Common pitfalls include: 1) Assuming the NTK is constant for any finite-width network; in practice, the empirical NTK changes during training, especially with large learning rates or normalization layers; 2) Confusing the NNGP kernel (the covariance of random outputs at initialization) with the NTK (the covariance of gradients); they coincide only in special cases; 3) Ignoring discretization: the continuous-time gradient flow analysis translates to gradient descent only for sufficiently small learning rates; otherwise, stability and convergence rates differ; 4) Believing NTK implies no feature learning at all; it only describes the infinite-width limit—finite models do learn features as their tangent features evolve; 5) Numerical issues when using NTK in practice: the Gram matrix K(X, X) can be ill-conditioned, so kernel ridge regression should include regularization (\lambda I) and use stable solvers (Cholesky with jitter); 6) Overinterpreting early-time matches between NTK predictions and network training as proof of full equivalence; beyond the “lazy” regime, nonlinear effects dominate; 7) Ignoring parameterization: different scalings (NTK parameterization vs. standard) lead to different infinite-width limits and different kernel dynamics; and 8) Forgetting multi-output settings: for vector-valued outputs, NTK is matrix-valued, and care is needed to aggregate across outputs.

Key Formulas

Empirical NTK

K_{θ} (x, x^{'}) = \nabla_{θ} f (x; θ)^{⊤} \nabla_{θ} f (x^{'}; θ)

Explanation: The NTK at parameters θ is the inner product of parameter gradients for two inputs. It measures how similarly small parameter changes affect both inputs.

Infinite-Width NTK

K (x, x^{'}) = E_{θ_{0}} [\nabla_{θ} f (x; θ_{0})^{⊤} \nabla_{θ} f (x^{'}; θ_{0})]

Explanation: In the infinite-width limit, randomness averages out and the NTK becomes a deterministic kernel equal to the expectation over random initializations.

Neural Tangent Feature Map

Φ_{θ} (x) = \nabla_{θ} f (x; θ) \in R^{p}

Explanation: This is the Jacobian of the output with respect to parameters. NTK is the inner product of these features for two inputs.

Gradient Flow for Squared Loss

\frac{d θ}{d t} = - \nabla_{θ} L (θ), L (θ) = \frac{1}{2} i = 1 \sum n (f (x_{i}; θ) - y_{i})^{2}

Explanation: Continuous-time training under squared error. It is used to derive linear dynamics for predictions under NTK.

Function-Space Dynamics

\frac{d f _{t}}{d t} = - K_{θ_{t}} (X, X) (f_{t} - y)

Explanation: The time derivative of predictions equals minus the NTK Gram matrix times the residual. This is exact for any finite network under gradient flow.

Closed-Form Solution (Infinite-Width)

f_{t} = y + exp (- t K (X, X)) (f_{0} - y)

Explanation: When the NTK is constant, predictions decay exponentially toward labels along the kernel’s eigen-directions.

Kernel Ridge Regression Predictor

f^{*} (\cdot) = k (\cdot)^{⊤} (K + λ I)^{- 1} y

Explanation: The fitted function minimizing squared loss plus $λ$ times RKHS norm has this form, where k( $\cdot$ ) are kernel evaluations with training points.

First-Order Linearization

f (\cdot; θ) \approx f (\cdot; θ_{0}) + Φ_{θ_{0}} (\cdot)^{⊤} (θ - θ_{0})

Explanation: Near initialization, the network behaves like a linear model in the tangent features. This approximation becomes exact at infinite width.

Eigen-Dynamics

K = U Λ U^{⊤} \Rightarrow f_{t} - y = U exp (- t Λ) U^{⊤} (f_{0} - y)

Explanation: Decomposing K reveals independent exponential decay along eigenvectors with rates given by eigenvalues.

NNGP Kernel

Σ (x, x^{'}) = E_{θ_{0}} [f (x; θ_{0}) f (x^{'}; θ_{0})]

Explanation: The covariance of outputs at initialization forms a Gaussian process prior. It differs from the NTK which uses parameter gradients.

Condition Number

cond (K) = \frac{λ _{m a x} ( K )}{λ _{m i n} ( K )}

Explanation: The ratio of largest to smallest eigenvalue controls convergence speed and numerical stability when solving linear systems with K.

Kernel Regression Complexity

T (n) = O (n^{3}) for dense solves, T (predict) = O (nm)

Explanation: Solving (K+ $λ$ I) $α = y$ by Cholesky takes cubic time in n; predicting on m test points with precomputed α is linear in n per test.

Complexity Analysis

Let n be the number of training points, d the input dimension, h the hidden width, and p the parameter count (e.g., for a 2-layer MLP with tanh,

p = h \cdot d + 2 h + 1

). Computing one tangent feature vector Φ_

θ (x)

requires a forward and backward pass through the network. For a single-hidden-layer MLP, both passes cost O(hd) time and O(h + d) auxiliary space; producing the full gradient vector costs O(p) to write. Building the empirical NTK Gram matrix K(X, X) in the straightforward way involves either: (a) computing and storing all n tangent feature vectors (O(np) space and O(np) time), then forming

K = G

G^{T}

which costs O(

n^{2}

p) time; or (b) streaming gradients and accumulating pairwise dot-products directly in O(

n^{2}

p) time with O(

n^{2}

) space for K and O(p) scratch space. In both cases, the dominant term is O(

n^{2}

p), which becomes expensive for large n or p. Once K is formed, kernel ridge regression solves (K +

λ I) α

= y. A dense, numerically stable method (e.g., Cholesky) takes O(

n^{3}

) time and O(

n^{2}

) space. Predictions on m test points require computing k(x_*) (the NTK between x_* and all training points), which costs O(p + np) if recomputing gradients, or O(np) if test gradients are computed once and cached; the final dot product

k^{T}

α is O(n). For small to medium n (hundreds to a few thousands), this is tractable on a CPU; beyond that, approximations (Nyström, random features, or block methods) are preferred. Memory-wise, storing G (n×p) is often the bottleneck, so a streaming or blockwise approach that avoids materializing G in full can reduce peak memory from O(np) to O(

n^{2}

) (for K) plus O(p) scratch. Numerical stability requires adding jitter

λ I;

the condition number cond(K) governs both convergence of gradient-based solvers and error amplification in linear solves.

Code Examples

Compute an empirical NTK matrix for a 2-layer MLP (tanh) at random initialization

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // A simple 2-layer MLP: f(x) = a^T tanh(W x + b) + c
5 // We compute the gradient of f w.r.t. all parameters to build the empirical NTK.
6 
7 struct MLP {
8     int d;      // input dimension
9     int h;      // hidden width
10     // Parameters
11     vector<double> W; // size h*d, row-major: W[i*d + j]
12     vector<double> b; // size h
13     vector<double> a; // size h
14     double c;         // scalar bias
15 
16     MLP(int d_, int h_, unsigned seed=42) : d(d_), h(h_), W(h_*d_), b(h_), a(h_), c(0.0) {
17         mt19937 rng(seed);
18         normal_distribution<double> nd(0.0, 1.0);
19         // NTK-friendly scaling for tanh: weights ~ N(0, 1/d)
20         double scale_in = 1.0 / sqrt((double)d);
21         for (int i = 0; i < h*d; ++i) W[i] = nd(rng) * scale_in;
22         for (int i = 0; i < h; ++i) b[i] = 0.0; // zero biases common at init
23         // Output weights ~ N(0, 1/h)
24         double scale_out = 1.0 / sqrt((double)h);
25         for (int i = 0; i < h; ++i) a[i] = nd(rng) * scale_out;
26         c = 0.0;
27     }
28 
29     // Forward pass: returns f(x) and caches z and h for gradient
30     double forward(const vector<double>& x, vector<double>& z, vector<double>& h_act) const {
31         z.assign(h, 0.0);
32         for (int i = 0; i < h; ++i) {
33             double s = b[i];
34             int base = i * d;
35             for (int j = 0; j < d; ++j) s += W[base + j] * x[j];
36             z[i] = s;
37         }
38         h_act.resize(h);
39         for (int i = 0; i < h; ++i) h_act[i] = tanh(z[i]);
40         double y = c;
41         for (int i = 0; i < h; ++i) y += a[i] * h_act[i];
42         return y;
43     }
44 
45     // Compute gradient of f(x) w.r.t. all parameters, flattened as:
46     // [W (h*d elems), b (h), a (h), c (1)]
47     vector<double> grad(const vector<double>& x) const {
48         vector<double> z, h_act;
49         double y = forward(x, z, h_act);
50         (void)y; // unused
51         vector<double> g;
52         g.reserve(h*d + 2*h + 1);
53         // df/dc = 1
54         // For tanh: dh/dz = 1 - tanh(z)^2
55         vector<double> dhdz(h);
56         for (int i = 0; i < h; ++i) dhdz[i] = 1.0 - h_act[i]*h_act[i];
57         // df/dz_i = a_i * dhdz_i
58         vector<double> df_dz(h);
59         for (int i = 0; i < h; ++i) df_dz[i] = a[i] * dhdz[i];
60         // Gradients for W: df/dW_{i,j} = df/dz_i * x_j
61         for (int i = 0; i < h; ++i) {
62             for (int j = 0; j < d; ++j) g.push_back(df_dz[i] * x[j]);
63         }
64         // Gradients for b: df/db_i = df/dz_i
65         for (int i = 0; i < h; ++i) g.push_back(df_dz[i]);
66         // Gradients for a: df/da_i = h_i
67         for (int i = 0; i < h; ++i) g.push_back(h_act[i]);
68         // Gradient for c
69         g.push_back(1.0);
70         return g;
71     }
72 };
73 
74 // Compute dot product of two vectors
75 static double dot(const vector<double>& u, const vector<double>& v) {
76     double s = 0.0; int n = (int)u.size();
77     for (int i = 0; i < n; ++i) s += u[i] * v[i];
78     return s;
79 }
80 
81 int main() {
82     ios::sync_with_stdio(false);
83     cin.tie(nullptr);
84 
85     // Example: small synthetic dataset in R^2
86     int d = 2;        // input dimension
87     int h = 64;       // hidden width (moderately wide)
88     int n = 6;        // number of samples
89 
90     vector<vector<double>> X = {
91         {0.0, 0.0}, {1.0, 0.0}, {0.0, 1.0}, {1.0, 1.0}, {0.5, 0.2}, {0.2, 0.7}
92     };
93 
94     MLP net(d, h, /*seed=*/123);
95 
96     // Compute tangent features for each sample
97     vector<vector<double>> G(n);
98     for (int i = 0; i < n; ++i) G[i] = net.grad(X[i]);
99 
100     // Build NTK Gram matrix K = G * G^T
101     vector<vector<double>> K(n, vector<double>(n, 0.0));
102     for (int i = 0; i < n; ++i) {
103         K[i][i] = dot(G[i], G[i]);
104         for (int j = i+1; j < n; ++j) {
105             double kij = dot(G[i], G[j]);
106             K[i][j] = kij;
107             K[j][i] = kij;
108         }
109     }
110 
111     cout.setf(std::ios::fixed); cout<<setprecision(6);
112     cout << "Empirical NTK Gram matrix (" << n << "x" << n << ") at initialization:\n";
113     for (int i = 0; i < n; ++i) {
114         for (int j = 0; j < n; ++j) cout << K[i][j] << (j+1==n?'\n':' ');
115     }
116 
117     // Note: To approximate the infinite-width NTK, increase h and average over seeds.
118     return 0;
119 }
120

This program defines a simple 2-layer MLP with tanh activation, computes the gradient of its scalar output with respect to all parameters for each input, and forms the empirical NTK Gram matrix by taking dot products of these gradients. This demonstrates how the NTK measures similarity between inputs in terms of parameter-sensitivity. Increasing the hidden width and averaging over initializations approximates the infinite-width NTK.

Time: O(n p + n^2 p), where p = h·d + 2h + 1; practically dominated by O(n^2 p) for forming K.Space: O(n p) to store gradients plus O(n^2) to store K; can stream to reduce memory at the cost of recomputation.

NTK kernel ridge regression: fit and predict with the empirical NTK

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Reuse the same 2-layer MLP structure as before to compute gradients
5 struct MLP {
6     int d, h; vector<double> W, b, a; double c;
7     MLP(int d_, int h_, unsigned seed=42): d(d_), h(h_), W(h_*d_), b(h_), a(h_), c(0.0) {
8         mt19937 rng(seed); normal_distribution<double> nd(0.0,1.0);
9         double scale_in = 1.0/sqrt((double)d);
10         for (int i=0;i<h*d;++i) W[i]=nd(rng)*scale_in; for (int i=0;i<h;++i) b[i]=0.0;
11         double scale_out = 1.0/sqrt((double)h);
12         for (int i=0;i<h;++i) a[i]=nd(rng)*scale_out; c=0.0;
13     }
14     double forward(const vector<double>& x, vector<double>& z, vector<double>& h_act) const {
15         z.assign(h,0.0); for(int i=0;i<h;++i){ double s=b[i]; int base=i*d; for(int j=0;j<d;++j) s+=W[base+j]*x[j]; z[i]=s; }
16         h_act.resize(h); for(int i=0;i<h;++i) h_act[i]=tanh(z[i]); double y=c; for(int i=0;i<h;++i) y+=a[i]*h_act[i]; return y;
17     }
18     vector<double> grad(const vector<double>& x) const {
19         vector<double> z, h_act; (void)forward(x, z, h_act);
20         vector<double> dhdz(h), df_dz(h); for(int i=0;i<h;++i){ dhdz[i]=1.0-h_act[i]*h_act[i]; df_dz[i]=a[i]*dhdz[i]; }
21         vector<double> g; g.reserve(h*d+2*h+1);
22         for(int i=0;i<h;++i) for(int j=0;j<d;++j) g.push_back(df_dz[i]*x[j]);
23         for(int i=0;i<h;++i) g.push_back(df_dz[i]);
24         for(int i=0;i<h;++i) g.push_back(h_act[i]);
25         g.push_back(1.0); return g;
26     }
27 };
28 
29 static double dot(const vector<double>& u, const vector<double>& v){ double s=0.0; for(size_t i=0;i<u.size();++i) s+=u[i]*v[i]; return s; }
30 
31 // Solve (A) x = b by Cholesky (A must be SPD). Adds no pivoting; for small n only.
32 bool cholesky_solve(vector<vector<double>>& A, vector<double>& b) {
33     int n = (int)A.size();
34     // Cholesky decomposition: A = L L^T
35     for (int k = 0; k < n; ++k) {
36         double sum = A[k][k];
37         for (int j = 0; j < k; ++j) sum -= A[k][j]*A[k][j];
38         if (sum <= 0) return false; // not SPD
39         A[k][k] = sqrt(sum);
40         for (int i = k+1; i < n; ++i) {
41             double s = A[i][k];
42             for (int j = 0; j < k; ++j) s -= A[i][j]*A[k][j];
43             A[i][k] = s / A[k][k];
44         }
45         for (int j = k+1; j < n; ++j) A[k][j] = 0.0; // store L in lower triangle
46     }
47     // Forward solve: L y = b
48     vector<double> y(n);
49     for (int i = 0; i < n; ++i) {
50         double s = b[i];
51         for (int j = 0; j < i; ++j) s -= A[i][j]*y[j];
52         y[i] = s / A[i][i];
53     }
54     // Backward solve: L^T x = y
55     vector<double> x(n);
56     for (int i = n-1; i >= 0; --i) {
57         double s = y[i];
58         for (int j = i+1; j < n; ++j) s -= A[j][i]*x[j];
59         x[i] = s / A[i][i];
60     }
61     b = x; // overwrite b with solution
62     return true;
63 }
64 
65 int main(){
66     ios::sync_with_stdio(false);
67     cin.tie(nullptr);
68 
69     // 1D regression: y = sin(2πx) + noise, train on few points, predict densely
70     int d = 1, h = 128; unsigned seed = 7; MLP net(d, h, seed);
71 
72     int n = 16; vector<vector<double>> X(n, vector<double>(d)); vector<double> y(n);
73     mt19937 rng(123); uniform_real_distribution<double> ud(0.0, 1.0); normal_distribution<double> nd(0.0, 0.05);
74     for(int i=0;i<n;++i){ X[i][0] = (i+0.5)/n; y[i] = sin(2*M_PI*X[i][0]) + nd(rng); }
75 
76     // Compute gradients for train points
77     vector<vector<double>> G(n);
78     for (int i=0;i<n;++i) G[i] = net.grad(X[i]);
79 
80     // Build K = G G^T and add ridge λI (jitter)
81     double lambda = 1e-3; vector<vector<double>> K(n, vector<double>(n, 0.0));
82     for(int i=0;i<n;++i){ for(int j=i;j<n;++j){ double val = dot(G[i], G[j]); K[i][j]=K[j][i]=val; } K[i][i]+=lambda; }
83 
84     // Solve (K) alpha = y
85     vector<double> alpha = y; // copy
86     if(!cholesky_solve(K, alpha)){
87         cerr << "Cholesky failed: K not SPD. Try larger lambda.\n";
88         return 1;
89     }
90 
91     // Predict on test grid using k(x_*) = [grad(x_*)·grad(x_i)]_i
92     int m = 100; cout.setf(std::ios::fixed); cout<<setprecision(6);
93     cout << "x_true pred\n";
94     for(int t=0;t<m;++t){
95         double xval = (t+0.5)/m; vector<double> xvec = {xval};
96         vector<double> gstar = net.grad(xvec);
97         double fhat = 0.0; for(int i=0;i<n;++i) fhat += dot(gstar, G[i]) * alpha[i];
98         double ftrue = sin(2*M_PI*xval);
99         cout << xval << " " << ftrue << " " << fhat << "\n";
100     }
101     return 0;
102 }
103

This program fits a small 1D regression problem using kernel ridge regression with the empirical NTK computed from a randomly initialized 2-layer MLP. It builds the NTK Gram matrix on training points, adds a small ridge term for numerical stability, solves for α via Cholesky, and predicts on a test grid by computing the NTK vector k(x*). This illustrates the NTK-as-kernel view: without training parameters, we already obtain a reasonable predictor purely from initialization geometry.

Time: O(n p + n^2 p) to build K from gradients, O(n^3) to solve (K + λI)α = y, and O(m p + mn) to predict on m test points.Space: O(n p) to store gradients, O(n^2) for K, and O(n) for α and y.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// A simple 2-layer MLP: f(x) = a^T tanh(W x + b) + c
5	// We compute the gradient of f w.r.t. all parameters to build the empirical NTK.
6
7	struct MLP {
8	int d; // input dimension
9	int h; // hidden width
10	// Parameters
11	vector<double> W; // size hd, row-major: W[id + j]
12	vector<double> b; // size h
13	vector<double> a; // size h
14	double c; // scalar bias
15
16	MLP(int d_, int h_, unsigned seed=42) : d(d_), h(h_), W(h_*d_), b(h_), a(h_), c(0.0) {
17	mt19937 rng(seed);
18	normal_distribution<double> nd(0.0, 1.0);
19	// NTK-friendly scaling for tanh: weights ~ N(0, 1/d)
20	double scale_in = 1.0 / sqrt((double)d);
21	for (int i = 0; i < hd; ++i) W[i] = nd(rng) scale_in;
22	for (int i = 0; i < h; ++i) b[i] = 0.0; // zero biases common at init
23	// Output weights ~ N(0, 1/h)
24	double scale_out = 1.0 / sqrt((double)h);
25	for (int i = 0; i < h; ++i) a[i] = nd(rng) * scale_out;
26	c = 0.0;
27	}
28
29	// Forward pass: returns f(x) and caches z and h for gradient
30	double forward(const vector<double>& x, vector<double>& z, vector<double>& h_act) const {
31	z.assign(h, 0.0);
32	for (int i = 0; i < h; ++i) {
33	double s = b[i];
34	int base = i * d;
35	for (int j = 0; j < d; ++j) s += W[base + j] * x[j];
36	z[i] = s;
37	}
38	h_act.resize(h);
39	for (int i = 0; i < h; ++i) h_act[i] = tanh(z[i]);
40	double y = c;
41	for (int i = 0; i < h; ++i) y += a[i] * h_act[i];
42	return y;
43	}
44
45	// Compute gradient of f(x) w.r.t. all parameters, flattened as:
46	// [W (h*d elems), b (h), a (h), c (1)]
47	vector<double> grad(const vector<double>& x) const {
48	vector<double> z, h_act;
49	double y = forward(x, z, h_act);
50	(void)y; // unused
51	vector<double> g;
52	g.reserve(hd + 2h + 1);
53	// df/dc = 1
54	// For tanh: dh/dz = 1 - tanh(z)^2
55	vector<double> dhdz(h);
56	for (int i = 0; i < h; ++i) dhdz[i] = 1.0 - h_act[i]*h_act[i];
57	// df/dz_i = a_i * dhdz_i
58	vector<double> df_dz(h);
59	for (int i = 0; i < h; ++i) df_dz[i] = a[i] * dhdz[i];
60	// Gradients for W: df/dW_{i,j} = df/dz_i * x_j
61	for (int i = 0; i < h; ++i) {
62	for (int j = 0; j < d; ++j) g.push_back(df_dz[i] * x[j]);
63	}
64	// Gradients for b: df/db_i = df/dz_i
65	for (int i = 0; i < h; ++i) g.push_back(df_dz[i]);
66	// Gradients for a: df/da_i = h_i
67	for (int i = 0; i < h; ++i) g.push_back(h_act[i]);
68	// Gradient for c
69	g.push_back(1.0);
70	return g;
71	}
72	};
73
74	// Compute dot product of two vectors
75	static double dot(const vector<double>& u, const vector<double>& v) {
76	double s = 0.0; int n = (int)u.size();
77	for (int i = 0; i < n; ++i) s += u[i] * v[i];
78	return s;
79	}
80
81	int main() {
82	ios::sync_with_stdio(false);
83	cin.tie(nullptr);
84
85	// Example: small synthetic dataset in R^2
86	int d = 2; // input dimension
87	int h = 64; // hidden width (moderately wide)
88	int n = 6; // number of samples
89
90	vector<vector<double>> X = {
91	{0.0, 0.0}, {1.0, 0.0}, {0.0, 1.0}, {1.0, 1.0}, {0.5, 0.2}, {0.2, 0.7}
92	};
93
94	MLP net(d, h, /seed=/123);
95
96	// Compute tangent features for each sample
97	vector<vector<double>> G(n);
98	for (int i = 0; i < n; ++i) G[i] = net.grad(X[i]);
99
100	// Build NTK Gram matrix K = G * G^T
101	vector<vector<double>> K(n, vector<double>(n, 0.0));
102	for (int i = 0; i < n; ++i) {
103	K[i][i] = dot(G[i], G[i]);
104	for (int j = i+1; j < n; ++j) {
105	double kij = dot(G[i], G[j]);
106	K[i][j] = kij;
107	K[j][i] = kij;
108	}
109	}
110
111	cout.setf(std::ios::fixed); cout<<setprecision(6);
112	cout << "Empirical NTK Gram matrix (" << n << "x" << n << ") at initialization:\n";
113	for (int i = 0; i < n; ++i) {
114	for (int j = 0; j < n; ++j) cout << K[i][j] << (j+1==n?'\n':' ');
115	}
116
117	// Note: To approximate the infinite-width NTK, increase h and average over seeds.
118	return 0;
119	}
120

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Reuse the same 2-layer MLP structure as before to compute gradients
5	struct MLP {
6	int d, h; vector<double> W, b, a; double c;
7	MLP(int d_, int h_, unsigned seed=42): d(d_), h(h_), W(h_*d_), b(h_), a(h_), c(0.0) {
8	mt19937 rng(seed); normal_distribution<double> nd(0.0,1.0);
9	double scale_in = 1.0/sqrt((double)d);
10	for (int i=0;i<hd;++i) W[i]=nd(rng)scale_in; for (int i=0;i<h;++i) b[i]=0.0;
11	double scale_out = 1.0/sqrt((double)h);
12	for (int i=0;i<h;++i) a[i]=nd(rng)*scale_out; c=0.0;
13	}
14	double forward(const vector<double>& x, vector<double>& z, vector<double>& h_act) const {
15	z.assign(h,0.0); for(int i=0;i<h;++i){ double s=b[i]; int base=id; for(int j=0;j<d;++j) s+=W[base+j]x[j]; z[i]=s; }
16	h_act.resize(h); for(int i=0;i<h;++i) h_act[i]=tanh(z[i]); double y=c; for(int i=0;i<h;++i) y+=a[i]*h_act[i]; return y;
17	}
18	vector<double> grad(const vector<double>& x) const {
19	vector<double> z, h_act; (void)forward(x, z, h_act);
20	vector<double> dhdz(h), df_dz(h); for(int i=0;i<h;++i){ dhdz[i]=1.0-h_act[i]h_act[i]; df_dz[i]=a[i]dhdz[i]; }
21	vector<double> g; g.reserve(hd+2h+1);
22	for(int i=0;i<h;++i) for(int j=0;j<d;++j) g.push_back(df_dz[i]*x[j]);
23	for(int i=0;i<h;++i) g.push_back(df_dz[i]);
24	for(int i=0;i<h;++i) g.push_back(h_act[i]);
25	g.push_back(1.0); return g;
26	}
27	};
28
29	static double dot(const vector<double>& u, const vector<double>& v){ double s=0.0; for(size_t i=0;i<u.size();++i) s+=u[i]*v[i]; return s; }
30
31	// Solve (A) x = b by Cholesky (A must be SPD). Adds no pivoting; for small n only.
32	bool cholesky_solve(vector<vector<double>>& A, vector<double>& b) {
33	int n = (int)A.size();
34	// Cholesky decomposition: A = L L^T
35	for (int k = 0; k < n; ++k) {
36	double sum = A[k][k];
37	for (int j = 0; j < k; ++j) sum -= A[k][j]*A[k][j];
38	if (sum <= 0) return false; // not SPD
39	A[k][k] = sqrt(sum);
40	for (int i = k+1; i < n; ++i) {
41	double s = A[i][k];
42	for (int j = 0; j < k; ++j) s -= A[i][j]*A[k][j];
43	A[i][k] = s / A[k][k];
44	}
45	for (int j = k+1; j < n; ++j) A[k][j] = 0.0; // store L in lower triangle
46	}
47	// Forward solve: L y = b
48	vector<double> y(n);
49	for (int i = 0; i < n; ++i) {
50	double s = b[i];
51	for (int j = 0; j < i; ++j) s -= A[i][j]*y[j];
52	y[i] = s / A[i][i];
53	}
54	// Backward solve: L^T x = y
55	vector<double> x(n);
56	for (int i = n-1; i >= 0; --i) {
57	double s = y[i];
58	for (int j = i+1; j < n; ++j) s -= A[j][i]*x[j];
59	x[i] = s / A[i][i];
60	}
61	b = x; // overwrite b with solution
62	return true;
63	}
64
65	int main(){
66	ios::sync_with_stdio(false);
67	cin.tie(nullptr);
68
69	// 1D regression: y = sin(2πx) + noise, train on few points, predict densely
70	int d = 1, h = 128; unsigned seed = 7; MLP net(d, h, seed);
71
72	int n = 16; vector<vector<double>> X(n, vector<double>(d)); vector<double> y(n);
73	mt19937 rng(123); uniform_real_distribution<double> ud(0.0, 1.0); normal_distribution<double> nd(0.0, 0.05);
74	for(int i=0;i<n;++i){ X[i][0] = (i+0.5)/n; y[i] = sin(2M_PIX[i][0]) + nd(rng); }
75
76	// Compute gradients for train points
77	vector<vector<double>> G(n);
78	for (int i=0;i<n;++i) G[i] = net.grad(X[i]);
79
80	// Build K = G G^T and add ridge λI (jitter)
81	double lambda = 1e-3; vector<vector<double>> K(n, vector<double>(n, 0.0));
82	for(int i=0;i<n;++i){ for(int j=i;j<n;++j){ double val = dot(G[i], G[j]); K[i][j]=K[j][i]=val; } K[i][i]+=lambda; }
83
84	// Solve (K) alpha = y
85	vector<double> alpha = y; // copy
86	if(!cholesky_solve(K, alpha)){
87	cerr << "Cholesky failed: K not SPD. Try larger lambda.\n";
88	return 1;
89	}
90
91	// Predict on test grid using k(x_) = [grad(x_)·grad(x_i)]_i
92	int m = 100; cout.setf(std::ios::fixed); cout<<setprecision(6);
93	cout << "x_true pred\n";
94	for(int t=0;t<m;++t){
95	double xval = (t+0.5)/m; vector<double> xvec = {xval};
96	vector<double> gstar = net.grad(xvec);
97	double fhat = 0.0; for(int i=0;i<n;++i) fhat += dot(gstar, G[i]) * alpha[i];
98	double ftrue = sin(2M_PIxval);
99	cout << xval << " " << ftrue << " " << fhat << "\n";
100	}
101	return 0;
102	}
103