Reproducing Kernel Hilbert Spaces (RKHS)
Key Points
- •An RKHS is a space of functions where evaluating a function at a point equals taking an inner product with a kernel section, which enables the “kernel trick.”
- •A positive definite kernel k(x,y) uniquely defines an RKHS and its inner product via the Moore–Aronszajn theorem.
- •Learning in RKHS often reduces to working with the n×n Gram matrix K where ,); this enables non-linear learning with linear algebra.
- •The Representer Theorem guarantees many regularized learning problems have solutions of the form f(·) = k(,·).
- •Kernel Ridge Regression (KRR) solves (K + I) = y and predicts by f(x) = k(x,X)^T .
- •Using Cholesky instead of matrix inversion yields numerically stable and typically faster solutions for kernel methods.
- •Random Fourier Features approximate shift-invariant kernels like the RBF, reducing O() memory to O(nD), where D is the feature dimension.
- •Choosing kernel hyperparameters (e.g., RBF bandwidth ) and regularization critically affects generalization and conditioning.
Prerequisites
- →Linear Algebra (vectors, inner products, PSD matrices) — RKHS relies on inner products, Gram matrices, and factorizations like Cholesky.
- →Calculus and Real Analysis (basic) — Understanding function spaces, continuity, and norms benefits from calculus foundations.
- →Probability (Gaussians, expectations) — Bochner’s theorem, Random Fourier Features, and Gaussian Processes use probabilistic tools.
- →Optimization (convexity, regularization) — RKHS learning problems are regularized ERM with convex objectives.
- →Numerical Linear Algebra — Efficient and stable solutions require solving linear systems without explicit inversion.
Detailed Explanation
Tap terms for definitions01Overview
Hook: Imagine plotting data points in 2D that are not linearly separable. You wish you could lift them into some higher dimension where a straight line would work. Reproducing Kernel Hilbert Spaces (RKHS) let you do exactly that—without ever computing the lift directly. Concept: An RKHS is a Hilbert space of functions tied to a positive definite kernel k(x, y). The kernel plays double duty: it defines an implicit feature map \phi and the inner product in the function space through k(x, y) = \langle \phi(x), \phi(y) \rangle. Because we only need k to compute inner products, we can learn flexible, non-linear models using simple linear algebra on the n×n Gram matrix. Example: In kernel ridge regression, the best-fitting function under a smoothness penalty has the form f(·) = \sum_i \alpha_i k(x_i, ·). The coefficients \alpha come from solving one linear system involving the Gram matrix, avoiding explicit high-dimensional features entirely.
02Intuition & Analogies
Think of a music equalizer. Raw sound is complicated, but sliding the equalizer knobs emphasizes certain frequencies so the music sounds better. Similarly, many datasets are tangled in their original space. A kernel is like a smart equalizer that implicitly maps inputs into a (possibly infinite-dimensional) space where the data becomes simpler—often linearly separable or easier to approximate. You never see the knobs (the feature map \phi); you just hear the result (the kernel value k(x, y) that acts as an inner product there). Another analogy: Suppose you want to measure how similar two recipes are, but comparing ingredients one by one is tedious. Instead, you could ask a chef to give you a single similarity score that encodes their expert judgment of overlap in flavor, technique, and style. That chef’s score is your kernel. It saves you from manually enumerating and aligning features. The ‘reproducing’ part means evaluating a function at x is the same as taking an inner product with a special function anchored at x, k(·, x). So to know f(x), you just correlate f with a kernel bump centered at x. This viewpoint explains why many regularized learning solutions boil down to weighted sums of such bumps around training points, letting us sculpt complex functions by adding or reducing the influence of each point.
03Formal Definition
04When to Use
- Nonlinear patterns with limited to medium-sized datasets (e.g., n up to a few tens of thousands) where flexibility matters but you still can form an n×n Gram matrix. - When you need theoretically grounded similarity functions and guarantees (positive definiteness), e.g., SVMs, Gaussian Processes, Kernel Ridge Regression, Kernel PCA, and MMD-based two-sample testing. - When features are hard to engineer but a good similarity can be defined (strings, graphs, distributions) via custom kernels. - When you want convex training with global optima (e.g., KRR or SVMs) instead of deep non-convex optimization. - When interpretability in terms of influential training points is useful; solutions are sums of kernel sections k(·, x_i). Avoid pure kernel methods when n is huge and memory/time for O(n^2) or O(n^3) operations is prohibitive; consider approximations like Random Fourier Features or Nyström in that case.
⚠️Common Mistakes
- Using kernels that are not positive semidefinite, yielding invalid Gram matrices and unstable training. Always test PSD via Cholesky factorization; add small jitter if needed. - Explicitly inverting (K + \lambda I). Numerical inversion is unstable and slower; solve linear systems via Cholesky or CG. - Mis-tuning hyperparameters: RBF bandwidth \gamma (or \sigma) and regularization \lambda. Too small \gamma overfits with spiky kernels; too large \gamma underfits. Cross-validate on a log-scale grid. - Ignoring feature scaling. RBF kernels are very sensitive to input scales; standardize features. - Forgetting regularization (\lambda = 0) with noisy data leads to ill-conditioning and overfitting. - Confusing kernel trick with arbitrary similarity. Not every similarity is a valid kernel; require PSD. - For kernel PCA, forgetting to center the Gram matrix in feature space causes incorrect components. - Using too few Random Fourier Features (D too small), leading to high variance approximations; increase D until kernel approximation error is acceptable.
Key Formulas
Reproducing Property
Explanation: In an RKHS, evaluating a function at x equals its inner product with the kernel section centered at x. This ties evaluation to geometry in the function space.
Gram Matrix
Explanation: The Gram matrix collects all pairwise kernel evaluations among training points. It is the main object used by kernel algorithms.
Positive Definiteness of Kernel
Explanation: A valid kernel yields nonnegative quadratic forms for any coefficients c. Equivalently, every Gram matrix built from the kernel is positive semidefinite.
Representer Theorem Form
Explanation: For many regularized learning problems in an RKHS, the optimal solution is a finite linear combination of kernel sections at training points.
KRR Dual Solution
Explanation: Kernel ridge regression solves a linear system in the Gram matrix to find coefficients. In practice, avoid forming the inverse; solve via Cholesky.
KRR Prediction
Explanation: Once is known, predictions are kernel-weighted sums over training points. Complexity is linear in the number of training samples.
RBF (Gaussian) Kernel
Explanation: A popular, smooth, shift-invariant kernel with bandwidth parameter . Small spreads influence broadly; large makes it very local.
Polynomial Kernel
Explanation: Encodes all monomials up to degree p (and interaction terms) implicitly. The offset c shifts the feature space.
Mercer Expansion
Explanation: A PSD kernel can be expanded in orthonormal eigenfunctions with nonnegative weights. This provides a spectral view of the RKHS.
Bochner Representation
Explanation: Any continuous, shift-invariant PSD kernel is the Fourier transform of a nonnegative measure. Sampling w from p(w) leads to Random Fourier Features.
KRR Objective
Explanation: Balances data fit (least squares) against RKHS norm . The solution is unique and given by the dual coefficients.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 using Vec = vector<double>; 5 using Mat = vector<vector<double>>; 6 using Kernel = function<double(const Vec&, const Vec&)>; 7 8 // Basic helpers 9 static double dot(const Vec& a, const Vec& b) { 10 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i]*b[i]; return s; 11 } 12 static double sqL2(const Vec& a, const Vec& b) { 13 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) { double d = a[i]-b[i]; s += d*d; } return s; 14 } 15 16 // Common kernels 17 Kernel linearKernel = [](const Vec& x, const Vec& y){ return dot(x,y); }; 18 Kernel polynomialKernel(double c, int p) { 19 return [c,p](const Vec& x, const Vec& y){ return pow(dot(x,y) + c, p); }; 20 } 21 Kernel rbfKernel(double gamma) { 22 return [gamma](const Vec& x, const Vec& y){ return exp(-gamma * sqL2(x,y)); }; 23 } 24 25 Mat gramMatrix(const vector<Vec>& X, const Kernel& k) { 26 size_t n = X.size(); 27 Mat K(n, Vec(n, 0.0)); 28 for (size_t i = 0; i < n; ++i) { 29 K[i][i] = k(X[i], X[i]); 30 for (size_t j = i+1; j < n; ++j) { 31 double v = k(X[i], X[j]); 32 K[i][j] = v; K[j][i] = v; 33 } 34 } 35 return K; 36 } 37 38 // Cholesky factorization for SPD matrices: A = L L^T 39 bool cholesky(const Mat& A, Mat& L) { 40 size_t n = A.size(); 41 L.assign(n, Vec(n, 0.0)); 42 for (size_t i = 0; i < n; ++i) { 43 for (size_t j = 0; j <= i; ++j) { 44 double sum = A[i][j]; 45 for (size_t k = 0; k < j; ++k) sum -= L[i][k] * L[j][k]; 46 if (i == j) { 47 if (sum <= 0.0 || !isfinite(sum)) return false; // not SPD 48 L[i][i] = sqrt(max(sum, 0.0)); 49 } else { 50 if (L[j][j] == 0.0) return false; 51 L[i][j] = sum / L[j][j]; 52 } 53 } 54 } 55 return true; 56 } 57 58 // Solve (L L^T) x = b with forward/back substitution 59 Vec solveWithCholesky(const Mat& L, const Vec& b) { 60 size_t n = L.size(); 61 Vec y(n, 0.0), x(n, 0.0); 62 // Forward: L y = b 63 for (size_t i = 0; i < n; ++i) { 64 double sum = b[i]; 65 for (size_t k = 0; k < i; ++k) sum -= L[i][k] * y[k]; 66 y[i] = sum / L[i][i]; 67 } 68 // Backward: L^T x = y 69 for (int i = (int)n-1; i >= 0; --i) { 70 double sum = y[i]; 71 for (size_t k = i+1; k < n; ++k) sum -= L[k][i] * x[k]; 72 x[i] = sum / L[i][i]; 73 } 74 return x; 75 } 76 77 // PSD test by trying Cholesky with small jitter if necessary 78 bool isPSD(Mat K) { 79 size_t n = K.size(); 80 Mat L; 81 if (cholesky(K, L)) return true; 82 // Try adding jitter progressively 83 double eps = 1e-12; 84 for (int t = 0; t < 6; ++t) { 85 for (size_t i = 0; i < n; ++i) K[i][i] += eps; 86 if (cholesky(K, L)) return true; 87 eps *= 10.0; 88 } 89 return false; 90 } 91 92 int main(){ 93 // Toy 2D dataset 94 vector<Vec> X = { {0.0, 0.0}, {1.0, 0.5}, {0.5, 1.5}, {2.0, 2.0} }; 95 96 // Define kernels 97 auto lin = linearKernel; 98 auto poly = polynomialKernel(1.0, 3); // (x^T y + 1)^3 99 auto rbf = rbfKernel(0.5); // gamma = 0.5 100 101 // Build Gram matrices 102 Mat K_lin = gramMatrix(X, lin); 103 Mat K_poly = gramMatrix(X, poly); 104 Mat K_rbf = gramMatrix(X, rbf); 105 106 cout << boolalpha; 107 cout << "Linear kernel PSD? " << isPSD(K_lin) << "\n"; 108 cout << "Polynomial kernel PSD? " << isPSD(K_poly) << "\n"; 109 cout << "RBF kernel PSD? " << isPSD(K_rbf) << "\n"; 110 111 // Print a Gram matrix 112 cout << "RBF Gram matrix:\n"; 113 cout.setf(std::ios::fixed); cout<<setprecision(4); 114 for (auto& row : K_rbf) { 115 for (double v : row) cout << setw(9) << v << ' '; 116 cout << '\n'; 117 } 118 return 0; 119 } 120
This program defines common kernels (linear, polynomial, RBF), builds the Gram matrix for a small dataset, and checks positive semidefiniteness by attempting a Cholesky factorization (with diagonal jitter for numerical robustness). A successful Cholesky factorization is a practical PSD certificate for Gram matrices.
1 #include <bits/stdc++.h> 2 using namespace std; 3 using Vec = vector<double>; using Mat = vector<vector<double>>; 4 using Kernel = function<double(const Vec&, const Vec&)>; 5 6 static double sqL2(const Vec& a, const Vec& b){ double s=0; for(size_t i=0;i<a.size();++i){ double d=a[i]-b[i]; s+=d*d;} return s; } 7 Kernel rbfKernel(double gamma){ return [gamma](const Vec& x, const Vec& y){ return exp(-gamma*sqL2(x,y)); }; } 8 9 Mat gramMatrix(const vector<Vec>& X, const Kernel& k){ size_t n=X.size(); Mat K(n, Vec(n,0.0)); for(size_t i=0;i<n;++i){ for(size_t j=0;j<=i;++j){ double v=k(X[i],X[j]); K[i][j]=v; K[j][i]=v; } } return K; } 10 11 bool cholesky(const Mat& A, Mat& L){ size_t n=A.size(); L.assign(n, Vec(n,0.0)); for(size_t i=0;i<n;++i){ for(size_t j=0;j<=i;++j){ double sum=A[i][j]; for(size_t k=0;k<j;++k) sum-=L[i][k]*L[j][k]; if(i==j){ if(sum<=0.0||!isfinite(sum)) return false; L[i][i]=sqrt(max(sum,0.0)); } else { if(L[j][j]==0.0) return false; L[i][j]=sum/L[j][j]; } } } return true; } 12 Vec solveWithCholesky(const Mat& L, const Vec& b){ size_t n=L.size(); Vec y(n,0.0), x(n,0.0); for(size_t i=0;i<n;++i){ double s=b[i]; for(size_t k=0;k<i;++k) s-=L[i][k]*y[k]; y[i]=s/L[i][i]; } for(int i=(int)n-1;i>=0;--i){ double s=y[i]; for(size_t k=i+1;k<n;++k) s-=L[k][i]*x[k]; x[i]=s/L[i][i]; } return x; } 13 14 struct KernelRidgeRegression { 15 Kernel k; double lambda; vector<Vec> Xtrain; Vec alpha; bool fitted=false; 16 KernelRidgeRegression(Kernel kernel, double lam): k(kernel), lambda(lam) {} 17 void fit(const vector<Vec>& X, const Vec& y){ 18 Xtrain = X; size_t n = X.size(); 19 // Build K and add regularization to diagonal 20 Mat K = gramMatrix(X, k); 21 for(size_t i=0;i<n;++i) K[i][i] += lambda; 22 // Cholesky and solve 23 Mat L; bool ok = cholesky(K, L); 24 if(!ok){ // add jitter if needed 25 for(size_t i=0;i<n;++i) K[i][i] += 1e-10; ok = cholesky(K,L); 26 if(!ok) throw runtime_error("Cholesky failed: kernel may be non-PSD or ill-conditioned"); 27 } 28 alpha = solveWithCholesky(L, y); 29 fitted = true; 30 } 31 double predict_one(const Vec& x) const{ 32 if(!fitted) throw runtime_error("Model not fitted"); 33 double s=0.0; for(size_t i=0;i<Xtrain.size();++i) s += alpha[i]*k(x, Xtrain[i]); return s; 34 } 35 vector<double> predict(const vector<Vec>& X) const{ vector<double> out; out.reserve(X.size()); for(const auto& x: X) out.push_back(predict_one(x)); return out; } 36 }; 37 38 int main(){ 39 // Generate 1D regression data: y = sin(2*pi*x) + noise 40 int n = 80; std::mt19937 rng(42); std::uniform_real_distribution<double> U(0.0,1.0); std::normal_distribution<double> N(0.0, 0.1); 41 vector<Vec> X; Vec y; X.reserve(n); y.reserve(n); 42 for(int i=0;i<n;++i){ double xi = U(rng); X.push_back({xi}); y.push_back(sin(2*M_PI*xi) + N(rng)); } 43 44 auto k = rbfKernel(50.0); // gamma; sensitive to input scale—here x in [0,1] 45 KernelRidgeRegression krr(k, 1e-2); 46 krr.fit(X, y); 47 48 // Predict on a grid and compute MSE on training data 49 double mse = 0.0; for(int i=0;i<n;++i){ double yi_hat = krr.predict_one(X[i]); double e = yi_hat - y[i]; mse += e*e; } 50 mse /= n; cout.setf(std::ios::fixed); cout<<setprecision(6); 51 cout << "Training MSE: " << mse << "\n"; 52 53 // Show a few predictions 54 for(double xq : {0.0, 0.25, 0.5, 0.75, 1.0}){ 55 double pred = krr.predict_one(Vec{ xq }); 56 cout << "x=" << setw(5) << xq << " f(x)≈ " << setw(9) << pred << '\n'; 57 } 58 return 0; 59 } 60
This implements Kernel Ridge Regression using the RBF kernel. Training builds the Gram matrix, adds λ to the diagonal, performs a Cholesky factorization, and solves for α. Predictions are kernel-weighted sums over training points. The example fits a noisy sin curve, showing the power of RKHS-based smooth regression.
1 #include <bits/stdc++.h> 2 using namespace std; 3 using Vec = vector<double>; 4 5 // RFF transformer for RBF kernel: z(x) in R^D with z(x)^T z(y) ≈ k(x,y) 6 struct RFF { 7 int D; int d; double gamma; // RBF: k(x,y)=exp(-gamma||x-y||^2) 8 vector<Vec> W; // D x d Gaussian rows 9 vector<double> b; // D phases in [0, 2pi] 10 std::mt19937 rng; 11 12 RFF(int D_, int d_, double gamma_, uint32_t seed=42): D(D_), d(d_), gamma(gamma_), rng(seed) { 13 // For RBF with parameter gamma, spectral density is Gaussian with variance 2*gamma 14 double sigma2 = 2.0 * gamma; // w ~ N(0, 2*gamma I) 15 std::normal_distribution<double> N(0.0, sqrt(sigma2)); 16 std::uniform_real_distribution<double> U(0.0, 2.0*M_PI); 17 W.assign(D, Vec(d, 0.0)); b.assign(D, 0.0); 18 for (int i = 0; i < D; ++i) { 19 for (int j = 0; j < d; ++j) W[i][j] = N(rng); 20 b[i] = U(rng); 21 } 22 } 23 24 Vec transform(const Vec& x) const { 25 Vec z(D, 0.0); 26 double scale = sqrt(2.0 / D); 27 for (int i = 0; i < D; ++i) { 28 double proj = 0.0; for (int j = 0; j < d; ++j) proj += W[i][j] * x[j]; 29 z[i] = scale * cos(proj + b[i]); 30 } 31 return z; 32 } 33 }; 34 35 static double sqL2(const Vec& a, const Vec& b) { double s=0; for(size_t i=0;i<a.size();++i){ double d=a[i]-b[i]; s+=d*d; } return s; } 36 static double rbf_exact(const Vec& x, const Vec& y, double gamma){ return exp(-gamma * sqL2(x,y)); } 37 38 int main(){ 39 // 3D random points 40 int n = 1000; int d = 3; double gamma = 1.0; std::mt19937 rng(123); 41 std::normal_distribution<double> N(0.0, 1.0); 42 vector<Vec> X(n, Vec(d,0.0)); for(int i=0;i<n;++i){ for(int j=0;j<d;++j) X[i][j] = N(rng); } 43 44 // Build RFF transformer with varying D 45 for (int D : {100, 500, 2000}) { 46 RFF rff(D, d, gamma, 7); 47 // Estimate kernel approximation error over random pairs 48 std::uniform_int_distribution<int> UI(0, n-1); 49 double mae = 0.0; int trials = 2000; 50 for (int t = 0; t < trials; ++t) { 51 int i = UI(rng), j = UI(rng); 52 Vec zx = rff.transform(X[i]); 53 Vec zy = rff.transform(X[j]); 54 double approx = inner_product(zx.begin(), zx.end(), zy.begin(), 0.0); 55 double exact = rbf_exact(X[i], X[j], gamma); 56 mae += fabs(approx - exact); 57 } 58 mae /= trials; 59 cout.setf(std::ios::fixed); cout<<setprecision(6); 60 cout << "D=" << setw(5) << D << " Mean |approx-exact| = " << mae << "\n"; 61 } 62 return 0; 63 } 64
This demonstrates Random Fourier Features for the RBF kernel. It samples Gaussian frequencies according to Bochner’s theorem and maps x to z(x) so that z(x)^T z(y) approximates k(x,y). Increasing D improves the approximation while keeping memory linear in D instead of n.