📚TheoryAdvanced

Graph Neural Network Theory

Key Points

•
Graph Neural Networks (GNNs) learn on graphs by repeatedly letting each node aggregate messages from its neighbors and update its representation.
•
Standard message passing GNNs are at most as expressive as the 1-Weisfeiler–Leman (1-WL) graph isomorphism test for distinguishing non-isomorphic graphs.
•
Over-smoothing occurs in deep GNNs when repeated averaging makes node embeddings converge to nearly identical vectors.
•
Over-squashing happens when long-range information from exponentially many nodes must pass through few edges, creating a bottleneck.
•
Permutation equivariance (for node-level outputs) and permutation invariance (for graph-level outputs) are core symmetry properties any valid GNN should satisfy.
•
Positional encodings (e.g., Laplacian eigenvectors or random features) inject global structure that message passing alone may miss.
•
Higher-order GNNs (e.g., based on k-WL) are more expressive but usually have higher computational cost.
•
Practical GNN design balances expressivity, depth, normalization, and positional features to mitigate smoothing and squashing.

Prerequisites

→Linear Algebra — Matrix multiplication, eigenvectors, and normalization underpin GCNs and Laplacian encodings.
→Graph Theory Basics — Understanding nodes, edges, adjacency, degrees, and neighborhoods is essential for message passing.
→Neural Networks and Backpropagation — Weights, activations, and training dynamics translate directly to GNN parameter learning.
→Probability and Statistics — Random features, attention softmax, and generalization analysis rely on probabilistic thinking.
→Algorithmic Complexity — Evaluating scalability on large sparse graphs requires Big-O reasoning for N, E, and d.
→Spectral Graph Theory (optional) — Laplacian eigenvectors and over-smoothing analysis use spectral properties of graphs.
→Hashing and Maps — 1-WL refinement relies on hashing neighbor color multisets to new labels.
→Sparse Data Structures — Efficient message passing uses adjacency lists and edge-wise accumulations.

Detailed Explanation

Tap terms for definitions

01Overview

Graph Neural Network (GNN) theory studies how neural models learn from graph-structured data such as molecules, social networks, and knowledge graphs. The central paradigm is message passing: at each layer, a node gathers (aggregates) information from its neighbors and updates its hidden state. This iterative process blends local structures into node embeddings that can be used for downstream tasks like node classification, link prediction, and graph classification. A key theoretical insight is expressivity: standard message passing GNNs are bounded by the 1-Weisfeiler–Leman (1-WL) graph isomorphism test, meaning they cannot distinguish all non-isomorphic graphs that 1-WL also cannot distinguish. While deeper networks can propagate information farther, they face two structural issues: over-smoothing, where representations become indistinguishable, and over-squashing, where long-range dependencies are compressed through few edges. Symmetry is fundamental: valid GNNs must be permutation equivariant (node outputs permute with node reordering), and graph-level readouts must be permutation invariant. To capture global structure that neighbors may not reveal, positional encodings (e.g., Laplacian eigenvectors or random features) are often added. Higher-order architectures inspired by k-WL increase distinguishing power at increased computational cost. Understanding these principles guides the design of robust, efficient, and theoretically sound GNNs.

02Intuition & Analogies

Imagine a community where each person (node) writes a short summary about themselves. Every hour (layer), each person reads their neighbors’ notes, averages the ideas, and then rewrites their own summary incorporating what they learned. After a few rounds, everyone’s notes reflect local community structure—interests, roles, or affiliations. This is message passing. The way each person combines neighbors’ notes must be order-agnostic: you should get the same result no matter how you shuffle the neighbors. That’s why we use operations like sum/mean/max that don’t depend on order (permutation invariance within neighborhoods). Now picture two different communities; can we tell them apart by only running this note-sharing process? The 1-WL test is like a formalized version of the note-refinement game. If 1-WL can’t tell two communities apart, a standard GNN probably can’t either. Make the process very long and people’s notes begin to sound the same—unique voices fade. That’s over-smoothing. On the other hand, think of trying to pass news from one corner of a vast city to another using only a few narrow bridges; too many messages get jammed at the bottlenecks. That’s over-squashing. Finally, suppose we add GPS coordinates or district IDs to people’s notes—suddenly, global geography helps disambiguate neighborhoods. These are positional encodings. Whatever we do, renaming people without changing who is connected to whom shouldn’t change the conclusions; that’s permutation equivariance/invariance, the symmetry rule that keeps GNNs honest.

03Formal Definition

A message passing GNN updates node embeddings via functions that are invariant to neighbor ordering. Let G = (V, E) be a graph with initial node features

h_{v}^{(0)}

. At layer k, a broad template is:

h_{v}^{(k)}

\operatorname{UPDATE}^{(k)}\big

(

h_{v}^{(k - 1)}

AGGREGATE^{(k)}

MESSAGE^{(k)}

(

h_{v}^{(k - 1)}

h_{u}^{(k - 1)}

e_{uv}

): u

\in

N(v)\}

\big

). The AGGREGATE function is a permutation-invariant operator such as sum, mean, or max. A well-known special case is the GCN layer:

H^{(k + 1)}

σ

(

\hat{D}^{- 1/2}

\hat{A}

\hat{D}^{- 1/2}

H^{(k)}

W^{(k)}

), where

\hat{A}

= A + I adds self-loops,

\hat{D}

is the diagonal degree matrix of

\hat{A}

W^{(k)}

is a trainable weight matrix, and

σ

is a pointwise nonlinearity. Permutation equivariance means that for any permutation matrix P (reindexing nodes), a node-wise GNN satisfies f(PX, P A

P^{⊤}

) = P f(X, A). For graph-level outputs with a readout R that aggregates node embeddings, permutation invariance requires R(PH) = R(H). Expressivity results show that message passing GNNs match the power of the 1-Weisfeiler–Leman color refinement: if 1-WL cannot distinguish two graphs, then standard MPNNs cannot either. Increasing expressivity typically involves higher-order architectures (e.g., operating on k-tuples), which relate to k-WL but incur higher computational complexity.

04When to Use

Use message passing GNNs when relational structure is central: molecular property prediction (atoms and bonds), fraud detection in transaction networks, recommendation on user–item graphs, citation or social networks, and knowledge graphs. For node classification, a few layers (2–4) often suffice to capture local context without severe over-smoothing. For link prediction, use architectures with edge modeling (e.g., edge features, bilinear decoders) and sometimes positional encodings to capture nonlocal cues. For graph classification, pair a node encoder (MPNN) with a permutation-invariant readout (sum/mean/max or attention pooling). When you expect long-range dependencies (e.g., functional groups far apart in a molecule, circuitry with long paths), consider mitigations for over-squashing: add skip connections, use positional encodings (Laplacian eigenvectors or random features), graph rewiring, or attention mechanisms that create longer jumps. If you need to distinguish graphs that 1-WL cannot, consider higher-order GNNs or additional features that break symmetries. For tasks where node identity should not leak (e.g., anonymized networks), ensure your design is permutation equivariant and does not rely on raw indices, unless carefully encoded as positional features with proper regularization.

⚠️Common Mistakes

• Ignoring permutation symmetry: using order-sensitive aggregations or concatenating neighbors in arbitrary order leaks node ordering and breaks validity. Always use permutation-invariant AGGREGATE functions. • Forgetting self-loops or proper normalization: omitting self-loops can erase a node’s own features; poor normalization (e.g., unscaled sum) can cause exploding/vanishing signals. Use \hat{A} = A + I and symmetric normalization \hat{D}^{-1/2} \hat{A} \hat{D}^{-1/2}. • Too much depth without mitigation: stacking many message-passing layers often leads to over-smoothing. Prefer shallow models with residual/skip connections, normalization, and diverse channels (e.g., attention, jumping knowledge). • Overlooking over-squashing: even shallow models can fail on long-range dependencies. Consider rewiring, global attention edges, or positional encodings to increase effective receptive field. • Misusing positional encodings: treating Laplacian eigenvectors as absolute coordinates can break transfer; handle sign ambiguity and scale properly, and avoid leaking node IDs if invariance is required. • Underestimating expressivity limits: assuming a standard MPNN can distinguish any graphs leads to brittle solutions. Know when you need higher-order variants or task-specific features. • Inefficient implementations: recomputing normalizations at every step or using dense matrices for sparse graphs can blow up runtime and memory. Prefer sparse accumulations over edges. • Data leakage: including temporal or label information from the future, or encoding raw node indices as features, can falsely inflate performance. Enforce strict training/validation splits and sanity checks.

Key Formulas

Message Passing Template

h_{v}^{(k)} = UPDATE^{(k)} (h_{v}^{(k - 1)}, AGGREGATE^{(k)} {MESSAGE^{(k)} (h_{v}^{(k - 1)}, h_{u}^{(k - 1)}, e_{uv}) : u \in N (v)})

Explanation: At layer k, a node updates its state using its previous state and an order-invariant aggregation of messages from neighbors. This captures the core structure of MPNNs.

GCN Layer

H^{(k + 1)} = σ (\hat{D}^{- 1/2} \hat{A} \hat{D}^{- 1/2} H^{(k)} W^{(k)})

Explanation: Node features are first linearly transformed, then mixed by a symmetrically normalized adjacency, followed by a nonlinearity. The normalization stabilizes training and preserves scale.

Self-loop Augmentation

\hat{A} = A + I, \hat{D}_{ii} = j \sum \hat{A}_{ij}

Explanation: Adding the identity to A includes each node’s own features during aggregation. The diagonal degree matrix $\hat{D}$ counts neighbors including self.

Permutation Equivariance

f (PX, P A P^{⊤}) = P f (X, A)

Explanation: Reindexing nodes with permutation matrix P permutes the outputs accordingly. This ensures node predictions do not depend on arbitrary node order.

Permutation Invariance

R (P H) = R (H) for readout R

Explanation: Graph-level summaries must be invariant to node reordering. Sum/mean/max pooling satisfy this property.

Expressivity Bound

1-WL equivalence: MPNNs ⪯ 1-WL

Explanation: Standard message passing GNNs cannot distinguish more graphs than the 1-WL test can. If 1-WL fails on two graphs, so will such GNNs.

Over-smoothing Limit

k \to \infty lim (\hat{D}^{- 1/2} \hat{A} \hat{D}^{- 1/2})^{k} H^{(0)} \in span {u_{1}}

Explanation: Repeated normalized averaging drives features toward the principal eigenspace, making node embeddings similar. This explains feature collapse in deep GCNs.

Neighborhood Growth

∣ B (v, r) ∣ \approx i = 0 \sum r d^{i} = \frac{d ^{r + 1} - 1}{d - 1}

Explanation: The number of nodes within r hops grows roughly exponentially with degree d, causing long-range information to be squashed through few edges.

Graph Laplacians

L = D - A, L_{sym} = I - D^{- 1/2} A D^{- 1/2}

Explanation: Unnormalized and symmetric normalized Laplacians encode smoothness on graphs. Their eigenvectors serve as positional encodings.

Graph-level Embedding

Z = [H^{(K)}]_{readout} = AGG_v \in V h_{v}^{(K)}

Explanation: A graph representation is obtained by aggregating the final-layer node embeddings with an invariant operator such as sum or mean.

Message Passing Complexity

T_{MP} = O (E d) per layer

Explanation: Aggregating d-dimensional features over all edges requires time proportional to the number of edges times feature size. This guides scalability analysis.

1-WL Color Refinement

C^{(t + 1)} (v) = hash (C^{(t)} (v), multiset {C^{(t)} (u) : u \in N (v)})

Explanation: Node colors are iteratively updated from neighbor color multisets. Stabilization implies a partition that can (sometimes) distinguish non-isomorphic graphs.

Complexity Analysis

A single message passing layer on a sparse graph with N nodes, E edges, and feature dimension d typically costs O(E d) time to aggregate neighbor features, plus O(N d d') to apply a linear transform from d to d' (often fused or applied before aggregation). For common settings with d' ≈ d, this is O(E d + N

d^{2}

). When using attention (e.g., GAT), computing attention coefficients per edge adds O(E d) operations and increases constants due to softmax normalizations; multi-head attention multiplies costs by the number of heads. Memory usage is O(N d) for node embeddings plus O(E) for the sparse structure; storing per-edge messages or attention weights can increase memory to O(E d) or O(E h) for h heads. For K layers, forward complexity scales roughly linearly: O(K (E d + N

d^{2}

)). Backpropagation doubles (or slightly more) the cost due to gradient computations and intermediate activations, assuming training. Higher-order GNNs significantly increase costs: a 2-WL-like model that reasons over pairs can require O(

N^{2}

) states and O(E N) to aggregate, quickly becoming prohibitive on large graphs. 1-WL color refinement as an algorithm runs in O(T (E log N)) or O(T (E + N log N)) per iteration T depending on data structures, since it sorts or hashes neighbor color multisets. Positional encodings vary: computing top-k Laplacian eigenvectors exactly may require O(

N^{3}

) in dense settings or iterative methods (Lanczos) with O(k (E + N)) per iteration; random features are O(N k). Overall, scalable GNN implementations rely on sparse operations, minibatch sampling (e.g., neighborhood sampling), and careful memory management to keep O(E d) as the dominant term.

Code Examples

A minimal GCN-style message passing layer (inference only)

1 #include <iostream>
2 #include <vector>
3 #include <random>
4 #include <cmath>
5 #include <algorithm>
6 #include <numeric>
7 
8 // This example shows a single GCN-like layer with symmetric normalization.
9 // It performs: H' = ReLU( D_hat^{-1/2} * A_hat * D_hat^{-1/2} * H * W )
10 // where A_hat = A + I (self-loops) and D_hat is its degree matrix.
11 
12 struct Graph {
13     int n; // number of nodes
14     std::vector<std::vector<int>> adj; // adjacency list (undirected)
15 };
16 
17 // Multiply features by weight matrix W (d_in x d_out)
18 std::vector<std::vector<double>> linear(const std::vector<std::vector<double>>& H,
19                                         const std::vector<std::vector<double>>& W) {
20     int n = (int)H.size();
21     int din = (int)H[0].size();
22     int dout = (int)W[0].size();
23     std::vector<std::vector<double>> out(n, std::vector<double>(dout, 0.0));
24     for (int i = 0; i < n; ++i) {
25         for (int j = 0; j < dout; ++j) {
26             double s = 0.0;
27             for (int k = 0; k < din; ++k) s += H[i][k] * W[k][j];
28             out[i][j] = s;
29         }
30     }
31     return out;
32 }
33 
34 // Apply ReLU nonlinearity
35 void relu(std::vector<std::vector<double>>& H) {
36     for (auto& row : H) for (auto& x : row) x = std::max(0.0, x);
37 }
38 
39 // One GCN-like propagation step with symmetric normalization
40 std::vector<std::vector<double>> gcn_propagate(const Graph& G,
41                                               const std::vector<std::vector<double>>& Hlin) {
42     int n = G.n;
43     int d = (int)Hlin[0].size();
44 
45     // Compute degrees with self-loops: deg_hat(v) = deg(v) + 1
46     std::vector<int> deg_hat(n, 1); // start with self-loop
47     for (int v = 0; v < n; ++v) deg_hat[v] += (int)G.adj[v].size();
48 
49     std::vector<std::vector<double>> out(n, std::vector<double>(d, 0.0));
50 
51     // Accumulate self-loop contributions: out[v] += Hlin[v] / sqrt(deg_hat[v]*deg_hat[v])
52     for (int v = 0; v < n; ++v) {
53         double norm = 1.0 / std::sqrt((double)deg_hat[v] * (double)deg_hat[v]);
54         for (int j = 0; j < d; ++j) out[v][j] += Hlin[v][j] * norm;
55     }
56 
57     // Accumulate neighbor contributions with symmetric normalization
58     for (int v = 0; v < n; ++v) {
59         for (int u : G.adj[v]) {
60             double norm = 1.0 / std::sqrt((double)deg_hat[v] * (double)deg_hat[u]);
61             for (int j = 0; j < d; ++j) out[v][j] += Hlin[u][j] * norm;
62         }
63     }
64 
65     return out;
66 }
67 
68 // Utility to create a random weight matrix
69 std::vector<std::vector<double>> randn_matrix(int din, int dout, unsigned seed=42) {
70     std::mt19937 gen(seed);
71     std::normal_distribution<double> nd(0.0, 1.0 / std::sqrt(din));
72     std::vector<std::vector<double>> W(din, std::vector<double>(dout));
73     for (int i = 0; i < din; ++i) for (int j = 0; j < dout; ++j) W[i][j] = nd(gen);
74     return W;
75 }
76 
77 int main() {
78     // Build a small undirected graph (triangle with a tail)
79     Graph G; G.n = 4; G.adj.assign(G.n, {});
80     auto add_edge = [&](int u, int v){ G.adj[u].push_back(v); G.adj[v].push_back(u); };
81     add_edge(0,1); add_edge(1,2); add_edge(2,0); // triangle 0-1-2-0
82     add_edge(2,3); // tail 2-3
83 
84     // Node features: 4 nodes, 3-dimensional features
85     std::vector<std::vector<double>> H = {
86         {1.0, 0.5, -0.5},
87         {0.7, -1.2, 0.3},
88         {0.1, 0.0, 1.0},
89         {-0.3, 0.9, 0.2}
90     };
91 
92     // Weight matrix 3 -> 2
93     auto W = randn_matrix(3, 2, 123);
94 
95     // Linear transform
96     auto Hlin = linear(H, W);
97 
98     // Propagate via normalized adjacency (A+I)
99     auto Hprop = gcn_propagate(G, Hlin);
100 
101     // Nonlinearity
102     relu(Hprop);
103 
104     // Print result
105     std::cout << "Node embeddings after one GCN layer:\n";
106     for (int v = 0; v < G.n; ++v) {
107         std::cout << v << ": ";
108         for (double x : Hprop[v]) std::cout << x << ' ';
109         std::cout << '\n';
110     }
111     return 0;
112 }
113

This program implements one GCN-like layer. It first applies a linear transform to node features, then aggregates neighbor features using symmetrically normalized adjacency with self-loops, and finally applies ReLU. The normalization approximates multiplication by \hat{D}^{-1/2}\hat{A}\hat{D}^{-1/2}, ensuring permutation equivariance and stabilizing scales.

Time: O(E d + N d^2) for one layer (sparse aggregation plus dense linear transform).Space: O(N d) to store features plus O(E) for the graph structure.

1-WL color refinement (expressivity baseline)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Graph { int n; vector<vector<int>> adj; };
5 
6 // Perform 1-WL color refinement until convergence.
7 // Returns color class for each node and the number of iterations.
8 pair<vector<int>, int> wl_color_refinement(const Graph& G) {
9     int n = G.n;
10     vector<int> color(n, 0); // initial uniform color
11     vector<int> new_color(n, 0);
12     bool changed = true; int it = 0;
13     while (changed) {
14         ++it; changed = false;
15         // Key: (own_color, sorted multiset of neighbor colors)
16         vector<pair<int, vector<int>>> signatures(n);
17         for (int v = 0; v < n; ++v) {
18             vector<int> neigh;
19             neigh.reserve(G.adj[v].size());
20             for (int u : G.adj[v]) neigh.push_back(color[u]);
21             sort(neigh.begin(), neigh.end());
22             signatures[v] = {color[v], move(neigh)};
23         }
24         // Map unique signatures to new compact colors
25         map<pair<int, vector<int>>, int> dict; // deterministic ordering
26         int nxt = 0;
27         for (int v = 0; v < n; ++v) {
28             auto it2 = dict.find(signatures[v]);
29             if (it2 == dict.end()) dict[signatures[v]] = nxt++;
30         }
31         for (int v = 0; v < n; ++v) new_color[v] = dict[signatures[v]];
32         if (new_color != color) { color.swap(new_color); changed = true; }
33     }
34     return {color, it-1};
35 }
36 
37 int main() {
38     // Two small graphs to compare
39     Graph G1; G1.n = 6; G1.adj.assign(6, {});
40     auto add_edge = [&](Graph& G, int u, int v){ G.adj[u].push_back(v); G.adj[v].push_back(u); };
41     // G1: 6-cycle (0-1-2-3-4-5-0)
42     add_edge(G1,0,1); add_edge(G1,1,2); add_edge(G1,2,3);
43     add_edge(G1,3,4); add_edge(G1,4,5); add_edge(G1,5,0);
44 
45     Graph G2; G2.n = 6; G2.adj.assign(6, {});
46     // G2: two triangles connected by a bridge (0-1-2-0, 3-4-5-3, plus 2-3)
47     add_edge(G2,0,1); add_edge(G2,1,2); add_edge(G2,2,0);
48     add_edge(G2,3,4); add_edge(G2,4,5); add_edge(G2,5,3);
49     add_edge(G2,2,3);
50 
51     auto [c1, it1] = wl_color_refinement(G1);
52     auto [c2, it2] = wl_color_refinement(G2);
53 
54     // Summarize color histograms
55     auto hist = [&](const vector<int>& c){
56         map<int,int> m; for (int x : c) m[x]++;
57         string s; for (auto& kv : m) s += to_string(kv.first)+":"+to_string(kv.second)+" ";
58         return s;
59     };
60 
61     cout << "G1 colors after " << it1 << " iters: ";
62     for (int i = 0; i < (int)c1.size(); ++i) cout << c1[i] << (i+1==(int)c1.size()?'\n':' ');
63     cout << "G1 histogram: " << hist(c1) << "\n";
64 
65     cout << "G2 colors after " << it2 << " iters: ";
66     for (int i = 0; i < (int)c2.size(); ++i) cout << c2[i] << (i+1==(int)c2.size()?'\n':' ');
67     cout << "G2 histogram: " << hist(c2) << "\n";
68 
69     // If histograms differ, 1-WL distinguishes the graphs.
70     cout << ((hist(c1) == hist(c2)) ? "1-WL: cannot distinguish\n" : "1-WL: distinguishes\n");
71     return 0;
72 }
73

This program runs the 1-WL color refinement: it iteratively hashes each node’s color with the multiset of neighbor colors until stabilization. If two graphs have different stabilized color histograms, 1-WL (and thus standard MPNNs) can distinguish them. This illustrates the expressivity link.

Time: O(T (E log N)) where T is the number of refinement iterations (often small).Space: O(N + E) for the graph plus O(N) for colors and signatures.

Over-smoothing demo by repeated normalized averaging

1 #include <iostream>
2 #include <vector>
3 #include <cmath>
4 #include <algorithm>
5 #include <random>
6 
7 struct Graph { int n; std::vector<std::vector<int>> adj; };
8 
9 // Apply y <- D_hat^{-1/2} (A+I) D_hat^{-1/2} x for all feature dimensions
10 std::vector<std::vector<double>> normalized_step(const Graph& G,
11                                                  const std::vector<std::vector<double>>& H) {
12     int n = G.n, d = (int)H[0].size();
13     std::vector<int> deg_hat(n, 1);
14     for (int v = 0; v < n; ++v) deg_hat[v] += (int)G.adj[v].size();
15     std::vector<std::vector<double>> out(n, std::vector<double>(d, 0.0));
16     for (int v = 0; v < n; ++v) {
17         double selfw = 1.0 / std::sqrt((double)deg_hat[v] * (double)deg_hat[v]);
18         for (int j = 0; j < d; ++j) out[v][j] += H[v][j] * selfw;
19         for (int u : G.adj[v]) {
20             double w = 1.0 / std::sqrt((double)deg_hat[v] * (double)deg_hat[u]);
21             for (int j = 0; j < d; ++j) out[v][j] += H[u][j] * w;
22         }
23     }
24     return out;
25 }
26 
27 // Compute average pairwise cosine similarity across nodes (for first feature dim or averaged over dims)
28 double avg_pairwise_cosine(const std::vector<std::vector<double>>& H) {
29     int n = (int)H.size();
30     int d = (int)H[0].size();
31     auto dot = [&](int a, int b){ double s=0; for(int j=0;j<d;++j) s += H[a][j]*H[b][j]; return s; };
32     auto norm = [&](int a){ return std::sqrt(std::max(1e-12, dot(a,a))); };
33     double sum = 0.0; int cnt = 0;
34     for (int i = 0; i < n; ++i) for (int j = i+1; j < n; ++j) {
35         sum += dot(i,j) / (norm(i)*norm(j));
36         ++cnt;
37     }
38     return (cnt? sum/cnt : 0.0);
39 }
40 
41 int main(){
42     // Build a 10-node path graph (long, thin structure)
43     Graph G; G.n = 10; G.adj.assign(G.n, {});
44     auto add_edge = [&](int u, int v){ G.adj[u].push_back(v); G.adj[v].push_back(u); };
45     for (int i = 0; i+1 < G.n; ++i) add_edge(i,i+1);
46 
47     // Random initial features (d=4)
48     int d = 4; std::mt19937 gen(0); std::normal_distribution<double> nd(0.0,1.0);
49     std::vector<std::vector<double>> H(G.n, std::vector<double>(d));
50     for (int v = 0; v < G.n; ++v) for (int j = 0; j < d; ++j) H[v][j] = nd(gen);
51 
52     std::cout << "Iter\tAvgPairwiseCosine\n";
53     for (int t = 0; t <= 10; ++t) {
54         double c = avg_pairwise_cosine(H);
55         std::cout << t << "\t" << c << "\n";
56         // One normalized averaging step
57         H = normalized_step(G, H);
58     }
59     std::cout << "Note: cosine similarity increases toward 1 as vectors become similar (over-smoothing).\n";
60     return 0;
61 }
62

This program repeatedly applies normalized neighbor averaging (with self-loops) to random features on a path graph. It prints the average pairwise cosine similarity across nodes per iteration. As iterations increase, similarities rise toward 1, empirically showing over-smoothing.

Time: O(T E d) for T iterations over a graph with E edges and d feature dimensions.Space: O(N d) to store features plus O(E) for the graph.

1	#include <iostream>
2	#include <vector>
3	#include <random>
4	#include <cmath>
5	#include <algorithm>
6	#include <numeric>
7
8	// This example shows a single GCN-like layer with symmetric normalization.
9	// It performs: H' = ReLU( D_hat^{-1/2} * A_hat * D_hat^{-1/2} * H * W )
10	// where A_hat = A + I (self-loops) and D_hat is its degree matrix.
11
12	struct Graph {
13	int n; // number of nodes
14	std::vector<std::vector<int>> adj; // adjacency list (undirected)
15	};
16
17	// Multiply features by weight matrix W (d_in x d_out)
18	std::vector<std::vector<double>> linear(const std::vector<std::vector<double>>& H,
19	const std::vector<std::vector<double>>& W) {
20	int n = (int)H.size();
21	int din = (int)H[0].size();
22	int dout = (int)W[0].size();
23	std::vector<std::vector<double>> out(n, std::vector<double>(dout, 0.0));
24	for (int i = 0; i < n; ++i) {
25	for (int j = 0; j < dout; ++j) {
26	double s = 0.0;
27	for (int k = 0; k < din; ++k) s += H[i][k] * W[k][j];
28	out[i][j] = s;
29	}
30	}
31	return out;
32	}
33
34	// Apply ReLU nonlinearity
35	void relu(std::vector<std::vector<double>>& H) {
36	for (auto& row : H) for (auto& x : row) x = std::max(0.0, x);
37	}
38
39	// One GCN-like propagation step with symmetric normalization
40	std::vector<std::vector<double>> gcn_propagate(const Graph& G,
41	const std::vector<std::vector<double>>& Hlin) {
42	int n = G.n;
43	int d = (int)Hlin[0].size();
44
45	// Compute degrees with self-loops: deg_hat(v) = deg(v) + 1
46	std::vector<int> deg_hat(n, 1); // start with self-loop
47	for (int v = 0; v < n; ++v) deg_hat[v] += (int)G.adj[v].size();
48
49	std::vector<std::vector<double>> out(n, std::vector<double>(d, 0.0));
50
51	// Accumulate self-loop contributions: out[v] += Hlin[v] / sqrt(deg_hat[v]*deg_hat[v])
52	for (int v = 0; v < n; ++v) {
53	double norm = 1.0 / std::sqrt((double)deg_hat[v] * (double)deg_hat[v]);
54	for (int j = 0; j < d; ++j) out[v][j] += Hlin[v][j] * norm;
55	}
56
57	// Accumulate neighbor contributions with symmetric normalization
58	for (int v = 0; v < n; ++v) {
59	for (int u : G.adj[v]) {
60	double norm = 1.0 / std::sqrt((double)deg_hat[v] * (double)deg_hat[u]);
61	for (int j = 0; j < d; ++j) out[v][j] += Hlin[u][j] * norm;
62	}
63	}
64
65	return out;
66	}
67
68	// Utility to create a random weight matrix
69	std::vector<std::vector<double>> randn_matrix(int din, int dout, unsigned seed=42) {
70	std::mt19937 gen(seed);
71	std::normal_distribution<double> nd(0.0, 1.0 / std::sqrt(din));
72	std::vector<std::vector<double>> W(din, std::vector<double>(dout));
73	for (int i = 0; i < din; ++i) for (int j = 0; j < dout; ++j) W[i][j] = nd(gen);
74	return W;
75	}
76
77	int main() {
78	// Build a small undirected graph (triangle with a tail)
79	Graph G; G.n = 4; G.adj.assign(G.n, {});
80	auto add_edge = [&](int u, int v){ G.adj[u].push_back(v); G.adj[v].push_back(u); };
81	add_edge(0,1); add_edge(1,2); add_edge(2,0); // triangle 0-1-2-0
82	add_edge(2,3); // tail 2-3
83
84	// Node features: 4 nodes, 3-dimensional features
85	std::vector<std::vector<double>> H = {
86	{1.0, 0.5, -0.5},
87	{0.7, -1.2, 0.3},
88	{0.1, 0.0, 1.0},
89	{-0.3, 0.9, 0.2}
90	};
91
92	// Weight matrix 3 -> 2
93	auto W = randn_matrix(3, 2, 123);
94
95	// Linear transform
96	auto Hlin = linear(H, W);
97
98	// Propagate via normalized adjacency (A+I)
99	auto Hprop = gcn_propagate(G, Hlin);
100
101	// Nonlinearity
102	relu(Hprop);
103
104	// Print result
105	std::cout << "Node embeddings after one GCN layer:\n";
106	for (int v = 0; v < G.n; ++v) {
107	std::cout << v << ": ";
108	for (double x : Hprop[v]) std::cout << x << ' ';
109	std::cout << '\n';
110	}
111	return 0;
112	}
113

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Graph { int n; vector<vector<int>> adj; };
5
6	// Perform 1-WL color refinement until convergence.
7	// Returns color class for each node and the number of iterations.
8	pair<vector<int>, int> wl_color_refinement(const Graph& G) {
9	int n = G.n;
10	vector<int> color(n, 0); // initial uniform color
11	vector<int> new_color(n, 0);
12	bool changed = true; int it = 0;
13	while (changed) {
14	++it; changed = false;
15	// Key: (own_color, sorted multiset of neighbor colors)
16	vector<pair<int, vector<int>>> signatures(n);
17	for (int v = 0; v < n; ++v) {
18	vector<int> neigh;
19	neigh.reserve(G.adj[v].size());
20	for (int u : G.adj[v]) neigh.push_back(color[u]);
21	sort(neigh.begin(), neigh.end());
22	signatures[v] = {color[v], move(neigh)};
23	}
24	// Map unique signatures to new compact colors
25	map<pair<int, vector<int>>, int> dict; // deterministic ordering
26	int nxt = 0;
27	for (int v = 0; v < n; ++v) {
28	auto it2 = dict.find(signatures[v]);
29	if (it2 == dict.end()) dict[signatures[v]] = nxt++;
30	}
31	for (int v = 0; v < n; ++v) new_color[v] = dict[signatures[v]];
32	if (new_color != color) { color.swap(new_color); changed = true; }
33	}
34	return {color, it-1};
35	}
36
37	int main() {
38	// Two small graphs to compare
39	Graph G1; G1.n = 6; G1.adj.assign(6, {});
40	auto add_edge = [&](Graph& G, int u, int v){ G.adj[u].push_back(v); G.adj[v].push_back(u); };
41	// G1: 6-cycle (0-1-2-3-4-5-0)
42	add_edge(G1,0,1); add_edge(G1,1,2); add_edge(G1,2,3);
43	add_edge(G1,3,4); add_edge(G1,4,5); add_edge(G1,5,0);
44
45	Graph G2; G2.n = 6; G2.adj.assign(6, {});
46	// G2: two triangles connected by a bridge (0-1-2-0, 3-4-5-3, plus 2-3)
47	add_edge(G2,0,1); add_edge(G2,1,2); add_edge(G2,2,0);
48	add_edge(G2,3,4); add_edge(G2,4,5); add_edge(G2,5,3);
49	add_edge(G2,2,3);
50
51	auto [c1, it1] = wl_color_refinement(G1);
52	auto [c2, it2] = wl_color_refinement(G2);
53
54	// Summarize color histograms
55	auto hist = [&](const vector<int>& c){
56	map<int,int> m; for (int x : c) m[x]++;
57	string s; for (auto& kv : m) s += to_string(kv.first)+":"+to_string(kv.second)+" ";
58	return s;
59	};
60
61	cout << "G1 colors after " << it1 << " iters: ";
62	for (int i = 0; i < (int)c1.size(); ++i) cout << c1[i] << (i+1==(int)c1.size()?'\n':' ');
63	cout << "G1 histogram: " << hist(c1) << "\n";
64
65	cout << "G2 colors after " << it2 << " iters: ";
66	for (int i = 0; i < (int)c2.size(); ++i) cout << c2[i] << (i+1==(int)c2.size()?'\n':' ');
67	cout << "G2 histogram: " << hist(c2) << "\n";
68
69	// If histograms differ, 1-WL distinguishes the graphs.
70	cout << ((hist(c1) == hist(c2)) ? "1-WL: cannot distinguish\n" : "1-WL: distinguishes\n");
71	return 0;
72	}
73