Streaming Algorithm Theory
Key Points
- •Streaming algorithms process massive data one pass at a time using sublinear—often polylogarithmic—memory.
- •Approximate summaries like Count-Min Sketch, Misra–Gries, HyperLogLog, and AMS sketches trade tiny errors for huge memory and speed savings.
- •Count-Min Sketch estimates item frequencies with additive error at most \( N\) with probability at least \(1-\) using \(O( )\) counters.
- •Misra–Gries finds heavy hitters with at most \(k-1\) counters and guarantees any item with frequency greater than \(n/k\) appears among the candidates.
- •HyperLogLog estimates the number of distinct elements with relative error about \(1.04/\) using \(m\) tiny registers.
- •AMS sketches estimate frequency moments (e.g., \(\)) in the turnstile model with \(O( )\) space.
- •Lower bounds from communication complexity show many of these space trade-offs are essentially optimal.
- •Sliding-window variants (e.g., exponential histograms) let you track only the most recent \(W\) items with provable error guarantees.
Prerequisites
- →Hash functions and universal hashing — Sketches rely on pairwise or limited independence to control collision probabilities and variance.
- →Probability basics and concentration bounds — Understanding ε-δ guarantees, variance reduction, and why median-of-means works requires probabilistic reasoning.
- →Big-O notation and asymptotic analysis — Space–time trade-offs are expressed in O(·) terms like O(ε^{-2}) and O(log 1/δ).
- →Modular arithmetic and bit operations — Implementations use modular hashing, bit masking, and leading-zero counts (e.g., in HyperLogLog).
- →Maps and arrays in C++ (STL) — Sketches often combine hash maps for candidates and flat arrays for counters.
- →Discrete distributions and random number generation — Simulations and randomized algorithms require proper RNG seeding and usage.
- →Communication complexity (high level) — Explains why certain memory lower bounds are unavoidable in streaming.
- →Numerical stability and precision — Estimators like AMS require care with squaring large sums and aggregating across repetitions.
Detailed Explanation
Tap terms for definitions01Overview
Streaming algorithm theory studies how to compute useful statistics from data that arrives as a high-speed stream, when you are allowed to scan it only once (or a few times) and have very limited memory. Instead of storing the entire dataset, you keep a compact sketch—small data structures that summarize key properties. These sketches support updates and queries in constant or logarithmic time and use memory that is sublinear in the input size, often polylogarithmic in the stream length n. Core problems include frequency estimation (how many times did each item appear?), finding heavy hitters (which items occur most often?), counting distinct elements, and estimating higher-order moments (like the second moment F2 that reflects collision probability or variance-like behavior). There are multiple models: insertion-only streams (items only arrive), turnstile streams (items can be inserted and deleted), and sliding-window streams (only the last W items matter). Each model demands specific algorithmic techniques. Well-known sketches include Count-Min Sketch for approximate frequency queries, Misra–Gries for heavy hitters, HyperLogLog for distinct counts, and AMS sketches for frequency moments such as F2. Error bounds are probabilistic and tunable via parameters ε (accuracy) and δ (failure probability). Importantly, communication complexity provides lower bounds showing that, for many tasks, these memory-accuracy trade-offs are essentially optimal, guiding both theoretical understanding and practical design.
02Intuition & Analogies
Imagine trying to count the colors of cars passing a highway camera all day using just a few sticky notes. You cannot store every car, but you still want to answer questions like “How many red cars passed?” or “How many distinct colors did I see?” Streaming algorithms are those sticky notes: compact summaries that are constantly updated as new cars pass by. Count-Min Sketch works like having several parking lots (rows), each with a different quirky attendant (hash function). Every time a car arrives, you ask each attendant to tally in a slot determined by the car’s license plate. Collisions inflate counts but never deflate them—so the estimate is an overcount, bounded by how busy the lot is. Misra–Gries is like keeping k−1 name tags for possible VIPs. When a new car color arrives: if it’s already a VIP, increase its count; if there’s an empty tag, assign it; otherwise, reduce everyone’s counts. Only colors with truly high presence can survive persistent reductions, so VIPs at the end are your heavy hitters. HyperLogLog estimates distinct counts the way you’d estimate how many coins fell on a floor by how far the furthest coin rolled. It hashes each item to a random bitstring and records the position of the first 1 bit; very long runs of leading zeros are rare unless you’ve seen many distinct items. Aggregating tiny registers gives an accurate population estimate. AMS (for F2) treats each item’s identity as a coin flip that determines +1 or −1. You keep a running signed sum weighted by item frequencies. Squaring this aggregate cleverly captures the sum of squared frequencies; averaging independent copies reduces variance. These techniques feel counterintuitive at first, but the probabilistic structure ensures reliable answers with tiny memory.
03Formal Definition
04When to Use
Use streaming sketches when data is too large or too fast to store or scan multiple times. Typical scenarios include network traffic monitoring (e.g., top talkers, flow sizes), logs and telemetry (e.g., unique users, event rates), large-scale machine learning pipelines (feature frequency tracking, approximate deduplication), real-time fraud/abuse detection (heavy hitters), and database systems (approximate query processing).
- Count-Min Sketch: when you need per-key frequency estimates, point queries, and quick updates under insert-only workloads. It is great for approximate histograms, popularity ranking, and as a building block to verify Misra–Gries candidates or to compress counters.
- Misra–Gries: when your goal is to find heavy hitters with small k (e.g., top-100). It avoids hashing collisions in estimates and uses deterministic logic with strong guarantees for insertion-only streams.
- HyperLogLog: when counting distinct elements at huge scale (billions of IDs) with tight memory budgets and quick merges across distributed workers. Its mergeability and tiny per-register footprint make it standard in analytics systems.
- AMS F2 sketch: when you need second-moment estimates in turnstile streams (insertions and deletions), e.g., to estimate self-join size in databases, detect DDoS-like variance spikes, or feed algorithms that depend on (F_2).
- Sliding-window variants (e.g., exponential histograms): when only the last (W) arrivals matter, as in real-time dashboards and alerting where stale data must decay out automatically.
⚠️Common Mistakes
- Treating Count-Min estimates as unbiased: CMS overestimates due to collisions. Do not subtract the estimated background noise naively; prefer conservative updates or combine with a second pass (if possible) to validate heavy hitters.
- Using CMS in adversarial turnstile streams: negative updates can cancel collisions unpredictably. For general turnstile guarantees consider Count Sketch or AMS-style signs; for nonnegative frequencies CMS is ideal.
- Misinterpreting Misra–Gries counts: the counters are lower bounds; an item’s true count can exceed its stored counter by up to (n/k). You still need either a second pass or a secondary structure (like CMS) to refine counts if exactness matters.
- Overclaiming HyperLogLog memory as O(log log n) total: each register is O(log log n) bits, but total memory is O(m) registers where (m\approx 1/\varepsilon^2). Choose m to meet error targets and ensure good hashing.
- Poor hash choices: using low-quality or correlated hashes breaks independence assumptions and inflates error. Use 2-universal or stronger families; seed each row independently.
- Ignoring failure probability δ: set depth/replications via (\log(1/\delta)). Forgetting this may yield unstable estimates in production.
- Forgetting the model: some sketches assume insertion-only, others support turnstile or sliding windows. Applying the wrong sketch to deletions or time-decayed data can silently invalidate guarantees.
- Not planning validation: for heavy hitters, you often need a verification step (e.g., a compact exact map for top candidates, sampling, or a bounded-size cache) to minimize false positives.
Key Formulas
Frequency moment
Explanation: The k-th moment of the frequency vector summarizes distribution shape. F0 counts distinct items, F1 is the stream length, and F2 reflects concentration of mass.
Distinct count
Explanation: F0 counts how many unique items appear. Distinct counting sketches like HyperLogLog approximate this value.
CMS dimensions
Explanation: Choosing width w and depth d this way yields additive error at most \( N\) with probability at least \(1-\). Larger w and d improve accuracy and reliability.
CMS point query bound
Explanation: The CMS estimate overcounts due to collisions. With appropriate w and d, the error is at most \(\) times total count N with high probability.
Misra–Gries heavy hitter guarantee
Explanation: Any item heavier than n/k is guaranteed to survive to the end as a candidate. Its counter underestimates by at most n/k.
Misra–Gries error
Explanation: The final counter for any item is within n/k of its true frequency. This bounds false negatives when detecting heavy hitters.
AMS F2 estimator
Explanation: Using random signs, the squared sum is an unbiased estimator of F2. Averaging or median-of-means over repetitions reduces variance.
HyperLogLog estimator
Explanation: HLL uses m registers M[j] holding leading-zero runs of hashed items. The harmonic mean of 2^{M[j]} scaled by \(\) yields the distinct count; error is about 1.04/.
AMS accuracy
Explanation: Repeating independent AMS estimators and aggregating appropriately achieves a fully polynomial accuracy-reliability trade-off.
Exponential histogram space
Explanation: To maintain approximate counts over the last W items within \( W\), exponential histograms use logarithmically many buckets per scale.
Communication lower bound for F2
Explanation: Reductions from communication complexity imply that any one-pass algorithm approximating F2 needs at least on the order of 1/\(\) space.
Lower bound for distinct counting
Explanation: Approximating distinct elements within relative error \(\) requires at least this much memory in one pass.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // A simple, practical Count-Min Sketch implementation for insertion-only streams. 5 // Guarantees: with width w = ceil(e/eps) and depth d = ceil(log(1/delta)), 6 // estimate(key) <= true_count(key) + eps * N with probability >= 1 - delta, 7 // where N is the total number of insertions. 8 9 struct CountMinSketch { 10 int w, d; // width and depth 11 vector<vector<long long>> tbl; // counters (signed here for simplicity) 12 vector<uint64_t> seeds; // independent seeds per row 13 14 // SplitMix64: fast 64-bit hash mixer 15 static inline uint64_t splitmix64(uint64_t x) { 16 x += 0x9e3779b97f4a7c15ULL; 17 x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9ULL; 18 x = (x ^ (x >> 27)) * 0x94d049bb133111ebULL; 19 return x ^ (x >> 31); 20 } 21 22 // Construct from (eps, delta) 23 CountMinSketch(double eps = 1e-3, double delta = 1e-6) { 24 if (eps <= 0) eps = 1e-9; 25 if (delta <= 0) delta = 1e-12; 26 w = (int)ceil(exp(1.0) / eps); 27 d = (int)ceil(log(1.0 / delta)); 28 tbl.assign(d, vector<long long>(w, 0)); 29 seeds.resize(d); 30 std::mt19937_64 rng(123456789); 31 for (int i = 0; i < d; ++i) seeds[i] = rng(); 32 } 33 34 // Hash (key, row) -> [0, w) 35 inline uint32_t index_hash(uint64_t key, int row) const { 36 uint64_t h = splitmix64(key ^ seeds[row]); 37 return (uint32_t)(h % (uint64_t)w); 38 } 39 40 // Update with increment delta >= 0 (insertion-only). For generality we allow signed. 41 void update(uint64_t key, long long delta = 1) { 42 for (int r = 0; r < d; ++r) { 43 uint32_t c = index_hash(key, r); 44 tbl[r][c] += delta; // CMS works best with nonnegative totals 45 } 46 } 47 48 // Point query: min across rows 49 long long estimate(uint64_t key) const { 50 long long ans = LLONG_MAX; 51 for (int r = 0; r < d; ++r) { 52 uint32_t c = index_hash(key, r); 53 ans = min(ans, tbl[r][c]); 54 } 55 return ans; 56 } 57 }; 58 59 int main() { 60 ios::sync_with_stdio(false); 61 cin.tie(nullptr); 62 63 // Example: track frequencies of numbers 0..999999 with tight memory. 64 CountMinSketch cms(1e-3, 1e-6); // eps=0.001, delta=1e-6 65 66 // Simulate a skewed stream: key 42 is heavy. 67 const int N = 500000; 68 std::mt19937 rng(42); 69 std::discrete_distribution<int> dist({1, 100}); // 2 keys: 7 is light, 42 is heavy 70 71 long long true_7 = 0, true_42 = 0; 72 for (int i = 0; i < N; ++i) { 73 int pick = dist(rng); 74 uint64_t key = (pick == 0 ? 7ULL : 42ULL); 75 if (key == 7ULL) ++true_7; else ++true_42; 76 cms.update(key, 1); 77 } 78 79 long long est7 = cms.estimate(7); 80 long long est42 = cms.estimate(42); 81 82 cout << "True count(7) = " << true_7 << "\n"; 83 cout << "CMS estimate(7)= " << est7 << "\n"; 84 cout << "True count(42) = " << true_42 << "\n"; 85 cout << "CMS estimate(42)= " << est42 << "\n"; 86 87 // The overestimation should be bounded by eps * N (about 500 here) with high probability. 88 return 0; 89 } 90
This program implements Count-Min Sketch with d independent hash rows and width w. Each update increments one counter per row, and a point query returns the minimum across rows. The example simulates a skewed stream where key 42 is heavy; the CMS estimates should slightly overcount but remain within the additive bound εN with high probability.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Misra–Gries keeps at most k-1 counters and finds all items with frequency > n/k. 5 // The stored counts are LOWER bounds; each candidate's true count can be up to n/k higher. 6 7 template <class Key> 8 struct MisraGries { 9 int k; // we store at most k-1 keys 10 unordered_map<Key, long long> cnt; // candidate counters 11 long long n_seen = 0; // total processed items 12 13 explicit MisraGries(int k_) : k(max(2, k_)) {} 14 15 void update(const Key& x, long long times = 1) { 16 // Standard MG is for insertion-only (times >= 0). Apply 'times' updates one by one if small. 17 // For simplicity, assume times >= 0 in this demo. 18 for (long long t = 0; t < times; ++t) { 19 ++n_seen; 20 auto it = cnt.find(x); 21 if (it != cnt.end()) { 22 ++(it->second); 23 } else if ((int)cnt.size() < k - 1) { 24 cnt[x] = 1; 25 } else { 26 // Decrement all 27 vector<Key> to_erase; 28 for (auto &p : cnt) { 29 if (--p.second == 0) to_erase.push_back(p.first); 30 } 31 for (auto &z : to_erase) cnt.erase(z); 32 } 33 } 34 } 35 36 // Return current candidates and their lower-bound counts 37 unordered_map<Key, long long> candidates() const { return cnt; } 38 39 // Error bound: for any key i, true f_i - stored[i] <= n_seen / k 40 pair<long long, long long> bounds_for(const Key& x) const { 41 long long lb = 0; 42 auto it = cnt.find(x); 43 if (it != cnt.end()) lb = it->second; 44 long long ub = lb + n_seen / k; // MG bound 45 return {lb, ub}; 46 } 47 }; 48 49 int main() { 50 ios::sync_with_stdio(false); 51 cin.tie(nullptr); 52 53 MisraGries<int> mg(101); // find all items with freq > n/101 (~top-1%) 54 55 // Simulate a stream with a few heavy hitters. 56 std::mt19937 rng(123); 57 vector<int> keys = {1,2,3,4,5,6,7,8,9,10}; 58 std::discrete_distribution<int> dist({1,1,1,1,1,1,1,1,40,50}); // 9 and 10 are heavy 59 60 const int N = 200000; 61 for (int i = 0; i < N; ++i) { 62 int x = keys[dist(rng)]; 63 mg.update(x); 64 } 65 66 auto cand = mg.candidates(); 67 cout << "Number of candidates <= k-1 = " << cand.size() << "\n"; 68 // Report bounds for some keys 69 for (int q : {9, 10, 1, 2}) { 70 auto [lb, ub] = mg.bounds_for(q); 71 cout << "Key " << q << ": lower_bound=" << lb << ", upper_bound<=" << ub << "\n"; 72 } 73 74 // Note: To get exact frequencies of these candidates, perform a second pass or 75 // maintain an auxiliary exact map only for candidates. 76 return 0; 77 } 78
This code implements the Misra–Gries algorithm to track heavy hitters with at most k−1 counters. Each new item increments its counter, fills an empty slot, or decrements all counters when full. Any item whose frequency exceeds n/k is guaranteed to be in candidates(). The bounds_for method reports the Misra–Gries theoretical bounds: stored count is a lower bound and the error is at most n/k.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // AMS F2 sketch: maintains R independent signed sums Z_j = sum_i g_j(i) * f_i. 5 // Query returns a robust aggregate (median) of Z_j^2 to estimate F2 = sum_i f_i^2. 6 // This demo uses a pairwise-universal hash mod 2^61-1 to generate g_j(i) in {-1, +1}. 7 8 struct PairHash61 { 9 // Hash function: h(x) = (a*x + b) mod (2^61 - 1). We use its lowest bit for sign. 10 static const unsigned __int128 MOD = ((unsigned __int128)1 << 61) - 1; // 2^61 - 1 11 unsigned long long a, b; // parameters in [1, MOD-1] 12 13 explicit PairHash61(uint64_t seed) { 14 std::mt19937_64 rng(seed); 15 std::uniform_int_distribution<unsigned long long> dist(1ULL, (1ULL<<61) - 2ULL); 16 a = dist(rng); 17 b = dist(rng); 18 } 19 20 static inline unsigned __int128 mul_mod(unsigned __int128 x, unsigned __int128 y) { 21 unsigned __int128 z = x * y; 22 // reduce z mod (2^61 - 1) 23 unsigned __int128 lo = (z & ((((unsigned __int128)1)<<61) - 1)); 24 unsigned __int128 hi = (z >> 61); 25 unsigned __int128 r = lo + hi; 26 if (r >= MOD) r -= MOD; 27 return r; 28 } 29 30 static inline unsigned __int128 add_mod(unsigned __int128 x, unsigned __int128 y) { 31 unsigned __int128 r = x + y; 32 if (r >= MOD) r -= MOD; 33 return r; 34 } 35 36 inline uint64_t hash64(uint64_t x) const { 37 unsigned __int128 X = x; 38 unsigned __int128 A = a, B = b; 39 unsigned __int128 t = add_mod(mul_mod(A, X), B); 40 // return as 64-bit value (we only need LSB for sign) 41 return (uint64_t)t; 42 } 43 44 inline int sign(uint64_t x) const { 45 // map to +1/-1 using lowest bit 46 return (hash64(x) & 1ULL) ? +1 : -1; 47 } 48 }; 49 50 struct AMSSketchF2 { 51 int R; // number of repetitions 52 vector<PairHash61> h; // independent hashers for signs 53 vector<long double> Z; // accumulators (can be large in magnitude) 54 55 explicit AMSSketchF2(int R_, uint64_t base_seed = 2024) : R(max(1, R_)) { 56 h.reserve(R); 57 Z.assign(R, 0.0L); 58 std::mt19937_64 rng(base_seed); 59 for (int j = 0; j < R; ++j) h.emplace_back(rng()); 60 } 61 62 // Turnstile update: (key, delta), delta can be negative. 63 void update(uint64_t key, long long delta) { 64 for (int j = 0; j < R; ++j) { 65 int s = h[j].sign(key); 66 Z[j] += (long double)s * (long double)delta; 67 } 68 } 69 70 // Estimate F2 via robust aggregation of Z_j^2. Median is simple and effective. 71 long double estimate_F2() const { 72 vector<long double> vals(R); 73 for (int j = 0; j < R; ++j) vals[j] = Z[j] * Z[j]; 74 nth_element(vals.begin(), vals.begin() + R/2, vals.end()); 75 return vals[R/2]; 76 } 77 }; 78 79 int main() { 80 ios::sync_with_stdio(false); 81 cin.tie(nullptr); 82 83 // Build an AMS sketch with R repetitions. For (eps, delta) one often sets R = c * eps^{-2} * log(1/delta). 84 int R = 81; // e.g., eps ~ 1/sqrt(R) ~ 0.11 (rough intuition); use larger R for tighter error 85 AMSSketchF2 ams(R); 86 87 // We'll also keep an exact map for validation in this demo (not part of the sketch itself). 88 unordered_map<uint64_t, long long> exact; 89 90 // Simulate a turnstile stream with insertions and deletions 91 std::mt19937_64 rng(7); 92 uniform_int_distribution<uint64_t> keydist(1, 1'000'000); 93 for (int t = 0; t < 200000; ++t) { 94 uint64_t k = keydist(rng); 95 int delta = (rng() & 1) ? +1 : -1; // random +/- 1 96 ams.update(k, delta); 97 exact[k] += delta; 98 if (exact[k] == 0) exact.erase(k); 99 } 100 101 // Exact F2 for comparison (requires memory and is only for demonstration) 102 long double F2_exact = 0.0L; 103 for (auto &p : exact) F2_exact += (long double)p.second * (long double)p.second; 104 105 long double F2_est = ams.estimate_F2(); 106 107 cout.setf(std::ios::fixed); cout << setprecision(2); 108 cout << "Exact F2 = " << (double)F2_exact << "\n"; 109 cout << "AMS estimate = " << (double)F2_est << "\n"; 110 cout << "Rel. error = " << (double)(fabsl(F2_est - F2_exact) / max(1.0L, F2_exact)) << "\n"; 111 return 0; 112 } 113
This program implements an AMS sketch to estimate the second frequency moment F2 in the turnstile model. Each repetition uses a pairwise-universal hash to generate a random sign for each key; the accumulator Z stores the signed sum of updates. Squaring Z gives an unbiased estimator for F2; taking the median across repetitions improves concentration. The demo validates the estimate against an exact map.