📚TheoryAdvanced

Streaming Algorithm Theory

Key Points

  • Streaming algorithms process massive data one pass at a time using sublinear—often polylogarithmic—memory.
  • Approximate summaries like Count-Min Sketch, Misra–Gries, HyperLogLog, and AMS sketches trade tiny errors for huge memory and speed savings.
  • Count-Min Sketch estimates item frequencies with additive error at most \( N\) with probability at least \(1-\) using \(O( )\) counters.
  • Misra–Gries finds heavy hitters with at most \(k-1\) counters and guarantees any item with frequency greater than \(n/k\) appears among the candidates.
  • HyperLogLog estimates the number of distinct elements with relative error about \(1.04/\) using \(m\) tiny registers.
  • AMS sketches estimate frequency moments (e.g., \(\)) in the turnstile model with \(O( )\) space.
  • Lower bounds from communication complexity show many of these space trade-offs are essentially optimal.
  • Sliding-window variants (e.g., exponential histograms) let you track only the most recent \(W\) items with provable error guarantees.

Prerequisites

  • Hash functions and universal hashingSketches rely on pairwise or limited independence to control collision probabilities and variance.
  • Probability basics and concentration boundsUnderstanding ε-δ guarantees, variance reduction, and why median-of-means works requires probabilistic reasoning.
  • Big-O notation and asymptotic analysisSpace–time trade-offs are expressed in O(·) terms like O(ε^{-2}) and O(log 1/δ).
  • Modular arithmetic and bit operationsImplementations use modular hashing, bit masking, and leading-zero counts (e.g., in HyperLogLog).
  • Maps and arrays in C++ (STL)Sketches often combine hash maps for candidates and flat arrays for counters.
  • Discrete distributions and random number generationSimulations and randomized algorithms require proper RNG seeding and usage.
  • Communication complexity (high level)Explains why certain memory lower bounds are unavoidable in streaming.
  • Numerical stability and precisionEstimators like AMS require care with squaring large sums and aggregating across repetitions.

Detailed Explanation

Tap terms for definitions

01Overview

Streaming algorithm theory studies how to compute useful statistics from data that arrives as a high-speed stream, when you are allowed to scan it only once (or a few times) and have very limited memory. Instead of storing the entire dataset, you keep a compact sketch—small data structures that summarize key properties. These sketches support updates and queries in constant or logarithmic time and use memory that is sublinear in the input size, often polylogarithmic in the stream length n. Core problems include frequency estimation (how many times did each item appear?), finding heavy hitters (which items occur most often?), counting distinct elements, and estimating higher-order moments (like the second moment F2 that reflects collision probability or variance-like behavior). There are multiple models: insertion-only streams (items only arrive), turnstile streams (items can be inserted and deleted), and sliding-window streams (only the last W items matter). Each model demands specific algorithmic techniques. Well-known sketches include Count-Min Sketch for approximate frequency queries, Misra–Gries for heavy hitters, HyperLogLog for distinct counts, and AMS sketches for frequency moments such as F2. Error bounds are probabilistic and tunable via parameters ε (accuracy) and δ (failure probability). Importantly, communication complexity provides lower bounds showing that, for many tasks, these memory-accuracy trade-offs are essentially optimal, guiding both theoretical understanding and practical design.

02Intuition & Analogies

Imagine trying to count the colors of cars passing a highway camera all day using just a few sticky notes. You cannot store every car, but you still want to answer questions like “How many red cars passed?” or “How many distinct colors did I see?” Streaming algorithms are those sticky notes: compact summaries that are constantly updated as new cars pass by. Count-Min Sketch works like having several parking lots (rows), each with a different quirky attendant (hash function). Every time a car arrives, you ask each attendant to tally in a slot determined by the car’s license plate. Collisions inflate counts but never deflate them—so the estimate is an overcount, bounded by how busy the lot is. Misra–Gries is like keeping k−1 name tags for possible VIPs. When a new car color arrives: if it’s already a VIP, increase its count; if there’s an empty tag, assign it; otherwise, reduce everyone’s counts. Only colors with truly high presence can survive persistent reductions, so VIPs at the end are your heavy hitters. HyperLogLog estimates distinct counts the way you’d estimate how many coins fell on a floor by how far the furthest coin rolled. It hashes each item to a random bitstring and records the position of the first 1 bit; very long runs of leading zeros are rare unless you’ve seen many distinct items. Aggregating tiny registers gives an accurate population estimate. AMS (for F2) treats each item’s identity as a coin flip that determines +1 or −1. You keep a running signed sum weighted by item frequencies. Squaring this aggregate cleverly captures the sum of squared frequencies; averaging independent copies reduces variance. These techniques feel counterintuitive at first, but the probabilistic structure ensures reliable answers with tiny memory.

03Formal Definition

A data stream is a sequence \(, , , \) over a universe \([U]\). In the insertion-only model, items arrive and update frequency counts \( + 1\). In the turnstile model, updates are pairs \((i, )\) with \( \), so \( + \). A streaming algorithm maintains a state \(S\) of size \(O( n)\) (or sublinear in \(n\)) and supports queries about the frequency vector \(f\). Frequency moments are \( = ^k\). Key special cases: \(\) (distinct count), \( = f_i = n\) (stream length), and \( = ^2\) (second moment). A \((, )\)-approximation outputs \(\) such that, with probability at least \(1-\), either \((1-)v (1+)v\) (multiplicative) or \( V\) (additive relative to scale \(V\)). Count-Min Sketch uses \(d\) pairwise-independent hash functions \(:[U][w]\) and a \(d w\) table. Update \((i, )\) adds \(\) to counters \(C[j, (i)]\). The estimate is \((i) = C[j,(i)]\), yielding additive error \( N\) with probability \(1-\), for \(w= e/ , d= (1/)\). Misra–Gries keeps at most \(k-1\) counters; any item with \(f_i > n/k\) is guaranteed to be in the final candidate set, and \(\) underestimates by at most \(n/k\). AMS F2 sketch draws pairwise-independent signs \(g(i)\{-1,+1\}\) and maintains \(Z= g(i) \); \(\) is an unbiased estimator for \(\). Repeating \(O( )\) times and aggregating (e.g., median of means) gives a \((1)\) estimate with failure probability \(\).

04When to Use

Use streaming sketches when data is too large or too fast to store or scan multiple times. Typical scenarios include network traffic monitoring (e.g., top talkers, flow sizes), logs and telemetry (e.g., unique users, event rates), large-scale machine learning pipelines (feature frequency tracking, approximate deduplication), real-time fraud/abuse detection (heavy hitters), and database systems (approximate query processing).

  • Count-Min Sketch: when you need per-key frequency estimates, point queries, and quick updates under insert-only workloads. It is great for approximate histograms, popularity ranking, and as a building block to verify Misra–Gries candidates or to compress counters.
  • Misra–Gries: when your goal is to find heavy hitters with small k (e.g., top-100). It avoids hashing collisions in estimates and uses deterministic logic with strong guarantees for insertion-only streams.
  • HyperLogLog: when counting distinct elements at huge scale (billions of IDs) with tight memory budgets and quick merges across distributed workers. Its mergeability and tiny per-register footprint make it standard in analytics systems.
  • AMS F2 sketch: when you need second-moment estimates in turnstile streams (insertions and deletions), e.g., to estimate self-join size in databases, detect DDoS-like variance spikes, or feed algorithms that depend on (F_2).
  • Sliding-window variants (e.g., exponential histograms): when only the last (W) arrivals matter, as in real-time dashboards and alerting where stale data must decay out automatically.

⚠️Common Mistakes

  • Treating Count-Min estimates as unbiased: CMS overestimates due to collisions. Do not subtract the estimated background noise naively; prefer conservative updates or combine with a second pass (if possible) to validate heavy hitters.
  • Using CMS in adversarial turnstile streams: negative updates can cancel collisions unpredictably. For general turnstile guarantees consider Count Sketch or AMS-style signs; for nonnegative frequencies CMS is ideal.
  • Misinterpreting Misra–Gries counts: the counters are lower bounds; an item’s true count can exceed its stored counter by up to (n/k). You still need either a second pass or a secondary structure (like CMS) to refine counts if exactness matters.
  • Overclaiming HyperLogLog memory as O(log log n) total: each register is O(log log n) bits, but total memory is O(m) registers where (m\approx 1/\varepsilon^2). Choose m to meet error targets and ensure good hashing.
  • Poor hash choices: using low-quality or correlated hashes breaks independence assumptions and inflates error. Use 2-universal or stronger families; seed each row independently.
  • Ignoring failure probability δ: set depth/replications via (\log(1/\delta)). Forgetting this may yield unstable estimates in production.
  • Forgetting the model: some sketches assume insertion-only, others support turnstile or sliding windows. Applying the wrong sketch to deletions or time-decayed data can silently invalidate guarantees.
  • Not planning validation: for heavy hitters, you often need a verification step (e.g., a compact exact map for top candidates, sampling, or a bounded-size cache) to minimize false positives.

Key Formulas

Frequency moment

Explanation: The k-th moment of the frequency vector summarizes distribution shape. F0 counts distinct items, F1 is the stream length, and F2 reflects concentration of mass.

Distinct count

Explanation: F0 counts how many unique items appear. Distinct counting sketches like HyperLogLog approximate this value.

CMS dimensions

Explanation: Choosing width w and depth d this way yields additive error at most \( N\) with probability at least \(1-\). Larger w and d improve accuracy and reliability.

CMS point query bound

Explanation: The CMS estimate overcounts due to collisions. With appropriate w and d, the error is at most \(\) times total count N with high probability.

Misra–Gries heavy hitter guarantee

Explanation: Any item heavier than n/k is guaranteed to survive to the end as a candidate. Its counter underestimates by at most n/k.

Misra–Gries error

Explanation: The final counter for any item is within n/k of its true frequency. This bounds false negatives when detecting heavy hitters.

AMS F2 estimator

Explanation: Using random signs, the squared sum is an unbiased estimator of F2. Averaging or median-of-means over repetitions reduces variance.

HyperLogLog estimator

Explanation: HLL uses m registers M[j] holding leading-zero runs of hashed items. The harmonic mean of 2^{M[j]} scaled by \(\) yields the distinct count; error is about 1.04/.

AMS accuracy

Explanation: Repeating independent AMS estimators and aggregating appropriately achieves a fully polynomial accuracy-reliability trade-off.

Exponential histogram space

Explanation: To maintain approximate counts over the last W items within \( W\), exponential histograms use logarithmically many buckets per scale.

Communication lower bound for F2

Explanation: Reductions from communication complexity imply that any one-pass algorithm approximating F2 needs at least on the order of 1/\(\) space.

Lower bound for distinct counting

Explanation: Approximating distinct elements within relative error \(\) requires at least this much memory in one pass.

Complexity Analysis

For Count-Min Sketch with width w = O() and depth d = O( ), each update touches d counters and each point query reads d counters. Therefore, update and query time is O(d) = O( ), typically a small constant. Space is O(wd) = O( ) counters; with 4–8-bit or 32-bit counters this is compact and cache-friendly. CMS is mergeable by component-wise addition. Misra–Gries with parameter k maintains at most k−1 counters, giving O(k) worst-case time for a decrement step (when full); with hash maps it is amortized O(1) per update and O(k) space. Reporting candidates is O(k). To get exact counts for candidates you may need either a second pass or an auxiliary sketch/map, adding O(k) extra memory plus whatever the verifier requires. AMS F2 sketch maintains R independent estimators (R = O( )). Each update touches all R copies with a constant-time random sign computation, yielding O(R) update time. Space is O(R), storing one accumulator per copy (or O(RB) with median-of-means in B buckets). Query time is O(R) to aggregate. AMS supports the turnstile model naturally. Sliding-window techniques like exponential histograms require O((1/) log W) space and O(log W) update time, with constant-factor overhead from bucket merges. HyperLogLog uses registers, each a few bits; updates and merges are O(1), and space is O(m). Across these sketches, the key trade-off is a tunable accuracy parameter \(\) that dictates memory and time via 1/\(\) (variance-limited estimators) or 1/\(\) (collision-limited counters).

Code Examples

Count-Min Sketch for Frequency Estimation (Insertion-Only)
1#include <bits/stdc++.h>
2using namespace std;
3
4// A simple, practical Count-Min Sketch implementation for insertion-only streams.
5// Guarantees: with width w = ceil(e/eps) and depth d = ceil(log(1/delta)),
6// estimate(key) <= true_count(key) + eps * N with probability >= 1 - delta,
7// where N is the total number of insertions.
8
9struct CountMinSketch {
10 int w, d; // width and depth
11 vector<vector<long long>> tbl; // counters (signed here for simplicity)
12 vector<uint64_t> seeds; // independent seeds per row
13
14 // SplitMix64: fast 64-bit hash mixer
15 static inline uint64_t splitmix64(uint64_t x) {
16 x += 0x9e3779b97f4a7c15ULL;
17 x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9ULL;
18 x = (x ^ (x >> 27)) * 0x94d049bb133111ebULL;
19 return x ^ (x >> 31);
20 }
21
22 // Construct from (eps, delta)
23 CountMinSketch(double eps = 1e-3, double delta = 1e-6) {
24 if (eps <= 0) eps = 1e-9;
25 if (delta <= 0) delta = 1e-12;
26 w = (int)ceil(exp(1.0) / eps);
27 d = (int)ceil(log(1.0 / delta));
28 tbl.assign(d, vector<long long>(w, 0));
29 seeds.resize(d);
30 std::mt19937_64 rng(123456789);
31 for (int i = 0; i < d; ++i) seeds[i] = rng();
32 }
33
34 // Hash (key, row) -> [0, w)
35 inline uint32_t index_hash(uint64_t key, int row) const {
36 uint64_t h = splitmix64(key ^ seeds[row]);
37 return (uint32_t)(h % (uint64_t)w);
38 }
39
40 // Update with increment delta >= 0 (insertion-only). For generality we allow signed.
41 void update(uint64_t key, long long delta = 1) {
42 for (int r = 0; r < d; ++r) {
43 uint32_t c = index_hash(key, r);
44 tbl[r][c] += delta; // CMS works best with nonnegative totals
45 }
46 }
47
48 // Point query: min across rows
49 long long estimate(uint64_t key) const {
50 long long ans = LLONG_MAX;
51 for (int r = 0; r < d; ++r) {
52 uint32_t c = index_hash(key, r);
53 ans = min(ans, tbl[r][c]);
54 }
55 return ans;
56 }
57};
58
59int main() {
60 ios::sync_with_stdio(false);
61 cin.tie(nullptr);
62
63 // Example: track frequencies of numbers 0..999999 with tight memory.
64 CountMinSketch cms(1e-3, 1e-6); // eps=0.001, delta=1e-6
65
66 // Simulate a skewed stream: key 42 is heavy.
67 const int N = 500000;
68 std::mt19937 rng(42);
69 std::discrete_distribution<int> dist({1, 100}); // 2 keys: 7 is light, 42 is heavy
70
71 long long true_7 = 0, true_42 = 0;
72 for (int i = 0; i < N; ++i) {
73 int pick = dist(rng);
74 uint64_t key = (pick == 0 ? 7ULL : 42ULL);
75 if (key == 7ULL) ++true_7; else ++true_42;
76 cms.update(key, 1);
77 }
78
79 long long est7 = cms.estimate(7);
80 long long est42 = cms.estimate(42);
81
82 cout << "True count(7) = " << true_7 << "\n";
83 cout << "CMS estimate(7)= " << est7 << "\n";
84 cout << "True count(42) = " << true_42 << "\n";
85 cout << "CMS estimate(42)= " << est42 << "\n";
86
87 // The overestimation should be bounded by eps * N (about 500 here) with high probability.
88 return 0;
89}
90

This program implements Count-Min Sketch with d independent hash rows and width w. Each update increments one counter per row, and a point query returns the minimum across rows. The example simulates a skewed stream where key 42 is heavy; the CMS estimates should slightly overcount but remain within the additive bound εN with high probability.

Time: O(d) per update and per query (d = O(log 1/δ))Space: O(wd) counters (w = O(1/ε), d = O(log 1/δ))
Misra–Gries Heavy Hitters (Top-1/φ Approximation, One Pass)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Misra–Gries keeps at most k-1 counters and finds all items with frequency > n/k.
5// The stored counts are LOWER bounds; each candidate's true count can be up to n/k higher.
6
7template <class Key>
8struct MisraGries {
9 int k; // we store at most k-1 keys
10 unordered_map<Key, long long> cnt; // candidate counters
11 long long n_seen = 0; // total processed items
12
13 explicit MisraGries(int k_) : k(max(2, k_)) {}
14
15 void update(const Key& x, long long times = 1) {
16 // Standard MG is for insertion-only (times >= 0). Apply 'times' updates one by one if small.
17 // For simplicity, assume times >= 0 in this demo.
18 for (long long t = 0; t < times; ++t) {
19 ++n_seen;
20 auto it = cnt.find(x);
21 if (it != cnt.end()) {
22 ++(it->second);
23 } else if ((int)cnt.size() < k - 1) {
24 cnt[x] = 1;
25 } else {
26 // Decrement all
27 vector<Key> to_erase;
28 for (auto &p : cnt) {
29 if (--p.second == 0) to_erase.push_back(p.first);
30 }
31 for (auto &z : to_erase) cnt.erase(z);
32 }
33 }
34 }
35
36 // Return current candidates and their lower-bound counts
37 unordered_map<Key, long long> candidates() const { return cnt; }
38
39 // Error bound: for any key i, true f_i - stored[i] <= n_seen / k
40 pair<long long, long long> bounds_for(const Key& x) const {
41 long long lb = 0;
42 auto it = cnt.find(x);
43 if (it != cnt.end()) lb = it->second;
44 long long ub = lb + n_seen / k; // MG bound
45 return {lb, ub};
46 }
47};
48
49int main() {
50 ios::sync_with_stdio(false);
51 cin.tie(nullptr);
52
53 MisraGries<int> mg(101); // find all items with freq > n/101 (~top-1%)
54
55 // Simulate a stream with a few heavy hitters.
56 std::mt19937 rng(123);
57 vector<int> keys = {1,2,3,4,5,6,7,8,9,10};
58 std::discrete_distribution<int> dist({1,1,1,1,1,1,1,1,40,50}); // 9 and 10 are heavy
59
60 const int N = 200000;
61 for (int i = 0; i < N; ++i) {
62 int x = keys[dist(rng)];
63 mg.update(x);
64 }
65
66 auto cand = mg.candidates();
67 cout << "Number of candidates <= k-1 = " << cand.size() << "\n";
68 // Report bounds for some keys
69 for (int q : {9, 10, 1, 2}) {
70 auto [lb, ub] = mg.bounds_for(q);
71 cout << "Key " << q << ": lower_bound=" << lb << ", upper_bound<=" << ub << "\n";
72 }
73
74 // Note: To get exact frequencies of these candidates, perform a second pass or
75 // maintain an auxiliary exact map only for candidates.
76 return 0;
77}
78

This code implements the Misra–Gries algorithm to track heavy hitters with at most k−1 counters. Each new item increments its counter, fills an empty slot, or decrements all counters when full. Any item whose frequency exceeds n/k is guaranteed to be in candidates(). The bounds_for method reports the Misra–Gries theoretical bounds: stored count is a lower bound and the error is at most n/k.

Time: Amortized O(1) per update; worst-case O(k) when all counters decrementSpace: O(k)
AMS Sketch for Second Moment F2 in the Turnstile Model
1#include <bits/stdc++.h>
2using namespace std;
3
4// AMS F2 sketch: maintains R independent signed sums Z_j = sum_i g_j(i) * f_i.
5// Query returns a robust aggregate (median) of Z_j^2 to estimate F2 = sum_i f_i^2.
6// This demo uses a pairwise-universal hash mod 2^61-1 to generate g_j(i) in {-1, +1}.
7
8struct PairHash61 {
9 // Hash function: h(x) = (a*x + b) mod (2^61 - 1). We use its lowest bit for sign.
10 static const unsigned __int128 MOD = ((unsigned __int128)1 << 61) - 1; // 2^61 - 1
11 unsigned long long a, b; // parameters in [1, MOD-1]
12
13 explicit PairHash61(uint64_t seed) {
14 std::mt19937_64 rng(seed);
15 std::uniform_int_distribution<unsigned long long> dist(1ULL, (1ULL<<61) - 2ULL);
16 a = dist(rng);
17 b = dist(rng);
18 }
19
20 static inline unsigned __int128 mul_mod(unsigned __int128 x, unsigned __int128 y) {
21 unsigned __int128 z = x * y;
22 // reduce z mod (2^61 - 1)
23 unsigned __int128 lo = (z & ((((unsigned __int128)1)<<61) - 1));
24 unsigned __int128 hi = (z >> 61);
25 unsigned __int128 r = lo + hi;
26 if (r >= MOD) r -= MOD;
27 return r;
28 }
29
30 static inline unsigned __int128 add_mod(unsigned __int128 x, unsigned __int128 y) {
31 unsigned __int128 r = x + y;
32 if (r >= MOD) r -= MOD;
33 return r;
34 }
35
36 inline uint64_t hash64(uint64_t x) const {
37 unsigned __int128 X = x;
38 unsigned __int128 A = a, B = b;
39 unsigned __int128 t = add_mod(mul_mod(A, X), B);
40 // return as 64-bit value (we only need LSB for sign)
41 return (uint64_t)t;
42 }
43
44 inline int sign(uint64_t x) const {
45 // map to +1/-1 using lowest bit
46 return (hash64(x) & 1ULL) ? +1 : -1;
47 }
48};
49
50struct AMSSketchF2 {
51 int R; // number of repetitions
52 vector<PairHash61> h; // independent hashers for signs
53 vector<long double> Z; // accumulators (can be large in magnitude)
54
55 explicit AMSSketchF2(int R_, uint64_t base_seed = 2024) : R(max(1, R_)) {
56 h.reserve(R);
57 Z.assign(R, 0.0L);
58 std::mt19937_64 rng(base_seed);
59 for (int j = 0; j < R; ++j) h.emplace_back(rng());
60 }
61
62 // Turnstile update: (key, delta), delta can be negative.
63 void update(uint64_t key, long long delta) {
64 for (int j = 0; j < R; ++j) {
65 int s = h[j].sign(key);
66 Z[j] += (long double)s * (long double)delta;
67 }
68 }
69
70 // Estimate F2 via robust aggregation of Z_j^2. Median is simple and effective.
71 long double estimate_F2() const {
72 vector<long double> vals(R);
73 for (int j = 0; j < R; ++j) vals[j] = Z[j] * Z[j];
74 nth_element(vals.begin(), vals.begin() + R/2, vals.end());
75 return vals[R/2];
76 }
77};
78
79int main() {
80 ios::sync_with_stdio(false);
81 cin.tie(nullptr);
82
83 // Build an AMS sketch with R repetitions. For (eps, delta) one often sets R = c * eps^{-2} * log(1/delta).
84 int R = 81; // e.g., eps ~ 1/sqrt(R) ~ 0.11 (rough intuition); use larger R for tighter error
85 AMSSketchF2 ams(R);
86
87 // We'll also keep an exact map for validation in this demo (not part of the sketch itself).
88 unordered_map<uint64_t, long long> exact;
89
90 // Simulate a turnstile stream with insertions and deletions
91 std::mt19937_64 rng(7);
92 uniform_int_distribution<uint64_t> keydist(1, 1'000'000);
93 for (int t = 0; t < 200000; ++t) {
94 uint64_t k = keydist(rng);
95 int delta = (rng() & 1) ? +1 : -1; // random +/- 1
96 ams.update(k, delta);
97 exact[k] += delta;
98 if (exact[k] == 0) exact.erase(k);
99 }
100
101 // Exact F2 for comparison (requires memory and is only for demonstration)
102 long double F2_exact = 0.0L;
103 for (auto &p : exact) F2_exact += (long double)p.second * (long double)p.second;
104
105 long double F2_est = ams.estimate_F2();
106
107 cout.setf(std::ios::fixed); cout << setprecision(2);
108 cout << "Exact F2 = " << (double)F2_exact << "\n";
109 cout << "AMS estimate = " << (double)F2_est << "\n";
110 cout << "Rel. error = " << (double)(fabsl(F2_est - F2_exact) / max(1.0L, F2_exact)) << "\n";
111 return 0;
112}
113

This program implements an AMS sketch to estimate the second frequency moment F2 in the turnstile model. Each repetition uses a pairwise-universal hash to generate a random sign for each key; the accumulator Z stores the signed sum of updates. Squaring Z gives an unbiased estimator for F2; taking the median across repetitions improves concentration. The demo validates the estimate against an exact map.

Time: O(R) per update and per query (R = O(ε^{-2} log 1/δ) for theoretical guarantees)Space: O(R)