String Hashing (Polynomial Hash)
Key Points
- •Polynomial string hashing encodes a string as a base-p number modulo a large prime, letting us compare substrings in O(1) after O(n) preprocessing.
- •Using prefix hashes and precomputed powers, a substring hash h(l, r) can be extracted as H[r+1] − H[l]· (mod m).
- •Normalizing hashes (dividing by modulo m) makes equal substrings compare equal regardless of position.
- •To reduce collision probability, use double hashing with two independent moduli and the same base.
- •Choose a random base p and large prime moduli; avoid tiny bases or non-prime moduli to prevent patterns and high collision rates.
- •All operations fit in O(n) space and O(n) time to build, and support O(1) substring queries and O(log n) LCP via binary search.
- •Rabin–Karp pattern matching with polynomial hashes is simple and practical, but verify matches to be 100% safe against rare collisions.
- •Careful implementation details include 64-bit arithmetic, modular subtraction fixes, consistent indexing, and enough precomputed powers.
Prerequisites
- →Modular arithmetic — Hash computations are performed modulo large primes, requiring understanding of addition, subtraction, multiplication, and modular inverses.
- →Asymptotic complexity — To appreciate O(n) preprocessing and O(1) queries, and to plan algorithms like binary search on top of hashing.
- →Binary search — Used with substring-equality checks to compute LCPs or solve longest-repeated-substring problems efficiently.
- →Basic probability — To understand collision probabilities and why double hashing reduces risk.
- →String indexing and slicing — Avoid off-by-one mistakes when computing substring hashes with inclusive/exclusive bounds.
Detailed Explanation
Tap terms for definitions01Overview
String hashing turns a string into a number so we can compare pieces quickly. The polynomial (rolling) hash treats the string as digits in base p (like writing numbers in base 10), then takes the result modulo a large prime m. With this approach, we can precompute a prefix hash array and a power array so any substring’s hash can be read in constant time. Equal substrings yield equal hashes with very high probability, especially if we use two moduli. This makes hashing a powerful building block for fast string algorithms such as substring equality checks, Rabin–Karp pattern matching, palindrome checks (with a reverse hash), and binary search for longest common prefixes. The core trade-off is collisions: two different substrings might share a hash. We mitigate this by choosing strong parameters (random base, large primes) and by using double hashing. In practice, with good parameters, collisions are astronomically unlikely for competitive programming scales.
02Intuition & Analogies
Imagine each character is a bead with a small number on it. You string the beads together on a rod. Now, think of sliding a window across the rod and asking, “Do these two windows look the same?” You could compare bead-by-bead, but that’s slow. Instead, you give each position a weight that grows by a fixed factor p as you move right. The weight of a window is the sum of bead_value × position_weight. If two windows have the same beads in the same order, their weighted sums match. However, these sums can be huge, so we reduce them modulo a big prime m, like reading a clock with many hours; it wraps around but keeps enough information to distinguish different windows most of the time. Prefix sums make things fast: the total weight from the start to r is easy, and the weight from l to r is just a subtraction plus a known power p^{length}. Because positions matter through powers of p, two different arrangements produce different results with overwhelming probability. To be extra safe, we can read two clocks at once (two moduli) and only believe equality if both match. That makes accidental matches fantastically rare, while still being blazing fast.
03Formal Definition
04When to Use
Use polynomial string hashing when you need fast substring equality checks, especially many queries after a single preprocessing pass. Classic problems include: Rabin–Karp pattern matching to find all occurrences of a pattern in a text; checking if two substrings are equal or computing the longest common prefix (LCP) between suffixes with binary search; testing if a substring is a palindrome by comparing its forward and reversed hashes; deduplication tasks like finding the longest repeated substring via binary search on length and hashing; and building hash-based string sets/maps for large immutable strings. It’s particularly effective in competitive programming where constraints make O(n) preprocessing plus O(1) queries ideal. Avoid it if you need absolute certainty with no risk of collision and cannot verify matches (e.g., cryptographic settings—use cryptographic hashes there). For dynamic editing of strings, more advanced structures (e.g., Fenwick/segment trees with rolling hashes or suffix automata) may be more suitable.
⚠️Common Mistakes
Common pitfalls include: choosing weak parameters (e.g., small base p or non-prime modulus), which increases collisions; forgetting to adjust negatives after subtraction (H[r+1] − H[l]·p^{k}) mod m should be brought back to [0, m−1] by adding m; not precomputing enough powers p^{i} up to the maximum needed length; mixing indexing conventions (off-by-one errors between inclusive/exclusive ranges); mapping characters to 0 (v=0) which can blur information for leading characters—prefer v(s[i]) ≥ 1; using 32-bit integers and overflowing on multiplication before modulo—use 64-bit with __int128 for safe multiplication; comparing unnormalized hashes from different positions, mistakenly thinking they are directly comparable; and assuming no collisions and skipping verification in Rabin–Karp when it’s easy to re-check a candidate match. Also, using the same seed/base across multiple test files can theoretically be exploited; randomize base at runtime for safety.
Key Formulas
Prefix hash recurrence
Explanation: Build the running hash by multiplying the previous hash by the base p and adding the new character, then reduce modulo m. This definition aligns with positional weights implicitly.
Substring hash extraction
Explanation: The hash of s[l..r] is obtained by removing the contribution of the first l characters from the prefix hash of r and fixing the positional shift using .
Normalized substring hash
Explanation: Multiplying by (the modular inverse power) removes the dependence on the starting index so equal substrings have equal normalized hashes.
Fermat inverse
Explanation: When m is prime and p is not divisible by m, the modular inverse of p is modulo m. Use fast exponentiation to compute it.
Time complexity
Explanation: It takes linear time to compute all prefix hashes and powers, and constant time to get any substring hash afterward.
Space complexity
Explanation: We store O(n) prefix hashes and power values (per modulus) to answer queries quickly.
Single-mod collision rate
Explanation: Under a random-hash heuristic, two random different strings collide with probability about 1/m, which is tiny for large primes.
Double-hash collision rate
Explanation: Using two independent primes makes the chance of the same collision in both moduli roughly the product of the two probabilities.
Birthday bound
Explanation: When comparing k substrings, the chance that any two collide is roughly /(2m) for small k, guiding how large m should be.
Parameter constraints
Explanation: Character codes should be positive, base p should be an integer not divisible by m, and m is typically a large prime for good behavior.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct StringHasher { 5 // Two large primes for double hashing 6 static constexpr long long M1 = 1'000'000'007LL; 7 static constexpr long long M2 = 1'000'000'009LL; 8 9 long long base; // shared base for both moduli 10 vector<long long> p1, p2; // powers of base modulo M1 and M2 11 vector<long long> h1, h2; // prefix hashes modulo M1 and M2 12 13 static inline long long add_mod(long long a, long long b, long long mod){ 14 a += b; if (a >= mod) a -= mod; return a; 15 } 16 static inline long long sub_mod(long long a, long long b, long long mod){ 17 a -= b; if (a < 0) a += mod; return a; 18 } 19 static inline long long mul_mod(long long a, long long b, long long mod){ 20 // Safe 128-bit multiplication reduced modulo mod 21 __int128 t = ( __int128 )a * b; 22 t %= mod; 23 return (long long)t; 24 } 25 26 StringHasher() : base(911382323) {} 27 28 explicit StringHasher(const string &s) { init(s); } 29 30 void init(const string &s) { 31 // Randomize base in a safe range [256, 1e6] 32 mt19937_64 rng(chrono::steady_clock::now().time_since_epoch().count()); 33 base = uniform_int_distribution<long long>(256, 1'000'000)(rng); 34 int n = (int)s.size(); 35 p1.assign(n + 1, 1); p2.assign(n + 1, 1); 36 h1.assign(n + 1, 0); h2.assign(n + 1, 0); 37 for (int i = 0; i < n; ++i) { 38 unsigned char c = (unsigned char)s[i]; 39 long long v = (long long)c + 1; // ensure v >= 1 40 p1[i+1] = mul_mod(p1[i], base, M1); 41 p2[i+1] = mul_mod(p2[i], base, M2); 42 h1[i+1] = add_mod(mul_mod(h1[i], base, M1), v % M1, M1); 43 h2[i+1] = add_mod(mul_mod(h2[i], base, M2), v % M2, M2); 44 } 45 } 46 47 // Return pair of hashes for substring s[l..r] inclusive (0-indexed) 48 pair<long long,long long> hash_sub(int l, int r) const { 49 int len = r - l + 1; 50 long long x1 = sub_mod(h1[r+1], mul_mod(h1[l], p1[len], M1), M1); 51 long long x2 = sub_mod(h2[r+1], mul_mod(h2[l], p2[len], M2), M2); 52 return {x1, x2}; 53 } 54 55 // Normalized hash so position doesn't matter (start-at-zero form) 56 pair<long long,long long> hash_norm(int l, int r) const { 57 // Because we used H[i+1] = H[i]*p + v[i], hash_sub is already start-at-zero; no extra divide needed. 58 // If you had a different definition, you'd multiply by p^{-l}. Here hash_sub == normalized hash. 59 return hash_sub(l, r); 60 } 61 62 // Equality check between s[l1..r1] and s[l2..r2] 63 bool equals(int l1, int r1, int l2, int r2) const { 64 if (r1 - l1 != r2 - l2) return false; 65 return hash_sub(l1, r1) == hash_sub(l2, r2); 66 } 67 }; 68 69 int main(){ 70 ios::sync_with_stdio(false); 71 cin.tie(nullptr); 72 73 string s = "abracadabra"; 74 StringHasher H(s); 75 76 // Compare substrings: s[0..2] = "abr" vs s[7..9] = "abr" 77 cout << boolalpha << H.equals(0, 2, 7, 9) << "\n"; // true 78 79 // Print the double hash of s[3..5] = "aca" 80 auto h = H.hash_sub(3, 5); 81 cout << h.first << ' ' << h.second << "\n"; 82 83 // Check unequal substrings quickly 84 cout << H.equals(0, 2, 1, 3) << "\n"; // false ("abr" vs "bra") 85 86 return 0; 87 } 88
This defines a reusable double-hash class that precomputes powers and prefix hashes under two prime moduli using a shared random base. It exposes O(1) substring hashing and equality checks. The hash_sub function implements the standard formula H[r+1] − H[l]·p^{len} modulo each prime. Using two moduli makes accidental collisions vanishingly unlikely for typical contest sizes.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct DoubleHasher { 5 static constexpr long long M1 = 1'000'000'007LL; 6 static constexpr long long M2 = 1'000'000'009LL; 7 long long base; 8 vector<long long> p1, p2, h1, h2; 9 static inline long long add_mod(long long a, long long b, long long mod){ a+=b; if(a>=mod) a-=mod; return a; } 10 static inline long long sub_mod(long long a, long long b, long long mod){ a-=b; if(a<0) a+=mod; return a; } 11 static inline long long mul_mod(long long a, long long b, long long mod){ __int128 t=(__int128)a*b%mod; return (long long)t; } 12 13 void build(const string &s){ 14 mt19937_64 rng(chrono::steady_clock::now().time_since_epoch().count()); 15 base = uniform_int_distribution<long long>(256, 1'000'000)(rng); 16 int n = (int)s.size(); 17 p1.assign(n+1,1); p2.assign(n+1,1); 18 h1.assign(n+1,0); h2.assign(n+1,0); 19 for(int i=0;i<n;++i){ 20 unsigned char c=(unsigned char)s[i]; long long v=(long long)c+1; 21 p1[i+1]=mul_mod(p1[i],base,M1); p2[i+1]=mul_mod(p2[i],base,M2); 22 h1[i+1]=add_mod(mul_mod(h1[i],base,M1), v%M1, M1); 23 h2[i+1]=add_mod(mul_mod(h2[i],base,M2), v%M2, M2); 24 } 25 } 26 pair<long long,long long> subhash(int l,int r){ 27 int len=r-l+1; 28 long long a=sub_mod(h1[r+1], mul_mod(h1[l], p1[len], M1), M1); 29 long long b=sub_mod(h2[r+1], mul_mod(h2[l], p2[len], M2), M2); 30 return {a,b}; 31 } 32 }; 33 34 int main(){ 35 ios::sync_with_stdio(false); 36 cin.tie(nullptr); 37 38 string T = "aaaaabaaaabaaaa"; 39 string P = "aaab"; 40 41 // Build hasher for T and P using the SAME base for consistency 42 DoubleHasher HT, HP; 43 HT.build(T); 44 HP.base = HT.base; // ensure the same base is used 45 HP.build(P); 46 47 // Pattern hash 48 auto pat = HP.subhash(0, (int)P.size()-1); 49 50 vector<int> occ; 51 for(int i=0;i+(int)P.size()<= (int)T.size(); ++i){ 52 auto h = HT.subhash(i, i + (int)P.size()-1); 53 if(h == pat){ 54 // Optional verification to guard against collisions 55 if (T.compare(i, P.size(), P) == 0) occ.push_back(i); 56 } 57 } 58 59 for(int i: occ) cout << i << ' '; 60 cout << "\n"; 61 return 0; 62 } 63
This is a classic Rabin–Karp implementation using double hashing. We compute the pattern’s hash and compare it against each window hash in the text. When hashes match, we optionally verify with a direct substring comparison to avoid any chance of a false positive. Note we synchronize the base between the text and pattern hashers.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Hasher { 5 static constexpr long long M1 = 1'000'000'007LL; 6 static constexpr long long M2 = 1'000'000'009LL; 7 long long base; vector<long long> p1,p2,h1,h2; 8 static inline long long addm(long long a,long long b,long long m){a+=b; if(a>=m)a-=m; return a;} 9 static inline long long subm(long long a,long long b,long long m){a-=b; if(a<0)a+=m; return a;} 10 static inline long long mulm(long long a,long long b,long long m){ __int128 t=(__int128)a*b% m; return (long long)t; } 11 void build(const string&s,long long forced_base=0){ 12 if(forced_base) base=forced_base; else { mt19937_64 rng(chrono::steady_clock::now().time_since_epoch().count()); base=uniform_int_distribution<long long>(256,1'000'000)(rng);} 13 int n=(int)s.size(); p1.assign(n+1,1); p2.assign(n+1,1); h1.assign(n+1,0); h2.assign(n+1,0); 14 for(int i=0;i<n;++i){ unsigned char c=(unsigned char)s[i]; long long v=(long long)c+1; p1[i+1]=mulm(p1[i],base,M1); p2[i+1]=mulm(p2[i],base,M2); h1[i+1]=addm(mulm(h1[i],base,M1), v%M1, M1); h2[i+1]=addm(mulm(h2[i],base,M2), v%M2, M2);} } 15 pair<long long,long long> get(int l,int r){ int len=r-l+1; long long a=subm(h1[r+1], mulm(h1[l], p1[len], M1), M1); long long b=subm(h2[r+1], mulm(h2[l], p2[len], M2), M2); return {a,b}; } 16 }; 17 18 int main(){ 19 ios::sync_with_stdio(false); 20 cin.tie(nullptr); 21 22 string s = "mississippi"; 23 Hasher H; H.build(s); 24 25 auto lcp_between = [&](int i, int j){ 26 int n = (int)s.size(); 27 int lo = 0, hi = n - max(i, j); // maximum possible LCP length 28 while (lo < hi){ 29 int mid = (lo + hi + 1) / 2; 30 if (H.get(i, i+mid-1) == H.get(j, j+mid-1)) lo = mid; else hi = mid - 1; 31 } 32 return lo; 33 }; 34 35 cout << lcp_between(1, 4) << "\n"; // LCP of suffixes starting at 1 and 4 36 cout << lcp_between(0, 7) << "\n"; // another example 37 38 return 0; 39 } 40
The function lcp_between(i, j) returns the longest common prefix of suffixes s[i..] and s[j..]. It uses binary search on the length and O(1) hash equality checks per step, for O(log n) time per query. This is a standard technique in problems asking for many LCPs without building suffix arrays.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct DH { 5 static constexpr long long M1=1'000'000'007LL, M2=1'000'000'009LL; 6 long long base; vector<long long> p1,p2,h1,h2; 7 static inline long long subm(long long a,long long b,long long m){ a-=b; if(a<0)a+=m; return a; } 8 static inline long long mulm(long long a,long long b,long long m){ __int128 t=(__int128)a*b% m; return (long long)t; } 9 static inline long long addm(long long a,long long b,long long m){ a+=b; if(a>=m)a-=m; return a; } 10 void build(const string &s, long long forced_base=0){ 11 if(!forced_base){ mt19937_64 rng(chrono::steady_clock::now().time_since_epoch().count()); base=uniform_int_distribution<long long>(256,1'000'000)(rng);} else base = forced_base; 12 int n=(int)s.size(); p1.assign(n+1,1); p2.assign(n+1,1); h1.assign(n+1,0); h2.assign(n+1,0); 13 for(int i=0;i<n;++i){ unsigned char c=(unsigned char)s[i]; long long v=(long long)c+1; p1[i+1]=mulm(p1[i],base,M1); p2[i+1]=mulm(p2[i],base,M2); h1[i+1]=addm(mulm(h1[i],base,M1), v%M1, M1); h2[i+1]=addm(mulm(h2[i],base,M2), v%M2, M2);} } 14 pair<long long,long long> sub(int l,int r){ int len=r-l+1; return { subm(h1[r+1], mulm(h1[l], p1[len], M1), M1), subm(h2[r+1], mulm(h2[l], p2[len], M2), M2) }; } 15 }; 16 17 int main(){ 18 ios::sync_with_stdio(false); 19 cin.tie(nullptr); 20 21 string s = "abaxyzzyxf"; // contains palindrome "xyzzyx" 22 string rs = s; reverse(rs.begin(), rs.end()); 23 24 DH HF, HR; 25 HF.build(s); 26 HR.build(rs, HF.base); // use same base for consistency 27 28 auto is_pal = [&](int l, int r){ 29 // Map substring s[l..r] to reversed string indices 30 int n = (int)s.size(); 31 int rl = n - 1 - r; 32 int rr = n - 1 - l; 33 return HF.sub(l, r) == HR.sub(rl, rr); 34 }; 35 36 cout << boolalpha; 37 cout << is_pal(3, 8) << "\n"; // true for "xyzzyx" 38 cout << is_pal(0, 2) << "\n"; // false for "aba"? actually true; check string -> s[0..2]="aba" -> true 39 40 return 0; 41 } 42
We build a hasher for s and for reversed(s) using the same base. A substring s[l..r] is a palindrome if its forward hash equals the hash of the corresponding substring in the reversed string. This yields O(1) palindrome checks after O(n) preprocessing.