Hash Table
Key Points
- •A hash table stores key–value pairs and finds items in expected O(1) time using a hash function to map keys to buckets.
- •Collisions are inevitable; the two main ways to handle them are chaining (linked buckets) and open addressing (probing empty slots).
- •In C++, unordere and unordere are standard hash tables that use chaining and can be tuned with reserve and ma_factor.
- •Designing a good hash function matters; poor hashing or adversarial inputs can degrade performance to O(n) per operation.
- •Use randomized hashing (e.g., splitmix64 with a random seed) to mitigate anti-hash attacks on competitive programming platforms.
- •For pairs, vectors, and custom structs, you must provide custom hash functors to use them as unordere keys.
- •Open addressing has different performance behavior; probe counts grow as the load factor approaches 1, so rehashing is crucial.
- •The g_table from p often performs faster than unordere in practice but still needs careful hashing and seeding.
Prerequisites
- →Arrays and indexing — Hash tables map hash values to array indices; understanding array operations is fundamental.
- →Modular arithmetic — Hash functions commonly use modulo operations to fit values into a fixed bucket range.
- →Basic probability — Expected-time analysis assumes uniform hashing and uses probability for collision reasoning.
- →Big-O notation and amortized analysis — Understanding expected O(1) operations and amortized costs is essential for performance guarantees.
- →C++ templates and functors — Custom hashers are implemented as templated functors passed to unordered_map/set or gp_hash_table.
- →Balanced BSTs vs hash tables — Knowing when to choose ordered structures (map/set) versus hash-based ones prevents misuse.
Detailed Explanation
Tap terms for definitions01Overview
A hash table is a data structure that lets you store and retrieve items by key in expected constant time. The key idea is to apply a hash function to a key to produce an integer, then use that integer to choose a bucket (index) in an array. Because many different keys can map to the same bucket (a collision), a hash table must have a collision resolution strategy. The two common strategies are chaining (each bucket stores a small list of items that hashed there) and open addressing (probe other slots in the array until you find an empty one). In C++, unordered_map and unordered_set implement hash tables with chaining and typically give O(1) expected time for insert, find, and erase. However, in the worst case—like when an adversary crafts inputs that collide—operations can degrade to O(n). Competitive programmers often protect against this by using randomized hash functions (e.g., splitmix64 with a random seed) or third-party structures like gp_hash_table from the GNU pb_ds library, which are often faster in practice. Choosing a good load factor (ratio of items to buckets) and rehashing when necessary are important for keeping performance predictable.
02Intuition & Analogies
Imagine a row of lockers (the array of buckets). Every student (key) is assigned a locker number by a simple rule: take their student ID, do some arithmetic, and get a locker index. Most of the time, each locker holds exactly the right student’s stuff. But sometimes two students are assigned the same locker—that’s a collision. With chaining, we allow multiple students to share a locker by keeping a small list inside it; when a student comes to retrieve items, they check the list quickly. With open addressing, if a student finds their locker busy, they try the next locker, and the next, following a deterministic pattern (linear probing, quadratic probing, or double hashing) until they find an empty one. The key to speed is making sure the lockers aren’t too crowded (low load factor), and that our rule (hash function) spreads students out uniformly. If a prankster gives many students IDs that always map to the same locker, the line at that locker grows long, and everyone slows down. Randomizing our rule (adding a secret seed) makes the prankster’s job hard because, without knowing the secret, they can’t easily force all students into one locker. In practice, you also pick a bigger hallway of lockers when the school gets more students: that’s rehashing—build a bigger array and reassign everyone quickly using the same rule.
03Formal Definition
04When to Use
Use a hash table whenever you need fast expected O(1) insert, find, and erase for keys without ordering requirements. Typical tasks include frequency counting (word counts, value frequencies), membership queries (set lookups), building adjacency maps in graphs, memoization/cache maps for dynamic programming, and mapping composite keys (like pairs or tuples) to values. In competitive programming, unordered_map and unordered_set handle most needs quickly, especially with reserve to reduce rehashing. When input size is large and performance is tight, gp_hash_table can be faster due to lower constant factors. If you require ordered iteration or operations like predecessor/successor, prefer balanced BSTs (map/set). If you need to handle extremely memory-constrained environments or require very predictable probe counts, consider open addressing variants (like linear probing with careful load-factor control) or specialized hash tables (robin hood hashing, hopscotch). For cryptographic collision resistance or untrusted inputs, do not rely on standard hash functions; instead, use randomized hashers (splitmix64 with a secret seed) or a universal hashing scheme, and consider salting keys to mitigate collision-based denial-of-service.
⚠️Common Mistakes
- Forgetting to call reserve on unordered_map/unordered_set before bulk inserts, causing many rehashes and performance drops. Fix: call umap.reserve(expected_size) and optionally set max_load_factor.
- Using composite keys (pair, vector, struct) without providing a proper hash function, leading to compilation errors or poor hashing. Fix: write a custom hash functor that also defines equality consistently.
- Assuming O(1) worst-case time and ignoring adversarial inputs. Fix: know that worst-case is O(n) per operation; mitigate with randomized hashing and reasonable load factors.
- Writing weak hashes (e.g., returning key % m directly or combining fields with naive XOR) that cause patterns and clustering. Fix: use well-mixed 64-bit hashers like splitmix64, and combine fields with robust mixing and a random seed.
- Misusing operator[] on unordered_map when you only want to test membership; it inserts a default value and changes size. Fix: use find or contains (C++20) for lookup without insertion.
- Iteration invalidation surprises after rehash. Fix: know that rehash invalidates iterators and references; minimize holding long-lived iterators across inserts/rehashes.
- High load factor in open addressing leading to skyrocketing probe counts. Fix: trigger rehash around 0.5–0.75 and ensure good probing strategies.
- Comparing pointers as keys when objects move or addresses are unstable. Fix: hash the content or use stable IDs instead of raw addresses.
Key Formulas
Load Factor
Explanation: Alpha is the average number of elements per bucket when n items are stored in m buckets. Keeping small (e.g., around 0.5–1.0) helps ensure short chains and fast operations.
Chaining Expected Chain Length
Explanation: Under uniform hashing, the expected number of elements in any bucket is the load factor. This implies average search and insert time is proportional to 1 + .
Expected Time with Chaining
Explanation: Search/insert/delete in a chained hash table take constant time on average plus a term linear in the load factor. When is bounded, this becomes expected O(1).
Linear Probing Successful Search
Explanation: For open addressing with linear probing under uniform hashing, the expected number of probes for a successful search grows as the table fills. It diverges as 1.
Linear Probing Unsuccessful Search
Explanation: The expected probes for an unsuccessful search grow faster than for successful searches as the load factor increases, highlighting the need to rehash before the table is too full.
Universal Hashing Collision Bound
Explanation: For a universal hash family, the probability that two distinct keys collide is at most 1/m. This provides expected guarantees even against adversarial inputs.
Birthday Collision Approximation
Explanation: When hashing n random keys into m buckets, the chance of any collision follows a birthday-paradox-like curve. Collisions become likely once n is about .
Affine Universal Hash
Explanation: Choosing random a,b with 1 and 0 for prime p creates a universal family. It is simple and provides low collision probability for arbitrary inputs.
Amortized Insert
Explanation: Although occasional rehashing costs O(n), spreading that cost over many inserts yields constant amortized time per insertion.
Space Complexity
Explanation: A chained table stores m buckets plus n elements; an open-address table stores about m slots. Space is linear in the number of elements.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 int main() { 5 ios::sync_with_stdio(false); 6 cin.tie(nullptr); 7 8 // Example: frequency counting with unordered_map 9 unordered_map<string, int> freq; 10 11 // Tuning: reserve enough buckets to avoid many rehashes 12 // Suppose we expect up to 100000 distinct words 13 freq.reserve(100000); 14 freq.max_load_factor(0.7f); // keep chains short 15 16 vector<string> words = {"apple","banana","apple","pear","banana","apple"}; 17 for (const auto &w : words) { 18 // operator[] inserts if missing, then increments 19 ++freq[w]; 20 } 21 22 // Query: check membership without inserting using find/contains 23 #if __cplusplus >= 202002L 24 bool has_orange = freq.contains("orange"); 25 #else 26 bool has_orange = (freq.find("orange") != freq.end()); 27 #endif 28 29 cout << "has_orange? " << (has_orange ? "yes" : "no") << "\n"; 30 31 // Iterate and print frequencies 32 for (auto &kv : freq) { 33 cout << kv.first << ": " << kv.second << "\n"; 34 } 35 36 // unordered_set example: membership only 37 unordered_set<long long> seen; 38 seen.reserve(1000); 39 seen.insert(10); 40 seen.insert(20); 41 cout << (seen.count(10) ? "present" : "absent") << "\n"; 42 43 // Introspection: bucket_count and load_factor 44 cout << "bucket_count: " << freq.bucket_count() << " load_factor: " << freq.load_factor() << "\n"; 45 46 return 0; 47 } 48
This program demonstrates typical operations: insert via operator[] for unordered_map, membership queries via find/contains, and iteration. It also shows performance tuning with reserve and max_load_factor to reduce rehashing and keep expected O(1) operations. The unordered_set example shows pure membership use cases.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // A strong 64-bit mixer: splitmix64 5 static uint64_t splitmix64(uint64_t x) { 6 x += 0x9e3779b97f4a7c15ULL; 7 x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9ULL; 8 x = (x ^ (x >> 27)) * 0x94d049bb133111ebULL; 9 return x ^ (x >> 31); 10 } 11 12 struct custom_hash { 13 // Random seed to defeat adversarial inputs 14 const uint64_t FIXED_RANDOM = chrono::steady_clock::now().time_since_epoch().count(); 15 16 // Hash for 64-bit integers 17 size_t operator()(uint64_t x) const noexcept { 18 return (size_t)splitmix64(x + FIXED_RANDOM); 19 } 20 21 // Overload for signed long long 22 size_t operator()(long long x) const noexcept { 23 return (*this)((uint64_t)x); 24 } 25 26 // Hash combine helper 27 size_t combine(size_t a, size_t b) const noexcept { 28 // Similar to boost::hash_combine but with extra mixing 29 uint64_t res = a; 30 res ^= b + 0x9e3779b97f4a7c15ULL + (res << 6) + (res >> 2); 31 return (size_t)splitmix64(res + FIXED_RANDOM); 32 } 33 34 // Hash for pair<T,U> 35 template<class T, class U> 36 size_t operator()(const pair<T,U>& p) const noexcept { 37 size_t h1 = (*this)((uint64_t)std::hash<T>{}(p.first)); 38 size_t h2 = (*this)((uint64_t)std::hash<U>{}(p.second)); 39 return combine(h1, h2); 40 } 41 42 // Hash for vector<int> (illustrative; for other types, generalize) 43 size_t operator()(const vector<int>& v) const noexcept { 44 size_t h = 0; 45 for (int x : v) { 46 size_t hx = (*this)((uint64_t)(uint32_t)x); 47 h = combine(h, hx); 48 } 49 return h; 50 } 51 }; 52 53 int main() { 54 ios::sync_with_stdio(false); 55 cin.tie(nullptr); 56 57 // Using pair<int,int> as a key 58 unordered_map<pair<int,int>, int, custom_hash> grid_cost; 59 grid_cost.reserve(1 << 16); 60 61 grid_cost[{1,2}] = 5; 62 grid_cost[{2,3}] = 7; 63 cout << grid_cost[{1,2}] << "\n"; // 5 64 65 // Using vector<int> as a key 66 unordered_map<vector<int>, string, custom_hash> pattern_name; 67 pattern_name.reserve(1024); 68 pattern_name[{1,2,3}] = "seq123"; 69 pattern_name[{2,3,5,7}] = "primes"; 70 cout << pattern_name[{2,3,5,7}] << "\n"; 71 72 // Anti-hash: show that many equal hashes won't easily occur with random seed 73 unordered_set<long long, custom_hash> us; 74 us.reserve(100000); 75 for (long long i = 0; i < 100000; ++i) us.insert(i * 1000003LL); 76 cout << "size: " << us.size() << "\n"; // Should be 100000 77 78 return 0; 79 } 80
We define a custom_hash using splitmix64 with a random per-process seed to complicate adversarial collision attempts. The functor supports 64-bit integers, pairs, and vectors by robustly combining component hashes. This enables unordered containers to accept composite keys while maintaining good performance.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 class LinearProbingHash { 5 // States for each slot 6 enum State : uint8_t { EMPTY = 0, FILLED = 1, DELETED = 2 }; 7 8 struct Slot { 9 long long key = 0; 10 State state = EMPTY; 11 }; 12 13 vector<Slot> table; 14 size_t n = 0; // number of FILLED keys 15 double max_alpha = 0.7; // rehash threshold 16 17 static inline uint64_t splitmix64(uint64_t x) { 18 x += 0x9e3779b97f4a7c15ULL; 19 x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9ULL; 20 x = (x ^ (x >> 27)) * 0x94d049bb133111ebULL; 21 return x ^ (x >> 31); 22 } 23 24 size_t H(long long x) const { 25 // Simple 64-bit mixing, then map into table size 26 return (size_t)(splitmix64((uint64_t)x)) & (table.size() - 1); 27 } 28 29 void rehash(size_t new_m) { 30 // new_m must be a power of two for &-masking to work 31 vector<Slot> old = move(table); 32 table.assign(new_m, {}); 33 n = 0; 34 for (const auto &s : old) if (s.state == FILLED) insert(s.key); 35 } 36 37 void maybe_grow() { 38 if ((double)n / (double)table.size() > max_alpha) { 39 rehash(table.size() * 2); 40 } 41 } 42 43 public: 44 explicit LinearProbingHash(size_t m = 8) { 45 // Round up m to power of two 46 size_t pw = 1; 47 while (pw < m) pw <<= 1; 48 table.assign(pw, {}); 49 } 50 51 bool insert(long long key) { 52 maybe_grow(); 53 size_t m = table.size(); 54 size_t i = H(key); 55 size_t first_deleted = SIZE_MAX; 56 while (true) { 57 if (table[i].state == EMPTY) { 58 size_t pos = (first_deleted != SIZE_MAX ? first_deleted : i); 59 table[pos].key = key; 60 table[pos].state = FILLED; 61 ++n; 62 return true; // inserted new 63 } else if (table[i].state == DELETED) { 64 if (first_deleted == SIZE_MAX) first_deleted = i; // remember first tombstone 65 } else if (table[i].state == FILLED && table[i].key == key) { 66 return false; // already present 67 } 68 i = (i + 1) & (m - 1); // linear probing with wrap-around 69 } 70 } 71 72 bool contains(long long key) const { 73 size_t m = table.size(); 74 size_t i = H(key); 75 while (true) { 76 if (table[i].state == EMPTY) return false; // stop at empty slot 77 if (table[i].state == FILLED && table[i].key == key) return true; 78 i = (i + 1) & (m - 1); 79 } 80 } 81 82 bool erase(long long key) { 83 size_t m = table.size(); 84 size_t i = H(key); 85 while (true) { 86 if (table[i].state == EMPTY) return false; 87 if (table[i].state == FILLED && table[i].key == key) { 88 table[i].state = DELETED; // tombstone to preserve probe chains 89 --n; 90 return true; 91 } 92 i = (i + 1) & (m - 1); 93 } 94 } 95 96 size_t size() const { return n; } 97 size_t capacity() const { return table.size(); } 98 }; 99 100 int main() { 101 ios::sync_with_stdio(false); 102 cin.tie(nullptr); 103 104 LinearProbingHash hs; 105 for (int i = 0; i < 10; ++i) hs.insert(i * 3); 106 107 cout << "contains 9? " << (hs.contains(9) ? "yes" : "no") << "\n"; 108 cout << "contains 12? " << (hs.contains(12) ? "yes" : "no") << "\n"; 109 cout << "size: " << hs.size() << " capacity: " << hs.capacity() << "\n"; 110 111 hs.erase(9); 112 cout << "contains 9 after erase? " << (hs.contains(9) ? "yes" : "no") << "\n"; 113 114 return 0; 115 } 116
This is a minimal open-addressing hash table for integers using linear probing with tombstones and power-of-two capacity for fast modulo via bit-masking. It demonstrates hashing, probing, insertion, lookup, deletion, and dynamic growth via rehashing at a target load factor.
1 #include <bits/stdc++.h> 2 #include <ext/pb_ds/assoc_container.hpp> 3 using namespace std; 4 using namespace __gnu_pbds; 5 6 static uint64_t splitmix64_u(uint64_t x) { 7 x += 0x9e3779b97f4a7c15ULL; 8 x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9ULL; 9 x = (x ^ (x >> 27)) * 0x94d049bb133111ebULL; 10 return x ^ (x >> 31); 11 } 12 13 struct fast_hash { 14 const uint64_t SEED = chrono::steady_clock::now().time_since_epoch().count(); 15 size_t operator()(uint64_t x) const { return (size_t)splitmix64_u(x + SEED); } 16 size_t operator()(long long x) const { return operator()((uint64_t)x); } 17 }; 18 19 int main() { 20 ios::sync_with_stdio(false); 21 cin.tie(nullptr); 22 23 // gp_hash_table<Key, Mapped, Hash> 24 gp_hash_table<long long, int, fast_hash> mp; 25 mp.resize(1 << 16); // pre-allocate buckets 26 27 // Example: count occurrences 28 vector<long long> a = {1,2,3,2,1,4,5,2,1}; 29 for (auto x : a) mp[x]++; 30 31 cout << "count(2) = " << mp[2] << "\n"; 32 33 // Membership-style container: value-less map (mapped type = null_type) 34 gp_hash_table<long long, null_type, fast_hash> st; 35 st.resize(1 << 16); 36 st.insert(10); 37 st.insert(20); 38 cout << (st.find(10) != st.end() ? "present" : "absent") << "\n"; 39 40 return 0; 41 } 42
The GNU pb_ds gp_hash_table is often faster than unordered_map due to lower constant factors and cache-friendly design. We use a randomized fast_hash to reduce collision risks. The example shows both key→value and key-only (set-like) usage by using null_type for the mapped value.