🗂️Data StructureIntermediate

Hash Table

Key Points

  • A hash table stores key–value pairs and finds items in expected O(1) time using a hash function to map keys to buckets.
  • Collisions are inevitable; the two main ways to handle them are chaining (linked buckets) and open addressing (probing empty slots).
  • In C++, unordere and unordere are standard hash tables that use chaining and can be tuned with reserve and ma_factor.
  • Designing a good hash function matters; poor hashing or adversarial inputs can degrade performance to O(n) per operation.
  • Use randomized hashing (e.g., splitmix64 with a random seed) to mitigate anti-hash attacks on competitive programming platforms.
  • For pairs, vectors, and custom structs, you must provide custom hash functors to use them as unordere keys.
  • Open addressing has different performance behavior; probe counts grow as the load factor approaches 1, so rehashing is crucial.
  • The g_table from p often performs faster than unordere in practice but still needs careful hashing and seeding.

Prerequisites

  • Arrays and indexingHash tables map hash values to array indices; understanding array operations is fundamental.
  • Modular arithmeticHash functions commonly use modulo operations to fit values into a fixed bucket range.
  • Basic probabilityExpected-time analysis assumes uniform hashing and uses probability for collision reasoning.
  • Big-O notation and amortized analysisUnderstanding expected O(1) operations and amortized costs is essential for performance guarantees.
  • C++ templates and functorsCustom hashers are implemented as templated functors passed to unordered_map/set or gp_hash_table.
  • Balanced BSTs vs hash tablesKnowing when to choose ordered structures (map/set) versus hash-based ones prevents misuse.

Detailed Explanation

Tap terms for definitions

01Overview

A hash table is a data structure that lets you store and retrieve items by key in expected constant time. The key idea is to apply a hash function to a key to produce an integer, then use that integer to choose a bucket (index) in an array. Because many different keys can map to the same bucket (a collision), a hash table must have a collision resolution strategy. The two common strategies are chaining (each bucket stores a small list of items that hashed there) and open addressing (probe other slots in the array until you find an empty one). In C++, unordered_map and unordered_set implement hash tables with chaining and typically give O(1) expected time for insert, find, and erase. However, in the worst case—like when an adversary crafts inputs that collide—operations can degrade to O(n). Competitive programmers often protect against this by using randomized hash functions (e.g., splitmix64 with a random seed) or third-party structures like gp_hash_table from the GNU pb_ds library, which are often faster in practice. Choosing a good load factor (ratio of items to buckets) and rehashing when necessary are important for keeping performance predictable.

02Intuition & Analogies

Imagine a row of lockers (the array of buckets). Every student (key) is assigned a locker number by a simple rule: take their student ID, do some arithmetic, and get a locker index. Most of the time, each locker holds exactly the right student’s stuff. But sometimes two students are assigned the same locker—that’s a collision. With chaining, we allow multiple students to share a locker by keeping a small list inside it; when a student comes to retrieve items, they check the list quickly. With open addressing, if a student finds their locker busy, they try the next locker, and the next, following a deterministic pattern (linear probing, quadratic probing, or double hashing) until they find an empty one. The key to speed is making sure the lockers aren’t too crowded (low load factor), and that our rule (hash function) spreads students out uniformly. If a prankster gives many students IDs that always map to the same locker, the line at that locker grows long, and everyone slows down. Randomizing our rule (adding a secret seed) makes the prankster’s job hard because, without knowing the secret, they can’t easily force all students into one locker. In practice, you also pick a bigger hallway of lockers when the school gets more students: that’s rehashing—build a bigger array and reassign everyone quickly using the same rule.

03Formal Definition

Let U be the universe of keys and H: U \{0,1,,m-1\} be a hash function that maps each key to one of m buckets. For a set S U with = n, the load factor is = . A chained hash table stores, for each bucket i, a small list (typically linked list or vector) of elements whose keys hash to i. Under the Simple Uniform Hashing Assumption (SUHA)—each key is equally likely to hash to any bucket independently—the expected bucket size is , and search/insert/delete run in expected O(1 + ) time. Open addressing stores at most one item per bucket and uses a probe sequence p(k,0), p(k,1), ... to find a slot for key k; examples include linear probing, quadratic probing, and double hashing. For linear probing under SUHA with load factor < 1, expected unsuccessful search probes about \frac{1}{2}\left(1 + \frac{1}{(1-\alpha)^2}\right), while expected successful search probes about \frac{1}{2}\left(1 + \frac{1}{1-\alpha}\right). In the worst case (adversarial keys or degenerate hashing), any standard hash table operation can take (n). Randomized hashing and universal hash families reduce adversarial collision probability, offering expected guarantees even against adaptive inputs.

04When to Use

Use a hash table whenever you need fast expected O(1) insert, find, and erase for keys without ordering requirements. Typical tasks include frequency counting (word counts, value frequencies), membership queries (set lookups), building adjacency maps in graphs, memoization/cache maps for dynamic programming, and mapping composite keys (like pairs or tuples) to values. In competitive programming, unordered_map and unordered_set handle most needs quickly, especially with reserve to reduce rehashing. When input size is large and performance is tight, gp_hash_table can be faster due to lower constant factors. If you require ordered iteration or operations like predecessor/successor, prefer balanced BSTs (map/set). If you need to handle extremely memory-constrained environments or require very predictable probe counts, consider open addressing variants (like linear probing with careful load-factor control) or specialized hash tables (robin hood hashing, hopscotch). For cryptographic collision resistance or untrusted inputs, do not rely on standard hash functions; instead, use randomized hashers (splitmix64 with a secret seed) or a universal hashing scheme, and consider salting keys to mitigate collision-based denial-of-service.

⚠️Common Mistakes

  • Forgetting to call reserve on unordered_map/unordered_set before bulk inserts, causing many rehashes and performance drops. Fix: call umap.reserve(expected_size) and optionally set max_load_factor.
  • Using composite keys (pair, vector, struct) without providing a proper hash function, leading to compilation errors or poor hashing. Fix: write a custom hash functor that also defines equality consistently.
  • Assuming O(1) worst-case time and ignoring adversarial inputs. Fix: know that worst-case is O(n) per operation; mitigate with randomized hashing and reasonable load factors.
  • Writing weak hashes (e.g., returning key % m directly or combining fields with naive XOR) that cause patterns and clustering. Fix: use well-mixed 64-bit hashers like splitmix64, and combine fields with robust mixing and a random seed.
  • Misusing operator[] on unordered_map when you only want to test membership; it inserts a default value and changes size. Fix: use find or contains (C++20) for lookup without insertion.
  • Iteration invalidation surprises after rehash. Fix: know that rehash invalidates iterators and references; minimize holding long-lived iterators across inserts/rehashes.
  • High load factor in open addressing leading to skyrocketing probe counts. Fix: trigger rehash around 0.5–0.75 and ensure good probing strategies.
  • Comparing pointers as keys when objects move or addresses are unstable. Fix: hash the content or use stable IDs instead of raw addresses.

Key Formulas

Load Factor

Explanation: Alpha is the average number of elements per bucket when n items are stored in m buckets. Keeping small (e.g., around 0.5–1.0) helps ensure short chains and fast operations.

Chaining Expected Chain Length

Explanation: Under uniform hashing, the expected number of elements in any bucket is the load factor. This implies average search and insert time is proportional to 1 + .

Expected Time with Chaining

Explanation: Search/insert/delete in a chained hash table take constant time on average plus a term linear in the load factor. When is bounded, this becomes expected O(1).

Linear Probing Successful Search

Explanation: For open addressing with linear probing under uniform hashing, the expected number of probes for a successful search grows as the table fills. It diverges as 1.

Linear Probing Unsuccessful Search

Explanation: The expected probes for an unsuccessful search grow faster than for successful searches as the load factor increases, highlighting the need to rehash before the table is too full.

Universal Hashing Collision Bound

Explanation: For a universal hash family, the probability that two distinct keys collide is at most 1/m. This provides expected guarantees even against adversarial inputs.

Birthday Collision Approximation

Explanation: When hashing n random keys into m buckets, the chance of any collision follows a birthday-paradox-like curve. Collisions become likely once n is about .

Affine Universal Hash

Explanation: Choosing random a,b with 1 and 0 for prime p creates a universal family. It is simple and provides low collision probability for arbitrary inputs.

Amortized Insert

Explanation: Although occasional rehashing costs O(n), spreading that cost over many inserts yields constant amortized time per insertion.

Space Complexity

Explanation: A chained table stores m buckets plus n elements; an open-address table stores about m slots. Space is linear in the number of elements.

Complexity Analysis

Under the uniform hashing assumption, a chained hash table with m buckets and n elements has load factor = n/m. Each operation examines O(1 + ) elements on average because the expected chain length in any bucket is . If is kept bounded by rehashing (increasing m as n grows), search, insert, and erase run in expected O(1) time. Worst-case behavior occurs when many keys collide into the same bucket; then operations degrade to (n) for that bucket. Rehashing occasionally requires O(n) time to move all elements into a larger table, but since it happens infrequently (once per growth step), the amortized insertion time remains O(1). For open addressing, there is only one item per slot, so probe sequences are used to find positions. As the load factor approaches 1, expected probes grow rapidly. For linear probing, expected successful search probes are about (1 + 1/(1-)), and unsuccessful about (1 + 1/(1-)^2). Hence, keeping 0.5–0.75 is crucial. Open addressing uses contiguous arrays, which improves cache locality and constants, but deletions require tombstones or rehashing. Space usage is O(m + n) for chaining (m buckets plus nodes storing keys/values and pointers). For open addressing, space is O(m), where m n/. In practice, constant factors differ: chaining stores pointers and nodes (overhead), while open addressing stores keys/values inline (denser). Finally, adversarial inputs can force (n) per operation; randomized hashing (e.g., splitmix64 with a random seed) reduces the chance of concentrated collisions and recovers expected O(1) bounds against adaptive inputs.

Code Examples

Basics of unordered_map and unordered_set with tuning (reserve, max_load_factor)
1#include <bits/stdc++.h>
2using namespace std;
3
4int main() {
5 ios::sync_with_stdio(false);
6 cin.tie(nullptr);
7
8 // Example: frequency counting with unordered_map
9 unordered_map<string, int> freq;
10
11 // Tuning: reserve enough buckets to avoid many rehashes
12 // Suppose we expect up to 100000 distinct words
13 freq.reserve(100000);
14 freq.max_load_factor(0.7f); // keep chains short
15
16 vector<string> words = {"apple","banana","apple","pear","banana","apple"};
17 for (const auto &w : words) {
18 // operator[] inserts if missing, then increments
19 ++freq[w];
20 }
21
22 // Query: check membership without inserting using find/contains
23#if __cplusplus >= 202002L
24 bool has_orange = freq.contains("orange");
25#else
26 bool has_orange = (freq.find("orange") != freq.end());
27#endif
28
29 cout << "has_orange? " << (has_orange ? "yes" : "no") << "\n";
30
31 // Iterate and print frequencies
32 for (auto &kv : freq) {
33 cout << kv.first << ": " << kv.second << "\n";
34 }
35
36 // unordered_set example: membership only
37 unordered_set<long long> seen;
38 seen.reserve(1000);
39 seen.insert(10);
40 seen.insert(20);
41 cout << (seen.count(10) ? "present" : "absent") << "\n";
42
43 // Introspection: bucket_count and load_factor
44 cout << "bucket_count: " << freq.bucket_count() << " load_factor: " << freq.load_factor() << "\n";
45
46 return 0;
47}
48

This program demonstrates typical operations: insert via operator[] for unordered_map, membership queries via find/contains, and iteration. It also shows performance tuning with reserve and max_load_factor to reduce rehashing and keep expected O(1) operations. The unordered_set example shows pure membership use cases.

Time: Expected O(1) per insert/lookup/erase; rehash is O(n) when it occurs.Space: O(n) for elements plus O(m) buckets; overall linear.
Custom randomized hash for pairs and vectors to resist anti-hash attacks
1#include <bits/stdc++.h>
2using namespace std;
3
4// A strong 64-bit mixer: splitmix64
5static uint64_t splitmix64(uint64_t x) {
6 x += 0x9e3779b97f4a7c15ULL;
7 x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9ULL;
8 x = (x ^ (x >> 27)) * 0x94d049bb133111ebULL;
9 return x ^ (x >> 31);
10}
11
12struct custom_hash {
13 // Random seed to defeat adversarial inputs
14 const uint64_t FIXED_RANDOM = chrono::steady_clock::now().time_since_epoch().count();
15
16 // Hash for 64-bit integers
17 size_t operator()(uint64_t x) const noexcept {
18 return (size_t)splitmix64(x + FIXED_RANDOM);
19 }
20
21 // Overload for signed long long
22 size_t operator()(long long x) const noexcept {
23 return (*this)((uint64_t)x);
24 }
25
26 // Hash combine helper
27 size_t combine(size_t a, size_t b) const noexcept {
28 // Similar to boost::hash_combine but with extra mixing
29 uint64_t res = a;
30 res ^= b + 0x9e3779b97f4a7c15ULL + (res << 6) + (res >> 2);
31 return (size_t)splitmix64(res + FIXED_RANDOM);
32 }
33
34 // Hash for pair<T,U>
35 template<class T, class U>
36 size_t operator()(const pair<T,U>& p) const noexcept {
37 size_t h1 = (*this)((uint64_t)std::hash<T>{}(p.first));
38 size_t h2 = (*this)((uint64_t)std::hash<U>{}(p.second));
39 return combine(h1, h2);
40 }
41
42 // Hash for vector<int> (illustrative; for other types, generalize)
43 size_t operator()(const vector<int>& v) const noexcept {
44 size_t h = 0;
45 for (int x : v) {
46 size_t hx = (*this)((uint64_t)(uint32_t)x);
47 h = combine(h, hx);
48 }
49 return h;
50 }
51};
52
53int main() {
54 ios::sync_with_stdio(false);
55 cin.tie(nullptr);
56
57 // Using pair<int,int> as a key
58 unordered_map<pair<int,int>, int, custom_hash> grid_cost;
59 grid_cost.reserve(1 << 16);
60
61 grid_cost[{1,2}] = 5;
62 grid_cost[{2,3}] = 7;
63 cout << grid_cost[{1,2}] << "\n"; // 5
64
65 // Using vector<int> as a key
66 unordered_map<vector<int>, string, custom_hash> pattern_name;
67 pattern_name.reserve(1024);
68 pattern_name[{1,2,3}] = "seq123";
69 pattern_name[{2,3,5,7}] = "primes";
70 cout << pattern_name[{2,3,5,7}] << "\n";
71
72 // Anti-hash: show that many equal hashes won't easily occur with random seed
73 unordered_set<long long, custom_hash> us;
74 us.reserve(100000);
75 for (long long i = 0; i < 100000; ++i) us.insert(i * 1000003LL);
76 cout << "size: " << us.size() << "\n"; // Should be 100000
77
78 return 0;
79}
80

We define a custom_hash using splitmix64 with a random per-process seed to complicate adversarial collision attempts. The functor supports 64-bit integers, pairs, and vectors by robustly combining component hashes. This enables unordered containers to accept composite keys while maintaining good performance.

Time: Expected O(1) per operation; hashing a key is O(k) where k is the key’s length (e.g., vector size).Space: O(n) for stored elements; hash functor has negligible overhead.
Implementing a simple open-addressing hash table (linear probing) for integers
1#include <bits/stdc++.h>
2using namespace std;
3
4class LinearProbingHash {
5 // States for each slot
6 enum State : uint8_t { EMPTY = 0, FILLED = 1, DELETED = 2 };
7
8 struct Slot {
9 long long key = 0;
10 State state = EMPTY;
11 };
12
13 vector<Slot> table;
14 size_t n = 0; // number of FILLED keys
15 double max_alpha = 0.7; // rehash threshold
16
17 static inline uint64_t splitmix64(uint64_t x) {
18 x += 0x9e3779b97f4a7c15ULL;
19 x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9ULL;
20 x = (x ^ (x >> 27)) * 0x94d049bb133111ebULL;
21 return x ^ (x >> 31);
22 }
23
24 size_t H(long long x) const {
25 // Simple 64-bit mixing, then map into table size
26 return (size_t)(splitmix64((uint64_t)x)) & (table.size() - 1);
27 }
28
29 void rehash(size_t new_m) {
30 // new_m must be a power of two for &-masking to work
31 vector<Slot> old = move(table);
32 table.assign(new_m, {});
33 n = 0;
34 for (const auto &s : old) if (s.state == FILLED) insert(s.key);
35 }
36
37 void maybe_grow() {
38 if ((double)n / (double)table.size() > max_alpha) {
39 rehash(table.size() * 2);
40 }
41 }
42
43public:
44 explicit LinearProbingHash(size_t m = 8) {
45 // Round up m to power of two
46 size_t pw = 1;
47 while (pw < m) pw <<= 1;
48 table.assign(pw, {});
49 }
50
51 bool insert(long long key) {
52 maybe_grow();
53 size_t m = table.size();
54 size_t i = H(key);
55 size_t first_deleted = SIZE_MAX;
56 while (true) {
57 if (table[i].state == EMPTY) {
58 size_t pos = (first_deleted != SIZE_MAX ? first_deleted : i);
59 table[pos].key = key;
60 table[pos].state = FILLED;
61 ++n;
62 return true; // inserted new
63 } else if (table[i].state == DELETED) {
64 if (first_deleted == SIZE_MAX) first_deleted = i; // remember first tombstone
65 } else if (table[i].state == FILLED && table[i].key == key) {
66 return false; // already present
67 }
68 i = (i + 1) & (m - 1); // linear probing with wrap-around
69 }
70 }
71
72 bool contains(long long key) const {
73 size_t m = table.size();
74 size_t i = H(key);
75 while (true) {
76 if (table[i].state == EMPTY) return false; // stop at empty slot
77 if (table[i].state == FILLED && table[i].key == key) return true;
78 i = (i + 1) & (m - 1);
79 }
80 }
81
82 bool erase(long long key) {
83 size_t m = table.size();
84 size_t i = H(key);
85 while (true) {
86 if (table[i].state == EMPTY) return false;
87 if (table[i].state == FILLED && table[i].key == key) {
88 table[i].state = DELETED; // tombstone to preserve probe chains
89 --n;
90 return true;
91 }
92 i = (i + 1) & (m - 1);
93 }
94 }
95
96 size_t size() const { return n; }
97 size_t capacity() const { return table.size(); }
98};
99
100int main() {
101 ios::sync_with_stdio(false);
102 cin.tie(nullptr);
103
104 LinearProbingHash hs;
105 for (int i = 0; i < 10; ++i) hs.insert(i * 3);
106
107 cout << "contains 9? " << (hs.contains(9) ? "yes" : "no") << "\n";
108 cout << "contains 12? " << (hs.contains(12) ? "yes" : "no") << "\n";
109 cout << "size: " << hs.size() << " capacity: " << hs.capacity() << "\n";
110
111 hs.erase(9);
112 cout << "contains 9 after erase? " << (hs.contains(9) ? "yes" : "no") << "\n";
113
114 return 0;
115}
116

This is a minimal open-addressing hash table for integers using linear probing with tombstones and power-of-two capacity for fast modulo via bit-masking. It demonstrates hashing, probing, insertion, lookup, deletion, and dynamic growth via rehashing at a target load factor.

Time: Expected O(1) per operation when load factor is bounded; worst-case O(n).Space: O(m) where m is the current table capacity.
High-performance gp_hash_table (pb_ds) with randomized hasher
1#include <bits/stdc++.h>
2#include <ext/pb_ds/assoc_container.hpp>
3using namespace std;
4using namespace __gnu_pbds;
5
6static uint64_t splitmix64_u(uint64_t x) {
7 x += 0x9e3779b97f4a7c15ULL;
8 x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9ULL;
9 x = (x ^ (x >> 27)) * 0x94d049bb133111ebULL;
10 return x ^ (x >> 31);
11}
12
13struct fast_hash {
14 const uint64_t SEED = chrono::steady_clock::now().time_since_epoch().count();
15 size_t operator()(uint64_t x) const { return (size_t)splitmix64_u(x + SEED); }
16 size_t operator()(long long x) const { return operator()((uint64_t)x); }
17};
18
19int main() {
20 ios::sync_with_stdio(false);
21 cin.tie(nullptr);
22
23 // gp_hash_table<Key, Mapped, Hash>
24 gp_hash_table<long long, int, fast_hash> mp;
25 mp.resize(1 << 16); // pre-allocate buckets
26
27 // Example: count occurrences
28 vector<long long> a = {1,2,3,2,1,4,5,2,1};
29 for (auto x : a) mp[x]++;
30
31 cout << "count(2) = " << mp[2] << "\n";
32
33 // Membership-style container: value-less map (mapped type = null_type)
34 gp_hash_table<long long, null_type, fast_hash> st;
35 st.resize(1 << 16);
36 st.insert(10);
37 st.insert(20);
38 cout << (st.find(10) != st.end() ? "present" : "absent") << "\n";
39
40 return 0;
41}
42

The GNU pb_ds gp_hash_table is often faster than unordered_map due to lower constant factors and cache-friendly design. We use a randomized fast_hash to reduce collision risks. The example shows both key→value and key-only (set-like) usage by using null_type for the mapped value.

Time: Expected O(1) per operation; rehash/resizes are O(n).Space: O(n) with implementation-dependent overhead.