Bitset Optimization
Key Points
- •Bitset optimization exploits word-level parallelism so one CPU instruction processes 64 bits at once on typical 64-bit machines.
- •Representing sets as bitmasks lets union, intersection, and shifts run in O instead of O(n), giving large constant-factor speedups.
- •Classic applications include transitive closure (Floyd–Warshall rows as bitsets), subset-sum/0-1 knapsack via shifts, and fast set intersections in graph algorithms.
- •std::bitset<N> is very fast but has compile-time fixed size; for variable sizes use vector<unsigned long long> or a custom dynamic bitset.
- •Shifts (<<, >>) implement “add/remove weight” or “move frontier” operations; &, |, ^ implement set intersection/union/xor.
- •Bit operations pair naturally with popcount to count set bits quickly using CPU instructions, enabling O counting.
- •Time reductions like O() to O(/64) or O(nW) to O are common and often decisive for n up to ~40,000.
- •Beware off-by-one indices, fixed-size limits of std::bitset, and the cost of branching; operate on whole words and mask trailing bits when needed.
Prerequisites
- →Bitwise operations — Understanding &, |, ^, ~, <<, >> is essential to reason about set operations and shifts.
- →Asymptotic complexity — To appreciate and analyze the O(n/64) savings and overall algorithmic trade-offs.
- →Boolean dynamic programming — Many bitset applications (subset-sum/knapsack) are DP patterns realized by bit shifts and ORs.
- →Graph fundamentals — Transitive closure, reachability, and adjacency representations are key bitset use cases.
- →Computer architecture basics — Knowing word size, caches, and popcount helps understand real-world performance limits.
Detailed Explanation
Tap terms for definitions01Overview
Bitset optimization is a technique where you represent sets, masks, or boolean arrays as sequences of machine words (usually 64-bit). Instead of processing elements one-by-one, you operate on entire words at once using bitwise instructions (AND, OR, XOR, SHIFT). Because the CPU handles these operations on 64 bits per instruction on most platforms, you get a speedup of roughly 64× in the inner loops compared to scalar boolean operations. This is called word-level parallelism and sits below SIMD but above naive per-element loops.
In algorithmic problems, many tasks boil down to operations on large boolean vectors: merging reachability information in graphs, checking subset sums, intersecting adjacency lists, or maintaining dynamic sets. If you encode these as bit vectors, you can replace O(n) loops with O(n/64) word-wise operations. This doesn’t change asymptotic complexity classes in big-O terms when written as functions of n, but it dramatically improves constants and often changes what input sizes are solvable in practice.
C++ offers std::bitset<N> for fixed-size bitsets with compile-time N and fast operators, and you can roll your own dynamic bitset with vector<unsigned long long> for variable sizes. Common patterns include using << and >> to shift states (e.g., knapsack), using & to compute intersections (e.g., common neighbors), and using | to propagate reachability (e.g., transitive closure). With careful implementation and attention to memory layout, bitset optimization can be the difference between TLE and AC in competitive programming.
02Intuition & Analogies
Imagine sorting coins on a table: moving one coin at a time is slow, but pushing stacks of 64 coins together in one sweep is much faster. Bitset optimization does exactly that for boolean data. Each coin is a bit, and a 64-bit word is a neat stack. When you “AND” two stacks, you keep coins that are present in both. When you “OR,” you take coins from either stack. When you “SHIFT,” you slide the whole stack left or right, moving many coins in one motion.
Another analogy: think of road networks. If you track, for each city, all cities it can reach, the naive approach checks one destination at a time. With bitsets, you keep a whole city’s reachability as a single long strip of lights—each light is a city, on if reachable. To update after opening a new hub, you copy (OR) the entire light strip of the hub onto every city that can reach the hub. One operation toggles 64 lights at once.
For packing problems like 0-1 knapsack, envision a ruler with marks at all achievable weights. Adding an item shifts all marks by the item’s weight. Instead of re-marking each position individually, you slide the whole ruler in one move and overlay it (OR) with the original. Again, this moves 64 positions per CPU instruction.
This word-at-once philosophy is why bitset code, though not changing worst-case asymptotics in terms of n alone, can make an O(nW) knapsack or O(n^3) closure run fast enough in practice by effectively dividing the work by about 64.
03Formal Definition
04When to Use
Use bitset optimization when your core computation repeatedly performs boolean operations over large, dense sets or state vectors:
- Graph algorithms: transitive closure, counting triangles via common-neighbor intersections, fast BFS frontiers on dense graphs, or accelerating matching heuristics by bitwise intersections of adjacency.
- Dynamic programming on boolean states: subset-sum and 0-1 knapsack via shifts; some subset DP (e.g., SOS-like transitions) where operations can be modeled as bitwise OR over masks.
- Batch queries over sets: computing many intersections/unions between large sets (document retrieval, recommendation candidate filtering).
- Dense reachability or compatibility matrices: where full rows/columns are touched repeatedly and density justifies O(n/64) per operation.
Bitsets are most effective when: (1) the sets are dense (many 1s), (2) the universe size is large (thousands to tens of thousands), (3) operations are repetitive in tight loops, and (4) memory fits in cache (contiguous words help). For sparse sets with very few 1s, inverted indices or sorted vectors with two-pointer intersections may outperform bitsets. Prefer std::bitset<N> when N is known at compile time; otherwise, use a custom dynamic bitset or libraries like boost::dynamic_bitset. Consider CPU’s word size (64-bit), and avoid unnecessary branching inside word-wise loops.
⚠️Common Mistakes
- Using std::bitset when size is not known at compile time. std::bitset<N> requires N to be a constant expression; otherwise, use vector<unsigned long long> or boost::dynamic_bitset.
- Forgetting that shifts drop bits. dp << w discards bits that overflow the top; mask the range if you must preserve only up to a given capacity (e.g., limit to W in knapsack).
- Off-by-one and indexing errors. Bit i lives in word i/64 at offset i%64; ensure you don’t access out of bounds and that you clear unused high bits in the last word when comparing.
- Assuming O(1) for operations like count(). std::bitset::count() is O(n/64) but very fast thanks to hardware popcount; still, don’t call it in inner loops unless necessary.
- Ignoring memory layout. Non-contiguous or scattered allocations kill cache locality; store bitsets contiguously and iterate linearly over words.
- Overusing branches. Per-bit conditionals defeat the point; batch logic into word-wise operations. Prefer branchless masks and word loops.
- Mixing signed/unsigned shifts. Always use unsigned long long for word storage to avoid implementation-defined behavior on right shifts.
- Expecting exact 64× speedup. Real speedups depend on cache, memory bandwidth, and instruction throughput; 5–50× is common, 64× is an ideal upper bound.
Key Formulas
Number of words
Explanation: A bitset of m bits needs L machine words when each word stores w bits. This determines loop bounds for word-wise operations.
Bit location
Explanation: Bit i resides in word index floor at bit position i mod w. This guides set/test operations for dynamic bitsets.
Intersection by words
Explanation: Set intersection reduces to AND on corresponding words. The total time is proportional to the number of words L (i.e., O).
Left shift across words
Explanation: To shift by k = qw + r bits, shift by whole words q and then by r bits, combining carries from neighboring words. Out-of-range terms are treated as zero.
Bitset knapsack complexity
Explanation: Each item performs a shift and OR over W bits, executed word-wise in chunks of w bits, reducing the factor by roughly w.
Bitset transitive closure complexity
Explanation: Floyd–Warshall with bitset rows performs an OR over n bits per (i,k) pair in word chunks of size w, dividing the cubic work by w.
Intersection size via popcount
Explanation: The size of the intersection equals the sum of popcounts of word-wise ANDs. This is efficient using hardware popcount.
Practical speedup bound
Explanation: Ideal speedup is up to w, but memory bandwidth and caches limit practical gains. This reminds us speedup is bounded by hardware.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Adjust MAXN to the maximum n you expect at compile time for std::bitset 5 static const int MAXN = 2048; 6 7 int main() { 8 ios::sync_with_stdio(false); 9 cin.tie(nullptr); 10 11 // Example graph with n nodes 12 int n = 6; 13 vector<bitset<MAXN>> reach(n); 14 15 // Sample directed edges 16 vector<pair<int,int>> edges = { 17 {0,1},{1,2},{2,3},{3,4},{1,4},{4,5} 18 }; 19 20 // Initialize reachability: self loops and edges 21 for (int i = 0; i < n; ++i) reach[i].set(i); 22 for (auto [u,v] : edges) reach[u].set(v); 23 24 // Bitset-accelerated transitive closure 25 // For each intermediate k, propagate its reachability row to any i that can reach k 26 for (int k = 0; k < n; ++k) { 27 for (int i = 0; i < n; ++i) { 28 if (reach[i].test(k)) { // O(1) test 29 reach[i] |= reach[k]; // O(n/64) word-wise OR 30 } 31 } 32 } 33 34 // Print reachability matrix 35 cout << "Reachability matrix (1 if i reaches j):\n"; 36 for (int i = 0; i < n; ++i) { 37 for (int j = 0; j < n; ++j) cout << (reach[i].test(j) ? '1' : '0'); 38 cout << '\n'; 39 } 40 41 // Example query: can 0 reach 5? 42 cout << "0 reaches 5? " << (reach[0].test(5) ? "Yes" : "No") << '\n'; 43 return 0; 44 } 45
We store each row of the reachability matrix as a bitset. For each intermediate node k, if i can reach k, then all nodes k can reach are also reachable from i, so we OR row k into row i. This replaces an O(n) per-update merge with an O(n/64) word-wise OR, reducing the cubic algorithm to O(n^3/64).
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Max capacity; adjust as needed 5 static const int MAXW = 100000; // supports capacity up to 100000 6 7 int main() { 8 ios::sync_with_stdio(false); 9 cin.tie(nullptr); 10 11 // Example: given item weights, check achievable sums up to W 12 int n = 6; 13 int W = 50; 14 vector<int> w = {3, 7, 12, 19, 5, 9}; 15 16 bitset<MAXW + 1> dp; // dp[s] == 1 iff sum s is achievable 17 dp.set(0); // sum 0 is always achievable 18 19 for (int x : w) { 20 // Shift achievable sums by x and OR to add new achievable sums 21 dp |= (dp << x); // O(W/64) per item 22 } 23 24 // Mask out bits beyond capacity W to be safe (optional if MAXW == W) 25 for (int s = W + 1; s <= MAXW; ++s) dp.reset(s); 26 27 // Output: maximum achievable sum <= W and whether exact W is achievable 28 int best = -1; 29 for (int s = W; s >= 0; --s) if (dp.test(s)) { best = s; break; } 30 31 cout << "Exact W achievable? " << (dp.test(W) ? "Yes" : "No") << '\n'; 32 cout << "Max achievable <= W: " << best << '\n'; 33 34 return 0; 35 } 36
We represent achievable sums as a bitset where bit s is 1 if sum s is reachable. For each item of weight x, we shift the bitset left by x (representing adding x to all existing sums) and OR it with the original. This runs in O(W/64) per item and needs only O(W) memory.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct DynBitset { 5 int nbits; // logical number of bits 6 vector<unsigned long long> w; // words storing the bits 7 8 explicit DynBitset(int n = 0) { reset_size(n); } 9 10 void reset_size(int n) { 11 nbits = n; 12 int L = (n + 63) / 64; 13 w.assign(L, 0ULL); 14 } 15 16 void set(int i) { 17 int j = i >> 6; int r = i & 63; 18 w[j] |= (1ULL << r); 19 } 20 void reset(int i) { 21 int j = i >> 6; int r = i & 63; 22 w[j] &= ~(1ULL << r); 23 } 24 bool test(int i) const { 25 int j = i >> 6; int r = i & 63; 26 return (w[j] >> r) & 1ULL; 27 } 28 29 // In-place OR 30 void ior(const DynBitset& other) { 31 int L = (int)w.size(); 32 for (int j = 0; j < L; ++j) w[j] |= other.w[j]; 33 } 34 // In-place AND 35 void iand(const DynBitset& other) { 36 int L = (int)w.size(); 37 for (int j = 0; j < L; ++j) w[j] &= other.w[j]; 38 } 39 40 // Count set bits using hardware popcount 41 long long count() const { 42 long long c = 0; 43 for (auto x : w) c += __builtin_popcountll(x); 44 return c; 45 } 46 47 // Compute |A ∩ B| without allocating a new bitset 48 static long long intersection_count(const DynBitset& A, const DynBitset& B) { 49 int L = (int)A.w.size(); 50 long long c = 0; 51 for (int j = 0; j < L; ++j) c += __builtin_popcountll(A.w[j] & B.w[j]); 52 return c; 53 } 54 }; 55 56 int main() { 57 ios::sync_with_stdio(false); 58 cin.tie(nullptr); 59 60 int M = 1000; // universe size 61 DynBitset A(M), B(M); 62 63 // Build two sets 64 vector<int> elemsA = {1,3,5,7,10,42,128,256,512,700,900}; 65 vector<int> elemsB = {3,7,11,42,129,256,513,700,901}; 66 for (int x : elemsA) A.set(x); 67 for (int x : elemsB) B.set(x); 68 69 long long inter = DynBitset::intersection_count(A, B); 70 71 cout << "|A ∩ B| = " << inter << "\n"; 72 73 // Example: union count = |A| + |B| - |A ∩ B| 74 long long union_count = A.count() + B.count() - inter; 75 cout << "|A ∪ B| = " << union_count << "\n"; 76 77 return 0; 78 } 79
This implements a minimal dynamic bitset using vector<unsigned long long>. It demonstrates O(n/64) set intersection via word-wise AND and popcount. This approach generalizes std::bitset to variable sizes while keeping cache-friendly, branchless inner loops.