⚙️AlgorithmIntermediate

Bitset Optimization

Key Points

  • Bitset optimization exploits word-level parallelism so one CPU instruction processes 64 bits at once on typical 64-bit machines.
  • Representing sets as bitmasks lets union, intersection, and shifts run in O instead of O(n), giving large constant-factor speedups.
  • Classic applications include transitive closure (Floyd–Warshall rows as bitsets), subset-sum/0-1 knapsack via shifts, and fast set intersections in graph algorithms.
  • std::bitset<N> is very fast but has compile-time fixed size; for variable sizes use vector<unsigned long long> or a custom dynamic bitset.
  • Shifts (<<, >>) implement “add/remove weight” or “move frontier” operations; &, |, ^ implement set intersection/union/xor.
  • Bit operations pair naturally with popcount to count set bits quickly using CPU instructions, enabling O counting.
  • Time reductions like O() to O(/64) or O(nW) to O are common and often decisive for n up to ~40,000.
  • Beware off-by-one indices, fixed-size limits of std::bitset, and the cost of branching; operate on whole words and mask trailing bits when needed.

Prerequisites

  • Bitwise operationsUnderstanding &, |, ^, ~, <<, >> is essential to reason about set operations and shifts.
  • Asymptotic complexityTo appreciate and analyze the O(n/64) savings and overall algorithmic trade-offs.
  • Boolean dynamic programmingMany bitset applications (subset-sum/knapsack) are DP patterns realized by bit shifts and ORs.
  • Graph fundamentalsTransitive closure, reachability, and adjacency representations are key bitset use cases.
  • Computer architecture basicsKnowing word size, caches, and popcount helps understand real-world performance limits.

Detailed Explanation

Tap terms for definitions

01Overview

Bitset optimization is a technique where you represent sets, masks, or boolean arrays as sequences of machine words (usually 64-bit). Instead of processing elements one-by-one, you operate on entire words at once using bitwise instructions (AND, OR, XOR, SHIFT). Because the CPU handles these operations on 64 bits per instruction on most platforms, you get a speedup of roughly 64× in the inner loops compared to scalar boolean operations. This is called word-level parallelism and sits below SIMD but above naive per-element loops.

In algorithmic problems, many tasks boil down to operations on large boolean vectors: merging reachability information in graphs, checking subset sums, intersecting adjacency lists, or maintaining dynamic sets. If you encode these as bit vectors, you can replace O(n) loops with O(n/64) word-wise operations. This doesn’t change asymptotic complexity classes in big-O terms when written as functions of n, but it dramatically improves constants and often changes what input sizes are solvable in practice.

C++ offers std::bitset<N> for fixed-size bitsets with compile-time N and fast operators, and you can roll your own dynamic bitset with vector<unsigned long long> for variable sizes. Common patterns include using << and >> to shift states (e.g., knapsack), using & to compute intersections (e.g., common neighbors), and using | to propagate reachability (e.g., transitive closure). With careful implementation and attention to memory layout, bitset optimization can be the difference between TLE and AC in competitive programming.

02Intuition & Analogies

Imagine sorting coins on a table: moving one coin at a time is slow, but pushing stacks of 64 coins together in one sweep is much faster. Bitset optimization does exactly that for boolean data. Each coin is a bit, and a 64-bit word is a neat stack. When you “AND” two stacks, you keep coins that are present in both. When you “OR,” you take coins from either stack. When you “SHIFT,” you slide the whole stack left or right, moving many coins in one motion.

Another analogy: think of road networks. If you track, for each city, all cities it can reach, the naive approach checks one destination at a time. With bitsets, you keep a whole city’s reachability as a single long strip of lights—each light is a city, on if reachable. To update after opening a new hub, you copy (OR) the entire light strip of the hub onto every city that can reach the hub. One operation toggles 64 lights at once.

For packing problems like 0-1 knapsack, envision a ruler with marks at all achievable weights. Adding an item shifts all marks by the item’s weight. Instead of re-marking each position individually, you slide the whole ruler in one move and overlay it (OR) with the original. Again, this moves 64 positions per CPU instruction.

This word-at-once philosophy is why bitset code, though not changing worst-case asymptotics in terms of n alone, can make an O(nW) knapsack or O(n^3) closure run fast enough in practice by effectively dividing the work by about 64.

03Formal Definition

Let w denote the machine word size (typically ). A bitset of length m is a binary vector b \{0,1\}^{m}, stored as an array of L = m / w words B[0..L-1], where bit i resides at word index i / w and bit offset i w. Word-wise operations implement vector boolean algebra: - Intersection: B is done as C[j] = A[j] & B[j] for all j in [0, L-1]. - Union: B is done as C[j] = A[j] (A[j - q - 1] >> (w - r)) with out-of-range indices treated as 0. The computational cost of each operation is proportional to L rather than m, i.e., O word operations. When used within algorithms whose inner loops are dominated by set operations of size m, replacing element-wise loops with word-wise operations yields multiplicative speedups of roughly w. Examples include: - Subset/knapsack DP: O(nm) becomes O. - Transitive closure (Floyd–Warshall rows as bitsets): O() becomes O( / w). - Graph neighborhood intersections: O(m) per intersection becomes O.

04When to Use

Use bitset optimization when your core computation repeatedly performs boolean operations over large, dense sets or state vectors:

  • Graph algorithms: transitive closure, counting triangles via common-neighbor intersections, fast BFS frontiers on dense graphs, or accelerating matching heuristics by bitwise intersections of adjacency.
  • Dynamic programming on boolean states: subset-sum and 0-1 knapsack via shifts; some subset DP (e.g., SOS-like transitions) where operations can be modeled as bitwise OR over masks.
  • Batch queries over sets: computing many intersections/unions between large sets (document retrieval, recommendation candidate filtering).
  • Dense reachability or compatibility matrices: where full rows/columns are touched repeatedly and density justifies O(n/64) per operation.

Bitsets are most effective when: (1) the sets are dense (many 1s), (2) the universe size is large (thousands to tens of thousands), (3) operations are repetitive in tight loops, and (4) memory fits in cache (contiguous words help). For sparse sets with very few 1s, inverted indices or sorted vectors with two-pointer intersections may outperform bitsets. Prefer std::bitset<N> when N is known at compile time; otherwise, use a custom dynamic bitset or libraries like boost::dynamic_bitset. Consider CPU’s word size (64-bit), and avoid unnecessary branching inside word-wise loops.

⚠️Common Mistakes

  • Using std::bitset when size is not known at compile time. std::bitset<N> requires N to be a constant expression; otherwise, use vector<unsigned long long> or boost::dynamic_bitset.
  • Forgetting that shifts drop bits. dp << w discards bits that overflow the top; mask the range if you must preserve only up to a given capacity (e.g., limit to W in knapsack).
  • Off-by-one and indexing errors. Bit i lives in word i/64 at offset i%64; ensure you don’t access out of bounds and that you clear unused high bits in the last word when comparing.
  • Assuming O(1) for operations like count(). std::bitset::count() is O(n/64) but very fast thanks to hardware popcount; still, don’t call it in inner loops unless necessary.
  • Ignoring memory layout. Non-contiguous or scattered allocations kill cache locality; store bitsets contiguously and iterate linearly over words.
  • Overusing branches. Per-bit conditionals defeat the point; batch logic into word-wise operations. Prefer branchless masks and word loops.
  • Mixing signed/unsigned shifts. Always use unsigned long long for word storage to avoid implementation-defined behavior on right shifts.
  • Expecting exact 64× speedup. Real speedups depend on cache, memory bandwidth, and instruction throughput; 5–50× is common, 64× is an ideal upper bound.

Key Formulas

Number of words

Explanation: A bitset of m bits needs L machine words when each word stores w bits. This determines loop bounds for word-wise operations.

Bit location

Explanation: Bit i resides in word index floor at bit position i mod w. This guides set/test operations for dynamic bitsets.

Intersection by words

Explanation: Set intersection reduces to AND on corresponding words. The total time is proportional to the number of words L (i.e., O).

Left shift across words

Explanation: To shift by k = qw + r bits, shift by whole words q and then by r bits, combining carries from neighboring words. Out-of-range terms are treated as zero.

Bitset knapsack complexity

Explanation: Each item performs a shift and OR over W bits, executed word-wise in chunks of w bits, reducing the factor by roughly w.

Bitset transitive closure complexity

Explanation: Floyd–Warshall with bitset rows performs an OR over n bits per (i,k) pair in word chunks of size w, dividing the cubic work by w.

Intersection size via popcount

Explanation: The size of the intersection equals the sum of popcounts of word-wise ANDs. This is efficient using hardware popcount.

Practical speedup bound

Explanation: Ideal speedup is up to w, but memory bandwidth and caches limit practical gains. This reminds us speedup is bounded by hardware.

Complexity Analysis

Bitset optimization changes per-operation costs from per-element to per-word. For a bitset of length m and machine word size w (usually 64), element-wise O(m) loops become O loops over words. Each word operation (AND, OR, XOR, SHIFT) is constant time, and modern CPUs provide single-instruction popcount, making count operations fast. Transitive closure using bitset rows: For each k in 0..n-1 and each i in 0..n-1, we first test reach[i][k] in O(1), and when true, we perform reach[i] = (dp << ). A shift and an OR both cost O, hence total time O and O(W) space. If W is large but sparse solutions are required, consider alternative DP forms; otherwise, bitsets excel for dense feasibility DPs. Set intersection/union across many queries: Intersecting two size-m sets via word-wise AND is O. Counting the intersection uses popcount per word, still O. For Q intersections, total time is O. If sets are very sparse, two-pointer merge on sorted lists may be O(+) and faster. Space overhead: Bitsets use exactly m bits plus alignment to the next word; a dynamic bitset uses L 64-bit words, i.e., 8L bytes. The compactness improves cache behavior. However, copying large bitsets is O and should be minimized (pass by reference, reuse buffers).

Code Examples

Transitive Closure (Floyd–Warshall) with Bitset Rows: O(n^3/64)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Adjust MAXN to the maximum n you expect at compile time for std::bitset
5static const int MAXN = 2048;
6
7int main() {
8 ios::sync_with_stdio(false);
9 cin.tie(nullptr);
10
11 // Example graph with n nodes
12 int n = 6;
13 vector<bitset<MAXN>> reach(n);
14
15 // Sample directed edges
16 vector<pair<int,int>> edges = {
17 {0,1},{1,2},{2,3},{3,4},{1,4},{4,5}
18 };
19
20 // Initialize reachability: self loops and edges
21 for (int i = 0; i < n; ++i) reach[i].set(i);
22 for (auto [u,v] : edges) reach[u].set(v);
23
24 // Bitset-accelerated transitive closure
25 // For each intermediate k, propagate its reachability row to any i that can reach k
26 for (int k = 0; k < n; ++k) {
27 for (int i = 0; i < n; ++i) {
28 if (reach[i].test(k)) { // O(1) test
29 reach[i] |= reach[k]; // O(n/64) word-wise OR
30 }
31 }
32 }
33
34 // Print reachability matrix
35 cout << "Reachability matrix (1 if i reaches j):\n";
36 for (int i = 0; i < n; ++i) {
37 for (int j = 0; j < n; ++j) cout << (reach[i].test(j) ? '1' : '0');
38 cout << '\n';
39 }
40
41 // Example query: can 0 reach 5?
42 cout << "0 reaches 5? " << (reach[0].test(5) ? "Yes" : "No") << '\n';
43 return 0;
44}
45

We store each row of the reachability matrix as a bitset. For each intermediate node k, if i can reach k, then all nodes k can reach are also reachable from i, so we OR row k into row i. This replaces an O(n) per-update merge with an O(n/64) word-wise OR, reducing the cubic algorithm to O(n^3/64).

Time: O(n^3/64) in the dense worst caseSpace: O(n^2) bits = n*n bits for the reachability matrix (plus overhead)
0-1 Subset Sum / Knapsack Feasibility with Bitset Shifts: O(nW/64)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Max capacity; adjust as needed
5static const int MAXW = 100000; // supports capacity up to 100000
6
7int main() {
8 ios::sync_with_stdio(false);
9 cin.tie(nullptr);
10
11 // Example: given item weights, check achievable sums up to W
12 int n = 6;
13 int W = 50;
14 vector<int> w = {3, 7, 12, 19, 5, 9};
15
16 bitset<MAXW + 1> dp; // dp[s] == 1 iff sum s is achievable
17 dp.set(0); // sum 0 is always achievable
18
19 for (int x : w) {
20 // Shift achievable sums by x and OR to add new achievable sums
21 dp |= (dp << x); // O(W/64) per item
22 }
23
24 // Mask out bits beyond capacity W to be safe (optional if MAXW == W)
25 for (int s = W + 1; s <= MAXW; ++s) dp.reset(s);
26
27 // Output: maximum achievable sum <= W and whether exact W is achievable
28 int best = -1;
29 for (int s = W; s >= 0; --s) if (dp.test(s)) { best = s; break; }
30
31 cout << "Exact W achievable? " << (dp.test(W) ? "Yes" : "No") << '\n';
32 cout << "Max achievable <= W: " << best << '\n';
33
34 return 0;
35}
36

We represent achievable sums as a bitset where bit s is 1 if sum s is reachable. For each item of weight x, we shift the bitset left by x (representing adding x to all existing sums) and OR it with the original. This runs in O(W/64) per item and needs only O(W) memory.

Time: O(nW/64)Space: O(W)
Custom Dynamic Bitset for Fast Set Intersection (O(n/64))
1#include <bits/stdc++.h>
2using namespace std;
3
4struct DynBitset {
5 int nbits; // logical number of bits
6 vector<unsigned long long> w; // words storing the bits
7
8 explicit DynBitset(int n = 0) { reset_size(n); }
9
10 void reset_size(int n) {
11 nbits = n;
12 int L = (n + 63) / 64;
13 w.assign(L, 0ULL);
14 }
15
16 void set(int i) {
17 int j = i >> 6; int r = i & 63;
18 w[j] |= (1ULL << r);
19 }
20 void reset(int i) {
21 int j = i >> 6; int r = i & 63;
22 w[j] &= ~(1ULL << r);
23 }
24 bool test(int i) const {
25 int j = i >> 6; int r = i & 63;
26 return (w[j] >> r) & 1ULL;
27 }
28
29 // In-place OR
30 void ior(const DynBitset& other) {
31 int L = (int)w.size();
32 for (int j = 0; j < L; ++j) w[j] |= other.w[j];
33 }
34 // In-place AND
35 void iand(const DynBitset& other) {
36 int L = (int)w.size();
37 for (int j = 0; j < L; ++j) w[j] &= other.w[j];
38 }
39
40 // Count set bits using hardware popcount
41 long long count() const {
42 long long c = 0;
43 for (auto x : w) c += __builtin_popcountll(x);
44 return c;
45 }
46
47 // Compute |A ∩ B| without allocating a new bitset
48 static long long intersection_count(const DynBitset& A, const DynBitset& B) {
49 int L = (int)A.w.size();
50 long long c = 0;
51 for (int j = 0; j < L; ++j) c += __builtin_popcountll(A.w[j] & B.w[j]);
52 return c;
53 }
54};
55
56int main() {
57 ios::sync_with_stdio(false);
58 cin.tie(nullptr);
59
60 int M = 1000; // universe size
61 DynBitset A(M), B(M);
62
63 // Build two sets
64 vector<int> elemsA = {1,3,5,7,10,42,128,256,512,700,900};
65 vector<int> elemsB = {3,7,11,42,129,256,513,700,901};
66 for (int x : elemsA) A.set(x);
67 for (int x : elemsB) B.set(x);
68
69 long long inter = DynBitset::intersection_count(A, B);
70
71 cout << "|A ∩ B| = " << inter << "\n";
72
73 // Example: union count = |A| + |B| - |A ∩ B|
74 long long union_count = A.count() + B.count() - inter;
75 cout << "|A ∪ B| = " << union_count << "\n";
76
77 return 0;
78}
79

This implements a minimal dynamic bitset using vector<unsigned long long>. It demonstrates O(n/64) set intersection via word-wise AND and popcount. This approach generalizes std::bitset to variable sizes while keeping cache-friendly, branchless inner loops.

Time: O(n/64) per intersection/countSpace: O(n) bits = n/8 bytes for each bitset