⚙️AlgorithmAdvanced

Subset Sum Convolution

Key Points

•
Subset Sum Convolution (often called Subset Convolution) computes C[S] by summing A[T]×B[U] over all disjoint pairs T and U whose union is S.
•
It can be reduced to polynomial multiplication in the popcount (subset-size) dimension for every fixed S.
•
The standard implementation uses SOS DP (subset zeta transform) per popcount layer, yielding O( $n^{2}$ 2^n) time and O(n 2^n) space.
•
Möbius inversion over the subset lattice undoes the zeta transform to recover the final answer.
•
This technique is crucial for subset-partition DPs where transitions select a disjoint sub-subset at each step.
•
Be careful to handle the empty subset correctly; setting weights for empty blocks often needs special treatment.
•
Modulo arithmetic is recommended to avoid overflow, and careful rank bookkeeping by popcount is essential.
•
Fast Subset Convolution variants exist with O(n 2^n) time but are more complex; the O( $n^{2}$ 2^n) approach is simpler and sufficient for most CF tasks.

Prerequisites

→Bitmask DP — Subset convolution is defined over bitmask-indexed arrays and frequently replaces naive subset-iteration transitions.
→SOS DP (Subset Zeta/Möbius Transforms) — The algorithm relies on zeta transforms over subsets and their Möbius inverses.
→Popcount and Binary Representation — Layering by subset size (popcount) is central to turning the convolution into polynomial-like multiplication.
→Modular Arithmetic — To avoid overflow and maintain correctness when summing/multiplying large numbers.
→Polynomial Convolution Basics — Understanding why combining layers is equivalent to multiplying polynomials in the rank dimension.
→Inclusion–Exclusion Principle — Möbius inversion on the subset lattice is a direct application of inclusion–exclusion.

Detailed Explanation

Tap terms for definitions

01Overview

Subset Sum Convolution (commonly known as Subset Convolution) is an operation on functions defined over all subsets of an n-element ground set. Given two arrays A and B indexed by bitmasks (subsets), the goal is to compute an array C where each entry C[S] sums products A[T]×B[U] over all ways to split S into two disjoint parts T and U whose union is S. This captures 'build S by choosing a disjoint piece for A and the remaining piece for B'. A naive implementation would enumerate all T ⊆ S for every S, leading to O(3^n) work. The key insight is to reorganize the computation by subset size (popcount) and apply the SOS (Sum Over Subsets) transform. By grouping values by the number of bits set and running a zeta transform per size layer, we convert the convolution over disjoint unions into pointwise polynomial multiplications in the popcount dimension for each fixed S. Finally, a Möbius inversion undoes the zeta transform to retrieve C. The standard approach runs in O(n^2 2^n) time and O(n 2^n) space, which is practical up to n≈20–22 in competitive programming when implemented carefully with iterative bit-DP loops and modulo arithmetic.

02Intuition & Analogies

Imagine you are assembling a team S from a pool of n people. You want to split this team into two disjoint subteams T and U (their union is the whole team S) and then score the split as A[T]×B[U]. The final score for S, C[S], is the sum of scores over all possible disjoint splits. Naively, you would try every subset T of S and take the complement U = S \ T, which is too slow. Instead, think of every fixed S as a shelf containing all its subteams T, organized by size (how many people are in T). For this shelf S, make two 'catalogs': one that, for each size r, lists the cumulative total of A over all r-sized subteams of S, and similarly for B. These are the zeta-transformed layers. Now the beautiful trick: picking a split of S into T and U with sizes p and q is the same as picking degrees p and q that add to |S| in a product of two polynomials. So for each S, you multiply the two size-indexed 'polynomials' (the catalogs) and take coefficients that correspond to valid splits. This converts an exponential search into a structured multiplication across sizes. When you do this for all S simultaneously using SOS DP, you massively reduce the cost. However, the catalogs we built were cumulative over all sub-subsets; to extract the exact split contributions, we need to 'un-cumulate' the values, exactly what Möbius inversion over subsets does. The process is just like performing a subset-version of FFT: transform (zeta), multiply per S in the degree dimension, and inverse transform (Möbius).

03Formal Definition

Let [n] = {0,1,...,n-1}, and let functions A,B : 2^{[

n]} \to R

be arrays indexed by subsets (bitmasks). The subset convolution

C = A * B

is defined by: C[S] =

\sum_{T \subseteq S}

A[T] B[S

∖

T]. Equivalently, C[k] =

\sum_{i ∣ j = k, i & j = 0}

A[i] B[j], where i,j,k are bitmasks, "

|" is bitwise OR, and "&" is bitwise AND. Define the popcount |

∣ a s t h e n u mb ero f se t bi t s . F ore a c h r in [0.. n], d e f in e l a yer - res t r i c t e d a rr a ys f_{r} [S] = A [S] i f ∣

S| = r (otherwise 0) and

g_{r}

[S] similarly for B. The subset zeta transform over subsets is

\hat{f}_{r}

[S] =

\sum_{T \subseteq S}

f_{r}

[T] (and analogously for g). Then, for each S and each r, define

h_{r}

[S] =

\sum_{p = 0}^{r}

\hat{f}_{p}

[S]

\overset{g}{^}_{r - p}

[S]. Finally, apply Möbius inversion over subsets to each r-layer to obtain

\tilde{h}_{r}

from

h_{r}

, and the desired result is C[S] =

\tilde{h}_{∣ S ∣}

[S]. This layered transform turns the convolution over disjoint unions into polynomial multiplication in the size dimension for each S, while the zeta/Möbius pair manages inclusion-exclusion over subsets.

04When to Use

Use Subset Sum Convolution when your DP transition builds a set S by selecting a disjoint piece T and combining it with S\T in a multiplicative way. Typical scenarios include: (1) Partition DPs: You want the number or total weight of ways to partition S into blocks where each block contributes a weight w[block]; then transitions look like F = w * F (subset convolution) iterated by the number of blocks. (2) Cover DPs: Counting covers or decompositions where you repeatedly pick a sub-subset T and combine with the remainder disjointly. (3) Problems that ask for counts over all ordered/unordered ways to split S into two labeled parts with weights A and B. (4) Accelerating O(3^n) subset transitions that enumerate T ⊆ S by replacing them with O(n^2 2^n) zeta-based convolutions. If your combining rule does not require disjointness (e.g., OR convolution without the i & j = 0 constraint), then consider other transforms (OR/AND/XOR FWT). If your n is very small (n ≤ 18) and constants are tight, a direct O(3^n) may suffice; otherwise, for n ≈ 20–22, the O(n^2 2^n) subset convolution is often the sweet spot in CF 2200–2600 problems.

⚠️Common Mistakes

• Confusing subset convolution with OR/AND convolutions: subset convolution requires i & j = 0 and i | j = k, i.e., a disjoint union forming exactly k. OR/AND FWT do not enforce disjointness. Double-check the combining rule. • Mishandling the empty set: In many partition DPs, the block-weight for the empty set must be zero (w[∅] = 0); otherwise, you overcount infinitely many ways by repeatedly choosing ∅. Ensure base cases like F_0[∅] = 1 and F_0[S≠∅] = 0 are correct. • Forgetting Möbius inversion or applying the wrong transform direction: The forward zeta sums over subsets; the inverse must subtract to undo accumulation. Using the superset-variant by mistake yields wrong answers. • Off-by-one in rank bookkeeping: Layers are indexed by popcount r ∈ [0..n]. When combining layers, ensure you only sum p + q = r and finally read C[S] from layer r = |S|. • Integer overflow and modulo bugs: Intermediate products can exceed 64-bit for large moduli or values. Use modulo arithmetic consistently, and normalize negatives after Möbius inversion (add MOD back). • Memory layout inefficiency: Using vectors of vectors with frequent reallocation or poor cache locality can TLE. Prefer flat arrays with shape [(n+1) × 2^n] and in-place bit DP. • Complexity underestimation: The transforms cost O(n^2 2^n). For n ≈ 22, use fast IO, tight loops, and avoid unnecessary conditionals inside the hottest loops.

Key Formulas

Subset Convolution Definition

C [S] = T \subseteq S \sum A [T] \cdot B [S ∖ T]

Explanation: For each set S, sum over all ways to split S into a subsubset T and its complement in S. This is the natural 'disjoint union' convolution on the subset lattice.

Bitmask Form

C [k] = i ∣ j = k i & j = 0 \sum A [i] \cdot B [j]

Explanation: In bitwise terms, we only allow pairs (i,j) that are disjoint (i & j = 0) and whose union equals k (i | j = k). This is equivalent to summing over T ⊆ S.

Rank-Restricted Zeta

\hat{f}_{r} [S] = T \subseteq S ∣ T ∣ = r \sum A [T], \overset{g}{^}_{r} [S] = T \subseteq S ∣ T ∣ = r \sum B [T]

Explanation: We group values by subset size r and zeta-transform each layer over subsets. This prepares for polynomial-like multiplication in the size dimension.

Convolution in Popcount

h_{r} [S] = p = 0 \sum r \hat{f}_{p} [S] \cdot \overset{g}{^}_{r - p} [S]

Explanation: For each fixed S, the vectors ( $\hat{f}_{0}$ [S],..., $\hat{f}_{n}$ [S]) and ( $\overset{g}{^}_{0}$ [S],..., $\overset{g}{^}_{n}$ [S]) act as coefficients of polynomials in degree r; their product’s degree-r coefficient is $h_{r}$ [S].

Subset Möbius Inversion

f [T] = U \subseteq T \sum (- 1)^{∣ T ∣ - ∣ U ∣} F [U]

Explanation: This inverts the subset zeta transform. After accumulating over all submasks, we use this formula (implemented iteratively) to recover exact, non-accumulated values.

Final Extraction

C [S] = \tilde{h}_{∣ S ∣} [S]

Explanation: After inverse-transforming the $h_{r}$ -layers, the answer for mask S lives in the layer equal to its popcount r = $∣ S ∣$ .

Layer Sizes

r = 0 \sum n (r n) = 2^{n}

Explanation: Masks are partitioned by popcount into n+1 layers whose sizes sum to 2^n. This underlies the layered algorithm and its memory footprint.

Time and Space Complexity

T (n) = O (n^{2} 2^{n}), S (n) = O (n 2^{n})

Explanation: The n factor comes from iterating bits in the zeta/Möbius transforms, and another n from convolving across ranks. Space stores n+1 layers of size 2^n.

Complexity Analysis

Let

N = 2^{n}

be the number of masks. The layered subset convolution stores arrays

f_{r}

and

g_{r}

for

r = 0

..n, each of length N, so the space is O(nN) = O(n 2^n). The forward subset zeta transform over subsets for one layer costs O(nN) via the standard bit-iteration DP (for each bit b, update states with that bit set). We perform this transform for all n+1 layers of f and g, giving O(n · nN) = O(

n^{2}

2^n). Next, for each mask S, we convolve along the rank dimension: for each r in [0..n], compute

h_{r}

[S] = Σ_{

p=0}^{r}

\hat{f}_{p}

[S] ·

\overset{g}{^}_{r - p}

[S]. This is O(

n^{2}

) per S, yielding O(

n^{2}

N) = O(

n^{2}

2^n). Finally, we apply the inverse subset transform (Möbius inversion) to each

h_{r}

layer, again O(

n^{2}

2^n). Thus, overall time is O(

n^{2}

2^n), dominated by the layered transforms and rank-wise convolutions; hidden constants are modest since inner loops are simple adds/muls. Space-wise, we need three (n+1)×N buffers (for f, g, h), which can be reduced by reusing memory and doing transforms in place, but the asymptotic remains O(n 2^n). In contrast, a naive DP that for every S iterates over all T ⊆ S is Σ_S 2^{

∣ S ∣

} = 3^n, which is infeasible for n ≳ 20. There exist advanced fast subset convolution algorithms achieving O(n 2^n) time using more sophisticated transforms on the rank dimension, but they are substantially more complex to implement and typically unnecessary for constraints encountered in most competitive programming problems.

Code Examples

Core O(n^2 2^n) Subset Convolution with Modulo

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Computes C[S] = sum_{T subset S} A[T] * B[S \ T] (subset convolution)
5 // Complexity: O(n^2 * 2^n) time, O(n * 2^n) space
6 // MOD must fit in 64-bit and be positive; operations are done modulo MOD.
7 
8 static inline int popcnt_u32(unsigned x) { return __builtin_popcount(x); }
9 
10 vector<long long> subset_convolution(const vector<long long>& A,
11                                      const vector<long long>& B,
12                                      int n,
13                                      long long MOD) {
14     const int N = 1 << n;
15     // Layered buffers: index [r][S] flattened as r*N + S for cache efficiency
16     vector<long long> F((n + 1) * N, 0), G((n + 1) * N, 0), H((n + 1) * N, 0);
17 
18     // Initialize layers by popcount
19     for (int S = 0; S < N; ++S) {
20         int r = popcnt_u32((unsigned)S);
21         F[r * N + S] = (A[S] % MOD + MOD) % MOD;
22         G[r * N + S] = (B[S] % MOD + MOD) % MOD;
23     }
24 
25     // Subset Zeta Transform over subsets for each rank layer
26     auto zeta_subsets = [&](vector<long long>& X){
27         for (int b = 0; b < n; ++b) {
28             for (int r = 0; r <= n; ++r) {
29                 long long* base = X.data() + r * N;
30                 for (int S = 0; S < N; ++S) {
31                     if (S & (1 << b)) {
32                         long long v = base[S - (1 << b)];
33                         long long& cur = base[S];
34                         cur += v;
35                         if (cur >= MOD) cur -= MOD;
36                     }
37                 }
38             }
39         }
40     };
41 
42     zeta_subsets(F);
43     zeta_subsets(G);
44 
45     // Convolution in the popcount dimension: H_r[S] = sum_{p=0..r} F_p[S] * G_{r-p}[S]
46     for (int S = 0; S < N; ++S) {
47         for (int r = 0; r <= n; ++r) {
48             __int128 acc = 0; // avoid overflow before modulo
49             for (int p = 0; p <= r; ++p) {
50                 long long fp = F[p * N + S];
51                 long long gq = G[(r - p) * N + S];
52                 acc += (__int128)fp * gq;
53             }
54             long long val = (long long)(acc % MOD);
55             if (val < 0) val += MOD;
56             H[r * N + S] = val;
57         }
58     }
59 
60     // Inverse Subset Zeta (Möbius) over subsets for each rank layer
61     auto mobius_subsets = [&](vector<long long>& X){
62         for (int b = 0; b < n; ++b) {
63             for (int r = 0; r <= n; ++r) {
64                 long long* base = X.data() + r * N;
65                 for (int S = 0; S < N; ++S) {
66                     if (S & (1 << b)) {
67                         long long& cur = base[S];
68                         long long v = base[S - (1 << b)];
69                         cur -= v;
70                         if (cur < 0) cur += MOD;
71                     }
72                 }
73             }
74         }
75     };
76 
77     mobius_subsets(H);
78 
79     // Extract final answer: C[S] lives in H[|S|][S]
80     vector<long long> C(N, 0);
81     for (int S = 0; S < N; ++S) {
82         int r = popcnt_u32((unsigned)S);
83         C[S] = H[r * N + S] % MOD;
84     }
85     return C;
86 }
87 
88 int main() {
89     ios::sync_with_stdio(false);
90     cin.tie(nullptr);
91 
92     // Example usage:
93     // Input: n, arrays A and B of length 2^n
94     int n; long long MOD = 1000000007LL;
95     if (!(cin >> n)) return 0;
96     int N = 1 << n;
97     vector<long long> A(N), B(N);
98     for (int i = 0; i < N; ++i) cin >> A[i];
99     for (int i = 0; i < N; ++i) cin >> B[i];
100 
101     auto C = subset_convolution(A, B, n, MOD);
102 
103     for (int i = 0; i < N; ++i) {
104         if (i) cout << ' ';
105         cout << C[i];
106     }
107     cout << '\n';
108     return 0;
109 }
110

This program implements the classic layered subset convolution. It groups A and B by popcount into (n+1) layers, applies the subset zeta transform per layer, multiplies the layers per mask as a polynomial in the rank dimension, and finally applies Möbius inversion per layer. The result C[S] equals Σ_{T⊆S} A[T]·B[S\T] modulo MOD.

Time: O(n^2 2^n)Space: O(n 2^n)

Partition DP via Repeated Subset Convolution (Ordered Partitions)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // We model a DP counting ordered partitions of a set S into k nonempty blocks.
5 // Let w[T] be the weight of choosing block T (w[0] = 0). Then
6 // F_0[empty] = 1, F_0[S != empty] = 0 and for k >= 1: F_k = w (*) F_{k-1}
7 // where (*) is subset convolution: (w (*) F_{k-1})[S] = sum_{T subset S} w[T] * F_{k-1}[S\T].
8 // Summing F_k over k=0..n gives ordered-partition counts when w[T] = 1 for |T|>0.
9 
10 static inline int popcnt_u32(unsigned x) { return __builtin_popcount(x); }
11 
12 vector<long long> subset_convolution(const vector<long long>& A,
13                                      const vector<long long>& B,
14                                      int n,
15                                      long long MOD) {
16     const int N = 1 << n;
17     vector<long long> F((n + 1) * N, 0), G((n + 1) * N, 0), H((n + 1) * N, 0);
18     for (int S = 0; S < N; ++S) {
19         int r = popcnt_u32((unsigned)S);
20         F[r * N + S] = (A[S] % MOD + MOD) % MOD;
21         G[r * N + S] = (B[S] % MOD + MOD) % MOD;
22     }
23     auto zeta = [&](vector<long long>& X){
24         for (int b = 0; b < n; ++b)
25             for (int r = 0; r <= n; ++r) {
26                 long long* base = X.data() + r * N;
27                 for (int S = 0; S < N; ++S) if (S & (1 << b)) {
28                     long long v = base[S - (1 << b)];
29                     long long& cur = base[S];
30                     cur += v; if (cur >= MOD) cur -= MOD;
31                 }
32             }
33     };
34     zeta(F); zeta(G);
35 
36     for (int S = 0; S < N; ++S)
37         for (int r = 0; r <= n; ++r) {
38             __int128 acc = 0;
39             for (int p = 0; p <= r; ++p) acc += (__int128)F[p * N + S] * G[(r - p) * N + S];
40             long long val = (long long)(acc % MOD);
41             if (val < 0) val += MOD;
42             H[r * N + S] = val;
43         }
44 
45     auto mobius = [&](vector<long long>& X){
46         for (int b = 0; b < n; ++b)
47             for (int r = 0; r <= n; ++r) {
48                 long long* base = X.data() + r * N;
49                 for (int S = 0; S < N; ++S) if (S & (1 << b)) {
50                     long long& cur = base[S];
51                     long long v = base[S - (1 << b)];
52                     cur -= v; if (cur < 0) cur += MOD;
53                 }
54             }
55     };
56     mobius(H);
57 
58     vector<long long> C(N);
59     for (int S = 0; S < N; ++S) C[S] = H[popcnt_u32((unsigned)S) * N + S];
60     return C;
61 }
62 
63 int main() {
64     ios::sync_with_stdio(false);
65     cin.tie(nullptr);
66 
67     int n = 3; // ground set size
68     long long MOD = 1'000'000'007LL;
69     int N = 1 << n;
70 
71     // Define w[T] = 1 for nonempty T, 0 for empty T
72     vector<long long> w(N, 0);
73     for (int S = 1; S < N; ++S) w[S] = 1;
74 
75     // F_0: only empty set has one way (choose no blocks)
76     vector<long long> F_prev(N, 0); F_prev[0] = 1;
77 
78     // Compute F_k for k=1..n via repeated subset convolution
79     vector<long long> total(N, 0); // sum over k of F_k (ordered partitions count when w=1)
80     for (int k = 1; k <= n; ++k) {
81         vector<long long> Fk = subset_convolution(w, F_prev, n, MOD);
82         // accumulate
83         for (int S = 0; S < N; ++S) {
84             total[S] += Fk[S];
85             if (total[S] >= MOD) total[S] -= MOD;
86         }
87         F_prev.swap(Fk);
88     }
89 
90     // For S with |S| = m, total[S] equals the number of ordered partitions of an m-set.
91     // For m=3, this is 13.
92 
93     for (int S = 0; S < N; ++S) {
94         cout << "S=" << S << " popcount=" << __builtin_popcount((unsigned)S)
95              << " ordered_partitions= " << total[S] % MOD << '\n';
96     }
97 
98     return 0;
99 }
100

This demonstrates how subset convolution naturally powers a partition DP. Setting w[T] = 1 for nonempty T, F_k = w (*) F_{k-1} counts ordered partitions into exactly k blocks; summing over k yields the ordered Bell numbers per subset size. The same pattern extends to weighted blocks (use any w[T]) and to bounded k.

Time: O(n^3 2^n) for k up to n (n convolutions each costing O(n^2 2^n))Space: O(n 2^n)

Memory-Optimized Layered Implementation (Flat Buffers + Precomputed Popcount)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Flat, cache-friendly buffers and precomputed popcount to reduce overhead.
5 // Same asymptotics, better constants in practice.
6 
7 struct SubsetConvolver {
8     int n, N;
9     long long MOD;
10     vector<int> pc; // popcount for 0..N-1
11     vector<long long> F, G, H; // (n+1) * N each
12 
13     SubsetConvolver(int n_, long long MOD_) : n(n_), N(1<<n_), MOD(MOD_) {
14         pc.resize(N);
15         for (int i = 0; i < N; ++i) pc[i] = __builtin_popcount((unsigned)i);
16         F.assign((n+1)*N, 0);
17         G.assign((n+1)*N, 0);
18         H.assign((n+1)*N, 0);
19     }
20 
21     inline long long& at(vector<long long>& X, int r, int S) { return X[r * N + S]; }
22 
23     void zeta(vector<long long>& X) {
24         for (int b = 0; b < n; ++b) {
25             const int step = 1 << b;
26             for (int r = 0; r <= n; ++r) {
27                 long long* base = X.data() + r * N;
28                 for (int S = step; S < N; ++S) {
29                     if (S & step) {
30                         long long add = base[S - step];
31                         long long &cur = base[S];
32                         cur += add; if (cur >= MOD) cur -= MOD;
33                     }
34                 }
35             }
36         }
37     }
38 
39     void mobius(vector<long long>& X) {
40         for (int b = 0; b < n; ++b) {
41             const int step = 1 << b;
42             for (int r = 0; r <= n; ++r) {
43                 long long* base = X.data() + r * N;
44                 for (int S = step; S < N; ++S) {
45                     if (S & step) {
46                         long long &cur = base[S];
47                         long long sub = base[S - step];
48                         cur -= sub; if (cur < 0) cur += MOD;
49                     }
50                 }
51             }
52         }
53     }
54 
55     vector<long long> convolve(const vector<long long>& A, const vector<long long>& B) {
56         fill(F.begin(), F.end(), 0);
57         fill(G.begin(), G.end(), 0);
58         fill(H.begin(), H.end(), 0);
59 
60         for (int S = 0; S < N; ++S) {
61             int r = pc[S];
62             at(F, r, S) = (A[S] % MOD + MOD) % MOD;
63             at(G, r, S) = (B[S] % MOD + MOD) % MOD;
64         }
65 
66         zeta(F); zeta(G);
67 
68         // vectorized-like nested loops improving memory locality
69         for (int r = 0; r <= n; ++r) {
70             for (int p = 0; p <= r; ++p) {
71                 long long* Hr = H.data() + r * N;
72                 long long* Fp = F.data() + p * N;
73                 long long* Gq = G.data() + (r - p) * N;
74                 for (int S = 0; S < N; ++S) {
75                     Hr[S] = (Hr[S] + (__int128)Fp[S] * Gq[S]) % MOD;
76                 }
77             }
78         }
79 
80         mobius(H);
81 
82         vector<long long> C(N);
83         for (int S = 0; S < N; ++S) C[S] = at(H, pc[S], S);
84         return C;
85     }
86 };
87 
88 int main() {
89     ios::sync_with_stdio(false);
90     cin.tie(nullptr);
91 
92     int n = 4; long long MOD = 998244353LL; // N up to 16 for quick demo
93     int N = 1 << n;
94     vector<long long> A(N), B(N);
95 
96     // Example: random small values
97     mt19937_64 rng(712367);
98     uniform_int_distribution<long long> dist(0, MOD-1);
99     for (int i = 0; i < N; ++i) { A[i] = dist(rng); B[i] = dist(rng); }
100 
101     SubsetConvolver conv(n, MOD);
102     vector<long long> C = conv.convolve(A, B);
103 
104     // Print first few values
105     for (int i = 0; i < min(N, 16); ++i) {
106         cout << i << ": " << C[i] << '\n';
107     }
108 
109     return 0;
110 }
111

This is the same O(n^2 2^n) algorithm but written with a flat-memory layout and fused loops to improve cache locality and reduce loop overhead. This version is more likely to pass tighter time limits at n≈22.

Time: O(n^2 2^n)Space: O(n 2^n)

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Computes C[S] = sum_{T subset S} A[T] * B[S \ T] (subset convolution)
5	// Complexity: O(n^2 * 2^n) time, O(n * 2^n) space
6	// MOD must fit in 64-bit and be positive; operations are done modulo MOD.
7
8	static inline int popcnt_u32(unsigned x) { return __builtin_popcount(x); }
9
10	vector<long long> subset_convolution(const vector<long long>& A,
11	const vector<long long>& B,
12	int n,
13	long long MOD) {
14	const int N = 1 << n;
15	// Layered buffers: index [r][S] flattened as r*N + S for cache efficiency
16	vector<long long> F((n + 1) * N, 0), G((n + 1) * N, 0), H((n + 1) * N, 0);
17
18	// Initialize layers by popcount
19	for (int S = 0; S < N; ++S) {
20	int r = popcnt_u32((unsigned)S);
21	F[r * N + S] = (A[S] % MOD + MOD) % MOD;
22	G[r * N + S] = (B[S] % MOD + MOD) % MOD;
23	}
24
25	// Subset Zeta Transform over subsets for each rank layer
26	auto zeta_subsets = [&](vector<long long>& X){
27	for (int b = 0; b < n; ++b) {
28	for (int r = 0; r <= n; ++r) {
29	long long* base = X.data() + r * N;
30	for (int S = 0; S < N; ++S) {
31	if (S & (1 << b)) {
32	long long v = base[S - (1 << b)];
33	long long& cur = base[S];
34	cur += v;
35	if (cur >= MOD) cur -= MOD;
36	}
37	}
38	}
39	}
40	};
41
42	zeta_subsets(F);
43	zeta_subsets(G);
44
45	// Convolution in the popcount dimension: H_r[S] = sum_{p=0..r} F_p[S] * G_{r-p}[S]
46	for (int S = 0; S < N; ++S) {
47	for (int r = 0; r <= n; ++r) {
48	__int128 acc = 0; // avoid overflow before modulo
49	for (int p = 0; p <= r; ++p) {
50	long long fp = F[p * N + S];
51	long long gq = G[(r - p) * N + S];
52	acc += (__int128)fp * gq;
53	}
54	long long val = (long long)(acc % MOD);
55	if (val < 0) val += MOD;
56	H[r * N + S] = val;
57	}
58	}
59
60	// Inverse Subset Zeta (Möbius) over subsets for each rank layer
61	auto mobius_subsets = [&](vector<long long>& X){
62	for (int b = 0; b < n; ++b) {
63	for (int r = 0; r <= n; ++r) {
64	long long* base = X.data() + r * N;
65	for (int S = 0; S < N; ++S) {
66	if (S & (1 << b)) {
67	long long& cur = base[S];
68	long long v = base[S - (1 << b)];
69	cur -= v;
70	if (cur < 0) cur += MOD;
71	}
72	}
73	}
74	}
75	};
76
77	mobius_subsets(H);
78
79	// Extract final answer: C[S] lives in H[\|S\|][S]
80	vector<long long> C(N, 0);
81	for (int S = 0; S < N; ++S) {
82	int r = popcnt_u32((unsigned)S);
83	C[S] = H[r * N + S] % MOD;
84	}
85	return C;
86	}
87
88	int main() {
89	ios::sync_with_stdio(false);
90	cin.tie(nullptr);
91
92	// Example usage:
93	// Input: n, arrays A and B of length 2^n
94	int n; long long MOD = 1000000007LL;
95	if (!(cin >> n)) return 0;
96	int N = 1 << n;
97	vector<long long> A(N), B(N);
98	for (int i = 0; i < N; ++i) cin >> A[i];
99	for (int i = 0; i < N; ++i) cin >> B[i];
100
101	auto C = subset_convolution(A, B, n, MOD);
102
103	for (int i = 0; i < N; ++i) {
104	if (i) cout << ' ';
105	cout << C[i];
106	}
107	cout << '\n';
108	return 0;
109	}
110

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// We model a DP counting ordered partitions of a set S into k nonempty blocks.
5	// Let w[T] be the weight of choosing block T (w[0] = 0). Then
6	// F_0[empty] = 1, F_0[S != empty] = 0 and for k >= 1: F_k = w (*) F_{k-1}
7	// where () is subset convolution: (w () F_{k-1})[S] = sum_{T subset S} w[T] * F_{k-1}[S\T].
8	// Summing F_k over k=0..n gives ordered-partition counts when w[T] = 1 for \|T\|>0.
9
10	static inline int popcnt_u32(unsigned x) { return __builtin_popcount(x); }
11
12	vector<long long> subset_convolution(const vector<long long>& A,
13	const vector<long long>& B,
14	int n,
15	long long MOD) {
16	const int N = 1 << n;
17	vector<long long> F((n + 1) * N, 0), G((n + 1) * N, 0), H((n + 1) * N, 0);
18	for (int S = 0; S < N; ++S) {
19	int r = popcnt_u32((unsigned)S);
20	F[r * N + S] = (A[S] % MOD + MOD) % MOD;
21	G[r * N + S] = (B[S] % MOD + MOD) % MOD;
22	}
23	auto zeta = [&](vector<long long>& X){
24	for (int b = 0; b < n; ++b)
25	for (int r = 0; r <= n; ++r) {
26	long long* base = X.data() + r * N;
27	for (int S = 0; S < N; ++S) if (S & (1 << b)) {
28	long long v = base[S - (1 << b)];
29	long long& cur = base[S];
30	cur += v; if (cur >= MOD) cur -= MOD;
31	}
32	}
33	};
34	zeta(F); zeta(G);
35
36	for (int S = 0; S < N; ++S)
37	for (int r = 0; r <= n; ++r) {
38	__int128 acc = 0;
39	for (int p = 0; p <= r; ++p) acc += (__int128)F[p * N + S] * G[(r - p) * N + S];
40	long long val = (long long)(acc % MOD);
41	if (val < 0) val += MOD;
42	H[r * N + S] = val;
43	}
44
45	auto mobius = [&](vector<long long>& X){
46	for (int b = 0; b < n; ++b)
47	for (int r = 0; r <= n; ++r) {
48	long long* base = X.data() + r * N;
49	for (int S = 0; S < N; ++S) if (S & (1 << b)) {
50	long long& cur = base[S];
51	long long v = base[S - (1 << b)];
52	cur -= v; if (cur < 0) cur += MOD;
53	}
54	}
55	};
56	mobius(H);
57
58	vector<long long> C(N);
59	for (int S = 0; S < N; ++S) C[S] = H[popcnt_u32((unsigned)S) * N + S];
60	return C;
61	}
62
63	int main() {
64	ios::sync_with_stdio(false);
65	cin.tie(nullptr);
66
67	int n = 3; // ground set size
68	long long MOD = 1'000'000'007LL;
69	int N = 1 << n;
70
71	// Define w[T] = 1 for nonempty T, 0 for empty T
72	vector<long long> w(N, 0);
73	for (int S = 1; S < N; ++S) w[S] = 1;
74
75	// F_0: only empty set has one way (choose no blocks)
76	vector<long long> F_prev(N, 0); F_prev[0] = 1;
77
78	// Compute F_k for k=1..n via repeated subset convolution
79	vector<long long> total(N, 0); // sum over k of F_k (ordered partitions count when w=1)
80	for (int k = 1; k <= n; ++k) {
81	vector<long long> Fk = subset_convolution(w, F_prev, n, MOD);
82	// accumulate
83	for (int S = 0; S < N; ++S) {
84	total[S] += Fk[S];
85	if (total[S] >= MOD) total[S] -= MOD;
86	}
87	F_prev.swap(Fk);
88	}
89
90	// For S with \|S\| = m, total[S] equals the number of ordered partitions of an m-set.
91	// For m=3, this is 13.
92
93	for (int S = 0; S < N; ++S) {
94	cout << "S=" << S << " popcount=" << __builtin_popcount((unsigned)S)
95	<< " ordered_partitions= " << total[S] % MOD << '\n';
96	}
97
98	return 0;
99	}
100

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Flat, cache-friendly buffers and precomputed popcount to reduce overhead.
5	// Same asymptotics, better constants in practice.
6
7	struct SubsetConvolver {
8	int n, N;
9	long long MOD;
10	vector<int> pc; // popcount for 0..N-1
11	vector<long long> F, G, H; // (n+1) * N each
12
13	SubsetConvolver(int n_, long long MOD_) : n(n_), N(1<<n_), MOD(MOD_) {
14	pc.resize(N);
15	for (int i = 0; i < N; ++i) pc[i] = __builtin_popcount((unsigned)i);
16	F.assign((n+1)*N, 0);
17	G.assign((n+1)*N, 0);
18	H.assign((n+1)*N, 0);
19	}
20
21	inline long long& at(vector<long long>& X, int r, int S) { return X[r * N + S]; }
22
23	void zeta(vector<long long>& X) {
24	for (int b = 0; b < n; ++b) {
25	const int step = 1 << b;
26	for (int r = 0; r <= n; ++r) {
27	long long* base = X.data() + r * N;
28	for (int S = step; S < N; ++S) {
29	if (S & step) {
30	long long add = base[S - step];
31	long long &cur = base[S];
32	cur += add; if (cur >= MOD) cur -= MOD;
33	}
34	}
35	}
36	}
37	}
38
39	void mobius(vector<long long>& X) {
40	for (int b = 0; b < n; ++b) {
41	const int step = 1 << b;
42	for (int r = 0; r <= n; ++r) {
43	long long* base = X.data() + r * N;
44	for (int S = step; S < N; ++S) {
45	if (S & step) {
46	long long &cur = base[S];
47	long long sub = base[S - step];
48	cur -= sub; if (cur < 0) cur += MOD;
49	}
50	}
51	}
52	}
53	}
54
55	vector<long long> convolve(const vector<long long>& A, const vector<long long>& B) {
56	fill(F.begin(), F.end(), 0);
57	fill(G.begin(), G.end(), 0);
58	fill(H.begin(), H.end(), 0);
59
60	for (int S = 0; S < N; ++S) {
61	int r = pc[S];
62	at(F, r, S) = (A[S] % MOD + MOD) % MOD;
63	at(G, r, S) = (B[S] % MOD + MOD) % MOD;
64	}
65
66	zeta(F); zeta(G);
67
68	// vectorized-like nested loops improving memory locality
69	for (int r = 0; r <= n; ++r) {
70	for (int p = 0; p <= r; ++p) {
71	long long* Hr = H.data() + r * N;
72	long long* Fp = F.data() + p * N;
73	long long* Gq = G.data() + (r - p) * N;
74	for (int S = 0; S < N; ++S) {
75	Hr[S] = (Hr[S] + (__int128)Fp[S] * Gq[S]) % MOD;
76	}
77	}
78	}
79
80	mobius(H);
81
82	vector<long long> C(N);
83	for (int S = 0; S < N; ++S) C[S] = at(H, pc[S], S);
84	return C;
85	}
86	};
87
88	int main() {
89	ios::sync_with_stdio(false);
90	cin.tie(nullptr);
91
92	int n = 4; long long MOD = 998244353LL; // N up to 16 for quick demo
93	int N = 1 << n;
94	vector<long long> A(N), B(N);
95
96	// Example: random small values
97	mt19937_64 rng(712367);
98	uniform_int_distribution<long long> dist(0, MOD-1);
99	for (int i = 0; i < N; ++i) { A[i] = dist(rng); B[i] = dist(rng); }
100
101	SubsetConvolver conv(n, MOD);
102	vector<long long> C = conv.convolve(A, B);
103
104	// Print first few values
105	for (int i = 0; i < min(N, 16); ++i) {
106	cout << i << ": " << C[i] << '\n';
107	}
108
109	return 0;
110	}
111