⚙️AlgorithmAdvanced

NTT (Number Theoretic Transform)

Key Points

•
The Number Theoretic Transform (NTT) is an FFT-like algorithm that performs discrete convolutions exactly using modular arithmetic instead of floating-point numbers.
•
NTT requires a prime modulus p with a large power-of-two factor in p−1 so that high-order roots of unity exist (e.g., 998244353 = 119×2^{23}+1 with primitive root g=3).
•
Convolution via NTT runs in O(n $lo g$ n) time and avoids rounding errors, making it ideal for competitive programming and exact polynomial operations.
•
NTT computes transforms using butterfly operations, bit-reversal permutations, and modular powers of a primitive root of unity.
•
To multiply under a non-NTT-friendly modulus (like 10^9+7) or with large integers, combine results from multiple NTT-friendly primes using the Chinese Remainder Theorem (CRT).
•
Always pad vectors to a power-of-two size at least the sum of lengths minus one to match the needed convolution length.
•
Remember to multiply by the modular inverse of n when doing the inverse transform to correctly scale results.
•
Careful implementation details—like precomputing roots, handling negatives modulo p, and avoiding 32-bit overflow—are crucial for correctness and speed.

Prerequisites

→Discrete Fourier Transform and FFT — NTT mirrors FFT structure; understanding DFT/FFT illuminates why convolution becomes pointwise multiplication.
→Modular Arithmetic — NTT operates entirely modulo a prime; you must be comfortable with modular addition, multiplication, and reduction.
→Fast Modular Exponentiation — Computing powers of primitive roots and modular inverses via Fermat’s little theorem is core to NTT.
→Primitive Roots and Orders — Existence of n-th roots of unity in Z_p requires that n divides p−1 and knowledge of primitive roots.
→Chinese Remainder Theorem (CRT) — Combining results from multiple NTT primes to reach arbitrary moduli or exact integers depends on CRT/Garner.
→Asymptotic Complexity — Understanding O(n \log n) informs when NTT beats naive O(n^2) multiplication.
→64-bit Arithmetic and Overflow — NTT and CRT multiplications must use 64-bit or 128-bit intermediates to avoid overflow.

Detailed Explanation

Tap terms for definitions

01Overview

The Number Theoretic Transform (NTT) is the integer-arithmetic twin of the Fast Fourier Transform (FFT). Where the FFT evaluates a polynomial at complex roots of unity and risks floating-point rounding errors, the NTT evaluates at modular roots of unity inside a finite field (integers modulo a prime p). This allows exact computations with integers, which is essential in competitive programming and cryptographic protocols. The most common practical use is fast polynomial (or integer) multiplication via convolution: given sequences a and b, NTT computes their convolution c where c[k] = sum_i a[i] * b[k-i]. Like FFT, it achieves O(n log n) performance by recursively (or iteratively) structuring the calculations into butterfly steps and exploiting roots of unity.

However, NTT requires special primes p such that p−1 is divisible by a large power of two: this ensures that high-order roots of unity exist. A canonical example is p = 998244353 with primitive root g = 3; it supports length up to 2^{23} transforms. When the target modulus (e.g., 10^9+7) is not NTT-friendly, one can perform NTT under multiple friendly moduli and then reconstruct the exact integer result via the Chinese Remainder Theorem (CRT), finally reducing modulo the target. This approach retains exactness while keeping the asymptotic complexity.

In practice, NTT implementations hinge on efficient modular exponentiation, precomputation of twiddle factors (powers of primitive roots), in-place bit-reversal permutation, and careful scaling in the inverse transform. With these, NTT provides fast, deterministic, and exact convolution for large inputs.

02Intuition & Analogies

Imagine you want to know how two sound waves combine. The FFT lets you switch to a frequency view, multiply the corresponding frequencies, and switch back—far faster than directly summing overlaps in time. NTT does the same trick but replaces continuous waves with exact, discrete arithmetic in a finite world where numbers wrap around modulo a prime p.

Think of evaluating a polynomial at special points that "cycle" perfectly after a fixed number of steps—these are roots of unity. In the real/complex world, you would use angles on the unit circle (cosines and sines). In the NTT world, you pick a special number g (a primitive root modulo p) whose powers loop back to 1 after p−1 steps. By choosing the right step size, you get points that loop every n steps—these are the n-th roots of unity modulo p.

Why is this powerful? Because evaluating a polynomial at these structured points lets you turn convolution (which is like sliding one sequence across another and summing overlaps) into pointwise multiplication. After multiplying the values at those points, you transform back to get the exact convolution. No rounding, no floating errors, just pure modular arithmetic. The only caveat: your finite world must contain those special points. So you pick a prime p whose p−1 has a big power of 2 factor, ensuring many roots of unity exist. This is why 998244353 is beloved: it’s big enough and has 2^{23} dividing p−1, so you can handle sequences up to 8 million length with power-of-two padding. When your end goal is a different modulus (like 1e9+7) or even raw integer results, you can combine multiple NTT computations using CRT to stitch together the exact answer.

03Formal Definition

Let p be a prime and \(

Z_{p}

\) the finite field of integers modulo p. Suppose n divides p−1 so that \(

Z_{p}

\) contains a primitive n-th root of unity \(

ω

\) (i.e., \(

ω^{n}

\equiv

(mod p)

\) and \(

ω^{k}

\neq \equiv

1\) for 1 ≤

k < n

). Given a vector (polynomial coefficients) \(a = (

a_{0}

a_{1}

\dots

a_{n - 1}

)\) over \(

Z_{p}

\), the Number Theoretic Transform of length n is the mapping \[

A_{k}

\sum_{j = 0}^{n - 1}

a_{j}

ω^{jk}

(mod p)

k = 0

,1,

\dots

,n-1. \] The inverse transform uses \(

ω^{- 1}

\) and division by n (multiplication by \(

n^{- 1}

mod

p\)): \[

a_{j} = n^{- 1}

\sum_{k = 0}^{n - 1}

A_{k}

ω^{- jk}

(mod p)

. \] Polynomial convolution follows from the convolution theorem: if \(

c = a * b\

) denotes circular convolution of length n, then \(

NTT

NTT

(a)

⊙

NTT

(b)\), where \(

⊙

\) is componentwise multiplication. For linear convolution of two sequences of lengths

n_{a}

and

n_{b}

, pad both to length \(n

\geq

n_{a}

n_{b}

- 1\) that divides p−1 (typically a power of two) before applying NTT. In implementations, an iterative Cooley–Tukey style algorithm with bit-reversal ordering applies successive butterfly layers of size 2,4,8,...,n, multiplying by appropriate twiddle factors \(

ω^{\cdot}

\). Correctness requires that n divides p−1, and the chosen generator g produces \(

ω

g^{(p - 1) / n}

\) of exact order n.

04When to Use

Use NTT when you need exact integer polynomial or sequence convolution at large scale: multiplying polynomials, computing power series, combinatorics (e.g., counting paths, subset convolutions when adapted), and big integer multiplication. It shines when rounding errors from FFT would be problematic, such as in competitive programming problems that demand exact answers or modular arithmetic.

Convolution under an NTT-friendly modulus: If your target modulus is 998244353, 1004535809, 167772161, etc., NTT is a direct drop-in with minimal overhead.
Convolution under arbitrary prime/composite modulus (e.g., 10^9+7): Perform multiple NTTs under different NTT-friendly primes and reconstruct via CRT (or Garner) modulo your target.
Big integer multiplication: Split the numbers into base-B digits, convolve using NTT under one or more primes, and carry-compact to obtain exact integers.
Power series operations: Inversion, logarithm, exponentiation, and division of polynomials over (\mathbb{Z}_p) rely heavily on fast convolution.

Avoid NTT when sequences are very small (naive O(n^2) might be faster) or when you cannot ensure that required transform lengths divide p−1. For non power-of-two lengths, you may either pad to the next power of two or use Bluestein’s algorithm variant, though padding is typically simpler and faster under common NTT primes.

⚠️Common Mistakes

Using a modulus without sufficient 2-adic length: If n does not divide p−1, the n-th roots of unity do not exist, and the transform will be incorrect. Always ensure the padded length n is a power of two that divides p−1.
Forgetting the inverse scaling: The inverse NTT must multiply by n^{-1} modulo p. Omitting this leaves results off by a factor of n.
Wrong primitive root or power: Picking (\omega = g^{(p-1)/n}) where g is not a primitive root—or where n does not exactly divide p−1—breaks correctness. Verify g is a primitive root and n | (p−1).
Integer overflow before reduction: Multiplying two 32-bit ints can overflow 32-bit. Use 64-bit (int64_t or __int128) temporaries before reducing modulo p.
Mishandling negatives: After subtraction in butterflies, values may become negative. Normalize by adding p before a modulo reduction to keep values in [0, p−1].
Incorrect padding length: For linear convolution of sizes n_a and n_b, pad to n ≥ n_a + n_b − 1, usually rounding up to the next power of two supported by p.
Mixing moduli in CRT improperly: When reconstructing under a target modulus, ensure moduli are pairwise coprime, precompute modular inverses correctly, and use 128-bit intermediates to avoid overflow.
Reusing precomputed roots across different n incorrectly: Precomputed twiddle tables depend on the stage length. Either precompute per power-of-two or compute wlen freshly at each stage.

Key Formulas

NTT (forward)

A_{k} = j = 0 \sum n - 1 a_{j} ω^{jk} (mod p)

Explanation: This is the discrete transform over a finite field: evaluate the polynomial at powers of a primitive n-th root of unity modulo p. It maps time-domain coefficients to frequency-domain values.

Inverse NTT

a_{j} = n^{- 1} k = 0 \sum n - 1 A_{k} ω^{- jk} (mod p)

Explanation: To invert, use the inverse root of unity and multiply by the modular inverse of n. This perfectly recovers the original coefficients.

Linear convolution

c_{k} = i = 0 \sum k a_{i} b_{k - i}

Explanation: The k-th coefficient of the product polynomial equals the sum over all index pairs adding to k. Computing this naively is O( $n^{2}$ ), which NTT reduces to O(n log n).

Primitive n-th root of unity

ω = g^{\frac{p - 1}{n}} (mod p), ω^{n} \equiv 1, ω^{k} \neq \equiv 1 (1 \leq k < n)

Explanation: Given a primitive root g modulo p and a transform length n dividing p−1, this constructs an element of exact order n. Its powers are the evaluation points of the NTT.

Time Complexity

T (n) = O (n lo g n)

Explanation: The NTT, like FFT, uses divide-and-conquer butterfly stages. There are log n stages, each doing O(n) work, leading to O(n log n) total operations.

CRT for two moduli

x \equiv r_{0} (mod m_{0}), x \equiv r_{1} (mod m_{1}) \Rightarrow x = r_{0} + m_{0} \cdot ((r_{1} - r_{0}) \cdot m_{0}^{- 1} mod m_{1})

Explanation: Combines residues r0 and r1 modulo coprime m0 and m1 into a unique solution modulo m0·m1. This is used to merge NTT results from two primes.

Garner lifting step

x_{k + 1} = x_{k} + M_{k} \cdot t_{k}, t_{k} \equiv (r_{k + 1} - x_{k}) \cdot M_{k}^{- 1} mod m_{k + 1}

Explanation: Garner's algorithm incrementally reconstructs a residue modulo a target without forming huge products. $M_{k}$ is the current combined modulus product.

Padding length

n_{p a d} = 2^{⌈ l o g_{2} (n_{a} + n_{b} - 1)⌉}

Explanation: For linear convolution of sizes $n_{a}$ and $n_{b}$ , pad to the next power of two at least as large as $n_{a}$ + $n_{b}$ −1. This ensures the transform length is valid.

Fermat inverse

a^{- 1} \equiv a^{p - 2} (mod p)

Explanation: For prime p and nonzero a, the modular inverse can be computed via fast exponentiation using Fermat’s little theorem.

Complexity Analysis

An NTT of length n performs O(n log n) modular additions and multiplications. In an iterative Cooley–Tukey implementation, there are lo

g_{2}

n stages. At stage s (butterfly size

l e n = 2^{s}

), we process n elements in pairs, doing O(1) modular operations per pair, for O(n) work per stage and O(n log n) total. The constant factors include modular multiplications by precomputed twiddle factors and bit-reversal permutations, both linear in n. Precomputing twiddle tables for all stages also costs O(n) time and space, though many implementations compute a wlen per stage using fast exponentiation of cost O(log p) per stage; amortized, precomputation is faster. Space complexity is O(n) beyond the input and output arrays: in-place transforms store intermediate results in the same array, plus O(n) for twiddle tables and O(n) for the bit-reversal index array if cached. If temporary buffers are reused across calls, overall auxiliary memory remains linear. For convolution, padding two vectors to length N = 2^{

⌈

lo g_{2}

(

n_{a}

n_{b}

-1)

⌉

} yields two forward transforms O(N log N), one pointwise multiplication O(N), and one inverse transform O(N log N). Thus, total time is O(N log N) and space O(N). When using multiple NTT primes and CRT (k primes), the time multiplies by k for the transforms plus linear-time reconstruction, producing O(k N log N) time and O(k N) space (if transforms are not reused). In practice, two or three primes suffice to reconstruct 64-bit exact results. Using 64-bit or 128-bit intermediates avoids overflow in modular multiplications and CRT recombination.

Code Examples

Basic NTT-based convolution modulo 998244353

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // NTT under modulus 998244353 with primitive root g = 3
5 // Supports power-of-two lengths up to 2^23.
6 
7 static const uint32_t MOD = 998244353; // 119 * 2^23 + 1
8 static const uint32_t G   = 3;         // primitive root modulo MOD
9 
10 uint32_t addmod(uint32_t a, uint32_t b) {
11     uint32_t s = a + b;
12     if (s >= MOD) s -= MOD;
13     return s;
14 }
15 uint32_t submod(uint32_t a, uint32_t b) {
16     return a >= b ? a - b : a + MOD - b;
17 }
18 uint32_t powmod(uint32_t a, uint32_t e) {
19     uint64_t r = 1, x = a;
20     while (e) {
21         if (e & 1) r = (r * x) % MOD;
22         x = (x * x) % MOD;
23         e >>= 1;
24     }
25     return (uint32_t)r;
26 }
27 
28 void ntt(vector<uint32_t> &a, bool invert) {
29     int n = (int)a.size();
30     // Bit-reversal permutation
31     for (int i = 1, j = 0; i < n; i++) {
32         int bit = n >> 1;
33         for (; j & bit; bit >>= 1) j ^= bit;
34         j ^= bit;
35         if (i < j) swap(a[i], a[j]);
36     }
37 
38     for (int len = 2; len <= n; len <<= 1) {
39         // wlen = g^{(MOD-1)/len}
40         uint32_t wlen = powmod(G, (MOD - 1) / len);
41         if (invert) {
42             // use inverse root for inverse NTT
43             wlen = powmod(wlen, MOD - 2);
44         }
45         for (int i = 0; i < n; i += len) {
46             uint64_t w = 1;
47             for (int j = 0; j < len / 2; j++) {
48                 uint32_t u = a[i + j];
49                 uint32_t v = (uint32_t)((w * a[i + j + len / 2]) % MOD);
50                 a[i + j] = addmod(u, v);
51                 a[i + j + len / 2] = submod(u, v);
52                 w = (w * wlen) % MOD;
53             }
54         }
55     }
56 
57     if (invert) {
58         uint32_t inv_n = powmod(n, MOD - 2);
59         for (int i = 0; i < n; i++) {
60             a[i] = (uint32_t)((uint64_t)a[i] * inv_n % MOD);
61         }
62     }
63 }
64 
65 vector<uint32_t> convolution_mod(const vector<uint32_t>& A, const vector<uint32_t>& B) {
66     if (A.empty() || B.empty()) return {};
67     int nA = (int)A.size(), nB = (int)B.size();
68     int need = nA + nB - 1;
69     int n = 1; while (n < need) n <<= 1;
70     vector<uint32_t> fa(A.begin(), A.end()), fb(B.begin(), B.end());
71     fa.resize(n); fb.resize(n);
72 
73     ntt(fa, false);
74     ntt(fb, false);
75     for (int i = 0; i < n; i++) fa[i] = (uint32_t)((uint64_t)fa[i] * fb[i] % MOD);
76     ntt(fa, true);
77 
78     fa.resize(need);
79     return fa;
80 }
81 
82 int main() {
83     ios::sync_with_stdio(false);
84     cin.tie(nullptr);
85 
86     // Example: multiply polynomials (2 + 3x + 4x^2) * (1 + 2x)
87     vector<uint32_t> A = {2, 3, 4};
88     vector<uint32_t> B = {1, 2};
89 
90     vector<uint32_t> C = convolution_mod(A, B);
91 
92     // Expected: [2*1, 2*2 + 3*1, 3*2 + 4*1, 4*2] = [2, 7, 10, 8]
93     for (size_t i = 0; i < C.size(); i++) {
94         cout << C[i] << (i + 1 == C.size() ? '\n' : ' ');
95     }
96     return 0;
97 }
98

Implements an iterative Cooley–Tukey NTT under 998244353 with primitive root 3. The convolution pads inputs to power-of-two length, performs two forward transforms, multiplies pointwise, and applies the inverse transform with scaling by n^{-1}.

Time: O(n log n)Space: O(n)

Convolution modulo 1e9+7 using three NTT primes and CRT (Garner-style)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // We will convolve two sequences and return results modulo 1e9+7 (not NTT-friendly)
5 // by combining three NTT-friendly primes via CRT.
6 
7 struct ModNTT {
8     uint32_t MOD, G;
9     ModNTT(uint32_t mod, uint32_t g): MOD(mod), G(g) {}
10     inline uint32_t add(uint32_t a, uint32_t b) const { uint32_t s=a+b; return s>=MOD? s-MOD : s; }
11     inline uint32_t sub(uint32_t a, uint32_t b) const { return a>=b? a-b : a+MOD-b; }
12     uint32_t powmod(uint32_t a, uint32_t e) const {
13         uint64_t r=1, x=a;
14         while(e){ if(e&1) r=(r*x)%MOD; x=(x*x)%MOD; e>>=1; }
15         return (uint32_t)r;
16     }
17     void ntt(vector<uint32_t>& a, bool inv) const {
18         int n=(int)a.size();
19         for(int i=1,j=0;i<n;i++){
20             int bit=n>>1; for(; j & bit; bit>>=1) j^=bit; j^=bit; if(i<j) swap(a[i],a[j]);
21         }
22         for(int len=2; len<=n; len<<=1){
23             uint32_t wlen = powmod(G, (MOD-1)/len);
24             if(inv) wlen = powmod(wlen, MOD-2);
25             for(int i=0;i<n;i+=len){
26                 uint64_t w=1;
27                 for(int j=0;j<len/2;j++){
28                     uint32_t u=a[i+j];
29                     uint32_t v=(uint32_t)((w*a[i+j+len/2])%MOD);
30                     a[i+j] = add(u,v);
31                     a[i+j+len/2] = sub(u,v);
32                     w = (w*wlen)%MOD;
33                 }
34             }
35         }
36         if(inv){
37             uint32_t inv_n = powmod(n, MOD-2);
38             for(int i=0;i<n;i++) a[i] = (uint32_t)((uint64_t)a[i]*inv_n % MOD);
39         }
40     }
41 
42     vector<uint32_t> convolution(const vector<uint32_t>& A, const vector<uint32_t>& B) const {
43         if(A.empty() || B.empty()) return {};
44         int need = (int)A.size() + (int)B.size() - 1;
45         int n=1; while(n<need) n<<=1;
46         vector<uint32_t> fa(A.begin(),A.end()), fb(B.begin(),B.end());
47         fa.resize(n); fb.resize(n);
48         ntt(fa,false); ntt(fb,false);
49         for(int i=0;i<n;i++) fa[i] = (uint32_t)((uint64_t)fa[i]*fb[i]%MOD);
50         ntt(fa,true);
51         fa.resize(need);
52         return fa;
53     }
54 };
55 
56 static const uint32_t MOD_TARGET = 1000000007u; // 1e9+7
57 
58 // Three NTT-friendly primes (all have primitive root 3)
59 static const uint32_t P[3] = {998244353u, 1004535809u, 167772161u};
60 static const uint32_t G[3] = {3u, 3u, 3u};
61 
62 // Modular exponentiation for arbitrary modulus (used to compute inverses mod primes)
63 uint64_t mod_pow64(uint64_t a, uint64_t e, uint64_t m){
64     __int128 r=1, x=a % m;
65     while(e){ if(e&1) r=(r*x)%m; x=(x*x)%m; e>>=1; }
66     return (uint64_t)r;
67 }
68 
69 // Reconstruct a single coefficient modulo MOD_TARGET using Garner-like steps
70 uint32_t crt_to_target(uint32_t r0, uint32_t r1, uint32_t r2){
71     // x ≡ r0 (mod P0)
72     uint64_t x0 = r0 % MOD_TARGET;
73 
74     // Lift to include P1
75     uint64_t m0 = P[0];
76     uint64_t t1 = r1 >= r0 ? (r1 - r0) : (uint64_t)r1 + P[1] - r0; // in [0, P1)
77     uint64_t inv_m0_mod_p1 = mod_pow64(m0 % P[1], P[1]-2, P[1]);
78     t1 = (t1 * inv_m0_mod_p1) % P[1];
79     __int128 M0_mod_T = m0 % MOD_TARGET;
80     x0 = (uint64_t)(((__int128)x0 + (__int128)t1 * M0_mod_T) % MOD_TARGET);
81 
82     // Lift to include P2
83     uint64_t m01_mod_p2 = (uint64_t)(((__int128)P[0] % P[2]) * (P[1] % P[2]) % P[2]);
84     uint64_t t2_rhs;
85     // Compute current x modulo P2 without forming huge product: x = r0 + m0*t1 (mod P2)
86     uint64_t r0_plus = (r0 + (uint64_t)(((__int128)(P[0]%P[2]) * (t1 % P[2])) % P[2])) % P[2];
87     t2_rhs = r2 >= r0_plus ? (r2 - r0_plus) : (uint64_t)r2 + P[2] - r0_plus;
88     uint64_t inv_m01_mod_p2 = mod_pow64(m01_mod_p2, P[2]-2, P[2]);
89     uint64_t t2 = (t2_rhs * inv_m01_mod_p2) % P[2];
90 
91     // Update x0 modulo target: multiply by (P0*P1) mod MOD_TARGET
92     uint64_t M01_mod_T = (uint64_t)(((__int128)(P[0]%MOD_TARGET) * (P[1]%MOD_TARGET)) % MOD_TARGET);
93     x0 = (uint64_t)(((__int128)x0 + (__int128)t2 * M01_mod_T) % MOD_TARGET);
94     return (uint32_t)x0;
95 }
96 
97 vector<uint32_t> convolution_mod_1e9p7(const vector<uint32_t>& A, const vector<uint32_t>& B){
98     ModNTT N0(P[0], G[0]);
99     ModNTT N1(P[1], G[1]);
100     ModNTT N2(P[2], G[2]);
101     vector<uint32_t> c0 = N0.convolution(A,B);
102     vector<uint32_t> c1 = N1.convolution(A,B);
103     vector<uint32_t> c2 = N2.convolution(A,B);
104     size_t n = c0.size();
105     vector<uint32_t> res(n);
106     for(size_t i=0;i<n;i++) res[i] = crt_to_target(c0[i], c1[i], c2[i]);
107     return res;
108 }
109 
110 int main(){
111     ios::sync_with_stdio(false);
112     cin.tie(nullptr);
113 
114     // Example: multiply under 1e9+7 target modulus
115     vector<uint32_t> A = {1000000000u, 2u, 3u}; // large values mod 1e9+7 are okay as residues
116     vector<uint32_t> B = {7u, 8u, 9u, 10u};
117     // Reduce inputs modulo target? Not necessary; we want exact integer conv then mod 1e9+7.
118     // Values are treated as integers less than primes; safe as we use 64-bit temporaries.
119 
120     vector<uint32_t> C = convolution_mod_1e9p7(A, B);
121     for (size_t i = 0; i < C.size(); i++) cout << C[i] << (i+1==C.size()?'\n':' ');
122     return 0;
123 }
124

Computes convolution modulo 1e9+7 by performing three NTT convolutions under different primes, then reconstructing each coefficient modulo 1e9+7 using a Garner-style CRT combination. It avoids large intermediate products by reducing at each step and uses 128-bit temporaries for safety.

Time: O(k n log n) with k=3Space: O(k n)

Big integer multiplication using two NTT primes and CRT (base 10^4)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Multiply big nonnegative integers given as decimal strings using NTT + CRT.
5 // We split digits in base B=10^4, convolve, then carry-compact to get the result.
6 // We use two NTT-friendly primes whose product fits in 64 bits for exact reconstruction.
7 
8 struct NTT {
9     uint32_t MOD, G;
10     NTT(uint32_t mod, uint32_t g): MOD(mod), G(g) {}
11     inline uint32_t add(uint32_t a, uint32_t b) const { uint32_t s=a+b; return s>=MOD? s-MOD : s; }
12     inline uint32_t sub(uint32_t a, uint32_t b) const { return a>=b? a-b : a+MOD-b; }
13     uint32_t powmod(uint32_t a, uint32_t e) const {
14         uint64_t r=1, x=a;
15         while(e){ if(e&1) r=(r*x)%MOD; x=(x*x)%MOD; e>>=1; }
16         return (uint32_t)r;
17     }
18     void ntt(vector<uint32_t>& a, bool inv) const {
19         int n=(int)a.size();
20         for(int i=1,j=0;i<n;i++){
21             int bit=n>>1; for(; j & bit; bit>>=1) j^=bit; j^=bit; if(i<j) swap(a[i],a[j]);
22         }
23         for(int len=2; len<=n; len<<=1){
24             uint32_t wlen = powmod(G, (MOD-1)/len);
25             if(inv) wlen = powmod(wlen, MOD-2);
26             for(int i=0;i<n;i+=len){
27                 uint64_t w=1;
28                 for(int j=0;j<len/2;j++){
29                     uint32_t u=a[i+j];
30                     uint32_t v=(uint32_t)((w*a[i+j+len/2])%MOD);
31                     a[i+j] = add(u,v);
32                     a[i+j+len/2] = sub(u,v);
33                     w = (w*wlen)%MOD;
34                 }
35             }
36         }
37         if(inv){
38             uint32_t inv_n = powmod(n, MOD-2);
39             for(int i=0;i<n;i++) a[i] = (uint32_t)((uint64_t)a[i]*inv_n % MOD);
40         }
41     }
42     vector<uint32_t> convolution(const vector<uint32_t>& A, const vector<uint32_t>& B) const {
43         if(A.empty() || B.empty()) return {};
44         int need = (int)A.size() + (int)B.size() - 1;
45         int n=1; while(n<need) n<<=1;
46         vector<uint32_t> fa(A.begin(),A.end()), fb(B.begin(),B.end());
47         fa.resize(n); fb.resize(n);
48         ntt(fa,false); ntt(fb,false);
49         for(int i=0;i<n;i++) fa[i] = (uint32_t)((uint64_t)fa[i]*fb[i]%MOD);
50         ntt(fa,true);
51         fa.resize(need);
52         return fa;
53     }
54 };
55 
56 // Two primes (primitive root 3 for both)
57 static const uint32_t P0 = 998244353u; // ~1e9
58 static const uint32_t P1 = 1004535809u; // ~1e9
59 static const uint32_t G0 = 3u, G1 = 3u;
60 
61 // Combine residues r0 (mod P0) and r1 (mod P1) into x in [0, P0*P1)
62 static inline unsigned long long crt2(uint32_t r0, uint32_t r1){
63     unsigned long long m0 = P0, m1 = P1;
64     // t ≡ (r1 - r0) * inv(m0, m1) (mod m1)
65     unsigned long long diff = (r1 >= r0) ? (r1 - r0) : (unsigned long long)r1 + m1 - r0;
66     // inv_m0_mod_m1 via Fermat since m1 is prime
67     auto modpow = [](unsigned long long a, unsigned long long e, unsigned long long m){ __int128 r=1,x=a%m; while(e){ if(e&1) r=(r*x)%m; x=(x*x)%m; e>>=1;} return (unsigned long long)r; };
68     unsigned long long inv_m0 = modpow(m0 % m1, m1 - 2, m1);
69     unsigned long long t = ( (__int128)diff * inv_m0 ) % m1;
70     unsigned long long x = (unsigned long long)r0 + (unsigned long long)((__int128)t * m0);
71     return x; // fits in < 1e18
72 }
73 
74 vector<int> to_base(const string& s, int B){
75     // Parse decimal string into base-B little-endian digits
76     vector<int> a; a.reserve((int)s.size()/4+2);
77     int n=(int)s.size();
78     int i=n;
79     while(i>0){
80         int l=max(0,i-4); // chunk 4 decimal digits
81         int val=0;
82         for(int k=l;k<i;k++) val = val*10 + (s[k]-'0');
83         a.push_back(val);
84         i=l;
85     }
86     // Now a holds base-10^4 digits little-endian
87     return a;
88 }
89 
90 string from_base(vector<unsigned long long>& a, int B){
91     // Remove leading zeros
92     while(!a.empty() && a.back()==0) a.pop_back();
93     if(a.empty()) return string("0");
94     // Convert back to decimal string
95     const int WIDTH = 4; // since B=10^4
96     stringstream ss;
97     ss << a.back();
98     char buf[8];
99     for(int i=(int)a.size()-2;i>=0;i--){
100         unsigned long long x = a[i];
101         // print with leading zeros to WIDTH digits
102         snprintf(buf, sizeof(buf), "%04llu", x);
103         ss << buf;
104     }
105     return ss.str();
106 }
107 
108 string multiply_big(const string& A, const string& B){
109     if (A=="0" || B=="0") return "0";
110     const int BASE = 10000; // 1e4
111 
112     vector<int> a = to_base(A, BASE);
113     vector<int> b = to_base(B, BASE);
114 
115     NTT N0(P0, G0), N1(P1, G1);
116     vector<uint32_t> a0(a.size()), b0(b.size()), a1(a.size()), b1(b.size());
117     for(size_t i=0;i<a.size();i++){ a0[i]=a[i]%P0; a1[i]=a[i]%P1; }
118     for(size_t i=0;i<b.size();i++){ b0[i]=b[i]%P0; b1[i]=b[i]%P1; }
119 
120     vector<uint32_t> c0 = N0.convolution(a0,b0);
121     vector<uint32_t> c1 = N1.convolution(a1,b1);
122 
123     size_t n = c0.size();
124     vector<unsigned long long> coeff(n);
125     for(size_t i=0;i<n;i++) coeff[i] = crt2(c0[i], c1[i]); // exact coefficient value (< P0*P1)
126 
127     // Carry handling in base BASE
128     unsigned long long carry=0;
129     for(size_t i=0;i<n;i++){
130         unsigned __int128 v = (unsigned __int128)coeff[i] + carry;
131         coeff[i] = (unsigned long long)(v % BASE);
132         carry = (unsigned long long)(v / BASE);
133     }
134     while(carry){ coeff.push_back(carry % BASE); carry /= BASE; }
135 
136     return from_base(coeff, BASE);
137 }
138 
139 int main(){
140     ios::sync_with_stdio(false);
141     cin.tie(nullptr);
142 
143     string A, B;
144     // Demo: multiply two large numbers
145     A = "123456789012345678901234567890";
146     B = "98765432109876543210";
147     string C = multiply_big(A, B);
148     cout << C << "\n";
149     return 0;
150 }
151

Parses decimal strings into base 10^4 digits, performs two NTT convolutions under 998244353 and 1004535809, combines coefficients exactly via CRT (product < 2^64), and then performs carry propagation to reconstruct the decimal result. This achieves exact big-integer multiplication in O(n log n).

Time: O(n log n)Space: O(n)

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// NTT under modulus 998244353 with primitive root g = 3
5	// Supports power-of-two lengths up to 2^23.
6
7	static const uint32_t MOD = 998244353; // 119 * 2^23 + 1
8	static const uint32_t G = 3; // primitive root modulo MOD
9
10	uint32_t addmod(uint32_t a, uint32_t b) {
11	uint32_t s = a + b;
12	if (s >= MOD) s -= MOD;
13	return s;
14	}
15	uint32_t submod(uint32_t a, uint32_t b) {
16	return a >= b ? a - b : a + MOD - b;
17	}
18	uint32_t powmod(uint32_t a, uint32_t e) {
19	uint64_t r = 1, x = a;
20	while (e) {
21	if (e & 1) r = (r * x) % MOD;
22	x = (x * x) % MOD;
23	e >>= 1;
24	}
25	return (uint32_t)r;
26	}
27
28	void ntt(vector<uint32_t> &a, bool invert) {
29	int n = (int)a.size();
30	// Bit-reversal permutation
31	for (int i = 1, j = 0; i < n; i++) {
32	int bit = n >> 1;
33	for (; j & bit; bit >>= 1) j ^= bit;
34	j ^= bit;
35	if (i < j) swap(a[i], a[j]);
36	}
37
38	for (int len = 2; len <= n; len <<= 1) {
39	// wlen = g^{(MOD-1)/len}
40	uint32_t wlen = powmod(G, (MOD - 1) / len);
41	if (invert) {
42	// use inverse root for inverse NTT
43	wlen = powmod(wlen, MOD - 2);
44	}
45	for (int i = 0; i < n; i += len) {
46	uint64_t w = 1;
47	for (int j = 0; j < len / 2; j++) {
48	uint32_t u = a[i + j];
49	uint32_t v = (uint32_t)((w * a[i + j + len / 2]) % MOD);
50	a[i + j] = addmod(u, v);
51	a[i + j + len / 2] = submod(u, v);
52	w = (w * wlen) % MOD;
53	}
54	}
55	}
56
57	if (invert) {
58	uint32_t inv_n = powmod(n, MOD - 2);
59	for (int i = 0; i < n; i++) {
60	a[i] = (uint32_t)((uint64_t)a[i] * inv_n % MOD);
61	}
62	}
63	}
64
65	vector<uint32_t> convolution_mod(const vector<uint32_t>& A, const vector<uint32_t>& B) {
66	if (A.empty() \|\| B.empty()) return {};
67	int nA = (int)A.size(), nB = (int)B.size();
68	int need = nA + nB - 1;
69	int n = 1; while (n < need) n <<= 1;
70	vector<uint32_t> fa(A.begin(), A.end()), fb(B.begin(), B.end());
71	fa.resize(n); fb.resize(n);
72
73	ntt(fa, false);
74	ntt(fb, false);
75	for (int i = 0; i < n; i++) fa[i] = (uint32_t)((uint64_t)fa[i] * fb[i] % MOD);
76	ntt(fa, true);
77
78	fa.resize(need);
79	return fa;
80	}
81
82	int main() {
83	ios::sync_with_stdio(false);
84	cin.tie(nullptr);
85
86	// Example: multiply polynomials (2 + 3x + 4x^2) * (1 + 2x)
87	vector<uint32_t> A = {2, 3, 4};
88	vector<uint32_t> B = {1, 2};
89
90	vector<uint32_t> C = convolution_mod(A, B);
91
92	// Expected: [21, 22 + 31, 32 + 41, 42] = [2, 7, 10, 8]
93	for (size_t i = 0; i < C.size(); i++) {
94	cout << C[i] << (i + 1 == C.size() ? '\n' : ' ');
95	}
96	return 0;
97	}
98

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// We will convolve two sequences and return results modulo 1e9+7 (not NTT-friendly)
5	// by combining three NTT-friendly primes via CRT.
6
7	struct ModNTT {
8	uint32_t MOD, G;
9	ModNTT(uint32_t mod, uint32_t g): MOD(mod), G(g) {}
10	inline uint32_t add(uint32_t a, uint32_t b) const { uint32_t s=a+b; return s>=MOD? s-MOD : s; }
11	inline uint32_t sub(uint32_t a, uint32_t b) const { return a>=b? a-b : a+MOD-b; }
12	uint32_t powmod(uint32_t a, uint32_t e) const {
13	uint64_t r=1, x=a;
14	while(e){ if(e&1) r=(rx)%MOD; x=(xx)%MOD; e>>=1; }
15	return (uint32_t)r;
16	}
17	void ntt(vector<uint32_t>& a, bool inv) const {
18	int n=(int)a.size();
19	for(int i=1,j=0;i<n;i++){
20	int bit=n>>1; for(; j & bit; bit>>=1) j^=bit; j^=bit; if(i<j) swap(a[i],a[j]);
21	}
22	for(int len=2; len<=n; len<<=1){
23	uint32_t wlen = powmod(G, (MOD-1)/len);
24	if(inv) wlen = powmod(wlen, MOD-2);
25	for(int i=0;i<n;i+=len){
26	uint64_t w=1;
27	for(int j=0;j<len/2;j++){
28	uint32_t u=a[i+j];
29	uint32_t v=(uint32_t)((w*a[i+j+len/2])%MOD);
30	a[i+j] = add(u,v);
31	a[i+j+len/2] = sub(u,v);
32	w = (w*wlen)%MOD;
33	}
34	}
35	}
36	if(inv){
37	uint32_t inv_n = powmod(n, MOD-2);
38	for(int i=0;i<n;i++) a[i] = (uint32_t)((uint64_t)a[i]*inv_n % MOD);
39	}
40	}
41
42	vector<uint32_t> convolution(const vector<uint32_t>& A, const vector<uint32_t>& B) const {
43	if(A.empty() \|\| B.empty()) return {};
44	int need = (int)A.size() + (int)B.size() - 1;
45	int n=1; while(n<need) n<<=1;
46	vector<uint32_t> fa(A.begin(),A.end()), fb(B.begin(),B.end());
47	fa.resize(n); fb.resize(n);
48	ntt(fa,false); ntt(fb,false);
49	for(int i=0;i<n;i++) fa[i] = (uint32_t)((uint64_t)fa[i]*fb[i]%MOD);
50	ntt(fa,true);
51	fa.resize(need);
52	return fa;
53	}
54	};
55
56	static const uint32_t MOD_TARGET = 1000000007u; // 1e9+7
57
58	// Three NTT-friendly primes (all have primitive root 3)
59	static const uint32_t P[3] = {998244353u, 1004535809u, 167772161u};
60	static const uint32_t G[3] = {3u, 3u, 3u};
61
62	// Modular exponentiation for arbitrary modulus (used to compute inverses mod primes)
63	uint64_t mod_pow64(uint64_t a, uint64_t e, uint64_t m){
64	__int128 r=1, x=a % m;
65	while(e){ if(e&1) r=(rx)%m; x=(xx)%m; e>>=1; }
66	return (uint64_t)r;
67	}
68
69	// Reconstruct a single coefficient modulo MOD_TARGET using Garner-like steps
70	uint32_t crt_to_target(uint32_t r0, uint32_t r1, uint32_t r2){
71	// x ≡ r0 (mod P0)
72	uint64_t x0 = r0 % MOD_TARGET;
73
74	// Lift to include P1
75	uint64_t m0 = P[0];
76	uint64_t t1 = r1 >= r0 ? (r1 - r0) : (uint64_t)r1 + P[1] - r0; // in [0, P1)
77	uint64_t inv_m0_mod_p1 = mod_pow64(m0 % P[1], P[1]-2, P[1]);
78	t1 = (t1 * inv_m0_mod_p1) % P[1];
79	__int128 M0_mod_T = m0 % MOD_TARGET;
80	x0 = (uint64_t)(((__int128)x0 + (__int128)t1 * M0_mod_T) % MOD_TARGET);
81
82	// Lift to include P2
83	uint64_t m01_mod_p2 = (uint64_t)(((__int128)P[0] % P[2]) * (P[1] % P[2]) % P[2]);
84	uint64_t t2_rhs;
85	// Compute current x modulo P2 without forming huge product: x = r0 + m0*t1 (mod P2)
86	uint64_t r0_plus = (r0 + (uint64_t)(((__int128)(P[0]%P[2]) * (t1 % P[2])) % P[2])) % P[2];
87	t2_rhs = r2 >= r0_plus ? (r2 - r0_plus) : (uint64_t)r2 + P[2] - r0_plus;
88	uint64_t inv_m01_mod_p2 = mod_pow64(m01_mod_p2, P[2]-2, P[2]);
89	uint64_t t2 = (t2_rhs * inv_m01_mod_p2) % P[2];
90
91	// Update x0 modulo target: multiply by (P0*P1) mod MOD_TARGET
92	uint64_t M01_mod_T = (uint64_t)(((__int128)(P[0]%MOD_TARGET) * (P[1]%MOD_TARGET)) % MOD_TARGET);
93	x0 = (uint64_t)(((__int128)x0 + (__int128)t2 * M01_mod_T) % MOD_TARGET);
94	return (uint32_t)x0;
95	}
96
97	vector<uint32_t> convolution_mod_1e9p7(const vector<uint32_t>& A, const vector<uint32_t>& B){
98	ModNTT N0(P[0], G[0]);
99	ModNTT N1(P[1], G[1]);
100	ModNTT N2(P[2], G[2]);
101	vector<uint32_t> c0 = N0.convolution(A,B);
102	vector<uint32_t> c1 = N1.convolution(A,B);
103	vector<uint32_t> c2 = N2.convolution(A,B);
104	size_t n = c0.size();
105	vector<uint32_t> res(n);
106	for(size_t i=0;i<n;i++) res[i] = crt_to_target(c0[i], c1[i], c2[i]);
107	return res;
108	}
109
110	int main(){
111	ios::sync_with_stdio(false);
112	cin.tie(nullptr);
113
114	// Example: multiply under 1e9+7 target modulus
115	vector<uint32_t> A = {1000000000u, 2u, 3u}; // large values mod 1e9+7 are okay as residues
116	vector<uint32_t> B = {7u, 8u, 9u, 10u};
117	// Reduce inputs modulo target? Not necessary; we want exact integer conv then mod 1e9+7.
118	// Values are treated as integers less than primes; safe as we use 64-bit temporaries.
119
120	vector<uint32_t> C = convolution_mod_1e9p7(A, B);
121	for (size_t i = 0; i < C.size(); i++) cout << C[i] << (i+1==C.size()?'\n':' ');
122	return 0;
123	}
124

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Multiply big nonnegative integers given as decimal strings using NTT + CRT.
5	// We split digits in base B=10^4, convolve, then carry-compact to get the result.
6	// We use two NTT-friendly primes whose product fits in 64 bits for exact reconstruction.
7
8	struct NTT {
9	uint32_t MOD, G;
10	NTT(uint32_t mod, uint32_t g): MOD(mod), G(g) {}
11	inline uint32_t add(uint32_t a, uint32_t b) const { uint32_t s=a+b; return s>=MOD? s-MOD : s; }
12	inline uint32_t sub(uint32_t a, uint32_t b) const { return a>=b? a-b : a+MOD-b; }
13	uint32_t powmod(uint32_t a, uint32_t e) const {
14	uint64_t r=1, x=a;
15	while(e){ if(e&1) r=(rx)%MOD; x=(xx)%MOD; e>>=1; }
16	return (uint32_t)r;
17	}
18	void ntt(vector<uint32_t>& a, bool inv) const {
19	int n=(int)a.size();
20	for(int i=1,j=0;i<n;i++){
21	int bit=n>>1; for(; j & bit; bit>>=1) j^=bit; j^=bit; if(i<j) swap(a[i],a[j]);
22	}
23	for(int len=2; len<=n; len<<=1){
24	uint32_t wlen = powmod(G, (MOD-1)/len);
25	if(inv) wlen = powmod(wlen, MOD-2);
26	for(int i=0;i<n;i+=len){
27	uint64_t w=1;
28	for(int j=0;j<len/2;j++){
29	uint32_t u=a[i+j];
30	uint32_t v=(uint32_t)((w*a[i+j+len/2])%MOD);
31	a[i+j] = add(u,v);
32	a[i+j+len/2] = sub(u,v);
33	w = (w*wlen)%MOD;
34	}
35	}
36	}
37	if(inv){
38	uint32_t inv_n = powmod(n, MOD-2);
39	for(int i=0;i<n;i++) a[i] = (uint32_t)((uint64_t)a[i]*inv_n % MOD);
40	}
41	}
42	vector<uint32_t> convolution(const vector<uint32_t>& A, const vector<uint32_t>& B) const {
43	if(A.empty() \|\| B.empty()) return {};
44	int need = (int)A.size() + (int)B.size() - 1;
45	int n=1; while(n<need) n<<=1;
46	vector<uint32_t> fa(A.begin(),A.end()), fb(B.begin(),B.end());
47	fa.resize(n); fb.resize(n);
48	ntt(fa,false); ntt(fb,false);
49	for(int i=0;i<n;i++) fa[i] = (uint32_t)((uint64_t)fa[i]*fb[i]%MOD);
50	ntt(fa,true);
51	fa.resize(need);
52	return fa;
53	}
54	};
55
56	// Two primes (primitive root 3 for both)
57	static const uint32_t P0 = 998244353u; // ~1e9
58	static const uint32_t P1 = 1004535809u; // ~1e9
59	static const uint32_t G0 = 3u, G1 = 3u;
60
61	// Combine residues r0 (mod P0) and r1 (mod P1) into x in [0, P0*P1)
62	static inline unsigned long long crt2(uint32_t r0, uint32_t r1){
63	unsigned long long m0 = P0, m1 = P1;
64	// t ≡ (r1 - r0) * inv(m0, m1) (mod m1)
65	unsigned long long diff = (r1 >= r0) ? (r1 - r0) : (unsigned long long)r1 + m1 - r0;
66	// inv_m0_mod_m1 via Fermat since m1 is prime
67	auto modpow = [](unsigned long long a, unsigned long long e, unsigned long long m){ __int128 r=1,x=a%m; while(e){ if(e&1) r=(rx)%m; x=(xx)%m; e>>=1;} return (unsigned long long)r; };
68	unsigned long long inv_m0 = modpow(m0 % m1, m1 - 2, m1);
69	unsigned long long t = ( (__int128)diff * inv_m0 ) % m1;
70	unsigned long long x = (unsigned long long)r0 + (unsigned long long)((__int128)t * m0);
71	return x; // fits in < 1e18
72	}
73
74	vector<int> to_base(const string& s, int B){
75	// Parse decimal string into base-B little-endian digits
76	vector<int> a; a.reserve((int)s.size()/4+2);
77	int n=(int)s.size();
78	int i=n;
79	while(i>0){
80	int l=max(0,i-4); // chunk 4 decimal digits
81	int val=0;
82	for(int k=l;k<i;k++) val = val*10 + (s[k]-'0');
83	a.push_back(val);
84	i=l;
85	}
86	// Now a holds base-10^4 digits little-endian
87	return a;
88	}
89
90	string from_base(vector<unsigned long long>& a, int B){
91	// Remove leading zeros
92	while(!a.empty() && a.back()==0) a.pop_back();
93	if(a.empty()) return string("0");
94	// Convert back to decimal string
95	const int WIDTH = 4; // since B=10^4
96	stringstream ss;
97	ss << a.back();
98	char buf[8];
99	for(int i=(int)a.size()-2;i>=0;i--){
100	unsigned long long x = a[i];
101	// print with leading zeros to WIDTH digits
102	snprintf(buf, sizeof(buf), "%04llu", x);
103	ss << buf;
104	}
105	return ss.str();
106	}
107
108	string multiply_big(const string& A, const string& B){
109	if (A=="0" \|\| B=="0") return "0";
110	const int BASE = 10000; // 1e4
111
112	vector<int> a = to_base(A, BASE);
113	vector<int> b = to_base(B, BASE);
114
115	NTT N0(P0, G0), N1(P1, G1);
116	vector<uint32_t> a0(a.size()), b0(b.size()), a1(a.size()), b1(b.size());
117	for(size_t i=0;i<a.size();i++){ a0[i]=a[i]%P0; a1[i]=a[i]%P1; }
118	for(size_t i=0;i<b.size();i++){ b0[i]=b[i]%P0; b1[i]=b[i]%P1; }
119
120	vector<uint32_t> c0 = N0.convolution(a0,b0);
121	vector<uint32_t> c1 = N1.convolution(a1,b1);
122
123	size_t n = c0.size();
124	vector<unsigned long long> coeff(n);
125	for(size_t i=0;i<n;i++) coeff[i] = crt2(c0[i], c1[i]); // exact coefficient value (< P0*P1)
126
127	// Carry handling in base BASE
128	unsigned long long carry=0;
129	for(size_t i=0;i<n;i++){
130	unsigned __int128 v = (unsigned __int128)coeff[i] + carry;
131	coeff[i] = (unsigned long long)(v % BASE);
132	carry = (unsigned long long)(v / BASE);
133	}
134	while(carry){ coeff.push_back(carry % BASE); carry /= BASE; }
135
136	return from_base(coeff, BASE);
137	}
138
139	int main(){
140	ios::sync_with_stdio(false);
141	cin.tie(nullptr);
142
143	string A, B;
144	// Demo: multiply two large numbers
145	A = "123456789012345678901234567890";
146	B = "98765432109876543210";
147	string C = multiply_big(A, B);
148	cout << C << "\n";
149	return 0;
150	}
151