⚙️AlgorithmAdvanced

NTT (Number Theoretic Transform)

Key Points

  • The Number Theoretic Transform (NTT) is an FFT-like algorithm that performs discrete convolutions exactly using modular arithmetic instead of floating-point numbers.
  • NTT requires a prime modulus p with a large power-of-two factor in p−1 so that high-order roots of unity exist (e.g., 998244353 = 119×2^{23}+1 with primitive root g=3).
  • Convolution via NTT runs in O(n n) time and avoids rounding errors, making it ideal for competitive programming and exact polynomial operations.
  • NTT computes transforms using butterfly operations, bit-reversal permutations, and modular powers of a primitive root of unity.
  • To multiply under a non-NTT-friendly modulus (like 10^9+7) or with large integers, combine results from multiple NTT-friendly primes using the Chinese Remainder Theorem (CRT).
  • Always pad vectors to a power-of-two size at least the sum of lengths minus one to match the needed convolution length.
  • Remember to multiply by the modular inverse of n when doing the inverse transform to correctly scale results.
  • Careful implementation details—like precomputing roots, handling negatives modulo p, and avoiding 32-bit overflow—are crucial for correctness and speed.

Prerequisites

  • Discrete Fourier Transform and FFTNTT mirrors FFT structure; understanding DFT/FFT illuminates why convolution becomes pointwise multiplication.
  • Modular ArithmeticNTT operates entirely modulo a prime; you must be comfortable with modular addition, multiplication, and reduction.
  • Fast Modular ExponentiationComputing powers of primitive roots and modular inverses via Fermat’s little theorem is core to NTT.
  • Primitive Roots and OrdersExistence of n-th roots of unity in Z_p requires that n divides p−1 and knowledge of primitive roots.
  • Chinese Remainder Theorem (CRT)Combining results from multiple NTT primes to reach arbitrary moduli or exact integers depends on CRT/Garner.
  • Asymptotic ComplexityUnderstanding O(n \log n) informs when NTT beats naive O(n^2) multiplication.
  • 64-bit Arithmetic and OverflowNTT and CRT multiplications must use 64-bit or 128-bit intermediates to avoid overflow.

Detailed Explanation

Tap terms for definitions

01Overview

The Number Theoretic Transform (NTT) is the integer-arithmetic twin of the Fast Fourier Transform (FFT). Where the FFT evaluates a polynomial at complex roots of unity and risks floating-point rounding errors, the NTT evaluates at modular roots of unity inside a finite field (integers modulo a prime p). This allows exact computations with integers, which is essential in competitive programming and cryptographic protocols. The most common practical use is fast polynomial (or integer) multiplication via convolution: given sequences a and b, NTT computes their convolution c where c[k] = sum_i a[i] * b[k-i]. Like FFT, it achieves O(n log n) performance by recursively (or iteratively) structuring the calculations into butterfly steps and exploiting roots of unity.

However, NTT requires special primes p such that p−1 is divisible by a large power of two: this ensures that high-order roots of unity exist. A canonical example is p = 998244353 with primitive root g = 3; it supports length up to 2^{23} transforms. When the target modulus (e.g., 10^9+7) is not NTT-friendly, one can perform NTT under multiple friendly moduli and then reconstruct the exact integer result via the Chinese Remainder Theorem (CRT), finally reducing modulo the target. This approach retains exactness while keeping the asymptotic complexity.

In practice, NTT implementations hinge on efficient modular exponentiation, precomputation of twiddle factors (powers of primitive roots), in-place bit-reversal permutation, and careful scaling in the inverse transform. With these, NTT provides fast, deterministic, and exact convolution for large inputs.

02Intuition & Analogies

Imagine you want to know how two sound waves combine. The FFT lets you switch to a frequency view, multiply the corresponding frequencies, and switch back—far faster than directly summing overlaps in time. NTT does the same trick but replaces continuous waves with exact, discrete arithmetic in a finite world where numbers wrap around modulo a prime p.

Think of evaluating a polynomial at special points that "cycle" perfectly after a fixed number of steps—these are roots of unity. In the real/complex world, you would use angles on the unit circle (cosines and sines). In the NTT world, you pick a special number g (a primitive root modulo p) whose powers loop back to 1 after p−1 steps. By choosing the right step size, you get points that loop every n steps—these are the n-th roots of unity modulo p.

Why is this powerful? Because evaluating a polynomial at these structured points lets you turn convolution (which is like sliding one sequence across another and summing overlaps) into pointwise multiplication. After multiplying the values at those points, you transform back to get the exact convolution. No rounding, no floating errors, just pure modular arithmetic. The only caveat: your finite world must contain those special points. So you pick a prime p whose p−1 has a big power of 2 factor, ensuring many roots of unity exist. This is why 998244353 is beloved: it’s big enough and has 2^{23} dividing p−1, so you can handle sequences up to 8 million length with power-of-two padding. When your end goal is a different modulus (like 1e9+7) or even raw integer results, you can combine multiple NTT computations using CRT to stitch together the exact answer.

03Formal Definition

Let p be a prime and \(\) the finite field of integers modulo p. Suppose n divides p−1 so that \(\) contains a primitive n-th root of unity \(\) (i.e., \( 1 \) and \( 1\) for 1 ≤ ). Given a vector (polynomial coefficients) \(a = (, , , )\) over \(\), the Number Theoretic Transform of length n is the mapping \[ = , ,1,,n-1. \] The inverse transform uses \(\) and division by n (multiplication by \( p\)): \[ . \] Polynomial convolution follows from the convolution theorem: if \(c = a * b\) denotes circular convolution of length n, then \((c) = (a) (b)\), where \(\) is componentwise multiplication. For linear convolution of two sequences of lengths and , pad both to length \(n + - 1\) that divides p−1 (typically a power of two) before applying NTT. In implementations, an iterative Cooley–Tukey style algorithm with bit-reversal ordering applies successive butterfly layers of size 2,4,8,...,n, multiplying by appropriate twiddle factors \(\). Correctness requires that n divides p−1, and the chosen generator g produces \( = \) of exact order n.

04When to Use

Use NTT when you need exact integer polynomial or sequence convolution at large scale: multiplying polynomials, computing power series, combinatorics (e.g., counting paths, subset convolutions when adapted), and big integer multiplication. It shines when rounding errors from FFT would be problematic, such as in competitive programming problems that demand exact answers or modular arithmetic.

  • Convolution under an NTT-friendly modulus: If your target modulus is 998244353, 1004535809, 167772161, etc., NTT is a direct drop-in with minimal overhead.
  • Convolution under arbitrary prime/composite modulus (e.g., 10^9+7): Perform multiple NTTs under different NTT-friendly primes and reconstruct via CRT (or Garner) modulo your target.
  • Big integer multiplication: Split the numbers into base-B digits, convolve using NTT under one or more primes, and carry-compact to obtain exact integers.
  • Power series operations: Inversion, logarithm, exponentiation, and division of polynomials over (\mathbb{Z}_p) rely heavily on fast convolution.

Avoid NTT when sequences are very small (naive O(n^2) might be faster) or when you cannot ensure that required transform lengths divide p−1. For non power-of-two lengths, you may either pad to the next power of two or use Bluestein’s algorithm variant, though padding is typically simpler and faster under common NTT primes.

⚠️Common Mistakes

  • Using a modulus without sufficient 2-adic length: If n does not divide p−1, the n-th roots of unity do not exist, and the transform will be incorrect. Always ensure the padded length n is a power of two that divides p−1.
  • Forgetting the inverse scaling: The inverse NTT must multiply by n^{-1} modulo p. Omitting this leaves results off by a factor of n.
  • Wrong primitive root or power: Picking (\omega = g^{(p-1)/n}) where g is not a primitive root—or where n does not exactly divide p−1—breaks correctness. Verify g is a primitive root and n | (p−1).
  • Integer overflow before reduction: Multiplying two 32-bit ints can overflow 32-bit. Use 64-bit (int64_t or __int128) temporaries before reducing modulo p.
  • Mishandling negatives: After subtraction in butterflies, values may become negative. Normalize by adding p before a modulo reduction to keep values in [0, p−1].
  • Incorrect padding length: For linear convolution of sizes n_a and n_b, pad to n ≥ n_a + n_b − 1, usually rounding up to the next power of two supported by p.
  • Mixing moduli in CRT improperly: When reconstructing under a target modulus, ensure moduli are pairwise coprime, precompute modular inverses correctly, and use 128-bit intermediates to avoid overflow.
  • Reusing precomputed roots across different n incorrectly: Precomputed twiddle tables depend on the stage length. Either precompute per power-of-two or compute wlen freshly at each stage.

Key Formulas

NTT (forward)

Explanation: This is the discrete transform over a finite field: evaluate the polynomial at powers of a primitive n-th root of unity modulo p. It maps time-domain coefficients to frequency-domain values.

Inverse NTT

Explanation: To invert, use the inverse root of unity and multiply by the modular inverse of n. This perfectly recovers the original coefficients.

Linear convolution

Explanation: The k-th coefficient of the product polynomial equals the sum over all index pairs adding to k. Computing this naively is O(), which NTT reduces to O(n log n).

Primitive n-th root of unity

Explanation: Given a primitive root g modulo p and a transform length n dividing p−1, this constructs an element of exact order n. Its powers are the evaluation points of the NTT.

Time Complexity

Explanation: The NTT, like FFT, uses divide-and-conquer butterfly stages. There are log n stages, each doing O(n) work, leading to O(n log n) total operations.

CRT for two moduli

Explanation: Combines residues r0 and r1 modulo coprime m0 and m1 into a unique solution modulo m0·m1. This is used to merge NTT results from two primes.

Garner lifting step

Explanation: Garner's algorithm incrementally reconstructs a residue modulo a target without forming huge products. is the current combined modulus product.

Padding length

Explanation: For linear convolution of sizes and , pad to the next power of two at least as large as +−1. This ensures the transform length is valid.

Fermat inverse

Explanation: For prime p and nonzero a, the modular inverse can be computed via fast exponentiation using Fermat’s little theorem.

Complexity Analysis

An NTT of length n performs O(n log n) modular additions and multiplications. In an iterative Cooley–Tukey implementation, there are lo n stages. At stage s (butterfly size ), we process n elements in pairs, doing O(1) modular operations per pair, for O(n) work per stage and O(n log n) total. The constant factors include modular multiplications by precomputed twiddle factors and bit-reversal permutations, both linear in n. Precomputing twiddle tables for all stages also costs O(n) time and space, though many implementations compute a wlen per stage using fast exponentiation of cost O(log p) per stage; amortized, precomputation is faster. Space complexity is O(n) beyond the input and output arrays: in-place transforms store intermediate results in the same array, plus O(n) for twiddle tables and O(n) for the bit-reversal index array if cached. If temporary buffers are reused across calls, overall auxiliary memory remains linear. For convolution, padding two vectors to length N = 2^{ (+-1) } yields two forward transforms O(N log N), one pointwise multiplication O(N), and one inverse transform O(N log N). Thus, total time is O(N log N) and space O(N). When using multiple NTT primes and CRT (k primes), the time multiplies by k for the transforms plus linear-time reconstruction, producing O(k N log N) time and O(k N) space (if transforms are not reused). In practice, two or three primes suffice to reconstruct 64-bit exact results. Using 64-bit or 128-bit intermediates avoids overflow in modular multiplications and CRT recombination.

Code Examples

Basic NTT-based convolution modulo 998244353
1#include <bits/stdc++.h>
2using namespace std;
3
4// NTT under modulus 998244353 with primitive root g = 3
5// Supports power-of-two lengths up to 2^23.
6
7static const uint32_t MOD = 998244353; // 119 * 2^23 + 1
8static const uint32_t G = 3; // primitive root modulo MOD
9
10uint32_t addmod(uint32_t a, uint32_t b) {
11 uint32_t s = a + b;
12 if (s >= MOD) s -= MOD;
13 return s;
14}
15uint32_t submod(uint32_t a, uint32_t b) {
16 return a >= b ? a - b : a + MOD - b;
17}
18uint32_t powmod(uint32_t a, uint32_t e) {
19 uint64_t r = 1, x = a;
20 while (e) {
21 if (e & 1) r = (r * x) % MOD;
22 x = (x * x) % MOD;
23 e >>= 1;
24 }
25 return (uint32_t)r;
26}
27
28void ntt(vector<uint32_t> &a, bool invert) {
29 int n = (int)a.size();
30 // Bit-reversal permutation
31 for (int i = 1, j = 0; i < n; i++) {
32 int bit = n >> 1;
33 for (; j & bit; bit >>= 1) j ^= bit;
34 j ^= bit;
35 if (i < j) swap(a[i], a[j]);
36 }
37
38 for (int len = 2; len <= n; len <<= 1) {
39 // wlen = g^{(MOD-1)/len}
40 uint32_t wlen = powmod(G, (MOD - 1) / len);
41 if (invert) {
42 // use inverse root for inverse NTT
43 wlen = powmod(wlen, MOD - 2);
44 }
45 for (int i = 0; i < n; i += len) {
46 uint64_t w = 1;
47 for (int j = 0; j < len / 2; j++) {
48 uint32_t u = a[i + j];
49 uint32_t v = (uint32_t)((w * a[i + j + len / 2]) % MOD);
50 a[i + j] = addmod(u, v);
51 a[i + j + len / 2] = submod(u, v);
52 w = (w * wlen) % MOD;
53 }
54 }
55 }
56
57 if (invert) {
58 uint32_t inv_n = powmod(n, MOD - 2);
59 for (int i = 0; i < n; i++) {
60 a[i] = (uint32_t)((uint64_t)a[i] * inv_n % MOD);
61 }
62 }
63}
64
65vector<uint32_t> convolution_mod(const vector<uint32_t>& A, const vector<uint32_t>& B) {
66 if (A.empty() || B.empty()) return {};
67 int nA = (int)A.size(), nB = (int)B.size();
68 int need = nA + nB - 1;
69 int n = 1; while (n < need) n <<= 1;
70 vector<uint32_t> fa(A.begin(), A.end()), fb(B.begin(), B.end());
71 fa.resize(n); fb.resize(n);
72
73 ntt(fa, false);
74 ntt(fb, false);
75 for (int i = 0; i < n; i++) fa[i] = (uint32_t)((uint64_t)fa[i] * fb[i] % MOD);
76 ntt(fa, true);
77
78 fa.resize(need);
79 return fa;
80}
81
82int main() {
83 ios::sync_with_stdio(false);
84 cin.tie(nullptr);
85
86 // Example: multiply polynomials (2 + 3x + 4x^2) * (1 + 2x)
87 vector<uint32_t> A = {2, 3, 4};
88 vector<uint32_t> B = {1, 2};
89
90 vector<uint32_t> C = convolution_mod(A, B);
91
92 // Expected: [2*1, 2*2 + 3*1, 3*2 + 4*1, 4*2] = [2, 7, 10, 8]
93 for (size_t i = 0; i < C.size(); i++) {
94 cout << C[i] << (i + 1 == C.size() ? '\n' : ' ');
95 }
96 return 0;
97}
98

Implements an iterative Cooley–Tukey NTT under 998244353 with primitive root 3. The convolution pads inputs to power-of-two length, performs two forward transforms, multiplies pointwise, and applies the inverse transform with scaling by n^{-1}.

Time: O(n log n)Space: O(n)
Convolution modulo 1e9+7 using three NTT primes and CRT (Garner-style)
1#include <bits/stdc++.h>
2using namespace std;
3
4// We will convolve two sequences and return results modulo 1e9+7 (not NTT-friendly)
5// by combining three NTT-friendly primes via CRT.
6
7struct ModNTT {
8 uint32_t MOD, G;
9 ModNTT(uint32_t mod, uint32_t g): MOD(mod), G(g) {}
10 inline uint32_t add(uint32_t a, uint32_t b) const { uint32_t s=a+b; return s>=MOD? s-MOD : s; }
11 inline uint32_t sub(uint32_t a, uint32_t b) const { return a>=b? a-b : a+MOD-b; }
12 uint32_t powmod(uint32_t a, uint32_t e) const {
13 uint64_t r=1, x=a;
14 while(e){ if(e&1) r=(r*x)%MOD; x=(x*x)%MOD; e>>=1; }
15 return (uint32_t)r;
16 }
17 void ntt(vector<uint32_t>& a, bool inv) const {
18 int n=(int)a.size();
19 for(int i=1,j=0;i<n;i++){
20 int bit=n>>1; for(; j & bit; bit>>=1) j^=bit; j^=bit; if(i<j) swap(a[i],a[j]);
21 }
22 for(int len=2; len<=n; len<<=1){
23 uint32_t wlen = powmod(G, (MOD-1)/len);
24 if(inv) wlen = powmod(wlen, MOD-2);
25 for(int i=0;i<n;i+=len){
26 uint64_t w=1;
27 for(int j=0;j<len/2;j++){
28 uint32_t u=a[i+j];
29 uint32_t v=(uint32_t)((w*a[i+j+len/2])%MOD);
30 a[i+j] = add(u,v);
31 a[i+j+len/2] = sub(u,v);
32 w = (w*wlen)%MOD;
33 }
34 }
35 }
36 if(inv){
37 uint32_t inv_n = powmod(n, MOD-2);
38 for(int i=0;i<n;i++) a[i] = (uint32_t)((uint64_t)a[i]*inv_n % MOD);
39 }
40 }
41
42 vector<uint32_t> convolution(const vector<uint32_t>& A, const vector<uint32_t>& B) const {
43 if(A.empty() || B.empty()) return {};
44 int need = (int)A.size() + (int)B.size() - 1;
45 int n=1; while(n<need) n<<=1;
46 vector<uint32_t> fa(A.begin(),A.end()), fb(B.begin(),B.end());
47 fa.resize(n); fb.resize(n);
48 ntt(fa,false); ntt(fb,false);
49 for(int i=0;i<n;i++) fa[i] = (uint32_t)((uint64_t)fa[i]*fb[i]%MOD);
50 ntt(fa,true);
51 fa.resize(need);
52 return fa;
53 }
54};
55
56static const uint32_t MOD_TARGET = 1000000007u; // 1e9+7
57
58// Three NTT-friendly primes (all have primitive root 3)
59static const uint32_t P[3] = {998244353u, 1004535809u, 167772161u};
60static const uint32_t G[3] = {3u, 3u, 3u};
61
62// Modular exponentiation for arbitrary modulus (used to compute inverses mod primes)
63uint64_t mod_pow64(uint64_t a, uint64_t e, uint64_t m){
64 __int128 r=1, x=a % m;
65 while(e){ if(e&1) r=(r*x)%m; x=(x*x)%m; e>>=1; }
66 return (uint64_t)r;
67}
68
69// Reconstruct a single coefficient modulo MOD_TARGET using Garner-like steps
70uint32_t crt_to_target(uint32_t r0, uint32_t r1, uint32_t r2){
71 // x ≡ r0 (mod P0)
72 uint64_t x0 = r0 % MOD_TARGET;
73
74 // Lift to include P1
75 uint64_t m0 = P[0];
76 uint64_t t1 = r1 >= r0 ? (r1 - r0) : (uint64_t)r1 + P[1] - r0; // in [0, P1)
77 uint64_t inv_m0_mod_p1 = mod_pow64(m0 % P[1], P[1]-2, P[1]);
78 t1 = (t1 * inv_m0_mod_p1) % P[1];
79 __int128 M0_mod_T = m0 % MOD_TARGET;
80 x0 = (uint64_t)(((__int128)x0 + (__int128)t1 * M0_mod_T) % MOD_TARGET);
81
82 // Lift to include P2
83 uint64_t m01_mod_p2 = (uint64_t)(((__int128)P[0] % P[2]) * (P[1] % P[2]) % P[2]);
84 uint64_t t2_rhs;
85 // Compute current x modulo P2 without forming huge product: x = r0 + m0*t1 (mod P2)
86 uint64_t r0_plus = (r0 + (uint64_t)(((__int128)(P[0]%P[2]) * (t1 % P[2])) % P[2])) % P[2];
87 t2_rhs = r2 >= r0_plus ? (r2 - r0_plus) : (uint64_t)r2 + P[2] - r0_plus;
88 uint64_t inv_m01_mod_p2 = mod_pow64(m01_mod_p2, P[2]-2, P[2]);
89 uint64_t t2 = (t2_rhs * inv_m01_mod_p2) % P[2];
90
91 // Update x0 modulo target: multiply by (P0*P1) mod MOD_TARGET
92 uint64_t M01_mod_T = (uint64_t)(((__int128)(P[0]%MOD_TARGET) * (P[1]%MOD_TARGET)) % MOD_TARGET);
93 x0 = (uint64_t)(((__int128)x0 + (__int128)t2 * M01_mod_T) % MOD_TARGET);
94 return (uint32_t)x0;
95}
96
97vector<uint32_t> convolution_mod_1e9p7(const vector<uint32_t>& A, const vector<uint32_t>& B){
98 ModNTT N0(P[0], G[0]);
99 ModNTT N1(P[1], G[1]);
100 ModNTT N2(P[2], G[2]);
101 vector<uint32_t> c0 = N0.convolution(A,B);
102 vector<uint32_t> c1 = N1.convolution(A,B);
103 vector<uint32_t> c2 = N2.convolution(A,B);
104 size_t n = c0.size();
105 vector<uint32_t> res(n);
106 for(size_t i=0;i<n;i++) res[i] = crt_to_target(c0[i], c1[i], c2[i]);
107 return res;
108}
109
110int main(){
111 ios::sync_with_stdio(false);
112 cin.tie(nullptr);
113
114 // Example: multiply under 1e9+7 target modulus
115 vector<uint32_t> A = {1000000000u, 2u, 3u}; // large values mod 1e9+7 are okay as residues
116 vector<uint32_t> B = {7u, 8u, 9u, 10u};
117 // Reduce inputs modulo target? Not necessary; we want exact integer conv then mod 1e9+7.
118 // Values are treated as integers less than primes; safe as we use 64-bit temporaries.
119
120 vector<uint32_t> C = convolution_mod_1e9p7(A, B);
121 for (size_t i = 0; i < C.size(); i++) cout << C[i] << (i+1==C.size()?'\n':' ');
122 return 0;
123}
124

Computes convolution modulo 1e9+7 by performing three NTT convolutions under different primes, then reconstructing each coefficient modulo 1e9+7 using a Garner-style CRT combination. It avoids large intermediate products by reducing at each step and uses 128-bit temporaries for safety.

Time: O(k n log n) with k=3Space: O(k n)
Big integer multiplication using two NTT primes and CRT (base 10^4)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Multiply big nonnegative integers given as decimal strings using NTT + CRT.
5// We split digits in base B=10^4, convolve, then carry-compact to get the result.
6// We use two NTT-friendly primes whose product fits in 64 bits for exact reconstruction.
7
8struct NTT {
9 uint32_t MOD, G;
10 NTT(uint32_t mod, uint32_t g): MOD(mod), G(g) {}
11 inline uint32_t add(uint32_t a, uint32_t b) const { uint32_t s=a+b; return s>=MOD? s-MOD : s; }
12 inline uint32_t sub(uint32_t a, uint32_t b) const { return a>=b? a-b : a+MOD-b; }
13 uint32_t powmod(uint32_t a, uint32_t e) const {
14 uint64_t r=1, x=a;
15 while(e){ if(e&1) r=(r*x)%MOD; x=(x*x)%MOD; e>>=1; }
16 return (uint32_t)r;
17 }
18 void ntt(vector<uint32_t>& a, bool inv) const {
19 int n=(int)a.size();
20 for(int i=1,j=0;i<n;i++){
21 int bit=n>>1; for(; j & bit; bit>>=1) j^=bit; j^=bit; if(i<j) swap(a[i],a[j]);
22 }
23 for(int len=2; len<=n; len<<=1){
24 uint32_t wlen = powmod(G, (MOD-1)/len);
25 if(inv) wlen = powmod(wlen, MOD-2);
26 for(int i=0;i<n;i+=len){
27 uint64_t w=1;
28 for(int j=0;j<len/2;j++){
29 uint32_t u=a[i+j];
30 uint32_t v=(uint32_t)((w*a[i+j+len/2])%MOD);
31 a[i+j] = add(u,v);
32 a[i+j+len/2] = sub(u,v);
33 w = (w*wlen)%MOD;
34 }
35 }
36 }
37 if(inv){
38 uint32_t inv_n = powmod(n, MOD-2);
39 for(int i=0;i<n;i++) a[i] = (uint32_t)((uint64_t)a[i]*inv_n % MOD);
40 }
41 }
42 vector<uint32_t> convolution(const vector<uint32_t>& A, const vector<uint32_t>& B) const {
43 if(A.empty() || B.empty()) return {};
44 int need = (int)A.size() + (int)B.size() - 1;
45 int n=1; while(n<need) n<<=1;
46 vector<uint32_t> fa(A.begin(),A.end()), fb(B.begin(),B.end());
47 fa.resize(n); fb.resize(n);
48 ntt(fa,false); ntt(fb,false);
49 for(int i=0;i<n;i++) fa[i] = (uint32_t)((uint64_t)fa[i]*fb[i]%MOD);
50 ntt(fa,true);
51 fa.resize(need);
52 return fa;
53 }
54};
55
56// Two primes (primitive root 3 for both)
57static const uint32_t P0 = 998244353u; // ~1e9
58static const uint32_t P1 = 1004535809u; // ~1e9
59static const uint32_t G0 = 3u, G1 = 3u;
60
61// Combine residues r0 (mod P0) and r1 (mod P1) into x in [0, P0*P1)
62static inline unsigned long long crt2(uint32_t r0, uint32_t r1){
63 unsigned long long m0 = P0, m1 = P1;
64 // t ≡ (r1 - r0) * inv(m0, m1) (mod m1)
65 unsigned long long diff = (r1 >= r0) ? (r1 - r0) : (unsigned long long)r1 + m1 - r0;
66 // inv_m0_mod_m1 via Fermat since m1 is prime
67 auto modpow = [](unsigned long long a, unsigned long long e, unsigned long long m){ __int128 r=1,x=a%m; while(e){ if(e&1) r=(r*x)%m; x=(x*x)%m; e>>=1;} return (unsigned long long)r; };
68 unsigned long long inv_m0 = modpow(m0 % m1, m1 - 2, m1);
69 unsigned long long t = ( (__int128)diff * inv_m0 ) % m1;
70 unsigned long long x = (unsigned long long)r0 + (unsigned long long)((__int128)t * m0);
71 return x; // fits in < 1e18
72}
73
74vector<int> to_base(const string& s, int B){
75 // Parse decimal string into base-B little-endian digits
76 vector<int> a; a.reserve((int)s.size()/4+2);
77 int n=(int)s.size();
78 int i=n;
79 while(i>0){
80 int l=max(0,i-4); // chunk 4 decimal digits
81 int val=0;
82 for(int k=l;k<i;k++) val = val*10 + (s[k]-'0');
83 a.push_back(val);
84 i=l;
85 }
86 // Now a holds base-10^4 digits little-endian
87 return a;
88}
89
90string from_base(vector<unsigned long long>& a, int B){
91 // Remove leading zeros
92 while(!a.empty() && a.back()==0) a.pop_back();
93 if(a.empty()) return string("0");
94 // Convert back to decimal string
95 const int WIDTH = 4; // since B=10^4
96 stringstream ss;
97 ss << a.back();
98 char buf[8];
99 for(int i=(int)a.size()-2;i>=0;i--){
100 unsigned long long x = a[i];
101 // print with leading zeros to WIDTH digits
102 snprintf(buf, sizeof(buf), "%04llu", x);
103 ss << buf;
104 }
105 return ss.str();
106}
107
108string multiply_big(const string& A, const string& B){
109 if (A=="0" || B=="0") return "0";
110 const int BASE = 10000; // 1e4
111
112 vector<int> a = to_base(A, BASE);
113 vector<int> b = to_base(B, BASE);
114
115 NTT N0(P0, G0), N1(P1, G1);
116 vector<uint32_t> a0(a.size()), b0(b.size()), a1(a.size()), b1(b.size());
117 for(size_t i=0;i<a.size();i++){ a0[i]=a[i]%P0; a1[i]=a[i]%P1; }
118 for(size_t i=0;i<b.size();i++){ b0[i]=b[i]%P0; b1[i]=b[i]%P1; }
119
120 vector<uint32_t> c0 = N0.convolution(a0,b0);
121 vector<uint32_t> c1 = N1.convolution(a1,b1);
122
123 size_t n = c0.size();
124 vector<unsigned long long> coeff(n);
125 for(size_t i=0;i<n;i++) coeff[i] = crt2(c0[i], c1[i]); // exact coefficient value (< P0*P1)
126
127 // Carry handling in base BASE
128 unsigned long long carry=0;
129 for(size_t i=0;i<n;i++){
130 unsigned __int128 v = (unsigned __int128)coeff[i] + carry;
131 coeff[i] = (unsigned long long)(v % BASE);
132 carry = (unsigned long long)(v / BASE);
133 }
134 while(carry){ coeff.push_back(carry % BASE); carry /= BASE; }
135
136 return from_base(coeff, BASE);
137}
138
139int main(){
140 ios::sync_with_stdio(false);
141 cin.tie(nullptr);
142
143 string A, B;
144 // Demo: multiply two large numbers
145 A = "123456789012345678901234567890";
146 B = "98765432109876543210";
147 string C = multiply_big(A, B);
148 cout << C << "\n";
149 return 0;
150}
151

Parses decimal strings into base 10^4 digits, performs two NTT convolutions under 998244353 and 1004535809, combines coefficients exactly via CRT (product < 2^64), and then performs carry propagation to reconstruct the decimal result. This achieves exact big-integer multiplication in O(n log n).

Time: O(n log n)Space: O(n)