Suffix Array
Key Points
- •A suffix array stores the starting indices of all suffixes of a string in lexicographic order.
- •The classic O(n n) construction uses the doubling method and radix sort on pairs of ranks (rank[i], rank[i+2^k]).
- •The LCP array stores the longest common prefix lengths between adjacent suffixes in the suffix array and can be built in O(n) with Kasai's algorithm.
- •Pattern matching can be done in O(m n) by binary searching over the suffix array, comparing the pattern with selected suffixes.
- •The number of distinct substrings equals n(n+1)/2 minus the sum of all LCP values.
- •You can answer LCP(i, j) of any two suffixes using an RMQ (e.g., Sparse Table) over the LCP array in O(1) time after O(n n) preprocessing.
- •Suffix arrays are memory-light alternatives to suffix trees and are competitive for many string problems on large inputs.
- •Careful handling of ranks, stability in counting sort, and off-by-one indices are crucial for correctness and performance.
Prerequisites
- →String basics and indexing — Suffixes, substrings, and lexicographic comparisons rely on understanding how strings are indexed and compared.
- →Asymptotic analysis (Big-O) — To analyze O(n log n) construction, O(n) LCP, and O(m log n) queries.
- →Sorting algorithms — Suffix array construction depends on stable sorting by integer keys; understanding stability is critical.
- →Counting sort and radix sort — Efficient O(n) per round sorting of rank pairs requires integer sorting with stability.
- →Binary search — Pattern matching on SA uses lower_bound/upper_bound logic on sorted suffixes.
- →Range Minimum Query (RMQ) / Sparse Table — To compute LCP of arbitrary suffix pairs from the LCP array in O(1) queries.
- →Basic discrete math — Counting total substrings n(n+1)/2 and reasoning about sums like Σ LCP.
- →Memory management with vectors/arrays in C++ — Implementations use multiple linear-size arrays and temporary buffers efficiently.
Detailed Explanation
Tap terms for definitions01Overview
A suffix array is a compact data structure for strings that lists the starting positions of all suffixes in lexicographic (dictionary) order. Imagine all endings of a word, sorted alphabetically: this ordering encodes powerful information about repeated patterns and how substrings relate. The suffix array alone enables binary search over suffixes, which lets us find patterns in O(m log n). When paired with the LCP (Longest Common Prefix) array, it supports deeper analyses such as counting distinct substrings, identifying repeating patterns, and computing common prefixes quickly. Construction can be done via the doubling algorithm in O(n log n) time by sorting indices based on rank pairs (current 2^k prefix and the next 2^k block) using radix/counting sort. After building the suffix array, Kasai’s algorithm computes the LCP array in O(n) time by reusing information from previous comparisons. Together, the suffix array and LCP array offer a practical and efficient alternative to heavier structures like suffix trees, with simpler implementation and lower memory usage, making them popular in competitive programming and text processing applications.
02Intuition & Analogies
Think of a book index that lists words alphabetically and tells you where they appear. A suffix array plays a similar role for all endings of a string: for a string s, it sorts s[0..], s[1..], s[2..], …, s[n-1..]. Why suffixes? Any substring is a prefix of some suffix, so if we can binary search over suffixes, we can locate any substring. For example, in the word "banana", the suffixes are banana, anana, nana, ana, na, a. Sorting them gives a, ana, anana, banana, na, nana, and the suffix array is their starting positions [5, 3, 1, 0, 4, 2]. Once sorted, if you want to find "ana", you compare it with the middle suffix, decide whether to go left or right, and keep narrowing, just like looking up a word in a dictionary. The LCP array is like the shared prefix between neighboring dictionary entries—if you know that two adjacent entries share many starting letters, you can avoid re-comparing those letters repeatedly. The doubling construction builds this ordering layer by layer: first by 1-character ranks, then by 2-character ranks, then 4, 8, and so on. At each step, suffixes that are equal on the first 2^k characters get the same rank; then we refine with the next 2^k block. Using counting sort to sort by these small integer ranks makes each layer linear, and since there are O(log n) layers, the total time is O(n log n).
03Formal Definition
04When to Use
Use a suffix array when you need fast, memory-efficient substring operations on a static string. Typical use cases include: exact pattern matching across a large text with many queries (O(m log n) per query); counting or enumerating distinct substrings (using the relation with the LCP sum); finding the longest repeated substring (max of LCP); computing similarity between regions via LCP queries; lexicographic tasks such as finding the k-th lexicographic suffix or substring; and serving as a base for text indexing structures like the Burrows–Wheeler Transform and FM-index. If you have many online updates to the string, suffix arrays are not ideal because rebuilding is costly; consider dynamic structures instead. If you need the absolute fastest single-query matching and can afford more preprocessing and memory, suffix trees or suffix automata may be more suitable. If queries require LCP(i, j) across arbitrary positions, augment the LCP array with RMQ (e.g., Sparse Table) to get O(1) query time after O(n log n) preprocessing.
⚠️Common Mistakes
Common pitfalls include: 1) Incorrect stability in the radix/counting sort passes when sorting by pairs—always sort by the second key first, then by the first, both stably; otherwise the order is wrong. 2) Mishandling ranks for out-of-bounds offsets during doubling—use a sentinel rank like -1 and shift by +1 when indexing counts so that out-of-range compares as smaller than any valid rank. 3) Off-by-one errors in LCP: note that LCP has length n-1, and LCP[i] compares SA[i] with SA[i+1]. 4) Forgetting to decrement the running match length h in Kasai’s algorithm; this breaks the linear-time guarantee. 5) Using std::string comparisons or substrings directly inside binary search, which can introduce extra O(log n) factors or hidden copies—compare character-by-character without constructing substrings. 6) Mixing 0-based and 1-based indices in outputs, especially when translating from textbook definitions using a sentinel; stay consistent. 7) Overflow when counting substrings—use 64-bit integers because n(n+1)/2 can exceed 32-bit range. 8) Assuming Unicode code-point order equals lexicographic order for natural language; for competitive programming, stick to byte/ASCII order unless specified.
Key Formulas
Suffix Array Ordering
Explanation: The suffix array is a permutation of all starting positions that sorts the corresponding suffixes lexicographically. This formalizes the definition of SA.
LCP Definition
Explanation: LCP[i] is the length of the longest common prefix between adjacent suffixes in the suffix array. It measures how similar neighboring suffixes are at the start.
SA Doubling Time
Explanation: The doubling construction does O( n) rounds, each taking O(n) with radix sort on integer ranks. Hence the total time is O(n log n).
Kasai's Time
Explanation: Kasai's algorithm computes the entire LCP array in linear time by amortizing character comparisons across the string.
Total Substrings
Explanation: There are n choices for the end and up to n for the start, which simplifies to n(n+1)/2 total substrings counting duplicates.
Distinct Substrings via LCP
Explanation: The overlaps captured by LCP subtract out duplicated prefixes when counting unique substrings. Summing LCP removes exactly the shared prefixes of adjacent suffixes.
LCP of Two Suffixes via RMQ
Explanation: The LCP between any two suffixes equals the minimum LCP across the interval between their positions in SA. This reduces to a range minimum query on the LCP array.
Pattern Match Time
Explanation: Binary searching over n suffixes makes O(log n) comparisons, each comparing up to m characters, giving O(m log n) per query.
Space for SA+LCP
Explanation: Both SA and LCP are arrays of length n (LCP is n-1), and ranks are also O(n). Additional buffers for counting sort are linear.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Stable counting sort helper: sorts sa by key rnk[sa[i]+k] (with -1 mapped to 0) 5 static void counting_sort_by_key(vector<int>& sa, const vector<int>& rnk, int k) { 6 int n = (int)sa.size(); 7 int maxKey = max(256, n) + 1; // +1 for the sentinel bucket 8 vector<int> cnt(maxKey, 0), sa2(n); 9 // Count occurrences 10 for (int i = 0; i < n; ++i) { 11 int key = (sa[i] + k < n) ? (rnk[sa[i] + k] + 1) : 0; // -1 -> 0 12 ++cnt[key]; 13 } 14 // Prefix sums for stable placement 15 for (int i = 1; i < maxKey; ++i) cnt[i] += cnt[i - 1]; 16 // Build output (iterate backward to be stable) 17 for (int i = n - 1; i >= 0; --i) { 18 int key = (sa[i] + k < n) ? (rnk[sa[i] + k] + 1) : 0; 19 sa2[--cnt[key]] = sa[i]; 20 } 21 sa.swap(sa2); 22 } 23 24 // Build suffix array in O(n log n) using doubling and radix sort on pairs 25 vector<int> build_suffix_array(const string& s) { 26 int n = (int)s.size(); 27 vector<int> sa(n), rnk(n); 28 if (n == 0) return sa; 29 // Initial ranks by single characters (byte values) 30 for (int i = 0; i < n; ++i) { 31 sa[i] = i; 32 rnk[i] = (unsigned char)s[i]; 33 } 34 for (int k = 1; k < n; k <<= 1) { 35 // Sort by second key then first key (both stably) 36 counting_sort_by_key(sa, rnk, k); 37 counting_sort_by_key(sa, rnk, 0); 38 // Recompute ranks 39 vector<int> newr(n); 40 newr[sa[0]] = 0; 41 int classes = 1; 42 for (int i = 1; i < n; ++i) { 43 int a = sa[i - 1], b = sa[i]; 44 int a1 = rnk[a], b1 = rnk[b]; 45 int a2 = (a + k < n) ? rnk[a + k] : -1; 46 int b2 = (b + k < n) ? rnk[b + k] : -1; 47 if (a1 != b1 || a2 != b2) ++classes; 48 newr[b] = classes - 1; 49 } 50 rnk.swap(newr); 51 if (classes == n) break; // All ranks unique 52 } 53 return sa; 54 } 55 56 // Kasai's algorithm: LCP[i] = lcp(s[sa[i]..], s[sa[i+1]..]) for i in [0..n-2] 57 vector<int> build_lcp(const string& s, const vector<int>& sa) { 58 int n = (int)s.size(); 59 vector<int> lcp(max(0, n - 1)), rank(n); 60 if (n == 0) return lcp; 61 for (int i = 0; i < n; ++i) rank[sa[i]] = i; 62 int h = 0; 63 for (int i = 0; i < n; ++i) { 64 int r = rank[i]; 65 if (r == n - 1) { // last suffix; no next neighbor 66 h = 0; 67 continue; 68 } 69 int j = sa[r + 1]; 70 // Extend current LCP by reusing previous value h - 1 71 while (i + h < n && j + h < n && s[i + h] == s[j + h]) ++h; 72 lcp[r] = h; 73 if (h > 0) --h; // Prepare for next i 74 } 75 return lcp; 76 } 77 78 int main() { 79 ios::sync_with_stdio(false); 80 cin.tie(nullptr); 81 82 string s; 83 if (!(cin >> s)) return 0; 84 85 auto sa = build_suffix_array(s); 86 auto lcp = build_lcp(s, sa); 87 88 cout << "Suffix Array (0-based):\n"; 89 for (int i = 0; i < (int)sa.size(); ++i) { 90 cout << i << ": " << sa[i] << " -> " << s.substr(sa[i]) << '\n'; 91 } 92 93 cout << "\nLCP array (between SA[i] and SA[i+1]):\n"; 94 for (int i = 0; i < (int)lcp.size(); ++i) { 95 cout << "LCP[" << i << "] = " << lcp[i] << '\n'; 96 } 97 return 0; 98 } 99
This program constructs the suffix array using the O(n log n) doubling algorithm with two-pass stable counting sorts (radix on pairs), and then computes the LCP array using Kasai’s O(n) algorithm. It prints each suffix in sorted order and the LCP values between adjacent suffixes.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 static void counting_sort_by_key(vector<int>& sa, const vector<int>& rnk, int k) { 5 int n = (int)sa.size(); 6 int maxKey = max(256, n) + 1; 7 vector<int> cnt(maxKey, 0), sa2(n); 8 for (int i = 0; i < n; ++i) { 9 int key = (sa[i] + k < n) ? (rnk[sa[i] + k] + 1) : 0; 10 ++cnt[key]; 11 } 12 for (int i = 1; i < maxKey; ++i) cnt[i] += cnt[i - 1]; 13 for (int i = n - 1; i >= 0; --i) { 14 int key = (sa[i] + k < n) ? (rnk[sa[i] + k] + 1) : 0; 15 sa2[--cnt[key]] = sa[i]; 16 } 17 sa.swap(sa2); 18 } 19 20 vector<int> build_suffix_array(const string& s) { 21 int n = (int)s.size(); 22 vector<int> sa(n), rnk(n); 23 if (n == 0) return sa; 24 for (int i = 0; i < n; ++i) sa[i] = i, rnk[i] = (unsigned char)s[i]; 25 for (int k = 1; k < n; k <<= 1) { 26 counting_sort_by_key(sa, rnk, k); 27 counting_sort_by_key(sa, rnk, 0); 28 vector<int> newr(n); 29 newr[sa[0]] = 0; 30 int classes = 1; 31 for (int i = 1; i < n; ++i) { 32 int a = sa[i - 1], b = sa[i]; 33 int a1 = rnk[a], b1 = rnk[b]; 34 int a2 = (a + k < n) ? rnk[a + k] : -1; 35 int b2 = (b + k < n) ? rnk[b + k] : -1; 36 if (a1 != b1 || a2 != b2) ++classes; 37 newr[b] = classes - 1; 38 } 39 rnk.swap(newr); 40 if (classes == n) break; 41 } 42 return sa; 43 } 44 45 // Compare suffix s[pos..] with pattern p. Return -1 if suffix < p, 0 if suffix starts with p, +1 if suffix > p. 46 int compare_suffix_with_pattern(const string& s, int pos, const string& p) { 47 int n = (int)s.size(), m = (int)p.size(); 48 for (int i = 0; i < m; ++i) { 49 if (pos + i >= n) return -1; // suffix exhausted first => suffix < pattern 50 unsigned char sc = (unsigned char)s[pos + i]; 51 unsigned char pc = (unsigned char)p[i]; 52 if (sc < pc) return -1; 53 if (sc > pc) return +1; 54 } 55 return 0; // pattern fully matched as a prefix of suffix 56 } 57 58 // Find the range [L, R) in SA such that all suffixes start with pattern p. 59 pair<int,int> find_occurrence_range(const string& s, const vector<int>& sa, const string& p) { 60 int n = (int)sa.size(); 61 int lo = 0, hi = n; // lower_bound for first suffix >= p 62 while (lo < hi) { 63 int mid = (lo + hi) / 2; 64 int cmp = compare_suffix_with_pattern(s, sa[mid], p); 65 if (cmp >= 0) hi = mid; else lo = mid + 1; 66 } 67 int L = lo; 68 lo = 0; hi = n; // upper_bound for first suffix > p as prefix (i.e., not starting with p) 69 while (lo < hi) { 70 int mid = (lo + hi) / 2; 71 int cmp = compare_suffix_with_pattern(s, sa[mid], p); 72 if (cmp > 0) hi = mid; else lo = mid + 1; 73 } 74 int R = lo; 75 // Check if any match exists 76 if (L >= R) return {-1, -1}; 77 return {L, R}; 78 } 79 80 int main() { 81 ios::sync_with_stdio(false); 82 cin.tie(nullptr); 83 84 string text, pattern; 85 if (!(cin >> text >> pattern)) return 0; 86 auto sa = build_suffix_array(text); 87 auto range = find_occurrence_range(text, sa, pattern); 88 89 if (range.first == -1) { 90 cout << "No occurrences found\n"; 91 } else { 92 cout << "Occurrences at positions (0-based):\n"; 93 for (int i = range.first; i < range.second; ++i) { 94 cout << sa[i] << (i + 1 < range.second ? ' ' : '\n'); 95 } 96 } 97 return 0; 98 } 99
This program builds the suffix array for a given text and finds all occurrences of a pattern by binary searching the lexicographic range of suffixes that have the pattern as a prefix. Comparisons are done character-by-character without creating substrings, giving O(m log n) time per query.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 static void counting_sort_by_key(vector<int>& sa, const vector<int>& rnk, int k) { 5 int n = (int)sa.size(); 6 int maxKey = max(256, n) + 1; 7 vector<int> cnt(maxKey, 0), sa2(n); 8 for (int i = 0; i < n; ++i) { 9 int key = (sa[i] + k < n) ? (rnk[sa[i] + k] + 1) : 0; 10 ++cnt[key]; 11 } 12 for (int i = 1; i < maxKey; ++i) cnt[i] += cnt[i - 1]; 13 for (int i = n - 1; i >= 0; --i) { 14 int key = (sa[i] + k < n) ? (rnk[sa[i] + k] + 1) : 0; 15 sa2[--cnt[key]] = sa[i]; 16 } 17 sa.swap(sa2); 18 } 19 20 vector<int> build_suffix_array(const string& s) { 21 int n = (int)s.size(); 22 vector<int> sa(n), rnk(n); 23 if (n == 0) return sa; 24 for (int i = 0; i < n; ++i) sa[i] = i, rnk[i] = (unsigned char)s[i]; 25 for (int k = 1; k < n; k <<= 1) { 26 counting_sort_by_key(sa, rnk, k); 27 counting_sort_by_key(sa, rnk, 0); 28 vector<int> newr(n); 29 newr[sa[0]] = 0; 30 int classes = 1; 31 for (int i = 1; i < n; ++i) { 32 int a = sa[i - 1], b = sa[i]; 33 int a1 = rnk[a], b1 = rnk[b]; 34 int a2 = (a + k < n) ? rnk[a + k] : -1; 35 int b2 = (b + k < n) ? rnk[b + k] : -1; 36 if (a1 != b1 || a2 != b2) ++classes; 37 newr[b] = classes - 1; 38 } 39 rnk.swap(newr); 40 if (classes == n) break; 41 } 42 return sa; 43 } 44 45 vector<int> build_lcp(const string& s, const vector<int>& sa) { 46 int n = (int)s.size(); 47 vector<int> lcp(max(0, n - 1)), rank(n); 48 if (n == 0) return lcp; 49 for (int i = 0; i < n; ++i) rank[sa[i]] = i; 50 int h = 0; 51 for (int i = 0; i < n; ++i) { 52 int r = rank[i]; 53 if (r == n - 1) { h = 0; continue; } 54 int j = sa[r + 1]; 55 while (i + h < n && j + h < n && s[i + h] == s[j + h]) ++h; 56 lcp[r] = h; 57 if (h) --h; 58 } 59 return lcp; 60 } 61 62 int main() { 63 ios::sync_with_stdio(false); 64 cin.tie(nullptr); 65 66 string s; cin >> s; 67 auto sa = build_suffix_array(s); 68 auto lcp = build_lcp(s, sa); 69 70 long long n = (long long)s.size(); 71 long long total = n * (n + 1) / 2; // total substrings (with duplicates) 72 long long overlap = 0; 73 for (int x : lcp) overlap += x; 74 long long distinct = total - overlap; 75 76 cout << distinct << '\n'; 77 return 0; 78 } 79
The number of distinct substrings equals total substrings minus duplicated prefixes captured by LCP. After building SA and LCP, we compute n(n+1)/2 − sum(LCP) using 64-bit integers to avoid overflow.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 static void counting_sort_by_key(vector<int>& sa, const vector<int>& rnk, int k) { 5 int n = (int)sa.size(); 6 int maxKey = max(256, n) + 1; 7 vector<int> cnt(maxKey, 0), sa2(n); 8 for (int i = 0; i < n; ++i) { 9 int key = (sa[i] + k < n) ? (rnk[sa[i] + k] + 1) : 0; 10 ++cnt[key]; 11 } 12 for (int i = 1; i < maxKey; ++i) cnt[i] += cnt[i - 1]; 13 for (int i = n - 1; i >= 0; --i) { 14 int key = (sa[i] + k < n) ? (rnk[sa[i] + k] + 1) : 0; 15 sa2[--cnt[key]] = sa[i]; 16 } 17 sa.swap(sa2); 18 } 19 20 vector<int> build_suffix_array(const string& s) { 21 int n = (int)s.size(); 22 vector<int> sa(n), rnk(n); 23 if (n == 0) return sa; 24 for (int i = 0; i < n; ++i) sa[i] = i, rnk[i] = (unsigned char)s[i]; 25 for (int k = 1; k < n; k <<= 1) { 26 counting_sort_by_key(sa, rnk, k); 27 counting_sort_by_key(sa, rnk, 0); 28 vector<int> newr(n); 29 newr[sa[0]] = 0; 30 int classes = 1; 31 for (int i = 1; i < n; ++i) { 32 int a = sa[i - 1], b = sa[i]; 33 int a1 = rnk[a], b1 = rnk[b]; 34 int a2 = (a + k < n) ? rnk[a + k] : -1; 35 int b2 = (b + k < n) ? rnk[b + k] : -1; 36 if (a1 != b1 || a2 != b2) ++classes; 37 newr[b] = classes - 1; 38 } 39 rnk.swap(newr); 40 if (classes == n) break; 41 } 42 return sa; 43 } 44 45 vector<int> build_lcp(const string& s, const vector<int>& sa) { 46 int n = (int)s.size(); 47 vector<int> lcp(max(0, n - 1)), rank(n); 48 if (n == 0) return lcp; 49 for (int i = 0; i < n; ++i) rank[sa[i]] = i; 50 int h = 0; 51 for (int i = 0; i < n; ++i) { 52 int r = rank[i]; 53 if (r == n - 1) { h = 0; continue; } 54 int j = sa[r + 1]; 55 while (i + h < n && j + h < n && s[i + h] == s[j + h]) ++h; 56 lcp[r] = h; 57 if (h) --h; 58 } 59 return lcp; 60 } 61 62 struct SparseTable { 63 // RMQ for min on LCP array 64 int n, K; 65 vector<int> lg; 66 vector<vector<int>> st; 67 SparseTable() {} 68 explicit SparseTable(const vector<int>& a) { build(a); } 69 70 void build(const vector<int>& a) { 71 n = (int)a.size(); 72 if (n == 0) return; 73 K = 32 - __builtin_clz(n); // ceil(log2(n)) 74 lg.assign(n + 1, 0); 75 for (int i = 2; i <= n; ++i) lg[i] = lg[i/2] + 1; 76 st.assign(K, vector<int>(n)); 77 st[0] = a; 78 for (int k = 1; k < K; ++k) { 79 int len = 1 << k; 80 int half = len >> 1; 81 for (int i = 0; i + len <= n; ++i) { 82 st[k][i] = min(st[k-1][i], st[k-1][i + half]); 83 } 84 } 85 } 86 87 // Query min on [l, r] inclusive 88 int query(int l, int r) const { 89 if (l > r) swap(l, r); 90 int len = r - l + 1; 91 int k = 31 - __builtin_clz(len); 92 return min(st[k][l], st[k][r - (1 << k) + 1]); 93 } 94 }; 95 96 int main() { 97 ios::sync_with_stdio(false); 98 cin.tie(nullptr); 99 100 string s; cin >> s; 101 auto sa = build_suffix_array(s); 102 auto lcp = build_lcp(s, sa); 103 104 int n = (int)s.size(); 105 vector<int> rank(n); 106 for (int i = 0; i < n; ++i) rank[sa[i]] = i; 107 108 SparseTable st(lcp); // RMQ over LCP 109 110 int q; cin >> q; 111 // Each query: positions i, j (0-based). Output LCP of suffixes s[i..] and s[j..]. 112 while (q--) { 113 int i, j; cin >> i >> j; 114 if (i == j) { 115 cout << (n - i) << '\n'; 116 continue; 117 } 118 int ri = rank[i], rj = rank[j]; 119 int L = min(ri, rj), R = max(ri, rj) - 1; // LCP indices are between SA positions 120 if ((int)lcp.size() == 0) { cout << 0 << '\n'; continue; } 121 int ans = st.query(L, R); 122 cout << ans << '\n'; 123 } 124 return 0; 125 } 126
This program augments SA + LCP with a Sparse Table over the LCP array to answer LCP(i, j) for any two suffixes in O(1). The RMQ is performed on the interval between the ranks of i and j in SA.