Suffix Automaton - Advanced Usage
Key Points
- •A suffix automaton (SAM) is a compact DFA that captures all distinct substrings of a string and supports many advanced queries in linear time.
- •Occurrences of a pattern in the text equal the size of its end-position set, which you can compute as a frequency propagated along suffix links.
- •The number of distinct substrings equals the sum over states of len[v] - len[link[v]], which counts how many new lengths each state contributes.
- •The K-th lexicographically smallest substring can be found by DP on the DAG of SAM transitions and guided traversal from the initial state.
- •The longest common substring between strings is found by walking the second (or many) strings over the SAM of the first and tracking matched lengths.
- •DP on the SAM (often in decreasing order of len) is the key to attach counts, maximize/minimize scores, or filter by frequency constraints.
- •SAM builds in O(n) time and space, has at most 2n-1 states, and answers pattern queries in O(m) where m is pattern length.
- •Implementation pitfalls include forgetting to propagate counts in topological order, off-by-one errors around len and suffix links, and unsorted transitions for lexicographic tasks.
Prerequisites
- →Deterministic finite automata (DFA) — SAM is a minimal DFA; understanding states, transitions, and determinism clarifies why each substring maps to a unique path.
- →Suffix link/suffix tree intuition — Suffix links encode the relation between contexts; they are crucial for construction and DP propagation.
- →Big-O notation and amortized analysis — SAM’s linear bounds rely on amortized arguments about cloning and transitions.
- →Topological order and DP on DAGs — Counting paths and propagating counts along suffix links require DAG DP in length order.
- →Lexicographic order — K-th substring selection depends on sorting transitions and reasoning about lexicographic skips.
- →String processing basics — Indexing, substrings, prefixes/suffixes, and character encodings are used throughout.
- →Hash maps vs fixed arrays — Choosing transition storage affects performance; arrays are best for small alphabets.
- →Proof by invariants — LCS walking logic maintains invariants on matched length when climbing suffix links.
Detailed Explanation
Tap terms for definitions01Overview
A suffix automaton (SAM) is a deterministic finite automaton built from a string S that recognizes exactly the set of substrings of S. It compresses all substrings into a minimal structure where each state represents an equivalence class of end positions (endpos) of substrings. The SAM has three main components per state: len[v] (the length of the longest string in the class), link[v] (a suffix link pointing to a state representing the longest proper suffix class), and next[v] (transitions labeled by characters). Remarkably, a SAM for a string of length n has at most 2n-1 states, so it can be built and stored in linear time and space. This compactness makes it a powerful backbone for string problems that need to count, order, or compare substrings. Advanced usages include counting occurrences of any pattern P in O(|P|), computing the number of distinct substrings, finding the K-th lexicographically smallest substring, computing the longest common substring (LCS) of multiple strings via a generalized SAM, and dynamic programming over the SAM’s DAG to aggregate counts and optimize over constrained substrings. Because transitions are deterministic, each distinct substring corresponds to exactly one path from the initial state; this fact underlies many DP solutions. The combination of suffix links and length-bounded contributions provides natural formulas to count or filter substrings by frequency.
02Intuition & Analogies
Imagine writing every substring of a text on sticky notes and grouping notes that end at the same positions in the text. Many notes are duplicates (the same substring appears in different places), and many are suffixes of one another. A suffix automaton is like a smart filing system that merges these notes into folders (states) where each folder represents all substrings that end at the same set of positions. Each folder knows the length of the longest note it contains (len), and it has a pointer (suffix link) to the folder that would contain the note if you chop off one character from the front as far as possible while staying in the same group structure. Now picture a subway map. The starting station is the empty string. Each labeled rail (transition) takes you to the next folder by appending a character. Any substring is a unique route starting at the empty station, moving along rails labeled by its characters. Because the map is deterministic, there is exactly one way to traverse a given substring, which is why counting distinct substrings becomes counting distinct paths. The suffix links are like emergency staircases that let you jump back to a simpler folder representing shorter contexts when a track you want isn’t available—this is the trick used when matching patterns or other strings quickly against the SAM. For lexicographic tasks (like the K-th substring), if you sort the rails leaving each station alphabetically and know how many paths lie beyond each rail, you can skip entire blocks of substrings without enumerating them. For counting occurrences, you can imagine each time a folder is reached during the text’s construction you drop a marble there; after building, you tilt the map along suffix links so marbles roll backward, adding up counts to represent how many end positions each class owns.
03Formal Definition
04When to Use
Use a suffix automaton when you need fast, repeated substring queries on a fixed text S: pattern occurrence counts in O(|P|), checking membership of a substring, or answering how many distinct substrings exist. It shines in competitive programming for problems like: computing the K-th lexicographically smallest substring without enumerating all substrings; longest common substring (2 or many strings) by scanning other strings against the SAM of the first and tracking matched lengths; counting substrings that occur at least K times by combining endpos sizes with length gaps; and accumulating statistics (e.g., number of substrings starting with a given prefix) via DP on the DAG. If you have multiple queries on one text, SAM typically beats suffix arrays or trees in coding complexity while retaining linear performance. When memory must be strictly linear in |S| and the alphabet is modest (e.g., lowercase letters), SAM is a robust choice. For very large alphabets, maps may slow transitions but correctness remains. For tasks that need lexicographic ranks or suffix order explicitly, a suffix array might be more direct; for substring automata with all substrings and on-the-fly minimality, SAM is ideal.
⚠️Common Mistakes
- Forgetting to propagate occurrence counts in decreasing order of len. Raw cnt[v] after building only counts times state v was the active end; you must bucket-sort states by len and do cnt[link[v]] += cnt[v] to get |endpos(v)|.
- Confusing len[v] with the number of substrings at state v. The contribution of state v to distinct substrings is len[v] - len[link[v]], not len[v] itself.
- Off-by-one in K-th substring DP: decide whether to count the empty substring. The standard approach excludes the empty substring, so the path contribution for an edge is 1 + dp[next]. Sort outgoing edges lexicographically.
- Mishandling transitions when matching another string for LCS: after failing a transition, while-climb via suffix links and shrink the matched length to len[cur]; do not let the length exceed len[cur] after jumps.
- Using unordered_map transitions for very tight time limits without reserving buckets or switching to fixed arrays when the alphabet is small; this can cause TLE due to hashing overhead.
- Forgetting to clone during construction or copying cnt/len incorrectly when cloning; clones must copy transitions, inherit link, set len to len[p] + 1, and never receive initial cnt=1 (clones get cnt=0 initially).
- Ignoring multiple test cases: always reinitialize the SAM and auxiliary arrays per test case.
- Recursion depth in DFS DP for large n: prefer iterative/topological DP or ensure the environment can handle deep recursion.
Key Formulas
State Bound
Explanation: A suffix automaton for a string of length n has at most 2n-1 states. This guarantees linear space and underpins O(n) construction time.
Transition Bound
Explanation: The total number of transitions is linear in n. This ensures that DP or propagation steps over all edges remain linear-time.
Distinct Substrings
Explanation: Each state v contributes all lengths between len[link[v]]+1 and len[v], which are new compared to its suffix link. Summing these gaps counts all distinct substrings exactly once.
Endpos Propagation
Explanation: After construction, cnt[v] initially counts how many times v was last during extension. Propagating counts from longer to shorter via suffix links yields |endpos(v)|, the number of occurrences of substrings in v’s class.
Pattern Occurrences
Explanation: Following P’s characters from the initial state yields a state that represents P. Its propagated count equals the number of times P appears in the text.
Path Counting for Substrings
Explanation: Counting paths starting at v (excluding the empty path) yields the number of substrings represented by paths from v. This supports K-th lexicographic substring selection.
LCS Matching Invariant
Explanation: When walking another string over the SAM, we extend matches on existing transitions and, on failure, climb suffix links while shrinking the current matched length to remain valid. The maximum L over the scan is the LCS length.
Frequent Substrings
Explanation: All substrings represented by state v (the length range it contributes) share the same endpos count. If cnt[v] K, all those substrings occur at least K times, contributing the full gap.
Construction Complexity
Explanation: SAM builds in linear time and space with respect to the string length n. Each extension creates at most one new state and possibly one clone.
Pattern Query Cost
Explanation: Checking a pattern or counting its occurrences is linear in the pattern length, since it follows at most one deterministic transition per character.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Suffix Automaton for lowercase 'a'..'z'. 5 // Supports occurrence counting via endpos sizes. 6 struct SAM { 7 struct State { 8 int next[26]; // transitions 9 int link; // suffix link 10 int len; // length of the longest string in this state 11 long long cnt; // number of end positions (after propagation) 12 State() : link(-1), len(0), cnt(0) { fill(begin(next), end(next), -1); } 13 }; 14 15 vector<State> st; 16 int last; // index of the state representing the whole current string 17 18 SAM(int maxlen = 0) { st.reserve(2 * maxlen); st.push_back(State()); last = 0; } 19 20 void extend(char ch) { 21 int c = ch - 'a'; 22 int cur = (int)st.size(); 23 st.push_back(State()); 24 st[cur].len = st[last].len + 1; 25 st[cur].cnt = 1; // Each new end contributes 1 occurrence initially 26 27 int p = last; 28 while (p != -1 && st[p].next[c] == -1) { 29 st[p].next[c] = cur; 30 p = st[p].link; 31 } 32 if (p == -1) { 33 st[cur].link = 0; 34 } else { 35 int q = st[p].next[c]; 36 if (st[p].len + 1 == st[q].len) { 37 st[cur].link = q; 38 } else { 39 int clone = (int)st.size(); 40 st.push_back(st[q]); // copy transitions, link, len, cnt (cnt will be 0 for clone) 41 st[clone].len = st[p].len + 1; 42 st[clone].cnt = 0; // clone does not represent new endpos directly 43 while (p != -1 && st[p].next[c] == q) { 44 st[p].next[c] = clone; 45 p = st[p].link; 46 } 47 st[q].link = st[cur].link = clone; 48 } 49 } 50 last = cur; 51 } 52 53 // Propagate cnt along suffix links in decreasing order of len to get |endpos| 54 void propagate_counts() { 55 int maxLen = 0; 56 for (auto &s : st) maxLen = max(maxLen, s.len); 57 vector<int> cntLen(maxLen + 1, 0); 58 for (auto &s : st) cntLen[s.len]++; 59 for (int i = 1; i <= maxLen; ++i) cntLen[i] += cntLen[i-1]; 60 vector<int> order(st.size()); 61 for (int i = (int)st.size() - 1; i >= 0; --i) { 62 order[--cntLen[st[i].len]] = i; 63 } 64 for (int i = (int)order.size() - 1; i >= 0; --i) { 65 int v = order[i]; 66 int p = st[v].link; 67 if (p != -1) st[p].cnt += st[v].cnt; 68 } 69 } 70 71 // Count occurrences of pattern P in the text used to build the SAM. 72 long long count_occurrences(const string &P) const { 73 int v = 0; 74 for (char ch : P) { 75 int c = ch - 'a'; 76 if (c < 0 || c >= 26) return 0; // out of alphabet 77 v = st[v].next[c]; 78 if (v == -1) return 0; 79 } 80 return st[v].cnt; // |endpos(P)| 81 } 82 }; 83 84 int main() { 85 ios::sync_with_stdio(false); 86 cin.tie(nullptr); 87 88 string S; int Q; 89 if (!(cin >> S >> Q)) return 0; 90 SAM sam((int)S.size()); 91 for (char ch : S) sam.extend(ch); 92 sam.propagate_counts(); 93 94 while (Q--) { 95 string P; cin >> P; 96 cout << sam.count_occurrences(P) << '\n'; 97 } 98 return 0; 99 } 100
We build a SAM for S (lowercase alphabet). Each extension sets cnt=1 for the newly created terminal state, then we propagate counts along suffix links in decreasing order of len to compute |endpos(v)| for all states. For a query pattern P, we follow transitions from the initial state; if traversal succeeds, the state’s cnt gives the total number of occurrences of P in S.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct SAM { 5 struct State { 6 int next[26]; 7 int link; 8 int len; 9 State() : link(-1), len(0) { fill(begin(next), end(next), -1); } 10 }; 11 vector<State> st; 12 int last; 13 14 SAM(int maxlen = 0) { st.reserve(2 * maxlen); st.push_back(State()); last = 0; } 15 16 void extend(char ch) { 17 int c = ch - 'a'; 18 int cur = (int)st.size(); st.push_back(State()); 19 st[cur].len = st[last].len + 1; 20 int p = last; 21 while (p != -1 && st[p].next[c] == -1) { st[p].next[c] = cur; p = st[p].link; } 22 if (p == -1) st[cur].link = 0; 23 else { 24 int q = st[p].next[c]; 25 if (st[p].len + 1 == st[q].len) st[cur].link = q; 26 else { 27 int clone = (int)st.size(); st.push_back(st[q]); 28 st[clone].len = st[p].len + 1; 29 // keep link and transitions; no cnt needed here 30 while (p != -1 && st[p].next[c] == q) { st[p].next[c] = clone; p = st[p].link; } 31 st[q].link = st[cur].link = clone; 32 } 33 } 34 last = cur; 35 } 36 37 long long count_distinct_substrings() const { 38 long long ans = 0; 39 for (size_t v = 1; v < st.size(); ++v) ans += (st[v].len - st[st[v].link].len); 40 return ans; 41 } 42 43 // DP to count the number of distinct substrings starting from each state. 44 // dp[v] = sum over edges (1 + dp[to]). 1 counts the single-edge path. 45 vector<long long> dp; 46 long long dfs_count_paths(int v) { 47 long long &res = dp[v]; 48 if (res != -1) return res; 49 res = 0; 50 for (int c = 0; c < 26; ++c) if (st[v].next[c] != -1) { 51 int u = st[v].next[c]; 52 res += 1; // the substring formed by taking this edge only 53 res += dfs_count_paths(u); // plus longer substrings continuing from u 54 if (res < 0) { /* overflow guard not needed for 64-bit up to ~1e10 */ } 55 } 56 return res; 57 } 58 59 // Return the K-th lexicographically smallest substring (1-indexed). If K is invalid, return "". 60 string kth_substring(long long K) { 61 dp.assign(st.size(), -1); 62 dfs_count_paths(0); // compute dp[0] = total number of distinct substrings 63 if (K <= 0 || K > dp[0]) return string(); 64 string ans; 65 int v = 0; 66 while (K > 0) { 67 for (int c = 0; c < 26; ++c) if (st[v].next[c] != -1) { 68 int u = st[v].next[c]; 69 long long block = 1 + (dp[u] == -1 ? 0 : dp[u]); 70 if (K > block) { 71 K -= block; // skip all substrings starting with this edge 72 } else { 73 ans.push_back(char('a' + c)); 74 K -= 1; // we used the single-edge path 75 v = u; 76 break; // continue to refine longer substrings if K > 0 77 } 78 } 79 if ((int)ans.size() == 0 && v == 0 && K > 0) { 80 // No more edges (should not happen if K was valid) 81 return string(); 82 } 83 if (K == 0) break; 84 } 85 return ans; 86 } 87 }; 88 89 int main() { 90 ios::sync_with_stdio(false); 91 cin.tie(nullptr); 92 93 string S; long long K; if (!(cin >> S >> K)) return 0; 94 SAM sam((int)S.size()); 95 for (char ch : S) sam.extend(ch); 96 97 cout << sam.count_distinct_substrings() << '\n'; 98 string kth = sam.kth_substring(K); 99 if (kth.empty()) cout << "-1\n"; else cout << kth << '\n'; 100 return 0; 101 } 102
We build a SAM and compute the number of distinct substrings via the standard sum of length gaps. To get the K-th lexicographically smallest substring, we do a DP that counts how many substrings start from each state (number of paths) and then traverse from the initial state, exploring outgoing edges in 'a'..'z' order while skipping entire blocks using the DP counts. We exclude the empty substring (paths must take at least one edge).
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct SAM { 5 struct State { 6 int next[26]; 7 int link; 8 int len; 9 State() : link(-1), len(0) { fill(begin(next), end(next), -1); } 10 }; 11 vector<State> st; 12 int last; 13 14 SAM(int maxlen = 0) { st.reserve(2 * maxlen); st.push_back(State()); last = 0; } 15 16 void extend(char ch) { 17 int c = ch - 'a'; 18 int cur = (int)st.size(); st.push_back(State()); 19 st[cur].len = st[last].len + 1; 20 int p = last; 21 while (p != -1 && st[p].next[c] == -1) { st[p].next[c] = cur; p = st[p].link; } 22 if (p == -1) st[cur].link = 0; 23 else { 24 int q = st[p].next[c]; 25 if (st[p].len + 1 == st[q].len) st[cur].link = q; 26 else { 27 int clone = (int)st.size(); st.push_back(st[q]); 28 st[clone].len = st[p].len + 1; 29 while (p != -1 && st[p].next[c] == q) { st[p].next[c] = clone; p = st[p].link; } 30 st[q].link = st[cur].link = clone; 31 } 32 } 33 last = cur; 34 } 35 36 // LCS length with another string T by walking over the SAM of S. 37 int lcs_two(const string &T) const { 38 int v = 0, L = 0, best = 0; 39 for (char ch : T) { 40 int c = ch - 'a'; 41 if (c < 0 || c >= 26) { v = 0; L = 0; continue; } 42 if (st[v].next[c] != -1) { 43 v = st[v].next[c]; 44 L++; 45 } else { 46 while (v != -1 && st[v].next[c] == -1) v = st[v].link; 47 if (v == -1) { v = 0; L = 0; continue; } 48 L = st[v].len + 1; 49 v = st[v].next[c]; 50 } 51 best = max(best, L); 52 } 53 return best; 54 } 55 56 // Generalized LCS length with multiple strings: intersect per-state maximum matched lengths. 57 int lcs_multi(const vector<string> &others) const { 58 int nStates = (int)st.size(); 59 vector<int> g(nStates); // global minimum over strings of their per-state max match length 60 for (int i = 0; i < nStates; ++i) g[i] = st[i].len; // initialize to len[v] 61 62 // Build order by len for suffix-link propagation 63 int maxLen = 0; for (auto &s: st) maxLen = max(maxLen, s.len); 64 vector<int> cntLen(maxLen+1), order(nStates); 65 for (auto &s: st) cntLen[s.len]++; 66 for (int i = 1; i <= maxLen; ++i) cntLen[i] += cntLen[i-1]; 67 for (int i = nStates - 1; i >= 0; --i) order[--cntLen[st[i].len]] = i; 68 69 for (const string &T : others) { 70 vector<int> mx(nStates, 0); 71 int v = 0, L = 0; 72 for (char ch : T) { 73 int c = ch - 'a'; 74 if (c < 0 || c >= 26) { v = 0; L = 0; continue; } 75 if (st[v].next[c] != -1) { 76 v = st[v].next[c]; 77 L++; 78 } else { 79 while (v != -1 && st[v].next[c] == -1) v = st[v].link; 80 if (v == -1) { v = 0; L = 0; continue; } 81 L = st[v].len + 1; 82 v = st[v].next[c]; 83 } 84 mx[v] = max(mx[v], L); 85 } 86 // Propagate mx to suffix links (decreasing len) 87 for (int i = nStates - 1; i >= 1; --i) { 88 int u = order[i]; 89 int p = st[u].link; 90 if (p != -1) mx[p] = max(mx[p], min(mx[u], st[p].len)); 91 } 92 // Intersect: global min with per-string mx 93 for (int vtx = 0; vtx < nStates; ++vtx) g[vtx] = min(g[vtx], mx[vtx]); 94 } 95 int ans = 0; 96 for (int v = 0; v < nStates; ++v) ans = max(ans, g[v]); 97 return ans; 98 } 99 }; 100 101 int main() { 102 ios::sync_with_stdio(false); 103 cin.tie(nullptr); 104 105 string S; int m; 106 if (!(cin >> S >> m)) return 0; 107 SAM sam((int)S.size()); 108 for (char ch : S) sam.extend(ch); 109 110 if (m == 1) { 111 string T; cin >> T; 112 cout << sam.lcs_two(T) << '\n'; 113 } else { 114 vector<string> others(m); 115 for (int i = 0; i < m; ++i) cin >> others[i]; 116 cout << sam.lcs_multi(others) << '\n'; 117 } 118 return 0; 119 } 120
We build a SAM for S, then compute LCS. For two strings, we walk T over the SAM: on success, extend the matched length; on failure, climb suffix links and shrink L to len[cur] before resuming. For multiple strings, for each T we record per-state maximum matched length during the scan (mx), propagate mx to suffix links in decreasing len order (capping by len[link]), and intersect across all strings by taking the minimum in g. The final answer is the maximum g[v].
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct SAM { 5 struct State { 6 int next[26]; 7 int link; 8 int len; 9 long long cnt; // |endpos| after propagation 10 State() : link(-1), len(0), cnt(0) { fill(begin(next), end(next), -1); } 11 }; 12 vector<State> st; 13 int last; 14 15 SAM(int maxlen = 0) { st.reserve(2 * maxlen); st.push_back(State()); last = 0; } 16 17 void extend(char ch) { 18 int c = ch - 'a'; 19 int cur = (int)st.size(); st.push_back(State()); 20 st[cur].len = st[last].len + 1; 21 st[cur].cnt = 1; // new end position 22 int p = last; 23 while (p != -1 && st[p].next[c] == -1) { st[p].next[c] = cur; p = st[p].link; } 24 if (p == -1) st[cur].link = 0; 25 else { 26 int q = st[p].next[c]; 27 if (st[p].len + 1 == st[q].len) st[cur].link = q; 28 else { 29 int clone = (int)st.size(); st.push_back(st[q]); 30 st[clone].len = st[p].len + 1; 31 st[clone].cnt = 0; // clone doesn't add new endpos 32 while (p != -1 && st[p].next[c] == q) { st[p].next[c] = clone; p = st[p].link; } 33 st[q].link = st[cur].link = clone; 34 } 35 } 36 last = cur; 37 } 38 39 long long count_at_least_K(long long K) { 40 int maxLen = 0; for (auto &s : st) maxLen = max(maxLen, s.len); 41 vector<int> cntLen(maxLen+1), order(st.size()); 42 for (auto &s : st) cntLen[s.len]++; 43 for (int i = 1; i <= maxLen; ++i) cntLen[i] += cntLen[i-1]; 44 for (int i = (int)st.size() - 1; i >= 0; --i) order[--cntLen[st[i].len]] = i; 45 for (int i = (int)order.size() - 1; i >= 0; --i) { 46 int v = order[i]; 47 int p = st[v].link; 48 if (p != -1) st[p].cnt += st[v].cnt; // propagate endpos sizes 49 } 50 long long ans = 0; 51 for (size_t v = 1; v < st.size(); ++v) if (st[v].cnt >= K) { 52 ans += (st[v].len - st[st[v].link].len); 53 } 54 return ans; 55 } 56 }; 57 58 int main() { 59 ios::sync_with_stdio(false); 60 cin.tie(nullptr); 61 62 string S; long long K; 63 if (!(cin >> S >> K)) return 0; 64 SAM sam((int)S.size()); 65 for (char ch : S) sam.extend(ch); 66 cout << sam.count_at_least_K(K) << '\n'; 67 return 0; 68 } 69
We build a SAM with initial cnt=1 for each new last state. After bucket-sorting by len, we propagate cnt from longer to shorter states along suffix links to get |endpos|. The number of substrings that occur at least K times is the sum over states with cnt[v] ≥ K of (len[v] − len[link[v]]), since all substrings represented by the length gap share the same frequency.