Suffix Automaton
Key Points
- •A suffix automaton (SAM) is the minimal deterministic finite automaton that recognizes all substrings of a string, built online in O(n) time and space.
- •Each SAM state compactly represents an equivalence class of end positions (endpos) of substrings that share the same set of right contexts.
- •Suffix links connect states to the longest proper suffix class and form a tree when oriented from longer to shorter lengths.
- •The number of states is at most 2n−1 and the number of transitions is O(n), which makes the structure extremely memory-efficient.
- •Counting distinct substrings is a one-line sum over states: sum(len[v] − len[link[v]]).
- •Substring occurrence counts (frequencies) are obtained by a single topological propagation along suffix links.
- •Pattern membership and frequency queries run in O(m) where m is the pattern length.
- •SAM supports advanced tasks such as longest common substring, k-th lexicographic substring, and many combinatorial substring counts.
Prerequisites
- →Basic string operations and substrings — Understanding substrings, indices, and contiguous segments is essential for interpreting what SAM recognizes.
- →Finite automata (DFA/NFA) — SAM is a minimal DFA; familiarity with states, transitions, and determinism clarifies the structure.
- →Failure links (e.g., KMP) — Suffix links in SAM conceptually resemble failure links and help build intuition about fallbacks.
- →Topological ordering — Occurrence propagation requires processing states in decreasing length, akin to a topological order.
- →Data structure trade-offs (arrays vs hash maps) — Choosing the right transition representation impacts performance and memory.
- →DP on DAGs — Lexicographic counting (k-th substring) uses DP over the acyclic transition graph.
Detailed Explanation
Tap terms for definitions01Overview
Imagine compressing all substrings of a string S into a single automaton that you can scan like a small map: every path from the start spells a substring, and every state summarizes many similar substrings at once. That is what the suffix automaton (SAM) offers. It is a deterministic finite automaton (DFA) that recognizes exactly the set of all substrings of S, and among all such DFAs it has the fewest states. The construction is online: you append characters one by one and update only a small part of the structure each time. This makes SAM practical for streaming data and incremental algorithms. The power of SAM comes from two facts. First, it merges substrings that behave the same with respect to future continuations (same right contexts), dramatically reducing size compared to naive tries. Second, it equips every state with a suffix link to the next-best match (the longest proper suffix class), which acts as a fast fallback pointer similar to KMP's failure link but for whole equivalence classes. As a result, many classic string problems become linear-time routines: counting distinct substrings, answering existence and frequency of a pattern, computing the longest common substring with another string, and extracting lexicographic statistics. Even though it works like a DFA, SAM is built using a clever clone-and-link strategy that preserves minimality without expensive global minimization. The result is a compact, elegant, and battle-tested tool in competitive programming and string algorithms.
02Intuition & Analogies
Think of writing a long word on a strip of paper. All its substrings are every contiguous piece you could tear from it. Listing them all is huge—there are O(n^2) of them. But many pieces behave the same way if you consider what letters can follow them inside the original word. For example, in "banana", the substrings "a" occurring at positions 2, 4, and 6 are different occurrences but share similar continuation opportunities. The suffix automaton groups substrings by this behavior: it creates a state for each group of substrings that can be extended by the same set of letters within S. Now imagine navigating a subway map. Stations are states, and colored tracks (edges) labeled with letters take you to new stations. Starting at the depot (initial state), any sequence of tracks you follow spells out a substring of S. If a track you want doesn't exist, you jump along a special backward corridor—the suffix link—to a nearby station that represents “drop the first character and try again.” This is your emergency route that keeps you moving efficiently without backtracking the whole trip. The online construction is like expanding the subway as new neighborhoods (characters) are added at the end of S. When a new letter arrives, we lay a small number of new tracks and sometimes copy an old station (a clone) if we must split two previously indistinguishable neighborhoods. Cloning might feel odd at first, but it preserves the rule that each station uniquely captures a behavior of right extensions. Despite the complexity under the hood, the passenger experience stays simple: paths are substrings, and the map remains compact—only O(n) stations and tracks for a length-n city.
03Formal Definition
04When to Use
Use a suffix automaton when you need fast substring queries and statistics on a single text S with many patterns:
- Pattern existence and frequency: For each query pattern P, walk the automaton in O(|P|). If you also propagate occurrence counts, you can return how many times P appears in S.
- Distinct substrings: Compute in O(n) using the suffix-link lengths formula. This avoids O(n^2) enumeration.
- Longest common substring (LCS): Build a SAM for A, then scan B once to track the longest path that stays in the automaton, yielding LCS in O(|A| + |B|).
- Lexicographic tasks: Count or enumerate distinct substrings in lexicographic order, or find the k-th lexicographic substring using DP over the DAG.
- Frequency-constrained statistics: With propagated endpos sizes, count how many distinct substrings occur at least k times. Choose SAM over suffix arrays/trees when you prefer online construction, need many membership/frequency queries, or want minimal DFA semantics. Prefer suffix arrays for global ordering of suffixes or tasks like full-text index with large alphabets under tight memory where compressed indexes shine. For lexicographically smallest rotation, specialized algorithms like Booth are simpler, though SAM can help with related substring lexicographic queries.
⚠️Common Mistakes
- Confusing substrings with suffixes: Despite the name, a suffix automaton recognizes all substrings of S, not only suffixes. Suffix links relate classes by longest proper suffix, but language acceptance is all substrings.
- Forgetting to clone: During extension, if you would redirect transitions that break existing path lengths, you must create a clone state with copied transitions and adjusted len/link. Skipping cloning yields incorrect minimality and wrong answers.
- Wrong occurrence counts: To count how many times a substring appears, you must 1) set cnt[cur]++ for each newly added character's terminal state during construction, and 2) propagate counts from longer to shorter states in decreasing len order via cnt[link[v]] += cnt[v]. Using raw cnt without propagation undercounts.
- Misusing state counts for arbitrary prefixes: Occurrence count of a specific string P is cnt[v] at the state v reached after scanning P. Do not look at cnt of a suffix-link ancestor or of the end state of the whole text.
- Transition map pitfalls: For large or sparse alphabets, using fixed-size arrays wastes memory; for unordered_map, remember to reserve and be mindful of constant factors. For tiny alphabets, arrays are faster.
- Off-by-one in len and link usage: Distinct substring formula uses len[v] − len[link[v]]. Ensure link[initial] = −1 and treat len[−1] = 0 conceptually when summing from v ≠ initial.
- Mixing 0-based/1-based positions when storing first/last occurrence positions (firstpos). Be consistent if you need positions for reconstruction or advanced filters.
- Expecting lexicographic order “for free”: SAM is not a suffix array; lexicographic tasks need additional DP or ordered traversal of transitions.
Key Formulas
State Bound
Explanation: For a string of length n, the suffix automaton has at most 2n−1 states. This linear bound guarantees compactness compared to the O() number of substrings.
Transition Bound
Explanation: The total number of transitions in the SAM is linear in n. A classical tight bound is 3n−4 for alphabets of size at least 2, ensuring linear memory.
Distinct Substrings
Explanation: Each state v contributes the number of new lengths it introduces beyond its suffix link ancestor. Summing over all non-initial states counts each distinct substring exactly once.
Occurrence Propagation
Explanation: Initialize cnt at terminal states during construction, then propagate counts along suffix links from longer to shorter states. After this, cnt[v] equals the number of occurrences of any string ending at v.
Build Complexity
Explanation: The online SAM construction runs in linear time and space in the length n of S. Each character causes a constant expected number of state/edge updates and occasional cloning.
LCS via Traversal
Explanation: Scanning the second string, we extend matches when transitions exist; otherwise, we follow suffix links to shorten the current match. The maximum length L reached is the LCS length.
Count of Distinct Substrings from a State
Explanation: Each outgoing edge contributes one new substring (adding c) plus all longer substrings reachable thereafter. This DP enables lexicographic ranking like finding the k-th substring.
Total Substrings (for reference)
Explanation: A length-n string has n(n+1)/2 substrings counting duplicates. SAM compresses them into O(n) states for efficient distinct-substring operations.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Suffix Automaton for lowercase 'a'..'z' 5 struct SAM { 6 struct State { 7 int len = 0; // max length in this class 8 int link = -1; // suffix link 9 int next[26]; // transitions 10 long long cnt = 0; // occurrence count (endpos size after propagation) 11 int firstpos = -1; // one end position of the longest string in this state 12 State() { fill(begin(next), end(next), -1); } 13 }; 14 15 vector<State> st; 16 int last; // state representing the entire string processed so far 17 18 SAM(int maxlen = 0) { st.reserve(2 * maxlen); init(); } 19 20 void init() { 21 st.clear(); st.push_back(State()); 22 st[0].len = 0; st[0].link = -1; st[0].firstpos = -1; 23 last = 0; 24 } 25 26 static int id(char c) { return c - 'a'; } 27 28 void extend(char cc, int pos) { 29 int c = id(cc); 30 int cur = (int)st.size(); 31 st.push_back(State()); 32 st[cur].len = st[last].len + 1; 33 st[cur].firstpos = pos; // end position of the current full prefix 34 st[cur].cnt = 1; // new end position contributes 1 occurrence 35 36 int p = last; 37 while (p != -1 && st[p].next[c] == -1) { 38 st[p].next[c] = cur; 39 p = st[p].link; 40 } 41 if (p == -1) { 42 st[cur].link = 0; 43 } else { 44 int q = st[p].next[c]; 45 if (st[p].len + 1 == st[q].len) { 46 st[cur].link = q; 47 } else { 48 // clone q 49 int clone = (int)st.size(); 50 st.push_back(st[q]); // copy transitions, len, link, firstpos, cnt (cnt will be fixed by propagation) 51 st[clone].len = st[p].len + 1; // restrict max length 52 // firstpos of clone equals firstpos of q (both represent same end positions) 53 while (p != -1 && st[p].next[c] == q) { 54 st[p].next[c] = clone; 55 p = st[p].link; 56 } 57 st[q].link = st[cur].link = clone; 58 } 59 } 60 last = cur; 61 } 62 63 // Propagate occurrence counts along suffix links in decreasing len order 64 void propagate_counts() { 65 int maxLen = 0; for (auto &s : st) maxLen = max(maxLen, s.len); 66 vector<int> cntLen(maxLen + 1, 0); 67 for (auto &s : st) cntLen[s.len]++; 68 for (int i = 1; i <= maxLen; ++i) cntLen[i] += cntLen[i - 1]; 69 vector<int> order(st.size()); 70 for (int v = (int)st.size() - 1; v >= 0; --v) order[--cntLen[st[v].len]] = v; 71 // order: increasing len; process in reverse for decreasing len 72 for (int i = (int)order.size() - 1; i > 0; --i) { 73 int v = order[i]; 74 int p = st[v].link; 75 if (p != -1) st[p].cnt += st[v].cnt; 76 } 77 } 78 79 // Number of distinct substrings using len - len[link] 80 long long distinct_substrings() const { 81 long long ans = 0; 82 for (int v = 1; v < (int)st.size(); ++v) ans += st[v].len - st[st[v].link].len; 83 return ans; 84 } 85 86 // Occurrence count of a pattern P (assumes propagate_counts has been called) 87 long long count_occurrences(const string &P) const { 88 int v = 0; 89 for (char cc : P) { 90 int c = id(cc); 91 if (st[v].next[c] == -1) return 0; 92 v = st[v].next[c]; 93 } 94 return st[v].cnt; // occurrences of P in the text 95 } 96 }; 97 98 int main() { 99 ios::sync_with_stdio(false); 100 cin.tie(nullptr); 101 102 string s; cin >> s; // lowercase 103 SAM sam((int)s.size()); 104 for (int i = 0; i < (int)s.size(); ++i) sam.extend(s[i], i); 105 sam.propagate_counts(); 106 107 cout << "Distinct substrings: " << sam.distinct_substrings() << "\n"; 108 109 int q; cin >> q; 110 while (q--) { 111 string p; cin >> p; 112 cout << sam.count_occurrences(p) << "\n"; 113 } 114 return 0; 115 } 116
We build a SAM for a lowercase string, then compute distinct substrings using the classic sum over len[v] − len[link[v]]. We also propagate end-position counts (cnt) by processing states in decreasing length order. After propagation, the count at the state reached by a pattern equals its number of occurrences in the text. Queries then run in O(|P|).
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct SAM { 5 struct S { int len=0, link=-1; unordered_map<char,int> next; }; 6 vector<S> st; int last; 7 SAM(int maxlen=0){ st.reserve(2*maxlen); init(); } 8 void init(){ st.clear(); st.push_back(S()); last=0; } 9 void extend(char c){ 10 int cur = (int)st.size(); st.push_back(S()); 11 st[cur].len = st[last].len + 1; 12 int p = last; 13 while(p!=-1 && !st[p].next.count(c)){ st[p].next[c]=cur; p=st[p].link; } 14 if(p==-1){ st[cur].link=0; } 15 else{ 16 int q = st[p].next[c]; 17 if(st[p].len + 1 == st[q].len){ st[cur].link = q; } 18 else{ 19 int clone = (int)st.size(); st.push_back(st[q]); 20 st[clone].len = st[p].len + 1; 21 while(p!=-1 && st[p].next[c]==q){ st[p].next[c]=clone; p=st[p].link; } 22 st[q].link = st[cur].link = clone; 23 } 24 } 25 last = cur; 26 } 27 }; 28 29 int main(){ 30 ios::sync_with_stdio(false); cin.tie(nullptr); 31 string A, B; cin >> A >> B; 32 SAM sam((int)A.size()); for(char c: A) sam.extend(c); 33 34 int v = 0; int l = 0; // current state and match length 35 int best = 0; int endB = -1; 36 for(int i=0;i<(int)B.size();++i){ 37 char c = B[i]; 38 // Follow transitions if possible; otherwise follow suffix links 39 while(v!= -1 && !sam.st[v].next.count(c)){ 40 v = sam.st[v].link; 41 if(v!=-1) l = sam.st[v].len; 42 } 43 if(v==-1){ v=0; l=0; continue; } 44 v = sam.st[v].next[c]; 45 l++; 46 if(l>best){ best=l; endB=i; } 47 } 48 cout << best << "\n"; 49 if(best>0){ 50 cout << B.substr(endB - best + 1, best) << "\n"; 51 } 52 return 0; 53 } 54
Build a SAM for A. Then scan B once: extend the current match if there is a transition by the next character; if not, follow suffix links to find the longest suffix that can be extended. Track the longest length reached and its position to output both the length and an example substring.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // SAM with unordered_map transitions for general alphabets (e.g., bytes) 5 struct SAM { 6 struct State { int len=0, link=-1; unordered_map<int,int> next; long long cnt=0; }; 7 vector<State> st; int last; 8 SAM(int maxlen=0){ st.reserve(2*maxlen); init(); } 9 void init(){ st.clear(); st.push_back(State()); last=0; } 10 void extend(int c){ 11 int cur=(int)st.size(); st.push_back(State()); 12 st[cur].len = st[last].len + 1; st[cur].cnt = 1; 13 int p = last; 14 while(p!=-1 && !st[p].next.count(c)) { st[p].next[c]=cur; p=st[p].link; } 15 if(p==-1){ st[cur].link=0; } 16 else{ 17 int q = st[p].next[c]; 18 if(st[p].len + 1 == st[q].len){ st[cur].link=q; } 19 else{ 20 int clone=(int)st.size(); st.push_back(st[q]); 21 st[clone].len = st[p].len + 1; st[clone].cnt = 0; // will be fixed by propagation 22 while(p!=-1 && st[p].next[c]==q){ st[p].next[c]=clone; p=st[p].link; } 23 st[q].link = st[cur].link = clone; 24 } 25 } 26 last=cur; 27 } 28 void build_from_string(const string &s){ 29 for(unsigned char uc : s) extend((int)uc); 30 } 31 void propagate_counts(){ 32 int maxLen=0; for(auto &x:st) maxLen=max(maxLen,x.len); 33 vector<int> cntLen(maxLen+1,0); for(auto &x:st) cntLen[x.len]++; 34 for(int i=1;i<=maxLen;++i) cntLen[i]+=cntLen[i-1]; 35 vector<int> order(st.size()); 36 for(int v=(int)st.size()-1; v>=0; --v) order[--cntLen[st[v].len]]=v; 37 for(int i=(int)order.size()-1; i>0; --i){ int v=order[i]; int p=st[v].link; if(p!=-1) st[p].cnt += st[v].cnt; } 38 } 39 // Return {exists, occurrences} 40 pair<bool,long long> query(const string &p) const{ 41 int v=0; 42 for(unsigned char uc: p){ int c=(int)uc; auto it = st[v].next.find(c); if(it==st[v].next.end()) return {false,0}; v=it->second; } 43 return {true, st[v].cnt}; 44 } 45 }; 46 47 int main(){ 48 ios::sync_with_stdio(false); cin.tie(nullptr); 49 string text; getline(cin, text); // allow spaces and general bytes 50 SAM sam((int)text.size()); sam.build_from_string(text); sam.propagate_counts(); 51 52 int q; cin >> q; string pattern; string dummy; getline(cin,dummy); 53 while(q--){ getline(cin, pattern); auto [ok, occ]=sam.query(pattern); cout << (ok?"YES":"NO") << ' ' << occ << '\n'; } 54 return 0; 55 } 56
This variant uses unordered_map<int,int> transitions to support arbitrary byte values or large alphabets. After building and propagating occurrence counts, each query returns whether the pattern occurs and how many times. This demonstrates SAM’s flexibility beyond small fixed alphabets.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct SAM { 5 struct S { int len=0, link=-1; array<int,26> next; S(){ next.fill(-1);} }; 6 vector<S> st; int last; 7 SAM(int maxlen=0){ st.reserve(2*maxlen); init(); } 8 void init(){ st.clear(); st.push_back(S()); last=0; } 9 void extend(char cc){ int c=cc-'a'; int cur=(int)st.size(); st.push_back(S()); st[cur].len=st[last].len+1; int p=last; while(p!=-1 && st[p].next[c]==-1){ st[p].next[c]=cur; p=st[p].link;} if(p==-1){ st[cur].link=0; } else { int q=st[p].next[c]; if(st[p].len+1==st[q].len){ st[cur].link=q; } else { int clone=(int)st.size(); st.push_back(st[q]); st[clone].len=st[p].len+1; while(p!=-1 && st[p].next[c]==q){ st[p].next[c]=clone; p=st[p].link;} st[q].link=st[cur].link=clone; } } last=cur; } 10 }; 11 12 // DP to count number of distinct substrings reachable from a state 13 struct KthSubstring { 14 SAM sam; 15 vector<long long> dp; // dp[v] = number of distinct substrings starting from state v 16 17 KthSubstring(const string &s): sam((int)s.size()){ 18 for(char c: s) sam.extend(c); 19 dp.assign(sam.st.size(), -1); 20 // ensure no overflow in typical constraints: cap at 9e18 if needed 21 } 22 23 long long solve_dp(int v){ 24 if(dp[v] != -1) return dp[v]; 25 long long res = 0; 26 for(int c=0;c<26;++c){ 27 int u = sam.st[v].next[c]; 28 if(u==-1) continue; 29 // one substring by taking this edge, plus all longer ones from u 30 long long sub = 1 + solve_dp(u); 31 // cap to avoid overflow if using very large k 32 const long long CAP = (long long)9e18; 33 res = min(CAP, res + sub); 34 } 35 return dp[v] = res; 36 } 37 38 // Return the k-th lexicographic distinct substring (1-indexed). Assumes 1 <= k <= total. 39 string kth(long long k){ 40 string ans; 41 int v = 0; // start state 42 while(k>0){ 43 for(int c=0;c<26;++c){ 44 int u = sam.st[v].next[c]; 45 if(u==-1) continue; 46 long long here = 1 + solve_dp(u); 47 if(k <= here){ 48 // This edge is inside the k-th block 49 ans.push_back(char('a'+c)); 50 k--; // consume the substring that ends here 51 if(k==0) break; // exactly ended at this prefix 52 v = u; // continue to longer substrings 53 // now choose next edge to account for longer ones from u 54 // do not reset v here if k>0 55 goto next_step; 56 } else { 57 k -= here; // skip all substrings starting with this edge 58 } 59 } 60 // If we exit the loop without finding, k was invalid 61 break; 62 next_step: ; 63 } 64 return ans; 65 } 66 }; 67 68 int main(){ 69 ios::sync_with_stdio(false); cin.tie(nullptr); 70 string s; cin >> s; 71 KthSubstring solver(s); 72 long long total = solver.solve_dp(0); // total number of distinct substrings 73 int q; cin >> q; 74 while(q--){ 75 long long k; cin >> k; 76 if(k<1 || k>total){ cout << "-1\n"; } 77 else { cout << solver.kth(k) << "\n"; } 78 } 79 return 0; 80 } 81
We build a SAM and compute dp[v] = sum over edges of (1 + dp[next]) to count how many distinct substrings start at each state. Then we walk edges in lexicographic order, subtracting blocks until we land on the k-th block. Each chosen edge contributes one substring (ending now) plus a tail of longer ones (continuations).