🗂️Data StructureAdvanced

Suffix Automaton

Key Points

  • A suffix automaton (SAM) is the minimal deterministic finite automaton that recognizes all substrings of a string, built online in O(n) time and space.
  • Each SAM state compactly represents an equivalence class of end positions (endpos) of substrings that share the same set of right contexts.
  • Suffix links connect states to the longest proper suffix class and form a tree when oriented from longer to shorter lengths.
  • The number of states is at most 2n−1 and the number of transitions is O(n), which makes the structure extremely memory-efficient.
  • Counting distinct substrings is a one-line sum over states: sum(len[v] − len[link[v]]).
  • Substring occurrence counts (frequencies) are obtained by a single topological propagation along suffix links.
  • Pattern membership and frequency queries run in O(m) where m is the pattern length.
  • SAM supports advanced tasks such as longest common substring, k-th lexicographic substring, and many combinatorial substring counts.

Prerequisites

  • Basic string operations and substringsUnderstanding substrings, indices, and contiguous segments is essential for interpreting what SAM recognizes.
  • Finite automata (DFA/NFA)SAM is a minimal DFA; familiarity with states, transitions, and determinism clarifies the structure.
  • Failure links (e.g., KMP)Suffix links in SAM conceptually resemble failure links and help build intuition about fallbacks.
  • Topological orderingOccurrence propagation requires processing states in decreasing length, akin to a topological order.
  • Data structure trade-offs (arrays vs hash maps)Choosing the right transition representation impacts performance and memory.
  • DP on DAGsLexicographic counting (k-th substring) uses DP over the acyclic transition graph.

Detailed Explanation

Tap terms for definitions

01Overview

Imagine compressing all substrings of a string S into a single automaton that you can scan like a small map: every path from the start spells a substring, and every state summarizes many similar substrings at once. That is what the suffix automaton (SAM) offers. It is a deterministic finite automaton (DFA) that recognizes exactly the set of all substrings of S, and among all such DFAs it has the fewest states. The construction is online: you append characters one by one and update only a small part of the structure each time. This makes SAM practical for streaming data and incremental algorithms. The power of SAM comes from two facts. First, it merges substrings that behave the same with respect to future continuations (same right contexts), dramatically reducing size compared to naive tries. Second, it equips every state with a suffix link to the next-best match (the longest proper suffix class), which acts as a fast fallback pointer similar to KMP's failure link but for whole equivalence classes. As a result, many classic string problems become linear-time routines: counting distinct substrings, answering existence and frequency of a pattern, computing the longest common substring with another string, and extracting lexicographic statistics. Even though it works like a DFA, SAM is built using a clever clone-and-link strategy that preserves minimality without expensive global minimization. The result is a compact, elegant, and battle-tested tool in competitive programming and string algorithms.

02Intuition & Analogies

Think of writing a long word on a strip of paper. All its substrings are every contiguous piece you could tear from it. Listing them all is huge—there are O(n^2) of them. But many pieces behave the same way if you consider what letters can follow them inside the original word. For example, in "banana", the substrings "a" occurring at positions 2, 4, and 6 are different occurrences but share similar continuation opportunities. The suffix automaton groups substrings by this behavior: it creates a state for each group of substrings that can be extended by the same set of letters within S. Now imagine navigating a subway map. Stations are states, and colored tracks (edges) labeled with letters take you to new stations. Starting at the depot (initial state), any sequence of tracks you follow spells out a substring of S. If a track you want doesn't exist, you jump along a special backward corridor—the suffix link—to a nearby station that represents “drop the first character and try again.” This is your emergency route that keeps you moving efficiently without backtracking the whole trip. The online construction is like expanding the subway as new neighborhoods (characters) are added at the end of S. When a new letter arrives, we lay a small number of new tracks and sometimes copy an old station (a clone) if we must split two previously indistinguishable neighborhoods. Cloning might feel odd at first, but it preserves the rule that each station uniquely captures a behavior of right extensions. Despite the complexity under the hood, the passenger experience stays simple: paths are substrings, and the map remains compact—only O(n) stations and tracks for a length-n city.

03Formal Definition

Given a string S over an alphabet , the suffix automaton of S is the minimal DFA that recognizes the language L consisting of all substrings of S. Minimal here means it has the fewest states among all DFAs that recognize L. Two strings x and y are equivalent (belong to the same state) if they have the same set of right contexts in S, i.e., \{ c : xc S \} = \{ c : yc S \} and, recursively, this holds for all longer extensions. Formally, each state v has: - len[v]: the length of the longest string in its equivalence class. - link[v]: a suffix link to the state representing the longest proper suffix of any string in v's class (or -1 for the initial state). - next[v][c]: a transition by character c to another state. The set of states is closed under taking suffixes via suffix links. Each state corresponds to an endpos equivalence class: the set of positions where strings from the class end in S (end positions). The automaton is constructed online by iteratively extending with each character of S. When a new character c is appended, if it would break minimality (two classes need to be distinguished), a clone state is created that copies transitions and adjusts links to maintain both determinism and minimality. The total number of states is bounded by 2−1 and transitions by O().

04When to Use

Use a suffix automaton when you need fast substring queries and statistics on a single text S with many patterns:

  • Pattern existence and frequency: For each query pattern P, walk the automaton in O(|P|). If you also propagate occurrence counts, you can return how many times P appears in S.
  • Distinct substrings: Compute in O(n) using the suffix-link lengths formula. This avoids O(n^2) enumeration.
  • Longest common substring (LCS): Build a SAM for A, then scan B once to track the longest path that stays in the automaton, yielding LCS in O(|A| + |B|).
  • Lexicographic tasks: Count or enumerate distinct substrings in lexicographic order, or find the k-th lexicographic substring using DP over the DAG.
  • Frequency-constrained statistics: With propagated endpos sizes, count how many distinct substrings occur at least k times. Choose SAM over suffix arrays/trees when you prefer online construction, need many membership/frequency queries, or want minimal DFA semantics. Prefer suffix arrays for global ordering of suffixes or tasks like full-text index with large alphabets under tight memory where compressed indexes shine. For lexicographically smallest rotation, specialized algorithms like Booth are simpler, though SAM can help with related substring lexicographic queries.

⚠️Common Mistakes

  • Confusing substrings with suffixes: Despite the name, a suffix automaton recognizes all substrings of S, not only suffixes. Suffix links relate classes by longest proper suffix, but language acceptance is all substrings.
  • Forgetting to clone: During extension, if you would redirect transitions that break existing path lengths, you must create a clone state with copied transitions and adjusted len/link. Skipping cloning yields incorrect minimality and wrong answers.
  • Wrong occurrence counts: To count how many times a substring appears, you must 1) set cnt[cur]++ for each newly added character's terminal state during construction, and 2) propagate counts from longer to shorter states in decreasing len order via cnt[link[v]] += cnt[v]. Using raw cnt without propagation undercounts.
  • Misusing state counts for arbitrary prefixes: Occurrence count of a specific string P is cnt[v] at the state v reached after scanning P. Do not look at cnt of a suffix-link ancestor or of the end state of the whole text.
  • Transition map pitfalls: For large or sparse alphabets, using fixed-size arrays wastes memory; for unordered_map, remember to reserve and be mindful of constant factors. For tiny alphabets, arrays are faster.
  • Off-by-one in len and link usage: Distinct substring formula uses len[v] − len[link[v]]. Ensure link[initial] = −1 and treat len[−1] = 0 conceptually when summing from v ≠ initial.
  • Mixing 0-based/1-based positions when storing first/last occurrence positions (firstpos). Be consistent if you need positions for reconstruction or advanced filters.
  • Expecting lexicographic order “for free”: SAM is not a suffix array; lexicographic tasks need additional DP or ordered traversal of transitions.

Key Formulas

State Bound

Explanation: For a string of length n, the suffix automaton has at most 2n−1 states. This linear bound guarantees compactness compared to the O() number of substrings.

Transition Bound

Explanation: The total number of transitions in the SAM is linear in n. A classical tight bound is 3n−4 for alphabets of size at least 2, ensuring linear memory.

Distinct Substrings

Explanation: Each state v contributes the number of new lengths it introduces beyond its suffix link ancestor. Summing over all non-initial states counts each distinct substring exactly once.

Occurrence Propagation

Explanation: Initialize cnt at terminal states during construction, then propagate counts along suffix links from longer to shorter states. After this, cnt[v] equals the number of occurrences of any string ending at v.

Build Complexity

Explanation: The online SAM construction runs in linear time and space in the length n of S. Each character causes a constant expected number of state/edge updates and occasional cloning.

LCS via Traversal

Explanation: Scanning the second string, we extend matches when transitions exist; otherwise, we follow suffix links to shorten the current match. The maximum length L reached is the LCS length.

Count of Distinct Substrings from a State

Explanation: Each outgoing edge contributes one new substring (adding c) plus all longer substrings reachable thereafter. This DP enables lexicographic ranking like finding the k-th substring.

Total Substrings (for reference)

Explanation: A length-n string has n(n+1)/2 substrings counting duplicates. SAM compresses them into O(n) states for efficient distinct-substring operations.

Complexity Analysis

Suffix automaton construction is linear in the length n of the input string. Each new character creates at most one new state and, at most once per iteration, one clone state. Transitions are added or redirected a constant number of times on average, and following suffix links (while loops) amortizes to O(1) per insertion. Thus the total time is O(n). The number of states is bounded by 2n−1 and the number of transitions is O(n) (with a classical tight bound of 3n−4), so memory is linear in n. Pattern queries of length m run in O(m), as each character triggers a single transition or a bounded number of suffix-link fallbacks (in the LCS scan). Counting occurrences requires a one-time propagation step: sort or bucket states by len and accumulate counts cnt[link[v]] += cnt[v] in decreasing len order. This is O(n) time and O(n) extra memory for the order. Distinct substring counting uses a simple sum over states and runs in O(n). For lexicographic tasks, computing dp[v] = sum(1 + dp[u]) over outgoing edges takes O() = O(n). Finding the k-th substring then walks at most O(n) transitions in lexicographic order, each step making O( checks if edges are scanned in order; with small alphabets this is effectively linear in answer length. Implementation constants depend on alphabet representation: fixed arrays are faster and smaller for small while hash maps are flexible for large σ but add overhead. Overall, SAM provides O(n) preprocessing and O(m) per-query performance with O(n) memory—ideal for many heavy-duty substring analytics.

Code Examples

Build SAM, count distinct substrings, and substring frequencies
1#include <bits/stdc++.h>
2using namespace std;
3
4// Suffix Automaton for lowercase 'a'..'z'
5struct SAM {
6 struct State {
7 int len = 0; // max length in this class
8 int link = -1; // suffix link
9 int next[26]; // transitions
10 long long cnt = 0; // occurrence count (endpos size after propagation)
11 int firstpos = -1; // one end position of the longest string in this state
12 State() { fill(begin(next), end(next), -1); }
13 };
14
15 vector<State> st;
16 int last; // state representing the entire string processed so far
17
18 SAM(int maxlen = 0) { st.reserve(2 * maxlen); init(); }
19
20 void init() {
21 st.clear(); st.push_back(State());
22 st[0].len = 0; st[0].link = -1; st[0].firstpos = -1;
23 last = 0;
24 }
25
26 static int id(char c) { return c - 'a'; }
27
28 void extend(char cc, int pos) {
29 int c = id(cc);
30 int cur = (int)st.size();
31 st.push_back(State());
32 st[cur].len = st[last].len + 1;
33 st[cur].firstpos = pos; // end position of the current full prefix
34 st[cur].cnt = 1; // new end position contributes 1 occurrence
35
36 int p = last;
37 while (p != -1 && st[p].next[c] == -1) {
38 st[p].next[c] = cur;
39 p = st[p].link;
40 }
41 if (p == -1) {
42 st[cur].link = 0;
43 } else {
44 int q = st[p].next[c];
45 if (st[p].len + 1 == st[q].len) {
46 st[cur].link = q;
47 } else {
48 // clone q
49 int clone = (int)st.size();
50 st.push_back(st[q]); // copy transitions, len, link, firstpos, cnt (cnt will be fixed by propagation)
51 st[clone].len = st[p].len + 1; // restrict max length
52 // firstpos of clone equals firstpos of q (both represent same end positions)
53 while (p != -1 && st[p].next[c] == q) {
54 st[p].next[c] = clone;
55 p = st[p].link;
56 }
57 st[q].link = st[cur].link = clone;
58 }
59 }
60 last = cur;
61 }
62
63 // Propagate occurrence counts along suffix links in decreasing len order
64 void propagate_counts() {
65 int maxLen = 0; for (auto &s : st) maxLen = max(maxLen, s.len);
66 vector<int> cntLen(maxLen + 1, 0);
67 for (auto &s : st) cntLen[s.len]++;
68 for (int i = 1; i <= maxLen; ++i) cntLen[i] += cntLen[i - 1];
69 vector<int> order(st.size());
70 for (int v = (int)st.size() - 1; v >= 0; --v) order[--cntLen[st[v].len]] = v;
71 // order: increasing len; process in reverse for decreasing len
72 for (int i = (int)order.size() - 1; i > 0; --i) {
73 int v = order[i];
74 int p = st[v].link;
75 if (p != -1) st[p].cnt += st[v].cnt;
76 }
77 }
78
79 // Number of distinct substrings using len - len[link]
80 long long distinct_substrings() const {
81 long long ans = 0;
82 for (int v = 1; v < (int)st.size(); ++v) ans += st[v].len - st[st[v].link].len;
83 return ans;
84 }
85
86 // Occurrence count of a pattern P (assumes propagate_counts has been called)
87 long long count_occurrences(const string &P) const {
88 int v = 0;
89 for (char cc : P) {
90 int c = id(cc);
91 if (st[v].next[c] == -1) return 0;
92 v = st[v].next[c];
93 }
94 return st[v].cnt; // occurrences of P in the text
95 }
96};
97
98int main() {
99 ios::sync_with_stdio(false);
100 cin.tie(nullptr);
101
102 string s; cin >> s; // lowercase
103 SAM sam((int)s.size());
104 for (int i = 0; i < (int)s.size(); ++i) sam.extend(s[i], i);
105 sam.propagate_counts();
106
107 cout << "Distinct substrings: " << sam.distinct_substrings() << "\n";
108
109 int q; cin >> q;
110 while (q--) {
111 string p; cin >> p;
112 cout << sam.count_occurrences(p) << "\n";
113 }
114 return 0;
115}
116

We build a SAM for a lowercase string, then compute distinct substrings using the classic sum over len[v] − len[link[v]]. We also propagate end-position counts (cnt) by processing states in decreasing length order. After propagation, the count at the state reached by a pattern equals its number of occurrences in the text. Queries then run in O(|P|).

Time: O(n) build + O(n) propagation + O(m) per querySpace: O(n)
Longest Common Substring (LCS) using a SAM
1#include <bits/stdc++.h>
2using namespace std;
3
4struct SAM {
5 struct S { int len=0, link=-1; unordered_map<char,int> next; };
6 vector<S> st; int last;
7 SAM(int maxlen=0){ st.reserve(2*maxlen); init(); }
8 void init(){ st.clear(); st.push_back(S()); last=0; }
9 void extend(char c){
10 int cur = (int)st.size(); st.push_back(S());
11 st[cur].len = st[last].len + 1;
12 int p = last;
13 while(p!=-1 && !st[p].next.count(c)){ st[p].next[c]=cur; p=st[p].link; }
14 if(p==-1){ st[cur].link=0; }
15 else{
16 int q = st[p].next[c];
17 if(st[p].len + 1 == st[q].len){ st[cur].link = q; }
18 else{
19 int clone = (int)st.size(); st.push_back(st[q]);
20 st[clone].len = st[p].len + 1;
21 while(p!=-1 && st[p].next[c]==q){ st[p].next[c]=clone; p=st[p].link; }
22 st[q].link = st[cur].link = clone;
23 }
24 }
25 last = cur;
26 }
27};
28
29int main(){
30 ios::sync_with_stdio(false); cin.tie(nullptr);
31 string A, B; cin >> A >> B;
32 SAM sam((int)A.size()); for(char c: A) sam.extend(c);
33
34 int v = 0; int l = 0; // current state and match length
35 int best = 0; int endB = -1;
36 for(int i=0;i<(int)B.size();++i){
37 char c = B[i];
38 // Follow transitions if possible; otherwise follow suffix links
39 while(v!= -1 && !sam.st[v].next.count(c)){
40 v = sam.st[v].link;
41 if(v!=-1) l = sam.st[v].len;
42 }
43 if(v==-1){ v=0; l=0; continue; }
44 v = sam.st[v].next[c];
45 l++;
46 if(l>best){ best=l; endB=i; }
47 }
48 cout << best << "\n";
49 if(best>0){
50 cout << B.substr(endB - best + 1, best) << "\n";
51 }
52 return 0;
53}
54

Build a SAM for A. Then scan B once: extend the current match if there is a transition by the next character; if not, follow suffix links to find the longest suffix that can be extended. Track the longest length reached and its position to output both the length and an example substring.

Time: O(|A| + |B|)Space: O(|A|)
Membership and frequency queries over general alphabets
1#include <bits/stdc++.h>
2using namespace std;
3
4// SAM with unordered_map transitions for general alphabets (e.g., bytes)
5struct SAM {
6 struct State { int len=0, link=-1; unordered_map<int,int> next; long long cnt=0; };
7 vector<State> st; int last;
8 SAM(int maxlen=0){ st.reserve(2*maxlen); init(); }
9 void init(){ st.clear(); st.push_back(State()); last=0; }
10 void extend(int c){
11 int cur=(int)st.size(); st.push_back(State());
12 st[cur].len = st[last].len + 1; st[cur].cnt = 1;
13 int p = last;
14 while(p!=-1 && !st[p].next.count(c)) { st[p].next[c]=cur; p=st[p].link; }
15 if(p==-1){ st[cur].link=0; }
16 else{
17 int q = st[p].next[c];
18 if(st[p].len + 1 == st[q].len){ st[cur].link=q; }
19 else{
20 int clone=(int)st.size(); st.push_back(st[q]);
21 st[clone].len = st[p].len + 1; st[clone].cnt = 0; // will be fixed by propagation
22 while(p!=-1 && st[p].next[c]==q){ st[p].next[c]=clone; p=st[p].link; }
23 st[q].link = st[cur].link = clone;
24 }
25 }
26 last=cur;
27 }
28 void build_from_string(const string &s){
29 for(unsigned char uc : s) extend((int)uc);
30 }
31 void propagate_counts(){
32 int maxLen=0; for(auto &x:st) maxLen=max(maxLen,x.len);
33 vector<int> cntLen(maxLen+1,0); for(auto &x:st) cntLen[x.len]++;
34 for(int i=1;i<=maxLen;++i) cntLen[i]+=cntLen[i-1];
35 vector<int> order(st.size());
36 for(int v=(int)st.size()-1; v>=0; --v) order[--cntLen[st[v].len]]=v;
37 for(int i=(int)order.size()-1; i>0; --i){ int v=order[i]; int p=st[v].link; if(p!=-1) st[p].cnt += st[v].cnt; }
38 }
39 // Return {exists, occurrences}
40 pair<bool,long long> query(const string &p) const{
41 int v=0;
42 for(unsigned char uc: p){ int c=(int)uc; auto it = st[v].next.find(c); if(it==st[v].next.end()) return {false,0}; v=it->second; }
43 return {true, st[v].cnt};
44 }
45};
46
47int main(){
48 ios::sync_with_stdio(false); cin.tie(nullptr);
49 string text; getline(cin, text); // allow spaces and general bytes
50 SAM sam((int)text.size()); sam.build_from_string(text); sam.propagate_counts();
51
52 int q; cin >> q; string pattern; string dummy; getline(cin,dummy);
53 while(q--){ getline(cin, pattern); auto [ok, occ]=sam.query(pattern); cout << (ok?"YES":"NO") << ' ' << occ << '\n'; }
54 return 0;
55}
56

This variant uses unordered_map<int,int> transitions to support arbitrary byte values or large alphabets. After building and propagating occurrence counts, each query returns whether the pattern occurs and how many times. This demonstrates SAM’s flexibility beyond small fixed alphabets.

Time: O(n) expected build + O(m) expected per querySpace: O(n)
K-th lexicographic distinct substring via DP on the SAM
1#include <bits/stdc++.h>
2using namespace std;
3
4struct SAM {
5 struct S { int len=0, link=-1; array<int,26> next; S(){ next.fill(-1);} };
6 vector<S> st; int last;
7 SAM(int maxlen=0){ st.reserve(2*maxlen); init(); }
8 void init(){ st.clear(); st.push_back(S()); last=0; }
9 void extend(char cc){ int c=cc-'a'; int cur=(int)st.size(); st.push_back(S()); st[cur].len=st[last].len+1; int p=last; while(p!=-1 && st[p].next[c]==-1){ st[p].next[c]=cur; p=st[p].link;} if(p==-1){ st[cur].link=0; } else { int q=st[p].next[c]; if(st[p].len+1==st[q].len){ st[cur].link=q; } else { int clone=(int)st.size(); st.push_back(st[q]); st[clone].len=st[p].len+1; while(p!=-1 && st[p].next[c]==q){ st[p].next[c]=clone; p=st[p].link;} st[q].link=st[cur].link=clone; } } last=cur; }
10};
11
12// DP to count number of distinct substrings reachable from a state
13struct KthSubstring {
14 SAM sam;
15 vector<long long> dp; // dp[v] = number of distinct substrings starting from state v
16
17 KthSubstring(const string &s): sam((int)s.size()){
18 for(char c: s) sam.extend(c);
19 dp.assign(sam.st.size(), -1);
20 // ensure no overflow in typical constraints: cap at 9e18 if needed
21 }
22
23 long long solve_dp(int v){
24 if(dp[v] != -1) return dp[v];
25 long long res = 0;
26 for(int c=0;c<26;++c){
27 int u = sam.st[v].next[c];
28 if(u==-1) continue;
29 // one substring by taking this edge, plus all longer ones from u
30 long long sub = 1 + solve_dp(u);
31 // cap to avoid overflow if using very large k
32 const long long CAP = (long long)9e18;
33 res = min(CAP, res + sub);
34 }
35 return dp[v] = res;
36 }
37
38 // Return the k-th lexicographic distinct substring (1-indexed). Assumes 1 <= k <= total.
39 string kth(long long k){
40 string ans;
41 int v = 0; // start state
42 while(k>0){
43 for(int c=0;c<26;++c){
44 int u = sam.st[v].next[c];
45 if(u==-1) continue;
46 long long here = 1 + solve_dp(u);
47 if(k <= here){
48 // This edge is inside the k-th block
49 ans.push_back(char('a'+c));
50 k--; // consume the substring that ends here
51 if(k==0) break; // exactly ended at this prefix
52 v = u; // continue to longer substrings
53 // now choose next edge to account for longer ones from u
54 // do not reset v here if k>0
55 goto next_step;
56 } else {
57 k -= here; // skip all substrings starting with this edge
58 }
59 }
60 // If we exit the loop without finding, k was invalid
61 break;
62 next_step: ;
63 }
64 return ans;
65 }
66};
67
68int main(){
69 ios::sync_with_stdio(false); cin.tie(nullptr);
70 string s; cin >> s;
71 KthSubstring solver(s);
72 long long total = solver.solve_dp(0); // total number of distinct substrings
73 int q; cin >> q;
74 while(q--){
75 long long k; cin >> k;
76 if(k<1 || k>total){ cout << "-1\n"; }
77 else { cout << solver.kth(k) << "\n"; }
78 }
79 return 0;
80}
81

We build a SAM and compute dp[v] = sum over edges of (1 + dp[next]) to count how many distinct substrings start at each state. Then we walk edges in lexicographic order, subtracting blocks until we land on the k-th block. Each chosen edge contributes one substring (ending now) plus a tail of longer ones (continuations).

Time: O(n) to build + O(n) to compute dp + O(L * σ) per querySpace: O(n)