KMP - Prefix Function Applications
Key Points
- •The prefix function π of a string tells, for every position, the length of the longest proper prefix that is also a suffix of the prefix ending there.
- •You can find the minimal period of a string in O(n) using - and checking if n is divisible by T.
- •KMP uses π to search a pattern in a text in O(n + m) time while allowing overlapping matches.
- •The failure (border) links of π form a tree on prefix lengths that can be used to aggregate counts of prefix occurrences.
- •Building a KMP automaton from π gives O(1) per character transitions and is handy for processing many texts against the same pattern.
- •Prefix-function-based techniques help in periodicity analysis, repetition detection, and substring statistics.
- •Common pitfalls include off-by-one mistakes and confusing the prefix function with the Z-function.
- •For multiple patterns, extend the idea to the Aho–Corasick automaton; KMP automaton is the single-pattern building block.
Prerequisites
- →String indexing and substrings — Understanding how prefixes and suffixes are defined and accessed is essential for π and borders.
- →Big-O notation — To analyze and compare the time and space complexity of KMP and related constructions.
- →Finite automata basics — Automaton construction for KMP uses DFA states and transitions over an alphabet.
- →Vectors and arrays in C++ — Implementing π, KMP, and automata requires comfortable use of dynamic arrays.
- →0-based vs 1-based indexing — Correctly converting between prefix lengths (states) and character indices avoids off-by-one errors.
Detailed Explanation
Tap terms for definitions01Overview
The KMP (Knuth–Morris–Pratt) algorithm and its prefix function π are fundamental tools in string processing. The prefix function π[i] for a string s measures, for each position i, the length of the longest proper prefix of s[0..i] that is also a suffix of s[0..i]. This information lets us efficiently skip characters when searching for a pattern in a text, yielding O(n + m) time for text length n and pattern length m. Beyond searching, π exposes deep structure of a string: its borders (prefixes that are also suffixes), its minimal period, and relationships between all prefixes via so-called failure links. These links form a tree over prefix lengths that can be used to accumulate how often each prefix appears as a substring. From π we can also build a deterministic finite automaton (the KMP automaton) with m + 1 states that processes a text one character at a time with O(1) transitions. This automaton is especially useful when the same pattern is used repeatedly against many texts. In competitive programming (CF 1500–1900), problems often require computing minimal periods, counting occurrences including overlaps, or computing statistics for all prefixes—tasks naturally expressed using π, its tree, and the automaton.
02Intuition & Analogies
Imagine you’re assembling a jigsaw puzzle strip-by-strip. Each time you place a new piece (a character), you try to see how much of the front of the puzzle matches its own tail. The prefix function π[i] is the length of the longest such front–tail overlap after placing piece i. When you search for a pattern in a long text, instead of restarting from scratch on a mismatch, you slide the pattern just enough so that the known overlapping part still matches—like aligning the puzzle’s tail with the largest confirmed head that still fits. That overlap length is exactly what π tells you. For periodicity, think of a song loop: if the ending of the song already matches its beginning by L seconds, then the loop length is the song length minus L. If the song’s length is an exact multiple of that loop, the song is built by repeating that loop. The failure links (π values viewed as arrows from a longer overlap to a shorter one) are like backup plans: if your best overlap fails, you immediately try the next best candidate without re-checking from zero. Building an automaton is like precomputing how you’d react to every possible next note (character) at every point in the song (state), so during performance you never pause to think—you just follow the next transition instantly.
03Formal Definition
04When to Use
Use prefix-function-based methods when you need: (1) Fast pattern matching allowing overlaps: KMP finds all occurrences in O(n + m). (2) Periodicity and repetition detection: compute minimal period and the number of repeats in O(n). This appears in strings made of repeated blocks, rope folding, or circular shift checks. (3) Counting how many times each prefix appears within a string: compute π on the whole string and aggregate along failure links. (4) Many queries with the same pattern: prebuild the KMP automaton and process multiple texts with O(1) per character. (5) Constructive algorithms on prefixes/borders: solve tasks about the longest border of each prefix, chain of borders, or find the number of border occurrences. For multiple patterns in one pass over a text, use Aho–Corasick, which generalizes KMP to a trie with failure links; however, understanding π, failure links, and the KMP automaton provides the conceptual foundation.
⚠️Common Mistakes
Common pitfalls include: (1) Off-by-one errors: π[i] refers to s[0..i], and when converting a border length k to an index, the last index is k-1. (2) Confusing the prefix function with the Z-function: π deals with prefix–suffix overlaps of prefixes, while Z compares the prefix with substrings starting at each position. (3) Incorrect period check: computing T = n - π[n-1] is necessary but not sufficient; you must also verify n mod T = 0 to claim periodicity. (4) Forgetting overlaps when counting matches: naive counting can skip overlapping occurrences; KMP naturally counts overlaps. (5) Mishandling automaton transitions: when building next(q, c), if q=0 and pattern[0] != c, the next state is 0, not -1; for q>0, follow failure links using π until a match or zero. (6) Using the wrong alphabet size or mapping in the automaton, causing out-of-bounds or incorrect transitions. (7) Not handling empty strings or degenerate cases: an empty pattern conventionally matches at every position, but many implementations assume m > 0; define behavior explicitly. (8) Mixing 1-based and 0-based indexing when converting between lengths (states) and character indices; states are lengths 0..m, while indices are 0..m-1.
Key Formulas
Prefix Function Definition
Explanation: For each position i, is the largest size of a border of the prefix ending at i. It measures how much the string matches itself at both ends up to i.
Minimal Period Candidate
Explanation: The minimal period candidate equals the length minus the longest border of the whole string. It is a true period if and only if n is divisible by this value.
Periodicity Test
Explanation: A string is made of repeated copies of its first block of length n - exactly when the length is a multiple of that block length.
KMP Automaton Transition
Explanation: From a state q (matched prefix length), if the next pattern character matches c, advance; otherwise follow failure links until a match or zero, where mismatches loop at 0.
Prefix Occurrence Aggregation
Explanation: The count of occurrences of prefix of length v equals its own occurrence as a prefix plus the occurrences contributed by longer prefixes that fall back to v via
KMP Amortized Fallbacks
Explanation: Across the scan, each mismatch leads to a fallback, but total fallbacks are bounded by the number of character advances, keeping runtime linear.
Number of Repeats
Explanation: If the string is periodic, the number of times the minimal block repeats is the total length divided by the minimal period candidate.
KMP Time and Space
Explanation: Searching takes time linear in the text and pattern sizes; space is linear in the pattern due to the π array or automaton.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute prefix function (pi) for string s in O(n) 5 vector<int> prefix_function(const string &s) { 6 int n = (int)s.size(); 7 vector<int> pi(n, 0); 8 for (int i = 1; i < n; ++i) { 9 int j = pi[i - 1]; 10 // Fallback along failure links until we can extend or reach 0 11 while (j > 0 && s[i] != s[j]) j = pi[j - 1]; 12 if (s[i] == s[j]) ++j; 13 pi[i] = j; 14 } 15 return pi; 16 } 17 18 int main() { 19 ios::sync_with_stdio(false); 20 cin.tie(nullptr); 21 22 string s; 23 // Example input: ababab 24 if (!(cin >> s)) return 0; 25 26 vector<int> pi = prefix_function(s); 27 int n = (int)s.size(); 28 int T = n - (n ? pi[n - 1] : 0); // minimal period candidate 29 30 // Check if s is periodic with period T 31 bool periodic = (n > 0 && n % T == 0); 32 33 cout << "String: " << s << "\n"; 34 cout << "pi: "; 35 for (int x : pi) cout << x << ' '; 36 cout << "\n"; 37 38 cout << "Minimal period length candidate T = n - pi[n-1] = " << T << "\n"; 39 if (periodic) { 40 cout << "String is periodic. Minimal period = " << T << ", repeats = " << n / T << "\n"; 41 cout << "Period block: '" << s.substr(0, T) << "'\n"; 42 } else { 43 cout << "String is not a repetition of a smaller block. Minimal period is the string itself (" << n << ")\n"; 44 } 45 46 return 0; 47 } 48
This program computes the prefix function and uses it to test periodicity via T = n - π[n-1]. If n is divisible by T, the string is composed of n/T repeats of the first T characters. The code prints π, the period candidate, and whether the string is periodic.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 vector<int> prefix_function(const string &s) { 5 int n = (int)s.size(); 6 vector<int> pi(n, 0); 7 for (int i = 1; i < n; ++i) { 8 int j = pi[i - 1]; 9 while (j > 0 && s[i] != s[j]) j = pi[j - 1]; 10 if (s[i] == s[j]) ++j; 11 pi[i] = j; 12 } 13 return pi; 14 } 15 16 // KMP search: return starting indices where pattern occurs in text (including overlaps) 17 vector<int> kmp_search(const string &text, const string &pattern) { 18 vector<int> res; 19 int n = (int)text.size(); 20 int m = (int)pattern.size(); 21 if (m == 0) { // define: empty pattern matches at every position 22 for (int i = 0; i <= n; ++i) res.push_back(i); 23 return res; 24 } 25 vector<int> pi = prefix_function(pattern); 26 int j = 0; // number of characters matched in pattern 27 for (int i = 0; i < n; ++i) { 28 while (j > 0 && text[i] != pattern[j]) j = pi[j - 1]; // fallback 29 if (text[i] == pattern[j]) ++j; // extend 30 if (j == m) { 31 res.push_back(i - m + 1); // match ends at i 32 j = pi[j - 1]; // allow overlaps by falling back 33 } 34 } 35 return res; 36 } 37 38 int main() { 39 ios::sync_with_stdio(false); 40 cin.tie(nullptr); 41 42 string text, pattern; 43 // Example: text = ababababa, pattern = aba -> matches at 0,2,4,6 44 if (!(cin >> text >> pattern)) return 0; 45 vector<int> occ = kmp_search(text, pattern); 46 47 cout << "Text: " << text << "\n"; 48 cout << "Pattern: " << pattern << "\n"; 49 cout << "Count: " << occ.size() << "\n"; 50 cout << "Positions: "; 51 for (int i = 0; i < (int)occ.size(); ++i) cout << occ[i] << (i + 1 == (int)occ.size() ? '\n' : ' '); 52 53 return 0; 54 } 55
This code searches for all occurrences of a pattern in a text using KMP and lists their starting indices, including overlapping matches. It maintains a pattern index j and uses π to jump on mismatches without re-examining characters.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 vector<int> prefix_function(const string &s) { 5 int n = (int)s.size(); 6 vector<int> pi(n, 0); 7 for (int i = 1; i < n; ++i) { 8 int j = pi[i - 1]; 9 while (j > 0 && s[i] != s[j]) j = pi[j - 1]; 10 if (s[i] == s[j]) ++j; 11 pi[i] = j; 12 } 13 return pi; 14 } 15 16 // Given s, compute how many times each prefix s[0..v-1] appears as a substring of s. 17 // Returns vector occ of size n+1, occ[v] = count for prefix length v (occ[0] is unused = 0). 18 vector<long long> count_prefix_occurrences(const string &s) { 19 int n = (int)s.size(); 20 vector<int> pi = prefix_function(s); 21 vector<long long> occ(n + 1, 0); 22 23 // Every position i contributes one occurrence to the border length pi[i] 24 for (int i = 0; i < n; ++i) occ[pi[i]]++; 25 26 // Propagate counts from longer borders to shorter via failure links 27 // For v from n down to 1: add children counts to parent (pi[v-1]) 28 for (int v = n; v >= 1; --v) { 29 if (v - 1 >= 0) occ[ (v - 1 >= 0 ? (v - 1 >= 0 ? (pi[v - 1]) : 0) : 0) ] += occ[v]; 30 } 31 32 // Each prefix appears at least once as itself (ending at its own end) 33 for (int v = 1; v <= n; ++v) occ[v]++; 34 35 return occ; 36 } 37 38 int main() { 39 ios::sync_with_stdio(false); 40 cin.tie(nullptr); 41 42 string s; 43 // Example: s = ababa 44 if (!(cin >> s)) return 0; 45 46 auto occ = count_prefix_occurrences(s); 47 cout << "String: " << s << "\n"; 48 cout << "Prefix occurrence counts (length -> count):\n"; 49 for (int v = 1; v <= (int)s.size(); ++v) { 50 cout << v << " -> " << occ[v] << " (\"" << s.substr(0, v) << "\")\n"; 51 } 52 53 return 0; 54 } 55
This program counts how many times each prefix of s appears as a substring of s. The trick is to count contributions at each position into occ[π[i]] and then accumulate along failure links from longer to shorter borders. Finally, each prefix occurs at least once as itself, so add one. This is a classic π-tree aggregation.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 vector<int> prefix_function(const string &s) { 5 int n = (int)s.size(); 6 vector<int> pi(n, 0); 7 for (int i = 1; i < n; ++i) { 8 int j = pi[i - 1]; 9 while (j > 0 && s[i] != s[j]) j = pi[j - 1]; 10 if (s[i] == s[j]) ++j; 11 pi[i] = j; 12 } 13 return pi; 14 } 15 16 // Build KMP automaton for lowercase 'a'..'z'. 17 // go[q][c] = next state from state q on character 'a'+c, states are 0..m. 18 vector<array<int,26>> build_kmp_automaton(const string &pat) { 19 int m = (int)pat.size(); 20 vector<int> pi = prefix_function(pat); 21 vector<array<int,26>> go(m + 1); 22 23 for (int c = 0; c < 26; ++c) go[0][c] = 0; 24 if (m > 0) go[0][pat[0]-'a'] = 1; 25 26 for (int q = 1; q <= m; ++q) { 27 for (int c = 0; c < 26; ++c) { 28 if (q < m && pat[q] - 'a' == c) { 29 go[q][c] = q + 1; 30 } else { 31 go[q][c] = go[ (q == 0 ? 0 : pi[q - 1]) ][c]; 32 } 33 } 34 } 35 return go; 36 } 37 38 // Use automaton to count occurrences in a text quickly 39 int count_occurrences_automaton(const string &text, const string &pat, const vector<array<int,26>> &go) { 40 int m = (int)pat.size(); 41 int state = 0, ans = 0; 42 for (char ch : text) { 43 if ('a' <= ch && ch <= 'z') { 44 state = go[state][ch - 'a']; 45 if (state == m) { 46 ++ans; 47 // Follow failure link implicitly via transition from state m 48 // No manual fallback needed since go captures it 49 } 50 } else { 51 // If the input can contain other chars, either extend alphabet or reset appropriately 52 state = 0; // simple policy: reset on non-lowercase 53 } 54 } 55 return ans; 56 } 57 58 int main() { 59 ios::sync_with_stdio(false); 60 cin.tie(nullptr); 61 62 string pat; 63 int T; 64 // Example: pattern "aba", then multiple texts 65 if (!(cin >> pat >> T)) return 0; 66 auto go = build_kmp_automaton(pat); 67 68 while (T--) { 69 string text; cin >> text; 70 int cnt = count_occurrences_automaton(text, pat, go); 71 cout << cnt << "\n"; 72 } 73 74 return 0; 75 } 76
This builds the KMP automaton for a lowercase alphabet and then reuses it to process multiple texts quickly. Each character transition is O(1). If you need multi-pattern matching in a single pass, switch to Aho–Corasick; the KMP automaton is the single-pattern case.