KMP Algorithm
Key Points
- •The KMP algorithm finds all occurrences of a pattern in a text in O(n + m) time by never re-checking characters that are already known to match or mismatch.
- •It precomputes a prefix function π for the pattern, where is the length of the longest proper prefix that is also a suffix of pattern[0..i].
- •On a mismatch during the scan, KMP jumps within the pattern using π instead of moving back in the text, saving time.
- •The prefix function also reveals structural properties like string borders and the minimal period of a string.
- •KMP handles overlapping matches naturally (e.g., counting 'ana' in 'banana').
- •You can implement KMP to work on streaming input by keeping only the current matched length and the π array.
- •The algorithm uses O(m) extra space for the prefix function and works well in competitive programming constraints.
- •Common pitfalls include off-by-one indexing and confusing π with the Z-function; careful 0-based indexing avoids errors.
Prerequisites
- →Big-O notation — To understand linear-time O(n + m) complexity and amortized arguments.
- →Strings and indexing — KMP manipulates substrings and relies on careful 0-based indexing.
- →Basic arrays/vectors in C++ — The prefix function π is stored and used as an array.
- →Loop invariants — Understanding why the while loop j := π[j − 1] maintains correctness helps avoid bugs.
- →Amortized analysis — Explains why the number of comparisons is linear overall.
Detailed Explanation
Tap terms for definitions01Overview
The Knuth–Morris–Pratt (KMP) algorithm is a classic method for fast pattern matching: given a text T of length n and a pattern P of length m, it finds all positions where P occurs in T in linear time O(n + m). Instead of naively restarting from the next text character on a mismatch, KMP preprocesses the pattern to learn how the pattern matches against itself. This knowledge is stored in the prefix function π, where π[i] is the length of the longest proper prefix of P that is also a suffix of P[0..i]. During the scan, when a mismatch happens after matching j characters, KMP does not re-check text characters; it jumps the pattern index j to π[j − 1], which is the size of the next best candidate prefix that could continue the match. This guarantees that each text character is examined only a constant number of times overall. Besides searching, the prefix function reveals useful combinatorial information about strings, such as borders (prefixes that are also suffixes) and minimal periods (the smallest substring that repeats to form the whole string). These properties lead to many problems solvable with KMP in competitive programming, including counting overlapping matches, detecting repetitions, and processing strings online.
02Intuition & Analogies
Imagine you are reading a book and trying to find a specific sentence P inside the book T. If you start comparing letter by letter and quit at the first mismatch, the naive approach would make you slide the book’s page by just one character and start over, re-reading many of the same letters. KMP is like keeping a smart bookmark inside the sentence P. This bookmark tells you, after a partial match breaks, which shorter prefix of P still fits with what you just saw. So instead of restarting from scratch, you reuse what you already know matches. Another analogy is fitting puzzle pieces: suppose you align P against a window of T and some prefix of P lines up perfectly with the end of the window, but then the next piece doesn’t fit. Rather than discarding everything, you slide P so that the longest prefix of P that is also a suffix of the well-aligned portion remains aligned. The prefix function π is your precomputed map of these reusable portions—borders of prefixes of P. When scanning T, you only move forward in T, and in P you either move forward (when letters match) or jump back using π (when they don’t). Because each forward move in T is never undone and each jump in P can only reduce how far into P you are, the total number of comparisons stays linear. This reuse of partial knowledge is the heart of KMP’s speed and why it excels at handling overlaps, like finding all occurrences of 'aba' in 'ababa', where matches share characters.
03Formal Definition
04When to Use
Use KMP when you need to find a pattern in a long text with guaranteed linear time and small extra memory, especially when overlaps matter. Typical tasks include: finding all occurrences of a pattern (including overlapping ones), counting them, or checking if the pattern exists at all. The prefix function is also powerful for analyzing a single string: compute borders, get the minimal period k = n − π[n − 1] when k divides n, and enumerate all border lengths by following π links from π[n − 1] down to 0. In competitive programming (CF 1400–1800), KMP helps with problems like detecting whether a string is made of repeats of a smaller substring, testing if one string is a rotation of another by searching in S+S, or computing how many times each prefix of a string appears as a substring by aggregating counts along the π-tree. KMP also adapts to streaming: you can maintain the current j across chunks, enabling online search in logs or network packets without storing the entire text. While KMP is optimal for single-pattern exact matches, for multiple patterns consider Aho–Corasick, and for simpler code with similar uses consider the Z-function when appropriate.
⚠️Common Mistakes
Common pitfalls include: (1) Off-by-one errors with π indexing—π[i] refers to P[0..i], and on mismatch you set j to π[j−1], not π[j]. (2) Forgetting to allow overlaps—after finding a full match, you must set j := π[m−1], not 0, or you’ll miss overlapping occurrences. (3) Mixing 1-based and 0-based indices; in C++, keeping everything 0-based is simpler and avoids confusion. (4) Confusing the prefix function with the Z-function; they measure different things, so don’t swap implementations. (5) Building π incorrectly by failing to loop while j > 0 and P[i] \neq P[j]; using if instead of while causes wrong π values on patterns with nested borders. (6) Mishandling empty strings; define behavior clearly—if the pattern is empty, either return all positions or treat as invalid input. (7) Using char types incorrectly for non-ASCII text; KMP works on sequences, but if your input is UTF-8, one char is a byte, not a Unicode codepoint. (8) Forgetting to reset j between independent searches or, conversely, resetting j when doing a streaming search. (9) Not considering memory for very large patterns; π uses O(m) ints—use int32 unless lengths can exceed 2e9. (10) Misinterpreting period formula—k = n − π[n − 1] is the minimal period only if k divides n; otherwise, the string has no nontrivial period.
Key Formulas
Prefix Function
Explanation: For each position i in the pattern P, is the length of the longest proper prefix that is also a suffix of P[0..i]. This precomputed array drives KMP's jumps on mismatches.
Failure Transition
Explanation: On a mismatch after j matched characters, jump to the next best border length instead of restarting from 0. This preserves linear time by reusing partial matches.
Match and Report Rule
Explanation: Advance j on matches and report an occurrence when j reaches the pattern length m; then fall back to allow overlapping matches.
Time Complexity
Explanation: KMP preprocesses the pattern in O(m) and scans the text in O(n). Total time grows linearly with the sum of text and pattern lengths.
Minimal Period
Explanation: Let n be the string length. If n minus the last π value divides n, that difference k is the smallest repeating unit length; otherwise, the string has no nontrivial period.
Concatenation Detection
Explanation: Compute π on the concatenated string P + '#' + T (with '#' not in the alphabet). Wherever π equals , a pattern occurrence ends there in T.
Comparison Bound
Explanation: Each text character advances once; each mismatch causes j to decrease, which can happen only as many times as j has previously increased. Hence a linear number of comparisons overall.
Rotation Test
Explanation: A string B is a rotation of A if it appears as a substring of A concatenated with itself. Use KMP to test this in linear time.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute prefix function (pi) for string s 5 // pi[i] = length of the longest proper prefix of s 6 // that is also a suffix of s[0..i] 7 vector<int> prefix_function(const string &s) { 8 int n = (int)s.size(); 9 vector<int> pi(n, 0); 10 for (int i = 1; i < n; ++i) { 11 int j = pi[i - 1]; 12 // Fallback using failure links until we can extend or reach 0 13 while (j > 0 && s[i] != s[j]) j = pi[j - 1]; 14 if (s[i] == s[j]) ++j; 15 pi[i] = j; 16 } 17 return pi; 18 } 19 20 // KMP search: return starting indices where pattern occurs in text 21 vector<int> kmp_search(const string &text, const string &pattern) { 22 vector<int> res; 23 if (pattern.empty()) return res; // define behavior: no matches for empty pattern 24 vector<int> pi = prefix_function(pattern); 25 int n = (int)text.size(); 26 int m = (int)pattern.size(); 27 int j = 0; // number of matched chars in pattern 28 for (int i = 0; i < n; ++i) { 29 while (j > 0 && text[i] != pattern[j]) j = pi[j - 1]; 30 if (text[i] == pattern[j]) ++j; 31 if (j == m) { 32 res.push_back(i - m + 1); // match ends at i 33 j = pi[j - 1]; // allow overlapping matches 34 } 35 } 36 return res; 37 } 38 39 int main() { 40 ios::sync_with_stdio(false); 41 cin.tie(nullptr); 42 43 string text, pattern; 44 // Example input: first the text line, then the pattern line 45 if (!getline(cin, text)) return 0; 46 if (!getline(cin, pattern)) return 0; 47 48 vector<int> occ = kmp_search(text, pattern); 49 50 cout << occ.size() << "\n"; 51 for (int i = 0; i < (int)occ.size(); ++i) { 52 if (i) cout << ' '; 53 cout << occ[i]; 54 } 55 cout << "\n"; 56 return 0; 57 } 58
We first compute the prefix function π for the pattern. While scanning the text, we maintain j, the length of the matched prefix of the pattern. On mismatch, we fall back to π[j−1], and on match we advance j. When j reaches m, we report an occurrence and reset j to π[m−1] to allow overlaps.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 vector<int> prefix_function(const string &s) { 5 int n = (int)s.size(); 6 vector<int> pi(n, 0); 7 for (int i = 1; i < n; ++i) { 8 int j = pi[i - 1]; 9 while (j > 0 && s[i] != s[j]) j = pi[j - 1]; 10 if (s[i] == s[j]) ++j; 11 pi[i] = j; 12 } 13 return pi; 14 } 15 16 int main() { 17 ios::sync_with_stdio(false); 18 cin.tie(nullptr); 19 20 string s; 21 if (!(cin >> s)) return 0; 22 int n = (int)s.size(); 23 vector<int> pi = prefix_function(s); 24 int k = n - pi[n - 1]; // candidate minimal period 25 26 if (n % k == 0) { 27 cout << "minimal_period_length=" << k << ", repetitions=" << (n / k) << "\n"; 28 } else { 29 cout << "minimal_period_length=" << n << ", repetitions=1\n"; 30 } 31 32 // Bonus: list all border lengths by following pi links 33 vector<int> borders; 34 for (int L = pi[n - 1]; L > 0; L = pi[L - 1]) borders.push_back(L); 35 sort(borders.begin(), borders.end()); 36 cout << "borders:"; 37 for (int L : borders) cout << ' ' << L; 38 cout << "\n"; 39 return 0; 40 } 41
The minimal period of s is k = n − π[n − 1] when k divides n; otherwise s has no shorter period. We also enumerate all border lengths by following π links from π[n − 1] down to 0.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 vector<int> prefix_function(const string &s) { 5 int n = (int)s.size(); 6 vector<int> pi(n, 0); 7 for (int i = 1; i < n; ++i) { 8 int j = pi[i - 1]; 9 while (j > 0 && s[i] != s[j]) j = pi[j - 1]; 10 if (s[i] == s[j]) ++j; 11 pi[i] = j; 12 } 13 return pi; 14 } 15 16 int main() { 17 ios::sync_with_stdio(false); 18 cin.tie(nullptr); 19 20 string s; 21 if (!(cin >> s)) return 0; 22 int n = (int)s.size(); 23 vector<int> pi = prefix_function(s); 24 25 // occ[len] = number of occurrences of prefix of length 'len' as a substring 26 vector<int> occ(n + 1, 0); 27 28 // Each prefix-function value pi[i] contributes one occurrence of that prefix as a suffix ending at i 29 for (int i = 0; i < n; ++i) occ[pi[i]]++; 30 31 // Propagate counts up the border chain: a longer prefix occurrence implies one occurrence of its borders 32 for (int len = n; len > 0; --len) { 33 if (len - 1 >= 0) occ[pi[len - 1]] += occ[len]; 34 } 35 36 // Each prefix occurs at least once as itself (ending at its own index) 37 for (int len = 1; len <= n; ++len) occ[len]++; 38 39 // Output: for each prefix length, print its count 40 for (int len = 1; len <= n; ++len) { 41 cout << len << ": " << occ[len] << "\n"; 42 } 43 return 0; 44 } 45
Occurrences of longer prefixes imply occurrences of their borders. We first count how many times each prefix length appears as a suffix of prefixes (using π values), then propagate counts along failure links, and finally add 1 to count the prefix itself. This yields, for every L, how many times s[0..L−1] appears anywhere in s.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct KMPStream { 5 string p; 6 vector<int> pi; 7 int j = 0; // current matched length in pattern 8 long long pos = 0; // number of processed characters so far in the stream 9 10 explicit KMPStream(string pattern) : p(std::move(pattern)) { 11 int m = (int)p.size(); 12 pi.assign(m, 0); 13 for (int i = 1; i < m; ++i) { 14 int k = pi[i - 1]; 15 while (k > 0 && p[i] != p[k]) k = pi[k - 1]; 16 if (p[i] == p[k]) ++k; 17 pi[i] = k; 18 } 19 } 20 21 // Process a chunk and return starting indices (0-based) of matches in the global stream 22 vector<long long> pushChunk(const string &chunk) { 23 vector<long long> out; 24 if (p.empty()) return out; // define: empty pattern -> no reports 25 for (char c : chunk) { 26 while (j > 0 && c != p[j]) j = pi[j - 1]; 27 if (c == p[j]) ++j; 28 if (j == (int)p.size()) { 29 long long start = (pos) - (long long)p.size() + 1; // match ends at current pos 30 out.push_back(start); 31 j = pi[j - 1]; 32 } 33 ++pos; 34 } 35 return out; 36 } 37 }; 38 39 int main() { 40 ios::sync_with_stdio(false); 41 cin.tie(nullptr); 42 43 string pattern; cin >> pattern; 44 KMPStream kmp(pattern); 45 46 int q; cin >> q; // number of chunks 47 long long total = 0; 48 while (q--) { 49 string chunk; cin >> chunk; 50 vector<long long> occ = kmp.pushChunk(chunk); 51 total += (long long)occ.size(); 52 for (auto idx : occ) cout << idx << ' '; 53 } 54 cout << "\nTotal matches: " << total << "\n"; 55 return 0; 56 } 57
This class builds π once and then processes arbitrary text chunks while maintaining the current match length j and the global position. It reports match start indices relative to the entire stream, demonstrating KMP's online nature.