⚙️AlgorithmIntermediate

KMP Algorithm

Key Points

  • The KMP algorithm finds all occurrences of a pattern in a text in O(n + m) time by never re-checking characters that are already known to match or mismatch.
  • It precomputes a prefix function π for the pattern, where is the length of the longest proper prefix that is also a suffix of pattern[0..i].
  • On a mismatch during the scan, KMP jumps within the pattern using π instead of moving back in the text, saving time.
  • The prefix function also reveals structural properties like string borders and the minimal period of a string.
  • KMP handles overlapping matches naturally (e.g., counting 'ana' in 'banana').
  • You can implement KMP to work on streaming input by keeping only the current matched length and the π array.
  • The algorithm uses O(m) extra space for the prefix function and works well in competitive programming constraints.
  • Common pitfalls include off-by-one indexing and confusing π with the Z-function; careful 0-based indexing avoids errors.

Prerequisites

  • Big-O notationTo understand linear-time O(n + m) complexity and amortized arguments.
  • Strings and indexingKMP manipulates substrings and relies on careful 0-based indexing.
  • Basic arrays/vectors in C++The prefix function π is stored and used as an array.
  • Loop invariantsUnderstanding why the while loop j := π[j − 1] maintains correctness helps avoid bugs.
  • Amortized analysisExplains why the number of comparisons is linear overall.

Detailed Explanation

Tap terms for definitions

01Overview

The Knuth–Morris–Pratt (KMP) algorithm is a classic method for fast pattern matching: given a text T of length n and a pattern P of length m, it finds all positions where P occurs in T in linear time O(n + m). Instead of naively restarting from the next text character on a mismatch, KMP preprocesses the pattern to learn how the pattern matches against itself. This knowledge is stored in the prefix function π, where π[i] is the length of the longest proper prefix of P that is also a suffix of P[0..i]. During the scan, when a mismatch happens after matching j characters, KMP does not re-check text characters; it jumps the pattern index j to π[j − 1], which is the size of the next best candidate prefix that could continue the match. This guarantees that each text character is examined only a constant number of times overall. Besides searching, the prefix function reveals useful combinatorial information about strings, such as borders (prefixes that are also suffixes) and minimal periods (the smallest substring that repeats to form the whole string). These properties lead to many problems solvable with KMP in competitive programming, including counting overlapping matches, detecting repetitions, and processing strings online.

02Intuition & Analogies

Imagine you are reading a book and trying to find a specific sentence P inside the book T. If you start comparing letter by letter and quit at the first mismatch, the naive approach would make you slide the book’s page by just one character and start over, re-reading many of the same letters. KMP is like keeping a smart bookmark inside the sentence P. This bookmark tells you, after a partial match breaks, which shorter prefix of P still fits with what you just saw. So instead of restarting from scratch, you reuse what you already know matches. Another analogy is fitting puzzle pieces: suppose you align P against a window of T and some prefix of P lines up perfectly with the end of the window, but then the next piece doesn’t fit. Rather than discarding everything, you slide P so that the longest prefix of P that is also a suffix of the well-aligned portion remains aligned. The prefix function π is your precomputed map of these reusable portions—borders of prefixes of P. When scanning T, you only move forward in T, and in P you either move forward (when letters match) or jump back using π (when they don’t). Because each forward move in T is never undone and each jump in P can only reduce how far into P you are, the total number of comparisons stays linear. This reuse of partial knowledge is the heart of KMP’s speed and why it excels at handling overlaps, like finding all occurrences of 'aba' in 'ababa', where matches share characters.

03Formal Definition

Given a pattern string P[0..m−1], define the prefix function π by: for each i [0, m−1], \([i] = \{ k [0, i) : P[0..k-1] = P[i-k+1..i] \}\). That is, \([i]\) is the length of the longest proper prefix of P that is also a suffix of P[0..i]. A border of a string S is any string that is both a proper prefix and a suffix of S; \([i]\) equals the length of the longest border of the prefix P[0..i]. The KMP search algorithm maintains an index j indicating how many characters of P have matched the current suffix of the scanned prefix of T. For each text index t advancing from 0 to n−1, while and T[t] P[j], set j := \([j-1]\). If T[t] = P[j], increment j. If , then an occurrence ends at t (starting at t − m + 1); after reporting, set j := \([m-1]\) to allow overlapping matches. The preprocessing of π takes O(m), and the scan of T takes O(n) because each time T[t] is compared unsuccessfully, j decreases to \([j-1]\), and the sum of all such decreases over the run is O(m + n).

04When to Use

Use KMP when you need to find a pattern in a long text with guaranteed linear time and small extra memory, especially when overlaps matter. Typical tasks include: finding all occurrences of a pattern (including overlapping ones), counting them, or checking if the pattern exists at all. The prefix function is also powerful for analyzing a single string: compute borders, get the minimal period k = n − π[n − 1] when k divides n, and enumerate all border lengths by following π links from π[n − 1] down to 0. In competitive programming (CF 1400–1800), KMP helps with problems like detecting whether a string is made of repeats of a smaller substring, testing if one string is a rotation of another by searching in S+S, or computing how many times each prefix of a string appears as a substring by aggregating counts along the π-tree. KMP also adapts to streaming: you can maintain the current j across chunks, enabling online search in logs or network packets without storing the entire text. While KMP is optimal for single-pattern exact matches, for multiple patterns consider Aho–Corasick, and for simpler code with similar uses consider the Z-function when appropriate.

⚠️Common Mistakes

Common pitfalls include: (1) Off-by-one errors with π indexing—π[i] refers to P[0..i], and on mismatch you set j to π[j−1], not π[j]. (2) Forgetting to allow overlaps—after finding a full match, you must set j := π[m−1], not 0, or you’ll miss overlapping occurrences. (3) Mixing 1-based and 0-based indices; in C++, keeping everything 0-based is simpler and avoids confusion. (4) Confusing the prefix function with the Z-function; they measure different things, so don’t swap implementations. (5) Building π incorrectly by failing to loop while j > 0 and P[i] \neq P[j]; using if instead of while causes wrong π values on patterns with nested borders. (6) Mishandling empty strings; define behavior clearly—if the pattern is empty, either return all positions or treat as invalid input. (7) Using char types incorrectly for non-ASCII text; KMP works on sequences, but if your input is UTF-8, one char is a byte, not a Unicode codepoint. (8) Forgetting to reset j between independent searches or, conversely, resetting j when doing a streaming search. (9) Not considering memory for very large patterns; π uses O(m) ints—use int32 unless lengths can exceed 2e9. (10) Misinterpreting period formula—k = n − π[n − 1] is the minimal period only if k divides n; otherwise, the string has no nontrivial period.

Key Formulas

Prefix Function

Explanation: For each position i in the pattern P, is the length of the longest proper prefix that is also a suffix of P[0..i]. This precomputed array drives KMP's jumps on mismatches.

Failure Transition

Explanation: On a mismatch after j matched characters, jump to the next best border length instead of restarting from 0. This preserves linear time by reusing partial matches.

Match and Report Rule

Explanation: Advance j on matches and report an occurrence when j reaches the pattern length m; then fall back to allow overlapping matches.

Time Complexity

Explanation: KMP preprocesses the pattern in O(m) and scans the text in O(n). Total time grows linearly with the sum of text and pattern lengths.

Minimal Period

Explanation: Let n be the string length. If n minus the last π value divides n, that difference k is the smallest repeating unit length; otherwise, the string has no nontrivial period.

Concatenation Detection

Explanation: Compute π on the concatenated string P + '#' + T (with '#' not in the alphabet). Wherever π equals , a pattern occurrence ends there in T.

Comparison Bound

Explanation: Each text character advances once; each mismatch causes j to decrease, which can happen only as many times as j has previously increased. Hence a linear number of comparisons overall.

Rotation Test

Explanation: A string B is a rotation of A if it appears as a substring of A concatenated with itself. Use KMP to test this in linear time.

Complexity Analysis

KMP runs in linear time with respect to the combined lengths of the text and pattern. Preprocessing the pattern to build the prefix function π takes O(m) time and O(m) space, where m is the pattern length. The search phase runs in O(n) time over a text of length n. The key to the linear bound is amortized analysis of the two indices: the text index t only increases from 0 to n − 1, and the pattern index j increases when characters match and decreases via j := − 1] on mismatches. Each decrement of j effectively undoes a previous increment, and because j is nonnegative, the total number of decrements over the entire scan is at most the total number of increments, which is O(n). Consequently, the total number of character comparisons is bounded by a constant factor times n + m (often stated )). Space complexity is dominated by storing requiring O(m) integers. The algorithm is cache-friendly and streaming-friendly: you can process the text in chunks while maintaining only j and π in memory. In typical competitive programming contexts, the constants are small, making KMP practical for inputs with up to 10^6–10^7 characters. If patterns are extremely large or multiple patterns must be matched simultaneously, alternative methods (e.g., Aho–Corasick for many patterns) may be more appropriate, but for single-pattern exact matching KMP is asymptotically optimal and simple to implement correctly.

Code Examples

Basic KMP: Find all (possibly overlapping) occurrences of a pattern in a text
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute prefix function (pi) for string s
5// pi[i] = length of the longest proper prefix of s
6// that is also a suffix of s[0..i]
7vector<int> prefix_function(const string &s) {
8 int n = (int)s.size();
9 vector<int> pi(n, 0);
10 for (int i = 1; i < n; ++i) {
11 int j = pi[i - 1];
12 // Fallback using failure links until we can extend or reach 0
13 while (j > 0 && s[i] != s[j]) j = pi[j - 1];
14 if (s[i] == s[j]) ++j;
15 pi[i] = j;
16 }
17 return pi;
18}
19
20// KMP search: return starting indices where pattern occurs in text
21vector<int> kmp_search(const string &text, const string &pattern) {
22 vector<int> res;
23 if (pattern.empty()) return res; // define behavior: no matches for empty pattern
24 vector<int> pi = prefix_function(pattern);
25 int n = (int)text.size();
26 int m = (int)pattern.size();
27 int j = 0; // number of matched chars in pattern
28 for (int i = 0; i < n; ++i) {
29 while (j > 0 && text[i] != pattern[j]) j = pi[j - 1];
30 if (text[i] == pattern[j]) ++j;
31 if (j == m) {
32 res.push_back(i - m + 1); // match ends at i
33 j = pi[j - 1]; // allow overlapping matches
34 }
35 }
36 return res;
37}
38
39int main() {
40 ios::sync_with_stdio(false);
41 cin.tie(nullptr);
42
43 string text, pattern;
44 // Example input: first the text line, then the pattern line
45 if (!getline(cin, text)) return 0;
46 if (!getline(cin, pattern)) return 0;
47
48 vector<int> occ = kmp_search(text, pattern);
49
50 cout << occ.size() << "\n";
51 for (int i = 0; i < (int)occ.size(); ++i) {
52 if (i) cout << ' ';
53 cout << occ[i];
54 }
55 cout << "\n";
56 return 0;
57}
58

We first compute the prefix function π for the pattern. While scanning the text, we maintain j, the length of the matched prefix of the pattern. On mismatch, we fall back to π[j−1], and on match we advance j. When j reaches m, we report an occurrence and reset j to π[m−1] to allow overlaps.

Time: O(n + m)Space: O(m)
Minimal period and repetition count of a string using π
1#include <bits/stdc++.h>
2using namespace std;
3
4vector<int> prefix_function(const string &s) {
5 int n = (int)s.size();
6 vector<int> pi(n, 0);
7 for (int i = 1; i < n; ++i) {
8 int j = pi[i - 1];
9 while (j > 0 && s[i] != s[j]) j = pi[j - 1];
10 if (s[i] == s[j]) ++j;
11 pi[i] = j;
12 }
13 return pi;
14}
15
16int main() {
17 ios::sync_with_stdio(false);
18 cin.tie(nullptr);
19
20 string s;
21 if (!(cin >> s)) return 0;
22 int n = (int)s.size();
23 vector<int> pi = prefix_function(s);
24 int k = n - pi[n - 1]; // candidate minimal period
25
26 if (n % k == 0) {
27 cout << "minimal_period_length=" << k << ", repetitions=" << (n / k) << "\n";
28 } else {
29 cout << "minimal_period_length=" << n << ", repetitions=1\n";
30 }
31
32 // Bonus: list all border lengths by following pi links
33 vector<int> borders;
34 for (int L = pi[n - 1]; L > 0; L = pi[L - 1]) borders.push_back(L);
35 sort(borders.begin(), borders.end());
36 cout << "borders:";
37 for (int L : borders) cout << ' ' << L;
38 cout << "\n";
39 return 0;
40}
41

The minimal period of s is k = n − π[n − 1] when k divides n; otherwise s has no shorter period. We also enumerate all border lengths by following π links from π[n − 1] down to 0.

Time: O(n)Space: O(n)
Count how many times each prefix of a string appears as a substring
1#include <bits/stdc++.h>
2using namespace std;
3
4vector<int> prefix_function(const string &s) {
5 int n = (int)s.size();
6 vector<int> pi(n, 0);
7 for (int i = 1; i < n; ++i) {
8 int j = pi[i - 1];
9 while (j > 0 && s[i] != s[j]) j = pi[j - 1];
10 if (s[i] == s[j]) ++j;
11 pi[i] = j;
12 }
13 return pi;
14}
15
16int main() {
17 ios::sync_with_stdio(false);
18 cin.tie(nullptr);
19
20 string s;
21 if (!(cin >> s)) return 0;
22 int n = (int)s.size();
23 vector<int> pi = prefix_function(s);
24
25 // occ[len] = number of occurrences of prefix of length 'len' as a substring
26 vector<int> occ(n + 1, 0);
27
28 // Each prefix-function value pi[i] contributes one occurrence of that prefix as a suffix ending at i
29 for (int i = 0; i < n; ++i) occ[pi[i]]++;
30
31 // Propagate counts up the border chain: a longer prefix occurrence implies one occurrence of its borders
32 for (int len = n; len > 0; --len) {
33 if (len - 1 >= 0) occ[pi[len - 1]] += occ[len];
34 }
35
36 // Each prefix occurs at least once as itself (ending at its own index)
37 for (int len = 1; len <= n; ++len) occ[len]++;
38
39 // Output: for each prefix length, print its count
40 for (int len = 1; len <= n; ++len) {
41 cout << len << ": " << occ[len] << "\n";
42 }
43 return 0;
44}
45

Occurrences of longer prefixes imply occurrences of their borders. We first count how many times each prefix length appears as a suffix of prefixes (using π values), then propagate counts along failure links, and finally add 1 to count the prefix itself. This yields, for every L, how many times s[0..L−1] appears anywhere in s.

Time: O(n)Space: O(n)
Streaming KMP: match across chunks of input
1#include <bits/stdc++.h>
2using namespace std;
3
4struct KMPStream {
5 string p;
6 vector<int> pi;
7 int j = 0; // current matched length in pattern
8 long long pos = 0; // number of processed characters so far in the stream
9
10 explicit KMPStream(string pattern) : p(std::move(pattern)) {
11 int m = (int)p.size();
12 pi.assign(m, 0);
13 for (int i = 1; i < m; ++i) {
14 int k = pi[i - 1];
15 while (k > 0 && p[i] != p[k]) k = pi[k - 1];
16 if (p[i] == p[k]) ++k;
17 pi[i] = k;
18 }
19 }
20
21 // Process a chunk and return starting indices (0-based) of matches in the global stream
22 vector<long long> pushChunk(const string &chunk) {
23 vector<long long> out;
24 if (p.empty()) return out; // define: empty pattern -> no reports
25 for (char c : chunk) {
26 while (j > 0 && c != p[j]) j = pi[j - 1];
27 if (c == p[j]) ++j;
28 if (j == (int)p.size()) {
29 long long start = (pos) - (long long)p.size() + 1; // match ends at current pos
30 out.push_back(start);
31 j = pi[j - 1];
32 }
33 ++pos;
34 }
35 return out;
36 }
37};
38
39int main() {
40 ios::sync_with_stdio(false);
41 cin.tie(nullptr);
42
43 string pattern; cin >> pattern;
44 KMPStream kmp(pattern);
45
46 int q; cin >> q; // number of chunks
47 long long total = 0;
48 while (q--) {
49 string chunk; cin >> chunk;
50 vector<long long> occ = kmp.pushChunk(chunk);
51 total += (long long)occ.size();
52 for (auto idx : occ) cout << idx << ' ';
53 }
54 cout << "\nTotal matches: " << total << "\n";
55 return 0;
56}
57

This class builds π once and then processes arbitrary text chunks while maintaining the current match length j and the global position. It reports match start indices relative to the entire stream, demonstrating KMP's online nature.

Time: O(total_chars + m) over all chunksSpace: O(m)