⚙️AlgorithmIntermediate

KMP - Prefix Function Applications

Key Points

  • The prefix function π of a string tells, for every position, the length of the longest proper prefix that is also a suffix of the prefix ending there.
  • You can find the minimal period of a string in O(n) using - and checking if n is divisible by T.
  • KMP uses π to search a pattern in a text in O(n + m) time while allowing overlapping matches.
  • The failure (border) links of π form a tree on prefix lengths that can be used to aggregate counts of prefix occurrences.
  • Building a KMP automaton from π gives O(1) per character transitions and is handy for processing many texts against the same pattern.
  • Prefix-function-based techniques help in periodicity analysis, repetition detection, and substring statistics.
  • Common pitfalls include off-by-one mistakes and confusing the prefix function with the Z-function.
  • For multiple patterns, extend the idea to the Aho–Corasick automaton; KMP automaton is the single-pattern building block.

Prerequisites

  • String indexing and substringsUnderstanding how prefixes and suffixes are defined and accessed is essential for π and borders.
  • Big-O notationTo analyze and compare the time and space complexity of KMP and related constructions.
  • Finite automata basicsAutomaton construction for KMP uses DFA states and transitions over an alphabet.
  • Vectors and arrays in C++Implementing π, KMP, and automata requires comfortable use of dynamic arrays.
  • 0-based vs 1-based indexingCorrectly converting between prefix lengths (states) and character indices avoids off-by-one errors.

Detailed Explanation

Tap terms for definitions

01Overview

The KMP (Knuth–Morris–Pratt) algorithm and its prefix function π are fundamental tools in string processing. The prefix function π[i] for a string s measures, for each position i, the length of the longest proper prefix of s[0..i] that is also a suffix of s[0..i]. This information lets us efficiently skip characters when searching for a pattern in a text, yielding O(n + m) time for text length n and pattern length m. Beyond searching, π exposes deep structure of a string: its borders (prefixes that are also suffixes), its minimal period, and relationships between all prefixes via so-called failure links. These links form a tree over prefix lengths that can be used to accumulate how often each prefix appears as a substring. From π we can also build a deterministic finite automaton (the KMP automaton) with m + 1 states that processes a text one character at a time with O(1) transitions. This automaton is especially useful when the same pattern is used repeatedly against many texts. In competitive programming (CF 1500–1900), problems often require computing minimal periods, counting occurrences including overlaps, or computing statistics for all prefixes—tasks naturally expressed using π, its tree, and the automaton.

02Intuition & Analogies

Imagine you’re assembling a jigsaw puzzle strip-by-strip. Each time you place a new piece (a character), you try to see how much of the front of the puzzle matches its own tail. The prefix function π[i] is the length of the longest such front–tail overlap after placing piece i. When you search for a pattern in a long text, instead of restarting from scratch on a mismatch, you slide the pattern just enough so that the known overlapping part still matches—like aligning the puzzle’s tail with the largest confirmed head that still fits. That overlap length is exactly what π tells you. For periodicity, think of a song loop: if the ending of the song already matches its beginning by L seconds, then the loop length is the song length minus L. If the song’s length is an exact multiple of that loop, the song is built by repeating that loop. The failure links (π values viewed as arrows from a longer overlap to a shorter one) are like backup plans: if your best overlap fails, you immediately try the next best candidate without re-checking from zero. Building an automaton is like precomputing how you’d react to every possible next note (character) at every point in the song (state), so during performance you never pause to think—you just follow the next transition instantly.

03Formal Definition

Given a string s of length n indexed from 0, the prefix function \(\) is an array where \([i]\) is the maximum integer \(k\) with \(0 k < i+1\) such that \(s[0..k-1] = s[i-k+1..i]\). Here, prefixes and suffixes are taken of the prefix \(s[0..i]\), and the prefix must be proper (strictly shorter than \(i+1\)). A border of a string is any non-empty proper prefix that is also a suffix. The sequence \([i]\) induces a chain of borders: if \(k = [i]\), then the next shorter border has length \([k-1]\), and so on, until zero. A period \(p\) of a string s of length n is an integer such that \(s[i] = s[i-p]\) for all \(i p\). The minimal period is \( - [n-1]\); it is a true period if and only if \(n\) is divisible by \(\). The KMP automaton is a DFA with states \(0,1,,m\) (prefix lengths of a pattern of length m). For a character c, the transition \((q,c)\) is the length of the longest prefix that matches the current suffix followed by c, computed using failure links from \(q\) (with the base that if \(q<m\) and pattern[q]=c then \(q+1\)).

04When to Use

Use prefix-function-based methods when you need: (1) Fast pattern matching allowing overlaps: KMP finds all occurrences in O(n + m). (2) Periodicity and repetition detection: compute minimal period and the number of repeats in O(n). This appears in strings made of repeated blocks, rope folding, or circular shift checks. (3) Counting how many times each prefix appears within a string: compute π on the whole string and aggregate along failure links. (4) Many queries with the same pattern: prebuild the KMP automaton and process multiple texts with O(1) per character. (5) Constructive algorithms on prefixes/borders: solve tasks about the longest border of each prefix, chain of borders, or find the number of border occurrences. For multiple patterns in one pass over a text, use Aho–Corasick, which generalizes KMP to a trie with failure links; however, understanding π, failure links, and the KMP automaton provides the conceptual foundation.

⚠️Common Mistakes

Common pitfalls include: (1) Off-by-one errors: π[i] refers to s[0..i], and when converting a border length k to an index, the last index is k-1. (2) Confusing the prefix function with the Z-function: π deals with prefix–suffix overlaps of prefixes, while Z compares the prefix with substrings starting at each position. (3) Incorrect period check: computing T = n - π[n-1] is necessary but not sufficient; you must also verify n mod T = 0 to claim periodicity. (4) Forgetting overlaps when counting matches: naive counting can skip overlapping occurrences; KMP naturally counts overlaps. (5) Mishandling automaton transitions: when building next(q, c), if q=0 and pattern[0] != c, the next state is 0, not -1; for q>0, follow failure links using π until a match or zero. (6) Using the wrong alphabet size or mapping in the automaton, causing out-of-bounds or incorrect transitions. (7) Not handling empty strings or degenerate cases: an empty pattern conventionally matches at every position, but many implementations assume m > 0; define behavior explicitly. (8) Mixing 1-based and 0-based indexing when converting between lengths (states) and character indices; states are lengths 0..m, while indices are 0..m-1.

Key Formulas

Prefix Function Definition

Explanation: For each position i, is the largest size of a border of the prefix ending at i. It measures how much the string matches itself at both ends up to i.

Minimal Period Candidate

Explanation: The minimal period candidate equals the length minus the longest border of the whole string. It is a true period if and only if n is divisible by this value.

Periodicity Test

Explanation: A string is made of repeated copies of its first block of length n - exactly when the length is a multiple of that block length.

KMP Automaton Transition

Explanation: From a state q (matched prefix length), if the next pattern character matches c, advance; otherwise follow failure links until a match or zero, where mismatches loop at 0.

Prefix Occurrence Aggregation

Explanation: The count of occurrences of prefix of length v equals its own occurrence as a prefix plus the occurrences contributed by longer prefixes that fall back to v via

KMP Amortized Fallbacks

Explanation: Across the scan, each mismatch leads to a fallback, but total fallbacks are bounded by the number of character advances, keeping runtime linear.

Number of Repeats

Explanation: If the string is periodic, the number of times the minimal block repeats is the total length divided by the minimal period candidate.

KMP Time and Space

Explanation: Searching takes time linear in the text and pattern sizes; space is linear in the pattern due to the π array or automaton.

Complexity Analysis

Computing the prefix function π for a string of length n runs in O(n) time and O(n) space. The loop maintains a pointer k to the current border length; each character increases k at most once and decreases it along failure links. Because k never increases past n and each decrease strictly reduces k, total increments and decrements are O(n), yielding linear time. KMP search on text length N and pattern length M also runs in O(N + M) time: one pass to compute in O(M), then a single scan of the text. In the scan, the text index advances at most N times, and the pattern index advances at most N times; fallbacks do not cause re-reading text, so the total steps are linear. Space for KMP search is O(M) for plus O(1) extra variables. The minimal period check uses and is O(1) after computing Building the KMP automaton for an alphabet of size Σ creates a table of size (M+1)×Σ; the straightforward construction is O(M·Σ) time and space. Running the automaton on a text is O(N) time and O(1) extra space beyond the state. The prefix-function tree on prefix lengths 0..n has n edges and can be built implicitly from π in O(n). Aggregating counts over the tree (e.g., via reverse propagation from longer to shorter borders) is O(n) time and O(n) space. For multiple texts against the same pattern, the automaton amortizes the pattern preprocessing to O(M·Σ) once, with each new text processed in O(N).

Code Examples

Compute Prefix Function and Minimal Period of a String
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute prefix function (pi) for string s in O(n)
5vector<int> prefix_function(const string &s) {
6 int n = (int)s.size();
7 vector<int> pi(n, 0);
8 for (int i = 1; i < n; ++i) {
9 int j = pi[i - 1];
10 // Fallback along failure links until we can extend or reach 0
11 while (j > 0 && s[i] != s[j]) j = pi[j - 1];
12 if (s[i] == s[j]) ++j;
13 pi[i] = j;
14 }
15 return pi;
16}
17
18int main() {
19 ios::sync_with_stdio(false);
20 cin.tie(nullptr);
21
22 string s;
23 // Example input: ababab
24 if (!(cin >> s)) return 0;
25
26 vector<int> pi = prefix_function(s);
27 int n = (int)s.size();
28 int T = n - (n ? pi[n - 1] : 0); // minimal period candidate
29
30 // Check if s is periodic with period T
31 bool periodic = (n > 0 && n % T == 0);
32
33 cout << "String: " << s << "\n";
34 cout << "pi: ";
35 for (int x : pi) cout << x << ' ';
36 cout << "\n";
37
38 cout << "Minimal period length candidate T = n - pi[n-1] = " << T << "\n";
39 if (periodic) {
40 cout << "String is periodic. Minimal period = " << T << ", repeats = " << n / T << "\n";
41 cout << "Period block: '" << s.substr(0, T) << "'\n";
42 } else {
43 cout << "String is not a repetition of a smaller block. Minimal period is the string itself (" << n << ")\n";
44 }
45
46 return 0;
47}
48

This program computes the prefix function and uses it to test periodicity via T = n - π[n-1]. If n is divisible by T, the string is composed of n/T repeats of the first T characters. The code prints π, the period candidate, and whether the string is periodic.

Time: O(n)Space: O(n)
KMP Search: Count and List All (Overlapping) Occurrences
1#include <bits/stdc++.h>
2using namespace std;
3
4vector<int> prefix_function(const string &s) {
5 int n = (int)s.size();
6 vector<int> pi(n, 0);
7 for (int i = 1; i < n; ++i) {
8 int j = pi[i - 1];
9 while (j > 0 && s[i] != s[j]) j = pi[j - 1];
10 if (s[i] == s[j]) ++j;
11 pi[i] = j;
12 }
13 return pi;
14}
15
16// KMP search: return starting indices where pattern occurs in text (including overlaps)
17vector<int> kmp_search(const string &text, const string &pattern) {
18 vector<int> res;
19 int n = (int)text.size();
20 int m = (int)pattern.size();
21 if (m == 0) { // define: empty pattern matches at every position
22 for (int i = 0; i <= n; ++i) res.push_back(i);
23 return res;
24 }
25 vector<int> pi = prefix_function(pattern);
26 int j = 0; // number of characters matched in pattern
27 for (int i = 0; i < n; ++i) {
28 while (j > 0 && text[i] != pattern[j]) j = pi[j - 1]; // fallback
29 if (text[i] == pattern[j]) ++j; // extend
30 if (j == m) {
31 res.push_back(i - m + 1); // match ends at i
32 j = pi[j - 1]; // allow overlaps by falling back
33 }
34 }
35 return res;
36}
37
38int main() {
39 ios::sync_with_stdio(false);
40 cin.tie(nullptr);
41
42 string text, pattern;
43 // Example: text = ababababa, pattern = aba -> matches at 0,2,4,6
44 if (!(cin >> text >> pattern)) return 0;
45 vector<int> occ = kmp_search(text, pattern);
46
47 cout << "Text: " << text << "\n";
48 cout << "Pattern: " << pattern << "\n";
49 cout << "Count: " << occ.size() << "\n";
50 cout << "Positions: ";
51 for (int i = 0; i < (int)occ.size(); ++i) cout << occ[i] << (i + 1 == (int)occ.size() ? '\n' : ' ');
52
53 return 0;
54}
55

This code searches for all occurrences of a pattern in a text using KMP and lists their starting indices, including overlapping matches. It maintains a pattern index j and uses π to jump on mismatches without re-examining characters.

Time: O(n + m)Space: O(m)
Prefix-Function Tree Aggregation: Count Occurrences of Every Prefix in the String
1#include <bits/stdc++.h>
2using namespace std;
3
4vector<int> prefix_function(const string &s) {
5 int n = (int)s.size();
6 vector<int> pi(n, 0);
7 for (int i = 1; i < n; ++i) {
8 int j = pi[i - 1];
9 while (j > 0 && s[i] != s[j]) j = pi[j - 1];
10 if (s[i] == s[j]) ++j;
11 pi[i] = j;
12 }
13 return pi;
14}
15
16// Given s, compute how many times each prefix s[0..v-1] appears as a substring of s.
17// Returns vector occ of size n+1, occ[v] = count for prefix length v (occ[0] is unused = 0).
18vector<long long> count_prefix_occurrences(const string &s) {
19 int n = (int)s.size();
20 vector<int> pi = prefix_function(s);
21 vector<long long> occ(n + 1, 0);
22
23 // Every position i contributes one occurrence to the border length pi[i]
24 for (int i = 0; i < n; ++i) occ[pi[i]]++;
25
26 // Propagate counts from longer borders to shorter via failure links
27 // For v from n down to 1: add children counts to parent (pi[v-1])
28 for (int v = n; v >= 1; --v) {
29 if (v - 1 >= 0) occ[ (v - 1 >= 0 ? (v - 1 >= 0 ? (pi[v - 1]) : 0) : 0) ] += occ[v];
30 }
31
32 // Each prefix appears at least once as itself (ending at its own end)
33 for (int v = 1; v <= n; ++v) occ[v]++;
34
35 return occ;
36}
37
38int main() {
39 ios::sync_with_stdio(false);
40 cin.tie(nullptr);
41
42 string s;
43 // Example: s = ababa
44 if (!(cin >> s)) return 0;
45
46 auto occ = count_prefix_occurrences(s);
47 cout << "String: " << s << "\n";
48 cout << "Prefix occurrence counts (length -> count):\n";
49 for (int v = 1; v <= (int)s.size(); ++v) {
50 cout << v << " -> " << occ[v] << " (\"" << s.substr(0, v) << "\")\n";
51 }
52
53 return 0;
54}
55

This program counts how many times each prefix of s appears as a substring of s. The trick is to count contributions at each position into occ[π[i]] and then accumulate along failure links from longer to shorter borders. Finally, each prefix occurs at least once as itself, so add one. This is a classic π-tree aggregation.

Time: O(n)Space: O(n)
Build and Use the KMP Automaton for Fast Multiple-Text Processing
1#include <bits/stdc++.h>
2using namespace std;
3
4vector<int> prefix_function(const string &s) {
5 int n = (int)s.size();
6 vector<int> pi(n, 0);
7 for (int i = 1; i < n; ++i) {
8 int j = pi[i - 1];
9 while (j > 0 && s[i] != s[j]) j = pi[j - 1];
10 if (s[i] == s[j]) ++j;
11 pi[i] = j;
12 }
13 return pi;
14}
15
16// Build KMP automaton for lowercase 'a'..'z'.
17// go[q][c] = next state from state q on character 'a'+c, states are 0..m.
18vector<array<int,26>> build_kmp_automaton(const string &pat) {
19 int m = (int)pat.size();
20 vector<int> pi = prefix_function(pat);
21 vector<array<int,26>> go(m + 1);
22
23 for (int c = 0; c < 26; ++c) go[0][c] = 0;
24 if (m > 0) go[0][pat[0]-'a'] = 1;
25
26 for (int q = 1; q <= m; ++q) {
27 for (int c = 0; c < 26; ++c) {
28 if (q < m && pat[q] - 'a' == c) {
29 go[q][c] = q + 1;
30 } else {
31 go[q][c] = go[ (q == 0 ? 0 : pi[q - 1]) ][c];
32 }
33 }
34 }
35 return go;
36}
37
38// Use automaton to count occurrences in a text quickly
39int count_occurrences_automaton(const string &text, const string &pat, const vector<array<int,26>> &go) {
40 int m = (int)pat.size();
41 int state = 0, ans = 0;
42 for (char ch : text) {
43 if ('a' <= ch && ch <= 'z') {
44 state = go[state][ch - 'a'];
45 if (state == m) {
46 ++ans;
47 // Follow failure link implicitly via transition from state m
48 // No manual fallback needed since go captures it
49 }
50 } else {
51 // If the input can contain other chars, either extend alphabet or reset appropriately
52 state = 0; // simple policy: reset on non-lowercase
53 }
54 }
55 return ans;
56}
57
58int main() {
59 ios::sync_with_stdio(false);
60 cin.tie(nullptr);
61
62 string pat;
63 int T;
64 // Example: pattern "aba", then multiple texts
65 if (!(cin >> pat >> T)) return 0;
66 auto go = build_kmp_automaton(pat);
67
68 while (T--) {
69 string text; cin >> text;
70 int cnt = count_occurrences_automaton(text, pat, go);
71 cout << cnt << "\n";
72 }
73
74 return 0;
75}
76

This builds the KMP automaton for a lowercase alphabet and then reuses it to process multiple texts quickly. Each character transition is O(1). If you need multi-pattern matching in a single pass, switch to Aho–Corasick; the KMP automaton is the single-pattern case.

Time: O(m * Σ) build, O(n) per textSpace: O(m * Σ)