⚙️AlgorithmIntermediate

Z-Function

Key Points

  • The Z-function of a string S computes for each position i the length of the longest substring starting at i that matches the prefix of S.
  • It can be computed in O(n) time using the Z-box optimization that maintains the furthest matched segment [l, r].
  • For pattern matching, build T = pattern + '$' + text and every index with z[i] ≥ |pattern| marks an occurrence of the pattern in the text.
  • The Z-function is often simpler to implement than KMP while providing similar capabilities for prefix-suffix comparisons.
  • You can detect borders and the smallest period of a string with simple checks on the Z-array (e.g., i + z[i] = n).
  • Z-values compactly encode LCPs (longest common prefixes) between the whole string and each of its suffixes.
  • Counting how many times each prefix appears in the string can be done in O(n) using a difference array built from Z-values.
  • Be careful to choose a sentinel character that does not appear in the inputs when concatenating pattern and text.

Prerequisites

  • Basic string indexing and substring notationUnderstanding how substrings, prefixes, and suffixes are represented is fundamental to interpreting Z-values.
  • Big-O time and space complexityThe Z-algorithm’s appeal is its O(n) performance; analyzing it requires basic asymptotic reasoning.
  • Prefix-function (KMP) basicsZ is often compared with KMP and solves similar problems; knowing KMP helps contextualize Z’s strengths.
  • Amortized analysisA clear argument for Z’s linear time hinges on amortized reasoning about how often the right boundary moves.

Detailed Explanation

Tap terms for definitions

01Overview

The Z-function (or Z-array) is a linear-time string algorithm that, for each position i in a string S, tells you how many characters from S[i..] match the prefix S[0..]. In other words, z[i] = LCP(S, S[i..]). With a clever sliding-window idea called the Z-box, we compute all these longest-common-prefix values in O(n) time, where n is the length of the string. This ability to compare the prefix against every suffix is surprisingly powerful: pattern matching, detecting borders (strings that are both prefix and suffix), finding the smallest period, and counting occurrences of prefixes are all straightforward with Z. In competitive programming, Z is a popular alternative to KMP because it’s short to code and easy to reason about. A classic trick for pattern matching uses concatenation: build T = P + '$' + S, compute Z on T, and any index i with z[i] ≥ |P| corresponds to an occurrence of P in S. Beyond matching, Z provides fast prefix-equality checks that often simplify problems involving repetitions and string compression.

02Intuition & Analogies

Hook: Imagine you have a long necklace of colored beads (the string). You keep a short template of beads—the beginning of the necklace (the prefix)—and you slide it along the necklace to see how many beads match at each position. The question at each position i is: how many beads from the template match the beads starting at i before the colors differ? That count is z[i]. Concept: Naively, you could compare bead-by-bead from scratch at each position, but that would be slow (quadratic). The Z-box trick says: remember the furthest you have already matched—a window [l, r] that is known to match the prefix. If your current position i falls inside this window, then part of the answer is already known by mirroring positions back to the prefix. You only need to compare new beads beyond r to finish the answer at i. This keeps the total number of comparisons linear because every time you extend r, you move it to the right, and it can pass each position only once. Example: Consider S = "abacaba". Starting at i = 3 (0-based), the substring is "caba". Compare with the prefix "abacaba": the first letters 'c' vs 'a' differ, so z[3] = 0. At i = 4, substring is "aba" and the prefix is "aba..."; they match for 3 characters, so z[4] = 3. If earlier we had a Z-box [l, r] covering i = 4, we’d reuse part of that information instead of re-checking from scratch. Across the string, you get values like [0, 0, 1, 0, 3, 0, 1], which tell you exactly where the prefix shows up again and how far it matches.

03Formal Definition

Given a string S of length n, the Z-function z is an integer array where z[0] is conventionally 0 (or sometimes n), and for i in {1, 2, ..., n-1}, z[i] = max { | S[0..k-1] = S[i..i+k-1] }. Equivalently, z[i] is the length of the longest common prefix between S and the suffix of S starting at i. The Z-box technique maintains indices (l, r) such that S[l..r] matches S[0..r-l]. When processing position i, if , we initialize z[i] as min(r - i + 1, z[i - l]) and then extend it by direct character comparisons while S[z[i]] = S[i + z[i]]. If , we start with z[i] = 0 and extend. After computing z[i], we update (l, r) to (i, i + z[i] - 1) if we reached further. The overall time is O(n) because each character comparison extends r at most once, and comparisons inside a Z-box use previously computed values without re-checking characters already known to match.

04When to Use

Use the Z-function when problems revolve around comparing prefixes with shifted versions of the string or when you need fast pattern matching without building more complex structures. Common use cases include:

  • Exact pattern matching: Build T = P + '$' + S and read matches where z[i] ≥ |P|. This is a clean alternative to KMP for exact matching.
  • Borders and smallest period: If i + z[i] = n, then the suffix starting at i equals the prefix of length n - i; collecting these yields all border lengths. If n % p = 0 and z[p] = n - p, then p is a period; the smallest such p is the minimal period.
  • Counting prefix occurrences: For each i, every length L ≤ z[i] indicates that the prefix of length L occurs at position i. A difference-array trick aggregates counts of all prefix lengths in O(n).
  • String compression and repetition checks: Determine if a string is k repeats of a smaller block by checking a candidate period via Z. If you need more general longest common prefix queries between arbitrary suffixes, consider suffix arrays/trees. For palindromes, Manacher’s algorithm is more appropriate. For approximate matching, hashing or specialized algorithms might suit better.

⚠️Common Mistakes

  1. Forgetting a unique sentinel in pattern matching: In T = P + '+S,the' + S, the '' must not appear in P or S; otherwise, false matches can occur across the boundary. Pick a character outside the alphabet or use a delimiter guaranteed absent.
  2. Misinterpreting z[0]: Some sources set z[0] = n, others set z[0] = 0. Be consistent within your code and downstream logic. In competitive programming, z[0] = 0 is common.
  3. Off-by-one errors in reading matches: For T = P + '$' + S, the position in S for a match at T-index i is i - (|P| + 1). Forgetting the +1 for the sentinel is a classic bug.
  4. Ignoring empty or length-1 strings: Handle n = 0 or n = 1 carefully to avoid out-of-bounds access. The loops should naturally skip comparisons in these cases.
  5. Assuming Z is a substring search tool directly: Raw Z on S compares only with the prefix of S. For searching P in S, you must concatenate P, a sentinel, and S first.
  6. Misusing Z for tasks better served by other tools: Palindromes need Manacher; general distinct substring counting needs suffix structures; rolling hash may be better for certain equality checks across arbitrary substrings.
  7. Not resetting or updating the Z-box correctly: The condition to update (l, r) is if i + z[i] - 1 > r. Small mistakes here can silently corrupt many values.

Key Formulas

Z-function definition

Explanation: For each i, z[i] is the length of the longest common prefix between the whole string S and its suffix starting at i. It measures how far the prefix matches when starting at i.

Pattern matching concatenation

Explanation: Build a new string T by concatenating the pattern P, a sentinel '$', and the text S. Any index i in T with z[i] \ge |P| corresponds to an occurrence of P in S.

Counting matches

Explanation: After computing Z on T = P + '$' + S, count indices where the Z-value reaches at least the pattern length. Each such index signals a full match.

Suffix equals prefix (border check)

Explanation: If the z-value at i extends exactly to the end, then the suffix starting at i matches the prefix of length n - i. This identifies border lengths.

Minimal period test

Explanation: A candidate period p must tile the string evenly and the prefix of length n - p must match the suffix starting at p. The smallest p satisfying this is the minimal period.

Prefix occurrence counts

Explanation: The prefix of length L appears once at position 0, plus once for each shift i where the Z-value is at least L. A difference array lets us compute all in O(n).

Amortized O(n) bound

Explanation: Each position is involved in at most a constant number of character comparisons: either it extends the right boundary r or is reused via Z-box mirroring. Hence total work is linear.

Time and space complexity

Explanation: Computing the Z-array scans the string once with amortized constant work per index, and stores one integer per position.

Z as LCP with prefix

Explanation: Z-values are exactly the longest common prefix lengths between S and each suffix S[i..]. This interpretation guides applications that rely on prefix-suffix equality.

Complexity Analysis

The Z-algorithm runs in linear time O(n) with O(n) extra space. The core idea is the Z-box [l, r]: while iterating i from left to right, we either start a new match from scratch (if i > r) or reuse previously computed information by mirroring into the prefix (if i ≤ r). In the reuse case, we initialize z[i] = min(r - i + 1, z[i - l]) without extra character checks inside the current box. Only when we try to extend beyond r do we perform explicit character comparisons, increasing r. Since r begins at -1 and can move right at most n - 1 times across the entire run, the total number of character comparisons that advance r is at most n - 1. Comparisons inside the box do not re-check matched characters and take constant time per i for initialization. Thus, summing over all i, the total work is bounded by a constant times n. Space usage is O(n) for the Z-array. We also store a few indices (l, r), which are O(1). For pattern matching using T = P + '$' + S, the time becomes O(|P| + |S|) and space O(|P| + |S|) to hold T and its Z-array. Additional applications like counting prefix occurrences or finding borders work within the same O(n) time and space bounds, often using small auxiliary arrays (e.g., a difference array for counts).

Code Examples

Compute the Z-array with the Z-box optimization
1#include <bits/stdc++.h>
2using namespace std;
3
4// Computes Z-function of string s.
5// z[i] = length of longest substring starting at i that matches the prefix.
6// Convention: z[0] = 0.
7vector<int> z_function(const string &s) {
8 int n = (int)s.size();
9 vector<int> z(n, 0);
10 int l = 0, r = 0; // current Z-box is [l, r]
11 for (int i = 1; i < n; ++i) {
12 if (i <= r) {
13 // We are inside the current Z-box; mirror to the prefix
14 z[i] = min(r - i + 1, z[i - l]);
15 }
16 // Try to extend the match beyond r
17 while (i + z[i] < n && s[z[i]] == s[i + z[i]]) {
18 ++z[i];
19 }
20 // Update the Z-box if we extended beyond r
21 if (i + z[i] - 1 > r) {
22 l = i;
23 r = i + z[i] - 1;
24 }
25 }
26 return z;
27}
28
29int main() {
30 vector<string> tests = {"abacaba", "aaaaa", "abcabcd"};
31 for (const auto &s : tests) {
32 auto z = z_function(s);
33 cout << "S = '" << s << "'\nZ : ";
34 for (int i = 0; i < (int)z.size(); ++i) cout << z[i] << (i + 1 == (int)z.size() ? '\n' : ' ');
35 cout << "---\n";
36 }
37 return 0;
38}
39

This program implements the linear-time Z-algorithm. It maintains a Z-box [l, r] that mirrors known matches to initialize z[i]. Only when extending beyond r do we compare new characters, ensuring O(n) total time. The main function runs the algorithm on a few test strings and prints the resulting Z-arrays.

Time: O(n)Space: O(n)
Exact pattern matching via Z on concatenated string T = P + '$' + S
1#include <bits/stdc++.h>
2using namespace std;
3
4vector<int> z_function(const string &s) {
5 int n = (int)s.size();
6 vector<int> z(n, 0);
7 int l = 0, r = 0;
8 for (int i = 1; i < n; ++i) {
9 if (i <= r) z[i] = min(r - i + 1, z[i - l]);
10 while (i + z[i] < n && s[z[i]] == s[i + z[i]]) ++z[i];
11 if (i + z[i] - 1 > r) l = i, r = i + z[i] - 1;
12 }
13 return z;
14}
15
16int main() {
17 string pattern = "aba";
18 string text = "abacabaaba";
19
20 // Choose a sentinel not present in pattern or text. If '$' can appear,
21 // pick a different character that is guaranteed absent.
22 char sentinel = '$';
23 string T = pattern + sentinel + text;
24
25 vector<int> z = z_function(T);
26
27 int m = (int)pattern.size();
28 int sep = m + 1; // index after sentinel in T
29
30 vector<int> positions; // starting indices in 'text'
31 for (int i = sep; i < (int)T.size(); ++i) {
32 if (z[i] >= m) {
33 int pos_in_text = i - sep;
34 positions.push_back(pos_in_text);
35 }
36 }
37
38 cout << "Pattern: '" << pattern << "'\nText : '" << text << "'\nMatches at positions: ";
39 for (int i = 0; i < (int)positions.size(); ++i) {
40 cout << positions[i] << (i + 1 == (int)positions.size() ? '\n' : ' ');
41 }
42 if (positions.empty()) cout << "(none)\n";
43 return 0;
44}
45

We build T = P + '$' + S and compute Z on T. Every index i with z[i] ≥ |P| indicates a full match of P starting at position i - (|P| + 1) in S. The sentinel prevents false cross-boundary matches. This approach is a compact alternative to KMP for exact matching.

Time: O(|P| + |S|)Space: O(|P| + |S|)
Find all borders and the smallest period of a string using Z
1#include <bits/stdc++.h>
2using namespace std;
3
4vector<int> z_function(const string &s) {
5 int n = (int)s.size();
6 vector<int> z(n, 0);
7 int l = 0, r = 0;
8 for (int i = 1; i < n; ++i) {
9 if (i <= r) z[i] = min(r - i + 1, z[i - l]);
10 while (i + z[i] < n && s[z[i]] == s[i + z[i]]) ++z[i];
11 if (i + z[i] - 1 > r) l = i, r = i + z[i] - 1;
12 }
13 return z;
14}
15
16int main() {
17 string s = "abcabcabc"; // try also: "aaaa", "abacaba", "abababa"
18 int n = (int)s.size();
19 vector<int> z = z_function(s);
20
21 // Borders: lengths L such that suffix of length L equals prefix of length L.
22 // Using i + z[i] == n, then L = n - i is a border.
23 vector<int> borders;
24 for (int i = 1; i < n; ++i) {
25 if (i + z[i] == n) {
26 int L = n - i;
27 borders.push_back(L);
28 }
29 }
30 sort(borders.begin(), borders.end());
31
32 // Smallest period p: smallest i > 0 such that n % i == 0 and z[i] == n - i.
33 int period = n;
34 for (int i = 1; i < n; ++i) {
35 if (n % i == 0 && z[i] == n - i) {
36 period = i;
37 break;
38 }
39 }
40
41 cout << "S = '" << s << "'\n";
42 cout << "Borders (lengths): ";
43 if (borders.empty()) cout << "(none)";
44 for (int i = 0; i < (int)borders.size(); ++i) {
45 cout << borders[i] << (i + 1 == (int)borders.size() ? '\n' : ' ');
46 }
47 if (!borders.empty()) cout << '\n';
48 cout << "Smallest period: " << period << "\n";
49
50 if (period < n) {
51 cout << "Compressed form: repeat block '" << s.substr(0, period) << "' x " << (n / period) << "\n";
52 } else {
53 cout << "No non-trivial period (string is aperiodic).\n";
54 }
55 return 0;
56}
57

This program computes all border lengths using the condition i + z[i] = n. The smallest period is the smallest i such that n is divisible by i and z[i] = n - i. If such an i exists, the string is repetitions of its first i characters; otherwise it has no non-trivial period.

Time: O(n)Space: O(n)
Count how many times each prefix occurs in the string (range trick on Z)
1#include <bits/stdc++.h>
2using namespace std;
3
4vector<int> z_function(const string &s) {
5 int n = (int)s.size();
6 vector<int> z(n, 0);
7 int l = 0, r = 0;
8 for (int i = 1; i < n; ++i) {
9 if (i <= r) z[i] = min(r - i + 1, z[i - l]);
10 while (i + z[i] < n && s[z[i]] == s[i + z[i]]) ++z[i];
11 if (i + z[i] - 1 > r) l = i, r = i + z[i] - 1;
12 }
13 return z;
14}
15
16int main() {
17 string s = "aaaa"; // Try: "ababa", "abcabcd"
18 int n = (int)s.size();
19 vector<int> z = z_function(s);
20
21 // occ[L] will store how many times the prefix of length L occurs in s.
22 // For each i>0, the prefix of every length 1..z[i] occurs at position i.
23 vector<long long> diff(n + 2, 0); // difference array for range adds on [1, z[i]]
24 for (int i = 1; i < n; ++i) {
25 if (z[i] > 0) {
26 diff[1] += 1;
27 diff[z[i] + 1] -= 1;
28 }
29 }
30 vector<long long> occ(n + 1, 0);
31 long long run = 0;
32 for (int L = 1; L <= n; ++L) {
33 run += diff[L];
34 occ[L] = run; // occurrences from positions i>0
35 }
36 // Add the occurrence at position 0 for every prefix length
37 for (int L = 1; L <= n; ++L) occ[L] += 1;
38
39 cout << "S = '" << s << "'\n";
40 for (int L = 1; L <= n; ++L) {
41 cout << "Prefix length " << L << " occurs " << occ[L] << " time(s)\n";
42 }
43
44 // Optionally, list which prefixes are also suffixes with their counts
45 cout << "\nBorders (lengths) with counts: \n";
46 for (int i = 1; i < n; ++i) {
47 if (i + z[i] == n) {
48 int L = n - i;
49 cout << "L = " << L << ", count = " << occ[L] << "\n";
50 }
51 }
52 return 0;
53}
54

For each shift i, any L ≤ z[i] indicates that the prefix of length L appears at position i. Using a difference array, we add +1 to the range [1, z[i]] for all i and then take prefix sums to get counts for every L. Finally, we add 1 for the trivial occurrence at position 0 for each prefix.

Time: O(n)Space: O(n)