⚙️AlgorithmAdvanced

Lyndon Factorization

Key Points

  • A Lyndon word is a string that is strictly smaller (lexicographically) than all of its nontrivial rotations.
  • Every string can be uniquely decomposed into a nonincreasing sequence of Lyndon words, known as the Chen–Fox–Lyndon (CFL) factorization.
  • Duval's algorithm computes the Lyndon factorization in O(n) time and O(1) extra space by scanning each character a constant number of times.
  • The lexicographically smallest rotation of a string can be found in O(n) time using a Duval/Booth-style two-pointer algorithm.
  • A string is a Lyndon word if and only if its smallest rotation is itself and the string is primitive (not a power of a shorter string).
  • Lyndon factorization helps in string comparison, minimal suffix/rotation problems, de Bruijn sequence construction, and has connections to suffix arrays.
  • The number of factors in the CFL decomposition is at most the string length, and the factors appear in nonincreasing lexicographic order.
  • These algorithms are cache-friendly, simple to implement in C++, and work for any ordered alphabet.

Prerequisites

  • Lexicographic orderLyndon words and factorization are defined using lexicographic comparisons.
  • String rotations and concatenationSmallest rotation and the Lyndon definition rely on understanding cyclic shifts and s+s.
  • Two-pointer techniquesBoth Duval’s and Booth’s algorithms use careful pointer advancement to achieve linear time.
  • Prefix function (KMP) or Z-functionUsed to test primitivity and reason about periodicity and borders.
  • Big-O notationTo analyze and compare the linear-time and constant-space guarantees.

Detailed Explanation

Tap terms for definitions

01Overview

Lyndon factorization is a powerful idea in string algorithms based on a special kind of word called a Lyndon word. A Lyndon word is a string that is strictly smaller, in lexicographic order, than any of its rotations. The Chen–Fox–Lyndon (CFL) theorem states that any finite string can be uniquely written as a concatenation of Lyndon words in nonincreasing lexicographic order. This decomposition is called the Lyndon (or CFL) factorization. The practical star of the show is Duval's algorithm, which computes this factorization in linear time O(n) and constant extra space. It proceeds with a clever two-pointer walk that finds the next maximal-length Lyndon factor starting at the current position, repeats it the correct number of times, and moves on. Closely related is a two-pointer algorithm (often attributed to Booth) that finds the lexicographically smallest rotation of a string in linear time. These tools are more than theoretical curiosities: they show up in competitive programming and computer science research. They enable fast solutions for problems about cyclic string comparison, minimal suffix/rotation, detecting primitive strings, and even play a role in constructing de Bruijn sequences and inspiring ideas in suffix array construction. They are simple, robust, and require only the ability to compare characters lexicographically.

02Intuition & Analogies

Think of a necklace made from lettered beads. If you rotate the necklace, you get another arrangement of the same beads. A Lyndon word is like a necklace arrangement that is the "best" (smallest in dictionary order) among all its rotations. In other words, it’s the most canonical way to present that necklace. Now imagine cutting a long beaded string into chunks so that each chunk is a best-possible necklace (a Lyndon word), and the chunks themselves are arranged from largest to smallest in dictionary order. That is Lyndon factorization: you greedily carve the string into the largest chunk at each step that still forms such a best necklace, and keep going. Because of how lexicographic order and rotations interact, this greedy strategy works globally and even yields a unique answer for any given string. For smallest rotation, picture turning the necklace little by little and asking: which starting bead makes the whole necklace read smallest left-to-right? The two-pointer algorithm compares candidate starts like two runners on a track, skipping over hopeless starting points in long leaps. Each time a mismatch shows one start is worse, you jump the worse start ahead by the amount you’ve already confirmed equal — so you never re-check equal stretches. This creates a linear-time method. Finally, to check if a string itself is a Lyndon word, combine these pictures: it must be the smallest among all its rotations (so the best start is the first bead) and it must not be a repetition of a shorter pattern (otherwise some rotation would tie rather than be strictly larger).

03Formal Definition

Let \(\) be a totally ordered alphabet. For a string \(w \) with length \(n\), define its rotation by \(k\) as \((w) = w[k..n-1] \, w[0..k-1]\) for \(1 k < n\). A nonempty string \(w\) is a Lyndon word if \(w < (w)\) in lexicographic order for every \(1 k < \); equivalently, \(w\) is strictly minimal in its conjugacy class under rotation and is primitive (not a power of a shorter string). The Chen–Fox–Lyndon theorem states that any string \(s \) can be uniquely written as \( ^{} L_\), where each \(\) is a Lyndon word, \( 1\) are integers (exponents of repetition), and \( > > \) in lexicographic order. The version without exponents is often presented as \( \) with \( \), where consecutive equal \(\) are grouped into \(L_\). Duval’s algorithm computes this factorization in \(O(n)\) time. It identifies the next Lyndon word starting at a position by maximal extension: use two indices \(i\) (start) and \(j\) (candidate comparison) and a span \(k\). While characters tie, advance \(k\); when they differ, decide whether to advance \(i\) or \(j\) by \(k+1\) to maintain linear progress. The minimal rotation of a string \(s\) is \( (s)\); a Booth/Duval two-pointer scheme also finds it in \(O(n)\) time.

04When to Use

Use Lyndon factorization when you need structural insight into a string’s lexicographic behavior: compressing a string into canonical building blocks, proving properties about periodicity/primitive words, or comparing large strings that share long repeated prefixes. It shines in problems where we must reason about all rotations or all suffixes but want to avoid (O(n^2)) brute force. Typical competitive programming scenarios include: finding the lexicographically smallest rotation of a string; checking if two cyclic strings are equal up to rotation or ordering them; recognizing whether a string is a power of a smaller string (primitivity); computing the lexicographically minimal suffix; or generating de Bruijn sequences by concatenating Lyndon words of lengths dividing a fixed (n). In some suffix-array-related reasoning, the factorization helps characterize minimal suffixes and compare big chunks at once. Prefer Duval’s factorization when you need the whole decomposition; prefer the dedicated smallest-rotation algorithm when you only need the best rotation. If you merely need substring search or borders, KMP/Z might be simpler. For global lex ordering across all suffixes, suffix arrays or suffix automata offer more power, but Duval-based tricks are lighter-weight and linear-time for the targeted tasks.

⚠️Common Mistakes

• Confusing nonincreasing with strictly decreasing: the CFL factorization yields a nonincreasing sequence of Lyndon words; equal consecutive factors are allowed and should be grouped with exponents. • Forgetting strictness in the Lyndon definition: a word like "aaa" is not Lyndon because its rotations are equal, not strictly larger; you must also check that the string is primitive. • Implementing smallest rotation with off-by-one errors on the doubled string: the standard algorithm never needs to actually build all rotations and should only compare within 2n bounds without modulo mistakes. • Re-checking characters in Duval’s loops: if you reset pointers incorrectly, you may degrade to (O(n^2)). The correct update is to jump the losing index by (k+1) and reset (k) to 0. • Assuming factor boundaries align with word semantics: Lyndon factors are determined solely by lex order, not by dictionary words or syllables. • Mishandling alphabets and sentinels: for minimal suffix vs minimal rotation, the algorithms differ subtly; do not append a sentinel unless you can guarantee it is strictly smaller (or larger) than all characters. • Comparing rotations by constructing strings: building each rotation costs (O(n)) per rotation, leading to (O(n^2)); use the linear-time two-pointer method instead.

Key Formulas

Lyndon Definition via Rotations

Explanation: A string is a Lyndon word if it is strictly smaller than every nontrivial rotation. This captures the idea that the string is the canonical representative of its rotation class.

Chen–Fox–Lyndon Factorization

Explanation: Every string s decomposes uniquely into powers of Lyndon words whose bases decrease strictly in lexicographic order. Grouping equal consecutive factors yields the exponents.

Duval Complexity

Explanation: Duval's algorithm for factorization runs in linear time and constant extra space. Each position is advanced a constant number of times due to pointer jumps.

Smallest Rotation

Explanation: The minimal rotation problem asks for the shift r that minimizes the rotated string in lex order. Booth/Duval’s two-pointer method finds this in O(n).

Primitivity via Prefix Function

Explanation: Let be the KMP prefix function of s. If the length n is a multiple of n - , then s consists of repeats of a shorter block; otherwise s is primitive.

Factor Count Bound

Explanation: Each factor has positive length and factors are nonoverlapping; thus their count is at most the string length. In practice, the count is often much smaller due to grouping.

Length Conservation

Explanation: The total length of the factorization, accounting for exponents, equals the original string length. This is a simple but useful invariant for correctness checks.

Smallest Rotation Complexity

Explanation: The two-pointer algorithm for minimal rotation also runs in linear time and constant space, by skipping over confirmed-equal blocks.

Complexity Analysis

Duval’s factorization algorithm maintains three integers (i, j, k) with i the start of the current factor, j a comparison index, and k the current matched length. At every step, characters s[i+k] and s[j+k] are compared. If they are equal, k increases. Otherwise, the smaller character determines which start (i or j) is lexicographically better; the worse start is advanced by k+1, and k resets to 0. Crucially, i and j never move backward, so each position of the string participates in a constant number of comparisons. Therefore, the total number of character comparisons is O(n), yielding T(n) = O(n) time. The algorithm uses only a handful of integer variables in addition to the input and outputs factor boundaries, so its extra space is S(n) = O(1) (excluding the space to store the output factors themselves). Booth’s algorithm for smallest rotation uses the same two-pointer idea on the doubled string s+s without materializing all rotations. It tracks two candidate starts (i and j) and a matched length k; when a mismatch occurs, the worse start jumps by k+1, and k resets. Since each index advances monotonically and never exceeds 2n, the total comparisons are O(n). Like Duval, it uses constant extra space beyond the input. The isLyndon check combines Booth (O(n)) with a primitivity test via the prefix function (O(n)), preserving linear time. The minimal-suffix algorithm is another two-pointer variant with identical complexity guarantees. Overall, these algorithms are tight and optimal under the RAM model because any solution must inspect all characters in the worst case.

Code Examples

Duval's Algorithm: Lyndon (CFL) Factorization with Grouped Exponents
1#include <bits/stdc++.h>
2using namespace std;
3
4// Duval's algorithm: returns vector of (start, length) for Lyndon factors
5// Also demonstrates grouping equal consecutive factors into (base, count)
6
7vector<pair<int,int>> duval_factors(const string &s) {
8 int n = (int)s.size();
9 vector<pair<int,int>> res; // (start, length) of each factor (ungrouped)
10 int i = 0;
11 while (i < n) {
12 int j = i + 1;
13 int k = i;
14 // Find the end of the next Lyndon word starting at i
15 while (j < n && s[k] <= s[j]) {
16 if (s[k] < s[j]) {
17 // Found strictly smaller char at s[k], reset k to start
18 k = i;
19 } else {
20 // s[k] == s[j], advance k
21 ++k;
22 }
23 ++j;
24 }
25 // The Lyndon word has length (j - k)
26 int len = j - k;
27 // Output as many copies as fit from i to j
28 while (i <= k) {
29 res.emplace_back(i, len);
30 i += len;
31 }
32 }
33 return res;
34}
35
36int main() {
37 ios::sync_with_stdio(false);
38 cin.tie(nullptr);
39
40 string s;
41 if (!(cin >> s)) return 0;
42
43 auto factors = duval_factors(s);
44
45 cout << "Ungrouped Lyndon factors (start,length):\n";
46 for (auto [st, len] : factors) {
47 cout << "(" << st << "," << len << ") -> '" << s.substr(st, len) << "'\n";
48 }
49
50 // Group equal consecutive factors into (base, count)
51 cout << "\nGrouped as base^count (CFL form):\n";
52 for (size_t i = 0; i < factors.size(); ) {
53 int st = factors[i].first;
54 int len = factors[i].second;
55 string base = s.substr(st, len);
56 int cnt = 1;
57 size_t j = i + 1;
58 while (j < factors.size() && s.substr(factors[j].first, factors[j].second) == base) {
59 ++cnt; ++j;
60 }
61 cout << "('" << base << "')^" << cnt << " ";
62 i = j;
63 }
64 cout << "\n";
65 return 0;
66}
67

This program implements Duval's algorithm to factor a string into a nonincreasing sequence of Lyndon words. It reports each factor with its starting index and length, then groups equal consecutive factors into the canonical CFL form with exponents. The core two-pointer loop discovers the next Lyndon word and the repetition count in O(n) total time.

Time: O(n)Space: O(1) extra (O(k) for output factors)
Smallest Rotation (Booth/Duval Two-Pointer Algorithm)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Returns the index r in [0, n) such that rot_r(s) is lexicographically minimal
5int smallest_rotation_index(const string &s) {
6 int n = (int)s.size();
7 if (n == 0) return 0;
8 string t = s + s; // conceptual doubling (we only index within 2n)
9 int i = 0, j = 1, k = 0;
10 while (i < n && j < n && k < n) {
11 char a = t[i + k];
12 char b = t[j + k];
13 if (a == b) {
14 ++k;
15 continue;
16 }
17 if (a > b) {
18 // i is worse; skip ahead past the matched block
19 i = i + k + 1;
20 if (i <= j) i = j + 1;
21 } else {
22 // j is worse
23 j = j + k + 1;
24 if (j <= i) j = i + 1;
25 }
26 k = 0;
27 }
28 int r = min(i, j);
29 return r >= n ? r - n : r;
30}
31
32string rotate_at(const string &s, int r) {
33 int n = (int)s.size();
34 return s.substr(r) + s.substr(0, r);
35}
36
37int main() {
38 ios::sync_with_stdio(false);
39 cin.tie(nullptr);
40
41 string s;
42 if (!(cin >> s)) return 0;
43
44 int r = smallest_rotation_index(s);
45 cout << "Index: " << r << "\n";
46 cout << "Smallest rotation: " << rotate_at(s, r) << "\n";
47 return 0;
48}
49

This code finds the lexicographically smallest rotation of the input string in linear time. It keeps two candidate starts (i and j) and a matched length k over the conceptual doubled string s+s, skipping over losing candidates by k+1 at each mismatch. The minimal of i and j modulo n is the answer.

Time: O(n)Space: O(1) extra
Check If a String Is a Lyndon Word
1#include <bits/stdc++.h>
2using namespace std;
3
4int smallest_rotation_index(const string &s) {
5 int n = (int)s.size();
6 if (n == 0) return 0;
7 string t = s + s;
8 int i = 0, j = 1, k = 0;
9 while (i < n && j < n && k < n) {
10 char a = t[i + k];
11 char b = t[j + k];
12 if (a == b) { ++k; continue; }
13 if (a > b) { i = i + k + 1; if (i <= j) i = j + 1; }
14 else { j = j + k + 1; if (j <= i) j = i + 1; }
15 k = 0;
16 }
17 int r = min(i, j);
18 return r >= n ? r - n : r;
19}
20
21// KMP prefix function
22vector<int> prefix_function(const string &s) {
23 int n = (int)s.size();
24 vector<int> pi(n, 0);
25 for (int i = 1; i < n; ++i) {
26 int j = pi[i - 1];
27 while (j > 0 && s[i] != s[j]) j = pi[j - 1];
28 if (s[i] == s[j]) ++j;
29 pi[i] = j;
30 }
31 return pi;
32}
33
34bool is_primitive(const string &s) {
35 int n = (int)s.size();
36 if (n == 0) return false;
37 auto pi = prefix_function(s);
38 int p = n - pi[n - 1];
39 return (n % p) != 0; // primitive iff minimal period does not divide n
40}
41
42bool is_lyndon(const string &s) {
43 if (s.empty()) return false;
44 // Smallest rotation must be the string itself (index 0) and string must be primitive
45 return smallest_rotation_index(s) == 0 && is_primitive(s);
46}
47
48int main() {
49 ios::sync_with_stdio(false);
50 cin.tie(nullptr);
51
52 string s;
53 if (!(cin >> s)) return 0;
54 cout << (is_lyndon(s) ? "YES" : "NO") << "\n";
55 return 0;
56}
57

A string is Lyndon if it is strictly smaller than any of its rotations and is primitive. We compute the smallest rotation index (Booth/Duval) and ensure it is 0, then test primitivity via the KMP prefix function to rule out equal rotations from periodicity.

Time: O(n)Space: O(n) for prefix function
Lexicographically Minimal Suffix (Duval Two-Pointer Variant)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Returns the start index of the lexicographically minimal suffix of s
5int minimal_suffix_index(const string &s) {
6 int n = (int)s.size();
7 if (n == 0) return 0;
8 int i = 0, j = 1, k = 0;
9 while (j + k < n) {
10 char a = s[i + k];
11 char b = s[j + k];
12 if (a == b) {
13 ++k;
14 } else if (a < b) {
15 // suffix at i is better; skip j past the matched block
16 j = j + k + 1;
17 k = 0;
18 if (j <= i) j = i + 1;
19 } else {
20 // suffix at j is better; advance i similarly
21 i = max(i + k + 1, j);
22 k = 0;
23 j = i + 1;
24 }
25 }
26 return i;
27}
28
29int main() {
30 ios::sync_with_stdio(false);
31 cin.tie(nullptr);
32
33 string s;
34 if (!(cin >> s)) return 0;
35 int idx = minimal_suffix_index(s);
36 cout << "Index: " << idx << "\n";
37 cout << "Minimal suffix: " << s.substr(idx) << "\n";
38 return 0;
39}
40

This is the non-cyclic analogue of the smallest-rotation algorithm. It finds the starting index of the lexicographically minimal suffix of s in linear time using two pointers and a matched length k.

Time: O(n)Space: O(1) extra