Lyndon Factorization
Key Points
- •A Lyndon word is a string that is strictly smaller (lexicographically) than all of its nontrivial rotations.
- •Every string can be uniquely decomposed into a nonincreasing sequence of Lyndon words, known as the Chen–Fox–Lyndon (CFL) factorization.
- •Duval's algorithm computes the Lyndon factorization in O(n) time and O(1) extra space by scanning each character a constant number of times.
- •The lexicographically smallest rotation of a string can be found in O(n) time using a Duval/Booth-style two-pointer algorithm.
- •A string is a Lyndon word if and only if its smallest rotation is itself and the string is primitive (not a power of a shorter string).
- •Lyndon factorization helps in string comparison, minimal suffix/rotation problems, de Bruijn sequence construction, and has connections to suffix arrays.
- •The number of factors in the CFL decomposition is at most the string length, and the factors appear in nonincreasing lexicographic order.
- •These algorithms are cache-friendly, simple to implement in C++, and work for any ordered alphabet.
Prerequisites
- →Lexicographic order — Lyndon words and factorization are defined using lexicographic comparisons.
- →String rotations and concatenation — Smallest rotation and the Lyndon definition rely on understanding cyclic shifts and s+s.
- →Two-pointer techniques — Both Duval’s and Booth’s algorithms use careful pointer advancement to achieve linear time.
- →Prefix function (KMP) or Z-function — Used to test primitivity and reason about periodicity and borders.
- →Big-O notation — To analyze and compare the linear-time and constant-space guarantees.
Detailed Explanation
Tap terms for definitions01Overview
Lyndon factorization is a powerful idea in string algorithms based on a special kind of word called a Lyndon word. A Lyndon word is a string that is strictly smaller, in lexicographic order, than any of its rotations. The Chen–Fox–Lyndon (CFL) theorem states that any finite string can be uniquely written as a concatenation of Lyndon words in nonincreasing lexicographic order. This decomposition is called the Lyndon (or CFL) factorization. The practical star of the show is Duval's algorithm, which computes this factorization in linear time O(n) and constant extra space. It proceeds with a clever two-pointer walk that finds the next maximal-length Lyndon factor starting at the current position, repeats it the correct number of times, and moves on. Closely related is a two-pointer algorithm (often attributed to Booth) that finds the lexicographically smallest rotation of a string in linear time. These tools are more than theoretical curiosities: they show up in competitive programming and computer science research. They enable fast solutions for problems about cyclic string comparison, minimal suffix/rotation, detecting primitive strings, and even play a role in constructing de Bruijn sequences and inspiring ideas in suffix array construction. They are simple, robust, and require only the ability to compare characters lexicographically.
02Intuition & Analogies
Think of a necklace made from lettered beads. If you rotate the necklace, you get another arrangement of the same beads. A Lyndon word is like a necklace arrangement that is the "best" (smallest in dictionary order) among all its rotations. In other words, it’s the most canonical way to present that necklace. Now imagine cutting a long beaded string into chunks so that each chunk is a best-possible necklace (a Lyndon word), and the chunks themselves are arranged from largest to smallest in dictionary order. That is Lyndon factorization: you greedily carve the string into the largest chunk at each step that still forms such a best necklace, and keep going. Because of how lexicographic order and rotations interact, this greedy strategy works globally and even yields a unique answer for any given string. For smallest rotation, picture turning the necklace little by little and asking: which starting bead makes the whole necklace read smallest left-to-right? The two-pointer algorithm compares candidate starts like two runners on a track, skipping over hopeless starting points in long leaps. Each time a mismatch shows one start is worse, you jump the worse start ahead by the amount you’ve already confirmed equal — so you never re-check equal stretches. This creates a linear-time method. Finally, to check if a string itself is a Lyndon word, combine these pictures: it must be the smallest among all its rotations (so the best start is the first bead) and it must not be a repetition of a shorter pattern (otherwise some rotation would tie rather than be strictly larger).
03Formal Definition
04When to Use
Use Lyndon factorization when you need structural insight into a string’s lexicographic behavior: compressing a string into canonical building blocks, proving properties about periodicity/primitive words, or comparing large strings that share long repeated prefixes. It shines in problems where we must reason about all rotations or all suffixes but want to avoid (O(n^2)) brute force. Typical competitive programming scenarios include: finding the lexicographically smallest rotation of a string; checking if two cyclic strings are equal up to rotation or ordering them; recognizing whether a string is a power of a smaller string (primitivity); computing the lexicographically minimal suffix; or generating de Bruijn sequences by concatenating Lyndon words of lengths dividing a fixed (n). In some suffix-array-related reasoning, the factorization helps characterize minimal suffixes and compare big chunks at once. Prefer Duval’s factorization when you need the whole decomposition; prefer the dedicated smallest-rotation algorithm when you only need the best rotation. If you merely need substring search or borders, KMP/Z might be simpler. For global lex ordering across all suffixes, suffix arrays or suffix automata offer more power, but Duval-based tricks are lighter-weight and linear-time for the targeted tasks.
⚠️Common Mistakes
• Confusing nonincreasing with strictly decreasing: the CFL factorization yields a nonincreasing sequence of Lyndon words; equal consecutive factors are allowed and should be grouped with exponents. • Forgetting strictness in the Lyndon definition: a word like "aaa" is not Lyndon because its rotations are equal, not strictly larger; you must also check that the string is primitive. • Implementing smallest rotation with off-by-one errors on the doubled string: the standard algorithm never needs to actually build all rotations and should only compare within 2n bounds without modulo mistakes. • Re-checking characters in Duval’s loops: if you reset pointers incorrectly, you may degrade to (O(n^2)). The correct update is to jump the losing index by (k+1) and reset (k) to 0. • Assuming factor boundaries align with word semantics: Lyndon factors are determined solely by lex order, not by dictionary words or syllables. • Mishandling alphabets and sentinels: for minimal suffix vs minimal rotation, the algorithms differ subtly; do not append a sentinel unless you can guarantee it is strictly smaller (or larger) than all characters. • Comparing rotations by constructing strings: building each rotation costs (O(n)) per rotation, leading to (O(n^2)); use the linear-time two-pointer method instead.
Key Formulas
Lyndon Definition via Rotations
Explanation: A string is a Lyndon word if it is strictly smaller than every nontrivial rotation. This captures the idea that the string is the canonical representative of its rotation class.
Chen–Fox–Lyndon Factorization
Explanation: Every string s decomposes uniquely into powers of Lyndon words whose bases decrease strictly in lexicographic order. Grouping equal consecutive factors yields the exponents.
Duval Complexity
Explanation: Duval's algorithm for factorization runs in linear time and constant extra space. Each position is advanced a constant number of times due to pointer jumps.
Smallest Rotation
Explanation: The minimal rotation problem asks for the shift r that minimizes the rotated string in lex order. Booth/Duval’s two-pointer method finds this in O(n).
Primitivity via Prefix Function
Explanation: Let be the KMP prefix function of s. If the length n is a multiple of n - , then s consists of repeats of a shorter block; otherwise s is primitive.
Factor Count Bound
Explanation: Each factor has positive length and factors are nonoverlapping; thus their count is at most the string length. In practice, the count is often much smaller due to grouping.
Length Conservation
Explanation: The total length of the factorization, accounting for exponents, equals the original string length. This is a simple but useful invariant for correctness checks.
Smallest Rotation Complexity
Explanation: The two-pointer algorithm for minimal rotation also runs in linear time and constant space, by skipping over confirmed-equal blocks.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Duval's algorithm: returns vector of (start, length) for Lyndon factors 5 // Also demonstrates grouping equal consecutive factors into (base, count) 6 7 vector<pair<int,int>> duval_factors(const string &s) { 8 int n = (int)s.size(); 9 vector<pair<int,int>> res; // (start, length) of each factor (ungrouped) 10 int i = 0; 11 while (i < n) { 12 int j = i + 1; 13 int k = i; 14 // Find the end of the next Lyndon word starting at i 15 while (j < n && s[k] <= s[j]) { 16 if (s[k] < s[j]) { 17 // Found strictly smaller char at s[k], reset k to start 18 k = i; 19 } else { 20 // s[k] == s[j], advance k 21 ++k; 22 } 23 ++j; 24 } 25 // The Lyndon word has length (j - k) 26 int len = j - k; 27 // Output as many copies as fit from i to j 28 while (i <= k) { 29 res.emplace_back(i, len); 30 i += len; 31 } 32 } 33 return res; 34 } 35 36 int main() { 37 ios::sync_with_stdio(false); 38 cin.tie(nullptr); 39 40 string s; 41 if (!(cin >> s)) return 0; 42 43 auto factors = duval_factors(s); 44 45 cout << "Ungrouped Lyndon factors (start,length):\n"; 46 for (auto [st, len] : factors) { 47 cout << "(" << st << "," << len << ") -> '" << s.substr(st, len) << "'\n"; 48 } 49 50 // Group equal consecutive factors into (base, count) 51 cout << "\nGrouped as base^count (CFL form):\n"; 52 for (size_t i = 0; i < factors.size(); ) { 53 int st = factors[i].first; 54 int len = factors[i].second; 55 string base = s.substr(st, len); 56 int cnt = 1; 57 size_t j = i + 1; 58 while (j < factors.size() && s.substr(factors[j].first, factors[j].second) == base) { 59 ++cnt; ++j; 60 } 61 cout << "('" << base << "')^" << cnt << " "; 62 i = j; 63 } 64 cout << "\n"; 65 return 0; 66 } 67
This program implements Duval's algorithm to factor a string into a nonincreasing sequence of Lyndon words. It reports each factor with its starting index and length, then groups equal consecutive factors into the canonical CFL form with exponents. The core two-pointer loop discovers the next Lyndon word and the repetition count in O(n) total time.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Returns the index r in [0, n) such that rot_r(s) is lexicographically minimal 5 int smallest_rotation_index(const string &s) { 6 int n = (int)s.size(); 7 if (n == 0) return 0; 8 string t = s + s; // conceptual doubling (we only index within 2n) 9 int i = 0, j = 1, k = 0; 10 while (i < n && j < n && k < n) { 11 char a = t[i + k]; 12 char b = t[j + k]; 13 if (a == b) { 14 ++k; 15 continue; 16 } 17 if (a > b) { 18 // i is worse; skip ahead past the matched block 19 i = i + k + 1; 20 if (i <= j) i = j + 1; 21 } else { 22 // j is worse 23 j = j + k + 1; 24 if (j <= i) j = i + 1; 25 } 26 k = 0; 27 } 28 int r = min(i, j); 29 return r >= n ? r - n : r; 30 } 31 32 string rotate_at(const string &s, int r) { 33 int n = (int)s.size(); 34 return s.substr(r) + s.substr(0, r); 35 } 36 37 int main() { 38 ios::sync_with_stdio(false); 39 cin.tie(nullptr); 40 41 string s; 42 if (!(cin >> s)) return 0; 43 44 int r = smallest_rotation_index(s); 45 cout << "Index: " << r << "\n"; 46 cout << "Smallest rotation: " << rotate_at(s, r) << "\n"; 47 return 0; 48 } 49
This code finds the lexicographically smallest rotation of the input string in linear time. It keeps two candidate starts (i and j) and a matched length k over the conceptual doubled string s+s, skipping over losing candidates by k+1 at each mismatch. The minimal of i and j modulo n is the answer.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 int smallest_rotation_index(const string &s) { 5 int n = (int)s.size(); 6 if (n == 0) return 0; 7 string t = s + s; 8 int i = 0, j = 1, k = 0; 9 while (i < n && j < n && k < n) { 10 char a = t[i + k]; 11 char b = t[j + k]; 12 if (a == b) { ++k; continue; } 13 if (a > b) { i = i + k + 1; if (i <= j) i = j + 1; } 14 else { j = j + k + 1; if (j <= i) j = i + 1; } 15 k = 0; 16 } 17 int r = min(i, j); 18 return r >= n ? r - n : r; 19 } 20 21 // KMP prefix function 22 vector<int> prefix_function(const string &s) { 23 int n = (int)s.size(); 24 vector<int> pi(n, 0); 25 for (int i = 1; i < n; ++i) { 26 int j = pi[i - 1]; 27 while (j > 0 && s[i] != s[j]) j = pi[j - 1]; 28 if (s[i] == s[j]) ++j; 29 pi[i] = j; 30 } 31 return pi; 32 } 33 34 bool is_primitive(const string &s) { 35 int n = (int)s.size(); 36 if (n == 0) return false; 37 auto pi = prefix_function(s); 38 int p = n - pi[n - 1]; 39 return (n % p) != 0; // primitive iff minimal period does not divide n 40 } 41 42 bool is_lyndon(const string &s) { 43 if (s.empty()) return false; 44 // Smallest rotation must be the string itself (index 0) and string must be primitive 45 return smallest_rotation_index(s) == 0 && is_primitive(s); 46 } 47 48 int main() { 49 ios::sync_with_stdio(false); 50 cin.tie(nullptr); 51 52 string s; 53 if (!(cin >> s)) return 0; 54 cout << (is_lyndon(s) ? "YES" : "NO") << "\n"; 55 return 0; 56 } 57
A string is Lyndon if it is strictly smaller than any of its rotations and is primitive. We compute the smallest rotation index (Booth/Duval) and ensure it is 0, then test primitivity via the KMP prefix function to rule out equal rotations from periodicity.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Returns the start index of the lexicographically minimal suffix of s 5 int minimal_suffix_index(const string &s) { 6 int n = (int)s.size(); 7 if (n == 0) return 0; 8 int i = 0, j = 1, k = 0; 9 while (j + k < n) { 10 char a = s[i + k]; 11 char b = s[j + k]; 12 if (a == b) { 13 ++k; 14 } else if (a < b) { 15 // suffix at i is better; skip j past the matched block 16 j = j + k + 1; 17 k = 0; 18 if (j <= i) j = i + 1; 19 } else { 20 // suffix at j is better; advance i similarly 21 i = max(i + k + 1, j); 22 k = 0; 23 j = i + 1; 24 } 25 } 26 return i; 27 } 28 29 int main() { 30 ios::sync_with_stdio(false); 31 cin.tie(nullptr); 32 33 string s; 34 if (!(cin >> s)) return 0; 35 int idx = minimal_suffix_index(s); 36 cout << "Index: " << idx << "\n"; 37 cout << "Minimal suffix: " << s.substr(idx) << "\n"; 38 return 0; 39 } 40
This is the non-cyclic analogue of the smallest-rotation algorithm. It finds the starting index of the lexicographically minimal suffix of s in linear time using two pointers and a matched length k.