Longest Common Subsequence
Key Points
- •The Longest Common Subsequence (LCS) between two sequences is the longest sequence that appears in both, not necessarily contiguously.
- •A classic dynamic programming solution fills a 2D table dp[i][j] using the recurrence: match gives dp[i-1][j-1] + 1, else take max of top or left.
- •The standard DP runs in O(nm) time and O(nm) space, where n and m are the lengths of the two strings.
- •You can reconstruct one valid LCS by backtracking from dp[n][m] to dp[0][0] following matches and choices.
- •Space can be reduced to O(min(n, m)) if you only need the LCS length by keeping two rolling rows or columns.
- •Hirschberg’s algorithm reconstructs an LCS in O(nm) time but only O(n + m) space using divide-and-conquer.
- •If multiple LCS strings exist, you can reconstruct the lexicographically smallest one using suffix DP and next-occurrence arrays.
- •LCS is a core string DP used in diff tools, version control merges, DNA sequence comparison, and approximate matching.
Prerequisites
- →2D arrays and indexing — The DP table for LCS is a 2D array and off-by-one handling is crucial.
- →Dynamic programming basics — LCS relies on recognizing overlapping subproblems and optimal substructure.
- →String manipulation — Understanding substrings, subsequences, and character access is essential.
- →Big-O notation — To analyze time and space costs like O(nm) and O(min(n, m)).
- →Recursion and divide-and-conquer — Hirschberg’s algorithm uses recursion to split the problem.
- →Lexicographic order — Needed when reconstructing the lexicographically smallest LCS.
Detailed Explanation
Tap terms for definitions01Overview
The Longest Common Subsequence (LCS) problem asks: given two sequences (often strings), what is the longest sequence of elements that appears in both in the same order, but not necessarily contiguously? For example, the LCS of "ABCBDAB" and "BDCABC" is "BCAB" with length 4. LCS is a foundational dynamic programming problem because it demonstrates how overlapping subproblems and optimal substructure allow an exponential search to be solved in polynomial time. The standard dynamic programming (DP) approach constructs a 2D table dp where dp[i][j] stores the LCS length of the first i characters of s and the first j characters of t. By comparing current characters and propagating best results from subproblems, the algorithm fills the table in O(nm) time, where n and m are the lengths of the inputs. Once the table is built, you can backtrack from the bottom-right corner to reconstruct one optimal subsequence. Space can be optimized from O(nm) to O(min(n, m)) if you only need the length, and further, Hirschberg's algorithm enables reconstruction with only linear space. LCS underpins tools like file differencing, merge utilities in version control, and bioinformatics sequence analysis.
02Intuition & Analogies
Imagine you and a friend took notes in class. Your notes (s) and your friend’s notes (t) are similar but not identical. A common subsequence is like the set of key points that you both wrote down in the same order, even if you skipped different words in between. We want the longest such shared storyline. A naive way would be to try all ways of crossing out words from s and seeing if what remains appears in t; but this explodes quickly. Instead, think bottom-up: if we know the best answer for smaller note fragments, we can build up the answer for larger fragments. At position i in s and j in t, there are two cases. If the current words match, we happily include this word and solve the smaller problem of the previous prefixes (i-1 and j-1). If they don’t match, we try skipping a word in either s or t and take the better result—we postpone the decision about where the next match will happen. This feels like walking on a grid: moves left or up mean skipping in one string, and a diagonal move means we found the same word and advance in both. By systematically exploring these local choices and reusing solutions from smaller rectangles of the grid, we avoid recomputing the same comparisons. The DP table is our map of all these partial agreements, and backtracking along the map’s arrows recovers the shared storyline itself. If memory is tight, we can keep only one or two rows of this map at a time, because each new row depends only on the previous one. And if we still want the path (the subsequence) with low memory, Hirschberg’s divide-and-conquer approach cleverly splits the grid to stitch together the final path using only linear space.
03Formal Definition
04When to Use
Use LCS when you need to measure or reconstruct order-preserving similarity between two sequences. Typical applications include: diff and merge tools that show common lines and differences between two versions of a file; plagiarism or similarity detection where exact order matters but gaps are allowed; bioinformatics for DNA/RNA/protein sequence comparison where insertions and deletions are common; error-tolerant matching where mismatches can be skipped; and computing edit distance with only insertions/deletions (no substitutions), since the minimum number of insertions plus deletions equals n + m − 2·LCS(s, t). In competitive programming, reach for LCS when constraints are around n, m ≤ 2000–5000 for O(nm) solutions, or when strings are shorter but you must also output an actual subsequence. If memory is limited but you must reconstruct, consider Hirschberg. If only the length is required and one dimension is large (e.g., m ≈ 10^5, n ≈ 10^3), rolling arrays with O(min(n, m)) space are appropriate. When the alphabet is small and you need lexicographic control over the output LCS, complement DP with next-occurrence arrays to greedily choose the smallest valid next character.
⚠️Common Mistakes
Common pitfalls include: confusing subsequence with substring—substring requires contiguity, while subsequence allows gaps. Another is off-by-one indexing in the DP table; remember dp has size (n+1) × (m+1) and characters align with i−1 and j−1. When reconstructing, failing to store parent pointers or a consistent tie-breaking rule can lead to incorrect backtracking or infinite loops. For large inputs, allocating dp as dp[n][m] without the +1 row/column can cause out-of-bounds access on base cases. Over-optimizing: using O(min(n, m)) space rolling arrays is great for lengths, but you cannot reconstruct an LCS from rolling arrays alone—use Hirschberg to reconstruct with low space. For lexicographically smallest LCS, naive backtracking that chooses left vs. up arbitrarily will not guarantee the smallest string; you need a principled tie-break, such as using suffix DP with next-occurrence arrays. Also beware of character sets: next-occurrence techniques shown for lowercase letters need generalization for larger alphabets. Finally, mixing LCS with edit distance incorrectly: standard Levenshtein distance allows substitutions; LCS aligns with the insert/delete-only model via the identity edits = n + m − 2·LCS.
Key Formulas
LCS Recurrence
Explanation: Defines how to build the LCS length for prefixes of lengths i and j from smaller subproblems. If current characters match, extend diagonally; else take the better of skipping in s or t.
LCS Base Cases
Explanation: An empty string has LCS length 0 with any string. These initialize the first row and column of the DP table.
Time and Space of Tabulation
Explanation: The table has (n+1)(m+1) cells and each is filled in O(1) time. Storing the full table and parent pointers requires O(nm) memory.
Rolling Array Space
Explanation: If only the LCS length is needed, each row depends only on the previous one, so memory can be reduced to the smaller dimension.
Hirschberg Split Index
Explanation: F is the forward DP row for s[0..mid) vs. t[0..j] and G is the backward DP row for reverse(s[mid..)) vs. reverse(t[j..m)). The split j that maximizes their sum yields an optimal partition.
LCS Length Upper Bound
Explanation: A common subsequence cannot be longer than the shorter input. This trivial bound is useful for sanity checks.
Insert/Delete Edit Distance
Explanation: When only insertions and deletions are allowed, the fewest operations needed equals total length minus twice the LCS length.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Computes dp table for LCS length and reconstructs one LCS string. 5 int main() { 6 ios::sync_with_stdio(false); 7 cin.tie(nullptr); 8 9 string s, t; 10 if (!(cin >> s >> t)) { 11 return 0; // Expect two strings as input 12 } 13 int n = (int)s.size(); 14 int m = (int)t.size(); 15 16 // dp[i][j] = LCS length of s[0..i-1] and t[0..j-1] 17 vector<vector<int>> dp(n + 1, vector<int>(m + 1, 0)); 18 // parent[i][j]: 'D' (diag, match), 'U' (up), 'L' (left) 19 vector<vector<char>> parent(n + 1, vector<char>(m + 1, 0)); 20 21 for (int i = 1; i <= n; ++i) { 22 for (int j = 1; j <= m; ++j) { 23 if (s[i - 1] == t[j - 1]) { 24 dp[i][j] = dp[i - 1][j - 1] + 1; 25 parent[i][j] = 'D'; 26 } else if (dp[i - 1][j] >= dp[i][j - 1]) { 27 dp[i][j] = dp[i - 1][j]; 28 parent[i][j] = 'U'; 29 } else { 30 dp[i][j] = dp[i][j - 1]; 31 parent[i][j] = 'L'; 32 } 33 } 34 } 35 36 // Reconstruct one LCS by backtracking from (n, m) 37 string lcs; 38 int i = n, j = m; 39 while (i > 0 && j > 0) { 40 if (parent[i][j] == 'D') { 41 lcs.push_back(s[i - 1]); 42 --i; --j; 43 } else if (parent[i][j] == 'U') { 44 --i; 45 } else { // 'L' 46 --j; 47 } 48 } 49 reverse(lcs.begin(), lcs.end()); 50 51 cout << dp[n][m] << "\n"; // LCS length 52 cout << lcs << "\n"; // One LCS string 53 return 0; 54 } 55
This program builds the classic (n+1)×(m+1) DP table. It stores parent pointers to remember how each cell was derived. Backtracking from (n, m) to (0, 0) following the parent pointers reconstructs one valid LCS. Ties are resolved by preferring 'up' over 'left' when characters differ; any deterministic tie-break yields a valid LCS.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Computes LCS length using O(min(n, m)) additional space. 5 int main() { 6 ios::sync_with_stdio(false); 7 cin.tie(nullptr); 8 9 string s, t; 10 if (!(cin >> s >> t)) return 0; 11 // Ensure t is the shorter string to minimize memory 12 if (t.size() > s.size()) swap(s, t); 13 int n = (int)s.size(); 14 int m = (int)t.size(); 15 16 vector<int> prev(m + 1, 0), curr(m + 1, 0); 17 for (int i = 1; i <= n; ++i) { 18 fill(curr.begin(), curr.end(), 0); 19 for (int j = 1; j <= m; ++j) { 20 if (s[i - 1] == t[j - 1]) curr[j] = prev[j - 1] + 1; 21 else curr[j] = max(prev[j], curr[j - 1]); 22 } 23 swap(prev, curr); 24 } 25 26 cout << prev[m] << "\n"; 27 return 0; 28 } 29
Only the previous row is needed to compute the current row, so we keep two vectors of length m+1 and swap them after each iteration. This computes the LCS length without storing the entire 2D table. Note that reconstruction is not possible from this data alone.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Assumes lowercase 'a'..'z'. For larger alphabets, replace 26 with the size and map chars to indices. 5 int main() { 6 ios::sync_with_stdio(false); 7 cin.tie(nullptr); 8 9 string s, t; 10 if (!(cin >> s >> t)) return 0; 11 int n = (int)s.size(); 12 int m = (int)t.size(); 13 14 // dp[i][j] = LCS length of s[i:] and t[j:] 15 vector<vector<int>> dp(n + 1, vector<int>(m + 1, 0)); 16 for (int i = n - 1; i >= 0; --i) { 17 for (int j = m - 1; j >= 0; --j) { 18 if (s[i] == t[j]) dp[i][j] = 1 + dp[i + 1][j + 1]; 19 else dp[i][j] = max(dp[i + 1][j], dp[i][j + 1]); 20 } 21 } 22 int L = dp[0][0]; 23 24 // Next-occurrence arrays: nextS[i][c] = first position >= i where s[pos] == 'a'+c, or n if none 25 const int SIG = 26; 26 vector<array<int, SIG>> nextS(n + 1), nextT(m + 1); 27 for (int c = 0; c < SIG; ++c) nextS[n][c] = n, nextT[m][c] = m; 28 for (int i = n - 1; i >= 0; --i) { 29 nextS[i] = nextS[i + 1]; 30 if ('a' <= s[i] && s[i] <= 'z') nextS[i][s[i] - 'a'] = i; 31 } 32 for (int j = m - 1; j >= 0; --j) { 33 nextT[j] = nextT[j + 1]; 34 if ('a' <= t[j] && t[j] <= 'z') nextT[j][t[j] - 'a'] = j; 35 } 36 37 string ans; 38 ans.reserve(L); 39 int i = 0, j = 0; 40 while ((int)ans.size() < L) { 41 bool placed = false; 42 for (int c = 0; c < SIG; ++c) { 43 int i1 = nextS[i][c]; 44 int j1 = nextT[j][c]; 45 if (i1 < n && j1 < m) { 46 // Can we still reach total length L if we place this char? 47 if (1 + dp[i1 + 1][j1 + 1] >= L - (int)ans.size()) { 48 ans.push_back(char('a' + c)); 49 i = i1 + 1; 50 j = j1 + 1; 51 placed = true; 52 break; 53 } 54 } 55 } 56 if (!placed) break; // Safety: should not happen if dp is correct 57 } 58 59 cout << L << "\n" << ans << "\n"; 60 return 0; 61 } 62
We compute a suffix-oriented DP to know the LCS length from any pair of suffixes. Then, using next-occurrence arrays and a greedy scan over the alphabet from 'a' to 'z', we pick the smallest character that still allows us to complete an optimal LCS. This guarantees the lexicographically smallest LCS among all optimal solutions when inputs are lowercase.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute last DP row (LCS lengths) for A vs. B using O(|B|) space 5 static vector<int> lcsRow(const string &A, const string &B) { 6 int n = (int)A.size(), m = (int)B.size(); 7 vector<int> prev(m + 1, 0), curr(m + 1, 0); 8 for (int i = 1; i <= n; ++i) { 9 fill(curr.begin(), curr.end(), 0); 10 for (int j = 1; j <= m; ++j) { 11 if (A[i - 1] == B[j - 1]) curr[j] = prev[j - 1] + 1; 12 else curr[j] = max(prev[j], curr[j - 1]); 13 } 14 swap(prev, curr); 15 } 16 return prev; // dp[n][*] 17 } 18 19 // Hirschberg's divide-and-conquer LCS reconstruction 20 static string hirschberg(const string &A, const string &B) { 21 int n = (int)A.size(), m = (int)B.size(); 22 if (m == 0 || n == 0) return ""; 23 if (n == 1) { 24 // If A[0] appears in B, that's the LCS of length 1; otherwise empty 25 for (int j = 0; j < m; ++j) if (A[0] == B[j]) return string(1, A[0]); 26 return ""; 27 } 28 int mid = n / 2; 29 string Aleft = A.substr(0, mid); 30 string Aright = A.substr(mid); 31 32 vector<int> L = lcsRow(Aleft, B); // forward DP row 33 // For backward row, reverse both strings 34 string Arev = string(Aright.rbegin(), Aright.rend()); 35 string Brev = string(B.rbegin(), B.rend()); 36 vector<int> Rrev = lcsRow(Arev, Brev); // backward on reversed strings 37 38 // R[j] = LCS length of Aright vs. B[j:]; map from reversed index 39 vector<int> R(m + 1, 0); 40 for (int j = 0; j <= m; ++j) { 41 R[j] = Rrev[m - j]; 42 } 43 44 // Choose split k maximizing L[j] + R[j] 45 int k = 0, best = -1; 46 for (int j = 0; j <= m; ++j) { 47 int val = L[j] + R[j]; 48 if (val > best) { best = val; k = j; } 49 } 50 51 string left = hirschberg(Aleft, B.substr(0, k)); 52 string right = hirschberg(Aright, B.substr(k)); 53 return left + right; 54 } 55 56 int main() { 57 ios::sync_with_stdio(false); 58 cin.tie(nullptr); 59 60 string s, t; 61 if (!(cin >> s >> t)) return 0; 62 string lcs = hirschberg(s, t); 63 cout << (int)lcs.size() << "\n" << lcs << "\n"; 64 return 0; 65 } 66
Hirschberg splits the first string at its midpoint and computes one forward DP row and one backward DP row (on reversed suffixes) to decide where to split the second string. It then recurses on the two halves. This reconstructs an LCS using only O(n + m) space while preserving O(nm) time.