CTC Loss (Connectionist Temporal Classification)
Key Points
- â˘CTC loss trains sequence models when you do not know the alignment between inputs (frames) and outputs (labels).
- â˘It sums probabilities over all valid alignments that collapse (remove blanks and merge repeats) to the target label sequence.
- â˘Dynamic programming with forwardâbackward recurrences computes the loss efficiently in O(TU), where T is time length and U is label length.
- â˘A special blank symbol allows the model to output ânothingâ between labels and to separate identical adjacent labels.
- â˘Greedy decoding picks the most likely symbol per time step, then collapses repeats and removes blanks; beam search performs better but is slower.
- â˘Numerical stability requires working in log-space and using log-sum-exp to avoid underflow.
- â˘Gradients with respect to logits are - posterior, where the posterior is the probability of emitting each symbol at each time given the target.
- â˘CTC is widely used in speech recognition, OCR, and handwriting recognition when frame-level labels are unavailable.
Prerequisites
- âSoftmax and cross-entropy â CTCâs gradients mirror cross-entropy with soft targets; understanding softmax normalization is essential.
- âDynamic programming â Forwardâbackward recurrences are DP over time and label positions.
- âLog-space arithmetic â CTC requires stable computation using log-sum-exp to avoid underflow.
- âProbability basics â Interpreting sums over paths and conditional independence relies on probability rules.
- âMatrix and tensor indexing â Implementations depend on correct handling of TĂK and TĂS arrays.
- âNumerical stability and floating-point â Understanding precision limits helps prevent NaNs and underflow.
Detailed Explanation
Tap terms for definitions01Overview
Connectionist Temporal Classification (CTC) is a training objective for sequence models that maps variable-length input sequences (such as audio frames or image columns) to shorter label sequences (such as words or characters) without requiring frame-level alignment. Instead of forcing a single alignment between inputs and outputs, CTC defines a set of all possible alignments that produce the target label sequence after applying a collapsing function: first merge consecutive duplicates, then remove a special blank symbol. The total probability of the target is the sum of probabilities of all these alignments. This sum would be exponential to compute naively, but dynamic programming (forwardâbackward) makes it tractable. The presence of a blank symbol lets the model âwaitâ between label emissions and disambiguates repeated labels (e.g., the two lâs in âhelloâ). CTC assumes conditional independence across time steps given the input representation, which is often produced by an encoder (RNN, CNN, or Transformer). During training, we minimize the negative log-likelihood of the target sequence; at inference, we decode with greedy or beam search. CTC has been crucial in domains like speech recognition and OCR, where aligning each frame to a specific label is impractical or expensive.
02Intuition & Analogies
Imagine a conveyor belt of T frames (e.g., tiny slices of audio). A clerk watches the belt and can stamp a character at each frame or choose to stamp nothing. The "nothing" stamp is the blank symbol. Multiple stamps in a row with the same character are merged into one final character, and all blank stamps are removed. This merging and blank removal is the collapsing function. If the final written string after collapsing is exactly the target word, then the sequence of stamps is a valid alignment path. For a short word like âno,â there are many ways to stamp it: you might wait (blank, blank), stamp ân,â wait again, then stamp âo,â and so on. CTC doesnât force a single choice. Instead, it sums the probabilities of all valid ways the clerk could have produced âno.â That is powerful because it doesnât require us to know at which exact frame each letter should appear. The blank plays two key roles: it gives the model room to spread letters over time, and it separates identical neighbors so repeats can be represented (e.g., âlâ then âlâ needs a blank or a time gap). Computing the sum over all valid stampings directly would be impossible for long sequences, but we can keep track of partial sums: how likely we are to have produced the first s symbols of an âextendedâ target (with blanks inserted between letters) by time t. These running sums flow forward (and similarly backward), letting us aggregate all alignments efficiently and stably in log-space.
03Formal Definition
04When to Use
Use CTC when you need to align a long input sequence to a shorter label sequence but do not have (or do not want to rely on) frame-level alignments. Classic examples include: speech recognition (map audio frames to characters or word-pieces), online handwriting recognition (map pen-stroke traces to text), optical character recognition for scanned lines, lip reading, and sign language recognition. CTC excels when the alignment between inputs and outputs is monotonic and mostly left-to-right: you do not need to reorder outputs, only to decide when to emit each label and how long to wait (blanks) between them. It is also appropriate when you want a streaming or low-latency system that can emit outputs as frames arrive. Consider CTC over attention-based encoderâdecoders when you prefer simpler decoding and strong monotonic constraints, or when you want to train from weak supervision (no frame labels). On the other hand, if you need non-monotonic alignments, strong long-range dependencies between emissions, or explicit language modeling during training, you might prefer sequence-to-sequence attention models or RNN-T (which generalizes CTC by conditioning emissions on previous outputs).
â ď¸Common Mistakes
- Ignoring numerical stability: computing products of probabilities quickly underflows. Always work in log-space and use log-sum-exp for additions. - Misusing the blank symbol: forgetting to add blanks around and between labels when building the extended target, or using the wrong blank index, breaks the recurrence. - Wrong skip condition: the sâ2 transition is only allowed if the current extended symbol is not blank and not equal to the symbol two steps back; otherwise repeated labels can be merged incorrectly. - Mixing logits, probabilities, and log-probabilities: compute a stable log-softmax once per time step and stick to it in DP; use probabilities only when turning posteriors or gradients into human-readable numbers. - Ending state miscalculation: the total probability is the sum of forward variables at the last two extended positions (last label or trailing blank), not over all states. - Not handling empty targets: CTC supports empty labels via an extended sequence of a single blank; implement this edge case. - Forgetting masks for padded frames in batched code: if some sequences are shorter, mask out extra time steps consistently in forward, backward, and gradient. - Over-relying on greedy decoding to assess training quality: greedy can be much worse than beam search; always evaluate with a reasonable beam.
Key Formulas
CTC probability
Explanation: The probability of the target label sequence equals the sum of probabilities of all alignment paths that collapse to it. Each pathâs probability is the product of per-time-step emission probabilities.
CTC loss
Explanation: Training minimizes the negative log-likelihood of the target sequence. Lower loss means the model assigns higher probability to the correct sequence across all valid alignments.
Extended length
Explanation: The extended target sequence inserts blanks between labels and at both ends, yielding length S. This enables waiting and separation of identical adjacent labels.
Forward recurrence (log-space)
Explanation: The forward variable sums (in log-space) all ways to reach extended position s by time t. You can stay, move by one, or skip a blank by two when allowed.
Backward recurrence (log-space)
Explanation: The backward variable accumulates probabilities from time t to the end. It mirrors the forward transitions, adding the next stepâs emission log-probability.
Terminal sum
Explanation: Valid alignments finish at the last blank or the last label in the extended sequence. The log-partition function is the log-sum-exp of these two forward values.
Alignment posterior
Explanation: This gives the probability mass of being at extended position s at time t given the target. These posteriors are used to compute gradients and diagnostics.
Gradient w.r.t. logits
Explanation: The gradient equals the modelâs softmax probability minus the posterior probability of emitting symbol k at time t. This mirrors cross-entropy with soft targets given by CTC posteriors.
Log-sum-exp
Explanation: A stable way to sum probabilities in log-space by factoring out the maximum. It avoids numerical underflow when probabilities are tiny.
Collapsing map example
Explanation: Consecutive duplicates are merged and blanks removed. This illustrates how alignments map to final label sequences.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Helper: log-sum-exp over a vector (returns -inf if all are -inf) 5 double logSumExp(const vector<double>& xs) { 6 double m = -numeric_limits<double>::infinity(); 7 for (double v : xs) m = max(m, v); 8 if (!isfinite(m)) return m; // all -inf 9 double s = 0.0; 10 for (double v : xs) s += exp(v - m); 11 return m + log(s); 12 } 13 14 // Stable log-softmax for a single time step (returns probabilities and log-probabilities) 15 void logSoftmaxRow(const vector<double>& logits, vector<double>& logp, vector<double>& prob) { 16 double m = *max_element(logits.begin(), logits.end()); 17 double sumexp = 0.0; 18 prob.resize(logits.size()); 19 logp.resize(logits.size()); 20 for (size_t k = 0; k < logits.size(); ++k) sumexp += exp(logits[k] - m); 21 double logZ = m + log(sumexp); 22 for (size_t k = 0; k < logits.size(); ++k) { 23 logp[k] = logits[k] - logZ; // log softmax 24 prob[k] = exp(logp[k]); // softmax 25 } 26 } 27 28 struct CTCLossResult { 29 double loss; // negative log-likelihood 30 vector<vector<double>> grad_logits; // gradient w.r.t. logits, shape [T][K] 31 }; 32 33 // Compute CTC loss and gradient for a single sequence. 34 // Inputs: 35 // - logits: [T][K] raw scores (pre-softmax) 36 // - labels: target labels without blanks, of length U (values in [0, K-1], including possible blank index but should exclude it in practice) 37 // - blank: index of the blank symbol in [0, K-1] 38 // Returns loss and gradient w.r.t logits. 39 CTCLossResult ctc_loss_and_grad(const vector<vector<double>>& logits, 40 const vector<int>& labels, 41 int blank) { 42 const int T = (int)logits.size(); 43 const int K = (int)logits[0].size(); 44 const int U = (int)labels.size(); 45 46 // Edge case: empty target -> extended is just [blank] 47 vector<int> ext; // extended label sequence e of length S = 2U+1 48 if (U == 0) { 49 ext = {blank}; 50 } else { 51 ext.reserve(2*U + 1); 52 for (int i = 0; i < U; ++i) { 53 if (i == 0) ext.push_back(blank); 54 ext.push_back(labels[i]); 55 ext.push_back(blank); 56 } 57 } 58 const int S = (int)ext.size(); 59 60 // Compute per-time softmax and log-softmax 61 vector<vector<double>> logp(T, vector<double>(K)); 62 vector<vector<double>> prob(T, vector<double>(K)); 63 for (int t = 0; t < T; ++t) { 64 logSoftmaxRow(logits[t], logp[t], prob[t]); 65 } 66 67 // Initialize alpha [T][S] with -inf 68 const double NEG_INF = -numeric_limits<double>::infinity(); 69 vector<vector<double>> alpha(T, vector<double>(S, NEG_INF)); 70 71 // t=0 initialization 72 alpha[0][0] = logp[0][ext[0]]; // must be blank at position 0 73 if (S > 1) { 74 alpha[0][1] = logp[0][ext[1]]; // first label 75 } 76 // rest remain -inf 77 78 // Forward pass 79 for (int t = 1; t < T; ++t) { 80 for (int s = 0; s < S; ++s) { 81 vector<double> terms; 82 // stay at s 83 terms.push_back(alpha[t-1][s]); 84 // move from s-1 85 if (s-1 >= 0) terms.push_back(alpha[t-1][s-1]); 86 // move from s-2 if allowed 87 if (s-2 >= 0) { 88 int es = ext[s]; 89 int es2 = ext[s-2]; 90 if (es != blank && es != es2) { 91 terms.push_back(alpha[t-1][s-2]); 92 } 93 } 94 double lse = logSumExp(terms); 95 alpha[t][s] = lse + logp[t][ext[s]]; 96 } 97 } 98 99 // Terminal sum: can end at S-1 (trailing blank) or S-2 (last label) 100 double logZ; 101 if (S == 1) { 102 logZ = alpha[T-1][0]; 103 } else { 104 logZ = logSumExp(vector<double>{alpha[T-1][S-1], alpha[T-1][S-2]}); 105 } 106 107 // Backward pass: initialize beta [T][S] 108 vector<vector<double>> beta(T, vector<double>(S, NEG_INF)); 109 // At t = T-1, probability to finish from last two positions is 0 in log-space (i.e., 1) 110 if (S == 1) { 111 beta[T-1][0] = 0.0; 112 } else { 113 beta[T-1][S-1] = 0.0; // end at trailing blank 114 beta[T-1][S-2] = 0.0; // or at last label 115 } 116 117 for (int t = T-2; t >= 0; --t) { 118 for (int s = 0; s < S; ++s) { 119 vector<double> terms; 120 // stay at s: next emits ext[s] at t+1 121 terms.push_back(beta[t+1][s] + logp[t+1][ext[s]]); 122 // move to s+1: next emits ext[s+1] 123 if (s+1 < S) terms.push_back(beta[t+1][s+1] + logp[t+1][ext[s+1]]); 124 // move to s+2 if allowed: next emits ext[s+2] 125 if (s+2 < S) { 126 int es = ext[s]; 127 int es2 = ext[s+2]; 128 if (es != blank && es != es2) { 129 terms.push_back(beta[t+1][s+2] + logp[t+1][ext[s+2]]); 130 } 131 } 132 beta[t][s] = logSumExp(terms); 133 } 134 } 135 136 // Posterior over extended positions and gradient w.r.t logits 137 vector<vector<double>> grad(T, vector<double>(K, 0.0)); 138 for (int t = 0; t < T; ++t) { 139 // gamma over classes starts at zero 140 vector<double> gamma_k(K, 0.0); 141 for (int s = 0; s < S; ++s) { 142 double log_gamma_ts = alpha[t][s] + beta[t][s] - logZ; // log posterior of (t,s) 143 double g = exp(log_gamma_ts); 144 gamma_k[ ext[s] ] += g; 145 } 146 for (int k = 0; k < K; ++k) { 147 grad[t][k] = prob[t][k] - gamma_k[k]; // dL/da = y_hat - posterior 148 } 149 } 150 151 CTCLossResult res; 152 res.loss = -logZ; 153 res.grad_logits = move(grad); 154 return res; 155 } 156 157 int main() { 158 // Tiny example: alphabet {blank=0, a=1, b=2} 159 // T=4 time steps, target = "ab" -> labels {1,2} 160 vector<vector<double>> logits = { 161 {3.0, 1.0, 0.0}, // t=0 162 {2.0, 2.0, 0.0}, // t=1 163 {3.0, 0.5, 1.5}, // t=2 164 {2.0, 0.2, 2.2} // t=3 165 }; 166 vector<int> labels = {1, 2}; 167 int blank = 0; 168 169 CTCLossResult out = ctc_loss_and_grad(logits, labels, blank); 170 cout.setf(std::ios::fixed); cout<<setprecision(6); 171 cout << "CTC Loss: " << out.loss << "\n"; 172 cout << "Gradient (t,k):\n"; 173 for (size_t t = 0; t < out.grad_logits.size(); ++t) { 174 for (size_t k = 0; k < out.grad_logits[t].size(); ++k) { 175 cout << out.grad_logits[t][k] << (k+1==out.grad_logits[t].size()?"\n":" "); 176 } 177 } 178 return 0; 179 } 180
This program computes the CTC negative log-likelihood and gradients with respect to logits using a numerically stable forwardâbackward algorithm. It builds the extended target with blanks, runs forward and backward recurrences in log-space, computes the partition function (logZ), then obtains alignment posteriors. Gradients are y_hat â posterior, matching the standard CTC derivative. The implementation handles the empty-target edge case and uses log-sum-exp to avoid underflow.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Argmax over classes for each time step 5 template <typename T> 6 int argmax(const vector<T>& v) { 7 return (int)(max_element(v.begin(), v.end()) - v.begin()); 8 } 9 10 // Greedy decode from logits: pick argmax per time, then collapse and remove blanks 11 vector<int> ctc_greedy_decode(const vector<vector<double>>& logits, int blank) { 12 const int T = (int)logits.size(); 13 const int K = (int)logits[0].size(); 14 (void)K; 15 16 // Step 1: argmax path 17 vector<int> path(T); 18 for (int t = 0; t < T; ++t) path[t] = argmax(logits[t]); 19 20 // Step 2: collapse repeats and remove blanks 21 vector<int> out; 22 int prev = -1; 23 for (int t = 0; t < T; ++t) { 24 int k = path[t]; 25 if (k == blank) { prev = k; continue; } // drop blanks 26 if (k == prev) { continue; } // collapse repeats 27 out.push_back(k); 28 prev = k; 29 } 30 return out; 31 } 32 33 int main() { 34 // Alphabet: {blank=0, a=1, b=2, l=3, o=4} 35 // Example logits over 7 frames producing something like "hello" greedily 36 vector<vector<double>> logits = { 37 {3.0, 0.1, 0.2, 0.1, 0.1}, // blank 38 {0.1, 0.2, 0.1, 3.2, 0.1}, // l 39 {0.1, 0.2, 0.1, 3.0, 0.1}, // l (repeat) 40 {3.1, 0.1, 0.1, 0.1, 0.1}, // blank (separator) 41 {0.1, 0.2, 0.1, 0.1, 3.0}, // o 42 {0.1, 0.2, 0.1, 0.1, 3.1}, // o (repeat) 43 {3.0, 0.1, 0.1, 0.1, 0.1} // blank 44 }; 45 int blank = 0; 46 vector<int> decoded = ctc_greedy_decode(logits, blank); 47 cout << "Decoded labels (indices): "; 48 for (int k : decoded) cout << k << ' '; 49 cout << "\n"; 50 return 0; 51 } 52
Greedy decoding selects the top-scoring symbol at each frame (argmax), then collapses consecutive duplicates and removes blanks to produce a label sequence. It is extremely fast and often good enough for quick checks, though beam search generally yields higher accuracy.