📚TheoryAdvanced

CTC Loss (Connectionist Temporal Classification)

Key Points

•
CTC loss trains sequence models when you do not know the alignment between inputs (frames) and outputs (labels).
•
It sums probabilities over all valid alignments that collapse (remove blanks and merge repeats) to the target label sequence.
•
Dynamic programming with forward–backward recurrences computes the loss efficiently in O(TU), where T is time length and U is label length.
•
A special blank symbol allows the model to output “nothing” between labels and to separate identical adjacent labels.
•
Greedy decoding picks the most likely symbol per time step, then collapses repeats and removes blanks; beam search performs better but is slower.
•
Numerical stability requires working in log-space and using log-sum-exp to avoid underflow.
•
Gradients with respect to logits are $y_{h} a t$ - posterior, where the posterior is the probability of emitting each symbol at each time given the target.
•
CTC is widely used in speech recognition, OCR, and handwriting recognition when frame-level labels are unavailable.

Prerequisites

→Softmax and cross-entropy — CTC’s gradients mirror cross-entropy with soft targets; understanding softmax normalization is essential.
→Dynamic programming — Forward–backward recurrences are DP over time and label positions.
→Log-space arithmetic — CTC requires stable computation using log-sum-exp to avoid underflow.
→Probability basics — Interpreting sums over paths and conditional independence relies on probability rules.
→Matrix and tensor indexing — Implementations depend on correct handling of T×K and T×S arrays.
→Numerical stability and floating-point — Understanding precision limits helps prevent NaNs and underflow.

Detailed Explanation

Tap terms for definitions

01Overview

Connectionist Temporal Classification (CTC) is a training objective for sequence models that maps variable-length input sequences (such as audio frames or image columns) to shorter label sequences (such as words or characters) without requiring frame-level alignment. Instead of forcing a single alignment between inputs and outputs, CTC defines a set of all possible alignments that produce the target label sequence after applying a collapsing function: first merge consecutive duplicates, then remove a special blank symbol. The total probability of the target is the sum of probabilities of all these alignments. This sum would be exponential to compute naively, but dynamic programming (forward–backward) makes it tractable. The presence of a blank symbol lets the model “wait” between label emissions and disambiguates repeated labels (e.g., the two l’s in “hello”). CTC assumes conditional independence across time steps given the input representation, which is often produced by an encoder (RNN, CNN, or Transformer). During training, we minimize the negative log-likelihood of the target sequence; at inference, we decode with greedy or beam search. CTC has been crucial in domains like speech recognition and OCR, where aligning each frame to a specific label is impractical or expensive.

02Intuition & Analogies

Imagine a conveyor belt of T frames (e.g., tiny slices of audio). A clerk watches the belt and can stamp a character at each frame or choose to stamp nothing. The "nothing" stamp is the blank symbol. Multiple stamps in a row with the same character are merged into one final character, and all blank stamps are removed. This merging and blank removal is the collapsing function. If the final written string after collapsing is exactly the target word, then the sequence of stamps is a valid alignment path. For a short word like “no,” there are many ways to stamp it: you might wait (blank, blank), stamp ‘n,’ wait again, then stamp ‘o,’ and so on. CTC doesn’t force a single choice. Instead, it sums the probabilities of all valid ways the clerk could have produced “no.” That is powerful because it doesn’t require us to know at which exact frame each letter should appear. The blank plays two key roles: it gives the model room to spread letters over time, and it separates identical neighbors so repeats can be represented (e.g., ‘l’ then ‘l’ needs a blank or a time gap). Computing the sum over all valid stampings directly would be impossible for long sequences, but we can keep track of partial sums: how likely we are to have produced the first s symbols of an “extended” target (with blanks inserted between letters) by time t. These running sums flow forward (and similarly backward), letting us aggregate all alignments efficiently and stably in log-space.

03Formal Definition

Let the label alphabet be \(

A

\) and augment it with a blank symbol \(

\emptyset

\). Given input sequence length $T$, a model outputs per-time-step distributions \(

y_{t}

(k) = P(k

∣

x, t)\) over \(

A

\cup

\emptyset

\}\), where $k$ is a symbol index. For a target sequence \(y = (

y_{1}

\dots

y_{U}

)\) without blanks, define the collapsing map \(

B

: (

A \cup

\emptyset

\})^T

\to

A

^*\) that merges consecutive duplicates and removes blanks. The CTC probability of the target is \(P(y

∣

x) =

\sum_{π \in B^{- 1} (y)}

\prod_{t = 1}^{T}

y_{t}

(

π_{t}

)\), summing over all alignment paths \(

π

\) that collapse to $y$. To compute this efficiently, create the extended target sequence $e$ of length \(

S=2U+1\

): insert blanks at the beginning, between every label, and at the end. Define forward variables \(

α

(t,s)\) as the log-probability of paths that end at time $t$ aligned to position $s$ in $e$. The recurrence allows transitions from $s$, $s-1$, and, when \(

e_{s} \neq = \emptyset

\) and \(

e_{s} \neq =

e_{s - 2}

\), from $s-2$. A symmetric backward variable \(

β

(t,s)\) accumulates probabilities from time $t$ to the end. The total log-probability is \(

lo g

P(y

∣

x) =

LSE

(

α

(T,S-1),

α

(T,S-2))\) with suitable indexing. Posteriors over alignment positions are \(p(t,s)

\propto

exp

(

α

(t,s)+

β

(t,s))\). The loss is the negative log-likelihood \(

L

= -

lo g

P(y

∣

x)\). For logits \(

a_{t}

\) with \(

y_{t}

softmax

(

a_{t}

)\), gradients are \(

\partial

L

\partial

a_{t}

(k) =

y_{t}

(k) -

\sum_{s : e_{s} = k}

p(t,s)\).

04When to Use

Use CTC when you need to align a long input sequence to a shorter label sequence but do not have (or do not want to rely on) frame-level alignments. Classic examples include: speech recognition (map audio frames to characters or word-pieces), online handwriting recognition (map pen-stroke traces to text), optical character recognition for scanned lines, lip reading, and sign language recognition. CTC excels when the alignment between inputs and outputs is monotonic and mostly left-to-right: you do not need to reorder outputs, only to decide when to emit each label and how long to wait (blanks) between them. It is also appropriate when you want a streaming or low-latency system that can emit outputs as frames arrive. Consider CTC over attention-based encoder–decoders when you prefer simpler decoding and strong monotonic constraints, or when you want to train from weak supervision (no frame labels). On the other hand, if you need non-monotonic alignments, strong long-range dependencies between emissions, or explicit language modeling during training, you might prefer sequence-to-sequence attention models or RNN-T (which generalizes CTC by conditioning emissions on previous outputs).

⚠️Common Mistakes

Ignoring numerical stability: computing products of probabilities quickly underflows. Always work in log-space and use log-sum-exp for additions. - Misusing the blank symbol: forgetting to add blanks around and between labels when building the extended target, or using the wrong blank index, breaks the recurrence. - Wrong skip condition: the s−2 transition is only allowed if the current extended symbol is not blank and not equal to the symbol two steps back; otherwise repeated labels can be merged incorrectly. - Mixing logits, probabilities, and log-probabilities: compute a stable log-softmax once per time step and stick to it in DP; use probabilities only when turning posteriors or gradients into human-readable numbers. - Ending state miscalculation: the total probability is the sum of forward variables at the last two extended positions (last label or trailing blank), not over all states. - Not handling empty targets: CTC supports empty labels via an extended sequence of a single blank; implement this edge case. - Forgetting masks for padded frames in batched code: if some sequences are shorter, mask out extra time steps consistently in forward, backward, and gradient. - Over-relying on greedy decoding to assess training quality: greedy can be much worse than beam search; always evaluate with a reasonable beam.

Key Formulas

CTC probability

P (y ∣ x) = π \in B^{- 1} (y) \sum t = 1 \prod T y_{t} (π_{t})

Explanation: The probability of the target label sequence equals the sum of probabilities of all alignment paths that collapse to it. Each path’s probability is the product of per-time-step emission probabilities.

CTC loss

L (x, y) = - lo g P (y ∣ x)

Explanation: Training minimizes the negative log-likelihood of the target sequence. Lower loss means the model assigns higher probability to the correct sequence across all valid alignments.

Extended length

S = 2 U + 1

Explanation: The extended target sequence inserts blanks between labels and at both ends, yielding length S. This enables waiting and separation of identical adjacent labels.

Forward recurrence (log-space)

α (t, s) = lo g y_{t} (e_{s}) + LSE (α (t - 1, s), α (t - 1, s - 1), α (t - 1, s - 2) if e_{s} \neq = \emptyset \land e_{s} \neq = e_{s - 2})

Explanation: The forward variable sums (in log-space) all ways to reach extended position s by time t. You can stay, move by one, or skip a blank by two when allowed.

Backward recurrence (log-space)

β (t, s) = LSE (β (t + 1, s) + lo g y_{t + 1} (e_{s}), β (t + 1, s + 1) + lo g y_{t + 1} (e_{s + 1}), β (t + 1, s + 2) + lo g y_{t + 1} (e_{s + 2}) if e_{s} \neq = \emptyset \land e_{s} \neq = e_{s + 2})

Explanation: The backward variable accumulates probabilities from time t to the end. It mirrors the forward transitions, adding the next step’s emission log-probability.

Terminal sum

lo g Z = LSE (α (T - 1, S - 1), α (T - 1, S - 2))

Explanation: Valid alignments finish at the last blank or the last label in the extended sequence. The log-partition function is the log-sum-exp of these two forward values.

Alignment posterior

p (t, s) = exp (α (t, s) + β (t, s) - lo g Z)

Explanation: This gives the probability mass of being at extended position s at time t given the target. These posteriors are used to compute gradients and diagnostics.

Gradient w.r.t. logits

\frac{\partial L}{\partial a _{t} ( k )} = y_{t} (k) - s : e_{s} = k \sum p (t, s)

Explanation: The gradient equals the model’s softmax probability minus the posterior probability of emitting symbol k at time t. This mirrors cross-entropy with soft targets given by CTC posteriors.

Log-sum-exp

LSE (z_{1}, \dots, z_{n}) = m + lo g i = 1 \sum n exp (z_{i} - m), m = i max z_{i}

Explanation: A stable way to sum probabilities in log-space by factoring out the maximum. It avoids numerical underflow when probabilities are tiny.

Collapsing map example

B ([\dots, a, a, \emptyset, a, \dots]) = [\dots, a, a, \dots]

Explanation: Consecutive duplicates are merged and blanks removed. This illustrates how alignments map to final label sequences.

Complexity Analysis

Let T be the number of time steps, U the number of labels in the target, K the alphabet size including the blank, and S = 2U + 1 the extended length. The forward–backward dynamic program fills a T×S table for both alpha and beta. Each cell aggregates up to three predecessor/successor states via log-sum-exp and adds one emission log-probability. Therefore, the time complexity is O(TS) = O(TU). Computing the per-time-step softmax and log-softmax costs O(TK). When gradients are required with respect to logits, we also compute posteriors by combining alpha and beta and summing over positions mapping to each symbol, which costs O(TS + TK). Overall, for a single sequence, the dominant term is O(TU + TK), often simplified to O(TU) when K is comparable to U or to O(TK) when U is small relative to K. Space complexity depends on whether you need gradients. For loss-only, a single pass with O(S) memory per time step suffices by discarding previous rows (streaming forward), giving O(S) space. For gradients, both alpha and beta (or sufficient intermediate summaries) are needed simultaneously to compute posteriors, requiring O(TS) space. Practical implementations may trade memory for recomputation (checkpointing) to reduce peak memory while keeping time near-linear. Beam search decoding adds a factor of the beam width B to the per-time-step cost: O(T·B·K) for a straightforward prefix beam search, with modest additional memory for maintaining beams and prefix scores. Greedy decoding is O(TK) time and O(1) extra space beyond outputs.

Code Examples

Numerically stable CTC loss and gradient (forward–backward in log-space)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Helper: log-sum-exp over a vector (returns -inf if all are -inf)
5 double logSumExp(const vector<double>& xs) {
6     double m = -numeric_limits<double>::infinity();
7     for (double v : xs) m = max(m, v);
8     if (!isfinite(m)) return m; // all -inf
9     double s = 0.0;
10     for (double v : xs) s += exp(v - m);
11     return m + log(s);
12 }
13 
14 // Stable log-softmax for a single time step (returns probabilities and log-probabilities)
15 void logSoftmaxRow(const vector<double>& logits, vector<double>& logp, vector<double>& prob) {
16     double m = *max_element(logits.begin(), logits.end());
17     double sumexp = 0.0;
18     prob.resize(logits.size());
19     logp.resize(logits.size());
20     for (size_t k = 0; k < logits.size(); ++k) sumexp += exp(logits[k] - m);
21     double logZ = m + log(sumexp);
22     for (size_t k = 0; k < logits.size(); ++k) {
23         logp[k] = logits[k] - logZ;      // log softmax
24         prob[k] = exp(logp[k]);          // softmax
25     }
26 }
27 
28 struct CTCLossResult {
29     double loss;                                 // negative log-likelihood
30     vector<vector<double>> grad_logits;          // gradient w.r.t. logits, shape [T][K]
31 };
32 
33 // Compute CTC loss and gradient for a single sequence.
34 // Inputs:
35 //  - logits: [T][K] raw scores (pre-softmax)
36 //  - labels: target labels without blanks, of length U (values in [0, K-1], including possible blank index but should exclude it in practice)
37 //  - blank: index of the blank symbol in [0, K-1]
38 // Returns loss and gradient w.r.t logits.
39 CTCLossResult ctc_loss_and_grad(const vector<vector<double>>& logits,
40                                 const vector<int>& labels,
41                                 int blank) {
42     const int T = (int)logits.size();
43     const int K = (int)logits[0].size();
44     const int U = (int)labels.size();
45 
46     // Edge case: empty target -> extended is just [blank]
47     vector<int> ext; // extended label sequence e of length S = 2U+1
48     if (U == 0) {
49         ext = {blank};
50     } else {
51         ext.reserve(2*U + 1);
52         for (int i = 0; i < U; ++i) {
53             if (i == 0) ext.push_back(blank);
54             ext.push_back(labels[i]);
55             ext.push_back(blank);
56         }
57     }
58     const int S = (int)ext.size();
59 
60     // Compute per-time softmax and log-softmax
61     vector<vector<double>> logp(T, vector<double>(K));
62     vector<vector<double>> prob(T, vector<double>(K));
63     for (int t = 0; t < T; ++t) {
64         logSoftmaxRow(logits[t], logp[t], prob[t]);
65     }
66 
67     // Initialize alpha [T][S] with -inf
68     const double NEG_INF = -numeric_limits<double>::infinity();
69     vector<vector<double>> alpha(T, vector<double>(S, NEG_INF));
70     
71     // t=0 initialization
72     alpha[0][0] = logp[0][ext[0]]; // must be blank at position 0
73     if (S > 1) {
74         alpha[0][1] = logp[0][ext[1]]; // first label
75     }
76     // rest remain -inf
77 
78     // Forward pass
79     for (int t = 1; t < T; ++t) {
80         for (int s = 0; s < S; ++s) {
81             vector<double> terms;
82             // stay at s
83             terms.push_back(alpha[t-1][s]);
84             // move from s-1
85             if (s-1 >= 0) terms.push_back(alpha[t-1][s-1]);
86             // move from s-2 if allowed
87             if (s-2 >= 0) {
88                 int es = ext[s];
89                 int es2 = ext[s-2];
90                 if (es != blank && es != es2) {
91                     terms.push_back(alpha[t-1][s-2]);
92                 }
93             }
94             double lse = logSumExp(terms);
95             alpha[t][s] = lse + logp[t][ext[s]];
96         }
97     }
98 
99     // Terminal sum: can end at S-1 (trailing blank) or S-2 (last label)
100     double logZ;
101     if (S == 1) {
102         logZ = alpha[T-1][0];
103     } else {
104         logZ = logSumExp(vector<double>{alpha[T-1][S-1], alpha[T-1][S-2]});
105     }
106 
107     // Backward pass: initialize beta [T][S]
108     vector<vector<double>> beta(T, vector<double>(S, NEG_INF));
109     // At t = T-1, probability to finish from last two positions is 0 in log-space (i.e., 1)
110     if (S == 1) {
111         beta[T-1][0] = 0.0;
112     } else {
113         beta[T-1][S-1] = 0.0; // end at trailing blank
114         beta[T-1][S-2] = 0.0; // or at last label
115     }
116 
117     for (int t = T-2; t >= 0; --t) {
118         for (int s = 0; s < S; ++s) {
119             vector<double> terms;
120             // stay at s: next emits ext[s] at t+1
121             terms.push_back(beta[t+1][s] + logp[t+1][ext[s]]);
122             // move to s+1: next emits ext[s+1]
123             if (s+1 < S) terms.push_back(beta[t+1][s+1] + logp[t+1][ext[s+1]]);
124             // move to s+2 if allowed: next emits ext[s+2]
125             if (s+2 < S) {
126                 int es = ext[s];
127                 int es2 = ext[s+2];
128                 if (es != blank && es != es2) {
129                     terms.push_back(beta[t+1][s+2] + logp[t+1][ext[s+2]]);
130                 }
131             }
132             beta[t][s] = logSumExp(terms);
133         }
134     }
135 
136     // Posterior over extended positions and gradient w.r.t logits
137     vector<vector<double>> grad(T, vector<double>(K, 0.0));
138     for (int t = 0; t < T; ++t) {
139         // gamma over classes starts at zero
140         vector<double> gamma_k(K, 0.0);
141         for (int s = 0; s < S; ++s) {
142             double log_gamma_ts = alpha[t][s] + beta[t][s] - logZ; // log posterior of (t,s)
143             double g = exp(log_gamma_ts);
144             gamma_k[ ext[s] ] += g;
145         }
146         for (int k = 0; k < K; ++k) {
147             grad[t][k] = prob[t][k] - gamma_k[k]; // dL/da = y_hat - posterior
148         }
149     }
150 
151     CTCLossResult res;
152     res.loss = -logZ;
153     res.grad_logits = move(grad);
154     return res;
155 }
156 
157 int main() {
158     // Tiny example: alphabet {blank=0, a=1, b=2}
159     // T=4 time steps, target = "ab" -> labels {1,2}
160     vector<vector<double>> logits = {
161         {3.0, 1.0, 0.0}, // t=0
162         {2.0, 2.0, 0.0}, // t=1
163         {3.0, 0.5, 1.5}, // t=2
164         {2.0, 0.2, 2.2}  // t=3
165     };
166     vector<int> labels = {1, 2};
167     int blank = 0;
168 
169     CTCLossResult out = ctc_loss_and_grad(logits, labels, blank);
170     cout.setf(std::ios::fixed); cout<<setprecision(6);
171     cout << "CTC Loss: " << out.loss << "\n";
172     cout << "Gradient (t,k):\n";
173     for (size_t t = 0; t < out.grad_logits.size(); ++t) {
174         for (size_t k = 0; k < out.grad_logits[t].size(); ++k) {
175             cout << out.grad_logits[t][k] << (k+1==out.grad_logits[t].size()?"\n":" ");
176         }
177     }
178     return 0;
179 }
180

This program computes the CTC negative log-likelihood and gradients with respect to logits using a numerically stable forward–backward algorithm. It builds the extended target with blanks, runs forward and backward recurrences in log-space, computes the partition function (logZ), then obtains alignment posteriors. Gradients are y_hat − posterior, matching the standard CTC derivative. The implementation handles the empty-target edge case and uses log-sum-exp to avoid underflow.

Time: O(TU + TK), where T is time steps, U is label length, and K is alphabet size.Space: O(TU + TK) for storing alpha, beta, probabilities, and gradients.

Greedy CTC decoding (collapse repeats and remove blanks)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Argmax over classes for each time step
5 template <typename T>
6 int argmax(const vector<T>& v) {
7     return (int)(max_element(v.begin(), v.end()) - v.begin());
8 }
9 
10 // Greedy decode from logits: pick argmax per time, then collapse and remove blanks
11 vector<int> ctc_greedy_decode(const vector<vector<double>>& logits, int blank) {
12     const int T = (int)logits.size();
13     const int K = (int)logits[0].size();
14     (void)K;
15 
16     // Step 1: argmax path
17     vector<int> path(T);
18     for (int t = 0; t < T; ++t) path[t] = argmax(logits[t]);
19 
20     // Step 2: collapse repeats and remove blanks
21     vector<int> out;
22     int prev = -1;
23     for (int t = 0; t < T; ++t) {
24         int k = path[t];
25         if (k == blank) { prev = k; continue; }     // drop blanks
26         if (k == prev) { continue; }                // collapse repeats
27         out.push_back(k);
28         prev = k;
29     }
30     return out;
31 }
32 
33 int main() {
34     // Alphabet: {blank=0, a=1, b=2, l=3, o=4}
35     // Example logits over 7 frames producing something like "hello" greedily
36     vector<vector<double>> logits = {
37         {3.0, 0.1, 0.2, 0.1, 0.1}, // blank
38         {0.1, 0.2, 0.1, 3.2, 0.1}, // l
39         {0.1, 0.2, 0.1, 3.0, 0.1}, // l (repeat)
40         {3.1, 0.1, 0.1, 0.1, 0.1}, // blank (separator)
41         {0.1, 0.2, 0.1, 0.1, 3.0}, // o
42         {0.1, 0.2, 0.1, 0.1, 3.1}, // o (repeat)
43         {3.0, 0.1, 0.1, 0.1, 0.1}  // blank
44     };
45     int blank = 0;
46     vector<int> decoded = ctc_greedy_decode(logits, blank);
47     cout << "Decoded labels (indices): ";
48     for (int k : decoded) cout << k << ' ';
49     cout << "\n";
50     return 0;
51 }
52

Greedy decoding selects the top-scoring symbol at each frame (argmax), then collapses consecutive duplicates and removes blanks to produce a label sequence. It is extremely fast and often good enough for quick checks, though beam search generally yields higher accuracy.

Time: O(TK) to compute argmax per time step.Space: O(1) additional space beyond outputs.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Helper: log-sum-exp over a vector (returns -inf if all are -inf)
5	double logSumExp(const vector<double>& xs) {
6	double m = -numeric_limits<double>::infinity();
7	for (double v : xs) m = max(m, v);
8	if (!isfinite(m)) return m; // all -inf
9	double s = 0.0;
10	for (double v : xs) s += exp(v - m);
11	return m + log(s);
12	}
13
14	// Stable log-softmax for a single time step (returns probabilities and log-probabilities)
15	void logSoftmaxRow(const vector<double>& logits, vector<double>& logp, vector<double>& prob) {
16	double m = *max_element(logits.begin(), logits.end());
17	double sumexp = 0.0;
18	prob.resize(logits.size());
19	logp.resize(logits.size());
20	for (size_t k = 0; k < logits.size(); ++k) sumexp += exp(logits[k] - m);
21	double logZ = m + log(sumexp);
22	for (size_t k = 0; k < logits.size(); ++k) {
23	logp[k] = logits[k] - logZ; // log softmax
24	prob[k] = exp(logp[k]); // softmax
25	}
26	}
27
28	struct CTCLossResult {
29	double loss; // negative log-likelihood
30	vector<vector<double>> grad_logits; // gradient w.r.t. logits, shape [T][K]
31	};
32
33	// Compute CTC loss and gradient for a single sequence.
34	// Inputs:
35	// - logits: [T][K] raw scores (pre-softmax)
36	// - labels: target labels without blanks, of length U (values in [0, K-1], including possible blank index but should exclude it in practice)
37	// - blank: index of the blank symbol in [0, K-1]
38	// Returns loss and gradient w.r.t logits.
39	CTCLossResult ctc_loss_and_grad(const vector<vector<double>>& logits,
40	const vector<int>& labels,
41	int blank) {
42	const int T = (int)logits.size();
43	const int K = (int)logits[0].size();
44	const int U = (int)labels.size();
45
46	// Edge case: empty target -> extended is just [blank]
47	vector<int> ext; // extended label sequence e of length S = 2U+1
48	if (U == 0) {
49	ext = {blank};
50	} else {
51	ext.reserve(2*U + 1);
52	for (int i = 0; i < U; ++i) {
53	if (i == 0) ext.push_back(blank);
54	ext.push_back(labels[i]);
55	ext.push_back(blank);
56	}
57	}
58	const int S = (int)ext.size();
59
60	// Compute per-time softmax and log-softmax
61	vector<vector<double>> logp(T, vector<double>(K));
62	vector<vector<double>> prob(T, vector<double>(K));
63	for (int t = 0; t < T; ++t) {
64	logSoftmaxRow(logits[t], logp[t], prob[t]);
65	}
66
67	// Initialize alpha [T][S] with -inf
68	const double NEG_INF = -numeric_limits<double>::infinity();
69	vector<vector<double>> alpha(T, vector<double>(S, NEG_INF));
70
71	// t=0 initialization
72	alpha[0][0] = logp[0][ext[0]]; // must be blank at position 0
73	if (S > 1) {
74	alpha[0][1] = logp[0][ext[1]]; // first label
75	}
76	// rest remain -inf
77
78	// Forward pass
79	for (int t = 1; t < T; ++t) {
80	for (int s = 0; s < S; ++s) {
81	vector<double> terms;
82	// stay at s
83	terms.push_back(alpha[t-1][s]);
84	// move from s-1
85	if (s-1 >= 0) terms.push_back(alpha[t-1][s-1]);
86	// move from s-2 if allowed
87	if (s-2 >= 0) {
88	int es = ext[s];
89	int es2 = ext[s-2];
90	if (es != blank && es != es2) {
91	terms.push_back(alpha[t-1][s-2]);
92	}
93	}
94	double lse = logSumExp(terms);
95	alpha[t][s] = lse + logp[t][ext[s]];
96	}
97	}
98
99	// Terminal sum: can end at S-1 (trailing blank) or S-2 (last label)
100	double logZ;
101	if (S == 1) {
102	logZ = alpha[T-1][0];
103	} else {
104	logZ = logSumExp(vector<double>{alpha[T-1][S-1], alpha[T-1][S-2]});
105	}
106
107	// Backward pass: initialize beta [T][S]
108	vector<vector<double>> beta(T, vector<double>(S, NEG_INF));
109	// At t = T-1, probability to finish from last two positions is 0 in log-space (i.e., 1)
110	if (S == 1) {
111	beta[T-1][0] = 0.0;
112	} else {
113	beta[T-1][S-1] = 0.0; // end at trailing blank
114	beta[T-1][S-2] = 0.0; // or at last label
115	}
116
117	for (int t = T-2; t >= 0; --t) {
118	for (int s = 0; s < S; ++s) {
119	vector<double> terms;
120	// stay at s: next emits ext[s] at t+1
121	terms.push_back(beta[t+1][s] + logp[t+1][ext[s]]);
122	// move to s+1: next emits ext[s+1]
123	if (s+1 < S) terms.push_back(beta[t+1][s+1] + logp[t+1][ext[s+1]]);
124	// move to s+2 if allowed: next emits ext[s+2]
125	if (s+2 < S) {
126	int es = ext[s];
127	int es2 = ext[s+2];
128	if (es != blank && es != es2) {
129	terms.push_back(beta[t+1][s+2] + logp[t+1][ext[s+2]]);
130	}
131	}
132	beta[t][s] = logSumExp(terms);
133	}
134	}
135
136	// Posterior over extended positions and gradient w.r.t logits
137	vector<vector<double>> grad(T, vector<double>(K, 0.0));
138	for (int t = 0; t < T; ++t) {
139	// gamma over classes starts at zero
140	vector<double> gamma_k(K, 0.0);
141	for (int s = 0; s < S; ++s) {
142	double log_gamma_ts = alpha[t][s] + beta[t][s] - logZ; // log posterior of (t,s)
143	double g = exp(log_gamma_ts);
144	gamma_k[ ext[s] ] += g;
145	}
146	for (int k = 0; k < K; ++k) {
147	grad[t][k] = prob[t][k] - gamma_k[k]; // dL/da = y_hat - posterior
148	}
149	}
150
151	CTCLossResult res;
152	res.loss = -logZ;
153	res.grad_logits = move(grad);
154	return res;
155	}
156
157	int main() {
158	// Tiny example: alphabet {blank=0, a=1, b=2}
159	// T=4 time steps, target = "ab" -> labels {1,2}
160	vector<vector<double>> logits = {
161	{3.0, 1.0, 0.0}, // t=0
162	{2.0, 2.0, 0.0}, // t=1
163	{3.0, 0.5, 1.5}, // t=2
164	{2.0, 0.2, 2.2} // t=3
165	};
166	vector<int> labels = {1, 2};
167	int blank = 0;
168
169	CTCLossResult out = ctc_loss_and_grad(logits, labels, blank);
170	cout.setf(std::ios::fixed); cout<<setprecision(6);
171	cout << "CTC Loss: " << out.loss << "\n";
172	cout << "Gradient (t,k):\n";
173	for (size_t t = 0; t < out.grad_logits.size(); ++t) {
174	for (size_t k = 0; k < out.grad_logits[t].size(); ++k) {
175	cout << out.grad_logits[t][k] << (k+1==out.grad_logits[t].size()?"\n":" ");
176	}
177	}
178	return 0;
179	}
180

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Argmax over classes for each time step
5	template <typename T>
6	int argmax(const vector<T>& v) {
7	return (int)(max_element(v.begin(), v.end()) - v.begin());
8	}
9
10	// Greedy decode from logits: pick argmax per time, then collapse and remove blanks
11	vector<int> ctc_greedy_decode(const vector<vector<double>>& logits, int blank) {
12	const int T = (int)logits.size();
13	const int K = (int)logits[0].size();
14	(void)K;
15
16	// Step 1: argmax path
17	vector<int> path(T);
18	for (int t = 0; t < T; ++t) path[t] = argmax(logits[t]);
19
20	// Step 2: collapse repeats and remove blanks
21	vector<int> out;
22	int prev = -1;
23	for (int t = 0; t < T; ++t) {
24	int k = path[t];
25	if (k == blank) { prev = k; continue; } // drop blanks
26	if (k == prev) { continue; } // collapse repeats
27	out.push_back(k);
28	prev = k;
29	}
30	return out;
31	}
32
33	int main() {
34	// Alphabet: {blank=0, a=1, b=2, l=3, o=4}
35	// Example logits over 7 frames producing something like "hello" greedily
36	vector<vector<double>> logits = {
37	{3.0, 0.1, 0.2, 0.1, 0.1}, // blank
38	{0.1, 0.2, 0.1, 3.2, 0.1}, // l
39	{0.1, 0.2, 0.1, 3.0, 0.1}, // l (repeat)
40	{3.1, 0.1, 0.1, 0.1, 0.1}, // blank (separator)
41	{0.1, 0.2, 0.1, 0.1, 3.0}, // o
42	{0.1, 0.2, 0.1, 0.1, 3.1}, // o (repeat)
43	{3.0, 0.1, 0.1, 0.1, 0.1} // blank
44	};
45	int blank = 0;
46	vector<int> decoded = ctc_greedy_decode(logits, blank);
47	cout << "Decoded labels (indices): ";
48	for (int k : decoded) cout << k << ' ';
49	cout << "\n";
50	return 0;
51	}
52