🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryAdvanced

CTC Loss (Connectionist Temporal Classification)

Key Points

  • •
    CTC loss trains sequence models when you do not know the alignment between inputs (frames) and outputs (labels).
  • •
    It sums probabilities over all valid alignments that collapse (remove blanks and merge repeats) to the target label sequence.
  • •
    Dynamic programming with forward–backward recurrences computes the loss efficiently in O(TU), where T is time length and U is label length.
  • •
    A special blank symbol allows the model to output “nothing” between labels and to separate identical adjacent labels.
  • •
    Greedy decoding picks the most likely symbol per time step, then collapses repeats and removes blanks; beam search performs better but is slower.
  • •
    Numerical stability requires working in log-space and using log-sum-exp to avoid underflow.
  • •
    Gradients with respect to logits are yh​at - posterior, where the posterior is the probability of emitting each symbol at each time given the target.
  • •
    CTC is widely used in speech recognition, OCR, and handwriting recognition when frame-level labels are unavailable.

Prerequisites

  • →Softmax and cross-entropy — CTC’s gradients mirror cross-entropy with soft targets; understanding softmax normalization is essential.
  • →Dynamic programming — Forward–backward recurrences are DP over time and label positions.
  • →Log-space arithmetic — CTC requires stable computation using log-sum-exp to avoid underflow.
  • →Probability basics — Interpreting sums over paths and conditional independence relies on probability rules.
  • →Matrix and tensor indexing — Implementations depend on correct handling of T×K and T×S arrays.
  • →Numerical stability and floating-point — Understanding precision limits helps prevent NaNs and underflow.

Detailed Explanation

Tap terms for definitions

01Overview

Connectionist Temporal Classification (CTC) is a training objective for sequence models that maps variable-length input sequences (such as audio frames or image columns) to shorter label sequences (such as words or characters) without requiring frame-level alignment. Instead of forcing a single alignment between inputs and outputs, CTC defines a set of all possible alignments that produce the target label sequence after applying a collapsing function: first merge consecutive duplicates, then remove a special blank symbol. The total probability of the target is the sum of probabilities of all these alignments. This sum would be exponential to compute naively, but dynamic programming (forward–backward) makes it tractable. The presence of a blank symbol lets the model “wait” between label emissions and disambiguates repeated labels (e.g., the two l’s in “hello”). CTC assumes conditional independence across time steps given the input representation, which is often produced by an encoder (RNN, CNN, or Transformer). During training, we minimize the negative log-likelihood of the target sequence; at inference, we decode with greedy or beam search. CTC has been crucial in domains like speech recognition and OCR, where aligning each frame to a specific label is impractical or expensive.

02Intuition & Analogies

Imagine a conveyor belt of T frames (e.g., tiny slices of audio). A clerk watches the belt and can stamp a character at each frame or choose to stamp nothing. The "nothing" stamp is the blank symbol. Multiple stamps in a row with the same character are merged into one final character, and all blank stamps are removed. This merging and blank removal is the collapsing function. If the final written string after collapsing is exactly the target word, then the sequence of stamps is a valid alignment path. For a short word like “no,” there are many ways to stamp it: you might wait (blank, blank), stamp ‘n,’ wait again, then stamp ‘o,’ and so on. CTC doesn’t force a single choice. Instead, it sums the probabilities of all valid ways the clerk could have produced “no.” That is powerful because it doesn’t require us to know at which exact frame each letter should appear. The blank plays two key roles: it gives the model room to spread letters over time, and it separates identical neighbors so repeats can be represented (e.g., ‘l’ then ‘l’ needs a blank or a time gap). Computing the sum over all valid stampings directly would be impossible for long sequences, but we can keep track of partial sums: how likely we are to have produced the first s symbols of an “extended” target (with blanks inserted between letters) by time t. These running sums flow forward (and similarly backward), letting us aggregate all alignments efficiently and stably in log-space.

03Formal Definition

Let the label alphabet be \(A\) and augment it with a blank symbol \(∅\). Given input sequence length \(T\), a model outputs per-time-step distributions \(yt​(k) = P(k ∣ x, t)\) over \(A ∪ \{∅\}\), where \(k\) is a symbol index. For a target sequence \(y = (y1​,…,yU​)\) without blanks, define the collapsing map \(B: (A∪\{∅\})^T → A^*\) that merges consecutive duplicates and removes blanks. The CTC probability of the target is \(P(y∣ x) = ∑π∈B−1(y)​ ∏t=1T​ yt​(πt​)\), summing over all alignment paths \(π\) that collapse to \(y\). To compute this efficiently, create the extended target sequence \(e\) of length \(S=2U+1\): insert blanks at the beginning, between every label, and at the end. Define forward variables \(α(t,s)\) as the log-probability of paths that end at time \(t\) aligned to position \(s\) in \(e\). The recurrence allows transitions from \(s\), \(s-1\), and, when \(es​=∅\) and \(es​= es−2​\), from \(s-2\). A symmetric backward variable \(β(t,s)\) accumulates probabilities from time \(t\) to the end. The total log-probability is \(log P(y∣ x) = LSE(α(T,S-1),α(T,S-2))\) with suitable indexing. Posteriors over alignment positions are \(p(t,s) ∝ exp(α(t,s)+β(t,s))\). The loss is the negative log-likelihood \(L = -log P(y∣ x)\). For logits \(at​\) with \(yt​ = softmax(at​)\), gradients are \(∂ L/∂ at​(k) = yt​(k) - ∑s:es​=k​ p(t,s)\).

04When to Use

Use CTC when you need to align a long input sequence to a shorter label sequence but do not have (or do not want to rely on) frame-level alignments. Classic examples include: speech recognition (map audio frames to characters or word-pieces), online handwriting recognition (map pen-stroke traces to text), optical character recognition for scanned lines, lip reading, and sign language recognition. CTC excels when the alignment between inputs and outputs is monotonic and mostly left-to-right: you do not need to reorder outputs, only to decide when to emit each label and how long to wait (blanks) between them. It is also appropriate when you want a streaming or low-latency system that can emit outputs as frames arrive. Consider CTC over attention-based encoder–decoders when you prefer simpler decoding and strong monotonic constraints, or when you want to train from weak supervision (no frame labels). On the other hand, if you need non-monotonic alignments, strong long-range dependencies between emissions, or explicit language modeling during training, you might prefer sequence-to-sequence attention models or RNN-T (which generalizes CTC by conditioning emissions on previous outputs).

⚠️Common Mistakes

  • Ignoring numerical stability: computing products of probabilities quickly underflows. Always work in log-space and use log-sum-exp for additions. - Misusing the blank symbol: forgetting to add blanks around and between labels when building the extended target, or using the wrong blank index, breaks the recurrence. - Wrong skip condition: the s−2 transition is only allowed if the current extended symbol is not blank and not equal to the symbol two steps back; otherwise repeated labels can be merged incorrectly. - Mixing logits, probabilities, and log-probabilities: compute a stable log-softmax once per time step and stick to it in DP; use probabilities only when turning posteriors or gradients into human-readable numbers. - Ending state miscalculation: the total probability is the sum of forward variables at the last two extended positions (last label or trailing blank), not over all states. - Not handling empty targets: CTC supports empty labels via an extended sequence of a single blank; implement this edge case. - Forgetting masks for padded frames in batched code: if some sequences are shorter, mask out extra time steps consistently in forward, backward, and gradient. - Over-relying on greedy decoding to assess training quality: greedy can be much worse than beam search; always evaluate with a reasonable beam.

Key Formulas

CTC probability

P(y∣x)=π∈B−1(y)∑​t=1∏T​yt​(πt​)

Explanation: The probability of the target label sequence equals the sum of probabilities of all alignment paths that collapse to it. Each path’s probability is the product of per-time-step emission probabilities.

CTC loss

L(x,y)=−logP(y∣x)

Explanation: Training minimizes the negative log-likelihood of the target sequence. Lower loss means the model assigns higher probability to the correct sequence across all valid alignments.

Extended length

S=2U+1

Explanation: The extended target sequence inserts blanks between labels and at both ends, yielding length S. This enables waiting and separation of identical adjacent labels.

Forward recurrence (log-space)

α(t,s)=logyt​(es​)+LSE(α(t−1,s),α(t−1,s−1),α(t−1,s−2)if es​=∅∧es​=es−2​)

Explanation: The forward variable sums (in log-space) all ways to reach extended position s by time t. You can stay, move by one, or skip a blank by two when allowed.

Backward recurrence (log-space)

β(t,s)=LSE(β(t+1,s)+logyt+1​(es​),β(t+1,s+1)+logyt+1​(es+1​),β(t+1,s+2)+logyt+1​(es+2​)if es​=∅∧es​=es+2​)

Explanation: The backward variable accumulates probabilities from time t to the end. It mirrors the forward transitions, adding the next step’s emission log-probability.

Terminal sum

logZ=LSE(α(T−1,S−1),α(T−1,S−2))

Explanation: Valid alignments finish at the last blank or the last label in the extended sequence. The log-partition function is the log-sum-exp of these two forward values.

Alignment posterior

p(t,s)=exp(α(t,s)+β(t,s)−logZ)

Explanation: This gives the probability mass of being at extended position s at time t given the target. These posteriors are used to compute gradients and diagnostics.

Gradient w.r.t. logits

∂at​(k)∂L​=yt​(k)−s:es​=k∑​p(t,s)

Explanation: The gradient equals the model’s softmax probability minus the posterior probability of emitting symbol k at time t. This mirrors cross-entropy with soft targets given by CTC posteriors.

Log-sum-exp

LSE(z1​,…,zn​)=m+logi=1∑n​exp(zi​−m),m=imax​zi​

Explanation: A stable way to sum probabilities in log-space by factoring out the maximum. It avoids numerical underflow when probabilities are tiny.

Collapsing map example

B([…,a,a,∅,a,…])=[…,a,a,…]

Explanation: Consecutive duplicates are merged and blanks removed. This illustrates how alignments map to final label sequences.

Complexity Analysis

Let T be the number of time steps, U the number of labels in the target, K the alphabet size including the blank, and S = 2U + 1 the extended length. The forward–backward dynamic program fills a T×S table for both alpha and beta. Each cell aggregates up to three predecessor/successor states via log-sum-exp and adds one emission log-probability. Therefore, the time complexity is O(TS) = O(TU). Computing the per-time-step softmax and log-softmax costs O(TK). When gradients are required with respect to logits, we also compute posteriors by combining alpha and beta and summing over positions mapping to each symbol, which costs O(TS + TK). Overall, for a single sequence, the dominant term is O(TU + TK), often simplified to O(TU) when K is comparable to U or to O(TK) when U is small relative to K. Space complexity depends on whether you need gradients. For loss-only, a single pass with O(S) memory per time step suffices by discarding previous rows (streaming forward), giving O(S) space. For gradients, both alpha and beta (or sufficient intermediate summaries) are needed simultaneously to compute posteriors, requiring O(TS) space. Practical implementations may trade memory for recomputation (checkpointing) to reduce peak memory while keeping time near-linear. Beam search decoding adds a factor of the beam width B to the per-time-step cost: O(T·B·K) for a straightforward prefix beam search, with modest additional memory for maintaining beams and prefix scores. Greedy decoding is O(TK) time and O(1) extra space beyond outputs.

Code Examples

Numerically stable CTC loss and gradient (forward–backward in log-space)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Helper: log-sum-exp over a vector (returns -inf if all are -inf)
5double logSumExp(const vector<double>& xs) {
6 double m = -numeric_limits<double>::infinity();
7 for (double v : xs) m = max(m, v);
8 if (!isfinite(m)) return m; // all -inf
9 double s = 0.0;
10 for (double v : xs) s += exp(v - m);
11 return m + log(s);
12}
13
14// Stable log-softmax for a single time step (returns probabilities and log-probabilities)
15void logSoftmaxRow(const vector<double>& logits, vector<double>& logp, vector<double>& prob) {
16 double m = *max_element(logits.begin(), logits.end());
17 double sumexp = 0.0;
18 prob.resize(logits.size());
19 logp.resize(logits.size());
20 for (size_t k = 0; k < logits.size(); ++k) sumexp += exp(logits[k] - m);
21 double logZ = m + log(sumexp);
22 for (size_t k = 0; k < logits.size(); ++k) {
23 logp[k] = logits[k] - logZ; // log softmax
24 prob[k] = exp(logp[k]); // softmax
25 }
26}
27
28struct CTCLossResult {
29 double loss; // negative log-likelihood
30 vector<vector<double>> grad_logits; // gradient w.r.t. logits, shape [T][K]
31};
32
33// Compute CTC loss and gradient for a single sequence.
34// Inputs:
35// - logits: [T][K] raw scores (pre-softmax)
36// - labels: target labels without blanks, of length U (values in [0, K-1], including possible blank index but should exclude it in practice)
37// - blank: index of the blank symbol in [0, K-1]
38// Returns loss and gradient w.r.t logits.
39CTCLossResult ctc_loss_and_grad(const vector<vector<double>>& logits,
40 const vector<int>& labels,
41 int blank) {
42 const int T = (int)logits.size();
43 const int K = (int)logits[0].size();
44 const int U = (int)labels.size();
45
46 // Edge case: empty target -> extended is just [blank]
47 vector<int> ext; // extended label sequence e of length S = 2U+1
48 if (U == 0) {
49 ext = {blank};
50 } else {
51 ext.reserve(2*U + 1);
52 for (int i = 0; i < U; ++i) {
53 if (i == 0) ext.push_back(blank);
54 ext.push_back(labels[i]);
55 ext.push_back(blank);
56 }
57 }
58 const int S = (int)ext.size();
59
60 // Compute per-time softmax and log-softmax
61 vector<vector<double>> logp(T, vector<double>(K));
62 vector<vector<double>> prob(T, vector<double>(K));
63 for (int t = 0; t < T; ++t) {
64 logSoftmaxRow(logits[t], logp[t], prob[t]);
65 }
66
67 // Initialize alpha [T][S] with -inf
68 const double NEG_INF = -numeric_limits<double>::infinity();
69 vector<vector<double>> alpha(T, vector<double>(S, NEG_INF));
70
71 // t=0 initialization
72 alpha[0][0] = logp[0][ext[0]]; // must be blank at position 0
73 if (S > 1) {
74 alpha[0][1] = logp[0][ext[1]]; // first label
75 }
76 // rest remain -inf
77
78 // Forward pass
79 for (int t = 1; t < T; ++t) {
80 for (int s = 0; s < S; ++s) {
81 vector<double> terms;
82 // stay at s
83 terms.push_back(alpha[t-1][s]);
84 // move from s-1
85 if (s-1 >= 0) terms.push_back(alpha[t-1][s-1]);
86 // move from s-2 if allowed
87 if (s-2 >= 0) {
88 int es = ext[s];
89 int es2 = ext[s-2];
90 if (es != blank && es != es2) {
91 terms.push_back(alpha[t-1][s-2]);
92 }
93 }
94 double lse = logSumExp(terms);
95 alpha[t][s] = lse + logp[t][ext[s]];
96 }
97 }
98
99 // Terminal sum: can end at S-1 (trailing blank) or S-2 (last label)
100 double logZ;
101 if (S == 1) {
102 logZ = alpha[T-1][0];
103 } else {
104 logZ = logSumExp(vector<double>{alpha[T-1][S-1], alpha[T-1][S-2]});
105 }
106
107 // Backward pass: initialize beta [T][S]
108 vector<vector<double>> beta(T, vector<double>(S, NEG_INF));
109 // At t = T-1, probability to finish from last two positions is 0 in log-space (i.e., 1)
110 if (S == 1) {
111 beta[T-1][0] = 0.0;
112 } else {
113 beta[T-1][S-1] = 0.0; // end at trailing blank
114 beta[T-1][S-2] = 0.0; // or at last label
115 }
116
117 for (int t = T-2; t >= 0; --t) {
118 for (int s = 0; s < S; ++s) {
119 vector<double> terms;
120 // stay at s: next emits ext[s] at t+1
121 terms.push_back(beta[t+1][s] + logp[t+1][ext[s]]);
122 // move to s+1: next emits ext[s+1]
123 if (s+1 < S) terms.push_back(beta[t+1][s+1] + logp[t+1][ext[s+1]]);
124 // move to s+2 if allowed: next emits ext[s+2]
125 if (s+2 < S) {
126 int es = ext[s];
127 int es2 = ext[s+2];
128 if (es != blank && es != es2) {
129 terms.push_back(beta[t+1][s+2] + logp[t+1][ext[s+2]]);
130 }
131 }
132 beta[t][s] = logSumExp(terms);
133 }
134 }
135
136 // Posterior over extended positions and gradient w.r.t logits
137 vector<vector<double>> grad(T, vector<double>(K, 0.0));
138 for (int t = 0; t < T; ++t) {
139 // gamma over classes starts at zero
140 vector<double> gamma_k(K, 0.0);
141 for (int s = 0; s < S; ++s) {
142 double log_gamma_ts = alpha[t][s] + beta[t][s] - logZ; // log posterior of (t,s)
143 double g = exp(log_gamma_ts);
144 gamma_k[ ext[s] ] += g;
145 }
146 for (int k = 0; k < K; ++k) {
147 grad[t][k] = prob[t][k] - gamma_k[k]; // dL/da = y_hat - posterior
148 }
149 }
150
151 CTCLossResult res;
152 res.loss = -logZ;
153 res.grad_logits = move(grad);
154 return res;
155}
156
157int main() {
158 // Tiny example: alphabet {blank=0, a=1, b=2}
159 // T=4 time steps, target = "ab" -> labels {1,2}
160 vector<vector<double>> logits = {
161 {3.0, 1.0, 0.0}, // t=0
162 {2.0, 2.0, 0.0}, // t=1
163 {3.0, 0.5, 1.5}, // t=2
164 {2.0, 0.2, 2.2} // t=3
165 };
166 vector<int> labels = {1, 2};
167 int blank = 0;
168
169 CTCLossResult out = ctc_loss_and_grad(logits, labels, blank);
170 cout.setf(std::ios::fixed); cout<<setprecision(6);
171 cout << "CTC Loss: " << out.loss << "\n";
172 cout << "Gradient (t,k):\n";
173 for (size_t t = 0; t < out.grad_logits.size(); ++t) {
174 for (size_t k = 0; k < out.grad_logits[t].size(); ++k) {
175 cout << out.grad_logits[t][k] << (k+1==out.grad_logits[t].size()?"\n":" ");
176 }
177 }
178 return 0;
179}
180

This program computes the CTC negative log-likelihood and gradients with respect to logits using a numerically stable forward–backward algorithm. It builds the extended target with blanks, runs forward and backward recurrences in log-space, computes the partition function (logZ), then obtains alignment posteriors. Gradients are y_hat − posterior, matching the standard CTC derivative. The implementation handles the empty-target edge case and uses log-sum-exp to avoid underflow.

Time: O(TU + TK), where T is time steps, U is label length, and K is alphabet size.Space: O(TU + TK) for storing alpha, beta, probabilities, and gradients.
Greedy CTC decoding (collapse repeats and remove blanks)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Argmax over classes for each time step
5template <typename T>
6int argmax(const vector<T>& v) {
7 return (int)(max_element(v.begin(), v.end()) - v.begin());
8}
9
10// Greedy decode from logits: pick argmax per time, then collapse and remove blanks
11vector<int> ctc_greedy_decode(const vector<vector<double>>& logits, int blank) {
12 const int T = (int)logits.size();
13 const int K = (int)logits[0].size();
14 (void)K;
15
16 // Step 1: argmax path
17 vector<int> path(T);
18 for (int t = 0; t < T; ++t) path[t] = argmax(logits[t]);
19
20 // Step 2: collapse repeats and remove blanks
21 vector<int> out;
22 int prev = -1;
23 for (int t = 0; t < T; ++t) {
24 int k = path[t];
25 if (k == blank) { prev = k; continue; } // drop blanks
26 if (k == prev) { continue; } // collapse repeats
27 out.push_back(k);
28 prev = k;
29 }
30 return out;
31}
32
33int main() {
34 // Alphabet: {blank=0, a=1, b=2, l=3, o=4}
35 // Example logits over 7 frames producing something like "hello" greedily
36 vector<vector<double>> logits = {
37 {3.0, 0.1, 0.2, 0.1, 0.1}, // blank
38 {0.1, 0.2, 0.1, 3.2, 0.1}, // l
39 {0.1, 0.2, 0.1, 3.0, 0.1}, // l (repeat)
40 {3.1, 0.1, 0.1, 0.1, 0.1}, // blank (separator)
41 {0.1, 0.2, 0.1, 0.1, 3.0}, // o
42 {0.1, 0.2, 0.1, 0.1, 3.1}, // o (repeat)
43 {3.0, 0.1, 0.1, 0.1, 0.1} // blank
44 };
45 int blank = 0;
46 vector<int> decoded = ctc_greedy_decode(logits, blank);
47 cout << "Decoded labels (indices): ";
48 for (int k : decoded) cout << k << ' ';
49 cout << "\n";
50 return 0;
51}
52

Greedy decoding selects the top-scoring symbol at each frame (argmax), then collapses consecutive duplicates and removes blanks to produce a label sequence. It is extremely fast and often good enough for quick checks, though beam search generally yields higher accuracy.

Time: O(TK) to compute argmax per time step.Space: O(1) additional space beyond outputs.
#ctc loss#connectionist temporal classification#forward backward#blank symbol#sequence alignment#speech recognition#ocr#beam search#log-sum-exp#greedy decoding#posterior#dynamic programming#temporal modeling#logits#softmax