Cross-Entropy Loss
Key Points
- β’Cross-entropy loss measures how well predicted probabilities match the true labels by penalizing confident wrong predictions heavily.
- β’For multiclass problems, cross-entropy is computed as the negative log of the predicted probability for the correct class.
- β’Using logits with the log-sum-exp trick and softmax yields numerically stable cross-entropy computations.
- β’Binary cross-entropy specializes the formula to two classes and pairs naturally with the sigmoid function.
- β’The gradient of softmax cross-entropy with respect to logits is simply predicted probabilities minus one-hot labels.
- β’Minimizing cross-entropy is equivalent to maximum likelihood estimation under a categorical or Bernoulli model.
- β’Cross-entropy connects to KL divergence: it equals entropy plus the KL divergence between true and predicted distributions.
- β’Efficient C++ implementations run in O(nk) time per batch where n is batch size and k is number of classes.
Prerequisites
- βLogarithms and Exponentials β Cross-entropy uses log and exp; stability relies on properties like log-sum-exp.
- βBasic Probability Distributions β Understanding probabilities, categorical and Bernoulli models is essential to interpret the loss.
- βVectors and Matrices β Implementations use vector/matrix operations for logits, probabilities, and gradients.
- βGradient-Based Optimization β Training with cross-entropy requires computing gradients and using algorithms like SGD.
- βOne-Hot Encoding and Label Formats β Correct loss computation depends on matching label representation to the formula.
Detailed Explanation
Tap terms for definitions01Overview
Hook β Imagine a weather app that says thereβs a 99% chance of sun, but it rains. Youβd be more upset than if it had said 55%. Cross-entropy captures exactly that frustration: it punishes being confidently wrong more than being uncertain. Concept β Cross-entropy loss is a numerical score used to evaluate probabilistic predictions. When you assign a probability to each possible outcome, cross-entropy tells you how surprising the true outcome is under your prediction. The lower the cross-entropy, the better your probabilities match reality. In classification, it is the standard objective used to train models to output calibrated probabilities. Example β In a 3-class task, if the true class is 2 and your model outputs p = [0.1, 0.8, 0.1], the loss is βlog(0.8), which is small because you were quite confident and correct. If instead p = [0.98, 0.01, 0.01], the loss is βlog(0.01), which is huge because you were confident in the wrong class.
02Intuition & Analogies
Hook β Think of cross-entropy like a guessing game where you bet points on outcomes. If you bet almost all your points on the wrong outcome, you lose big; if you spread your bets or put more on the correct outcome, you lose less. Concept β The negative log function turns probabilities into a penalty scale that grows rapidly as probabilities approach zero. This means the loss strongly discourages placing very low probability on the true class. Because logs turn products into sums, cross-entropy aggregates independent prediction penalties cleanly across samples. This is why itβs the go-to loss for models that output probabilities. Example β Suppose you predict the next word in a sentence. If the true next word gets probability 0.5, the loss is βlog(0.5) β 0.693; if it gets 0.1, the loss is β 2.302; if it gets 0.001, the loss is β 6.908. Each time you cut the assigned probability by 10, the penalty goes up by about log(10), reflecting how much more βsurprisedβ your model is. This aligns with how we want learning to behave: fix big mistakes (very confident and wrong) before nudging minor ones.
03Formal Definition
04When to Use
Hook β Use cross-entropy whenever your model must output probabilities that reflect real-world uncertainty. Concept β It is ideal for classification tasks where each sample belongs to exactly one class (softmax cross-entropy), to one of two classes (binary cross-entropy), or to multiple independent labels (sum of binary cross-entropies). Because it matches the negative log-likelihood of categorical/Bernoulli models, minimizing cross-entropy corresponds to maximum likelihood estimation, yielding statistically grounded learning. Example β Choose cross-entropy for image recognition (softmax over object classes), language modeling (softmax over vocabulary), click-through prediction (binary with sigmoid), and recommendation with multi-label outputs (sum of BCE terms). Avoid it when you predict continuous values (use MSE) or when outputs are not probabilities (unless you map them through sigmoid/softmax).
β οΈCommon Mistakes
Hook β Many training instabilities blamed on βbad optimizationβ are actually due to subtle mistakes in how cross-entropy is computed. Concept β Frequent pitfalls include: computing softmax naively without numerical stabilization (overflow/underflow), taking log(0) due to zero probabilities, mixing label formats (index vs one-hot), and forgetting to average over the batch, which changes learning rate scaling. Another common error is applying softmax twice or mixing sigmoid with softmax. Example β If logits are large, e^{z} can overflow. The fix is the log-sum-exp trick: subtract the maximum logit m before exponentiating. Similarly, compute BCE from logits using a stable formula max(z,0) β zΒ·y + log(1 + e^{β|z|}). Also ensure that one-hot labels sum to 1, class indices are in range, and that reduction (mean vs sum) is chosen consistently. For multi-label tasks, do not use softmax; use independent sigmoids with BCE across classes. Finally, regularize and monitor calibration if predicted probabilities appear overconfident.
Key Formulas
Cross-Entropy
Explanation: This is the expected negative log-probability the model assigns to outcomes drawn from p. In classification with one-hot labels, it reduces to the negative log of the true class probability.
Binary Cross-Entropy (BCE)
Explanation: This specializes cross-entropy to two classes using the Bernoulli model. It penalizes assigning low probability to the true label.
Softmax
Explanation: Softmax converts unnormalized scores (logits) into a valid probability distribution over k classes. It is used before computing multiclass cross-entropy.
Stable NLL from Logits
Explanation: This computes cross-entropy directly from logits without explicitly forming probabilities. Subtracting m implements the log-sum-exp trick for numerical stability.
Stable Binary Log-Loss from Logit
Explanation: A numerically stable BCE computed directly from the logit z. It avoids overflow or underflow in exp and log for large |z|.
Softmax CE Gradient
Explanation: For multiclass softmax cross-entropy, the gradient with respect to each logit is simply the predicted probability minus the one-hot label. This makes backpropagation efficient.
Binary CE Gradient
Explanation: For BCE with a sigmoid, the derivative w.r.t. the logit equals the prediction minus the label. This compact form simplifies gradient computation.
Cross-Entropy Decomposition
Explanation: Cross-entropy equals the entropy of p plus the KL divergence from p to q. Since H(p) is fixed, minimizing cross-entropy is equivalent to minimizing KL divergence.
Perplexity
Explanation: Perplexity is the exponential of average cross-entropy, commonly used in language modeling. Lower perplexity indicates better predictive performance.
Empirical Risk (Average CE)
Explanation: The dataset loss is the average of per-sample cross-entropy losses. Using the mean keeps the gradient scale consistent across batch sizes.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute softmax probabilities using the log-sum-exp trick for stability. 5 vector<double> softmax(const vector<double>& logits) { 6 double m = *max_element(logits.begin(), logits.end()); 7 double sumExp = 0.0; 8 vector<double> exps(logits.size()); 9 for (size_t i = 0; i < logits.size(); ++i) { 10 exps[i] = exp(logits[i] - m); 11 sumExp += exps[i]; 12 } 13 vector<double> probs(logits.size()); 14 for (size_t i = 0; i < logits.size(); ++i) probs[i] = exps[i] / sumExp; 15 return probs; 16 } 17 18 // Cross-entropy from probabilities and one-hot label. 19 double cross_entropy_from_probs(const vector<double>& probs, const vector<int>& one_hot) { 20 const double eps = 1e-15; // avoid log(0) 21 if (probs.size() != one_hot.size()) throw runtime_error("Size mismatch"); 22 double loss = 0.0; 23 for (size_t i = 0; i < probs.size(); ++i) { 24 if (one_hot[i]) { 25 loss = -log(max(probs[i], eps)); 26 break; // one-hot has a single 1 27 } 28 } 29 return loss; 30 } 31 32 // Cross-entropy directly from logits and true class index using stable log-sum-exp. 33 double cross_entropy_from_logits(const vector<double>& logits, int y) { 34 if (y < 0 || y >= (int)logits.size()) throw runtime_error("Label out of range"); 35 double m = *max_element(logits.begin(), logits.end()); 36 double sumExp = 0.0; 37 for (double z : logits) sumExp += exp(z - m); 38 // L = -(z_y - m) + log(sum_j exp(z_j - m)) 39 return -(logits[y] - m) + log(sumExp); 40 } 41 42 int main() { 43 // Example: 4-class problem, logits for one sample 44 vector<double> logits = {2.3, -1.2, 0.7, 1.1}; 45 int true_class = 0; // class index 0..3 46 47 vector<double> probs = softmax(logits); 48 vector<int> one_hot = {1, 0, 0, 0}; 49 50 double ce_probs = cross_entropy_from_probs(probs, one_hot); 51 double ce_logits = cross_entropy_from_logits(logits, true_class); 52 53 cout << fixed << setprecision(6); 54 cout << "Probabilities: "; 55 for (double p : probs) cout << p << ' '; 56 cout << "\nCE (from probs): " << ce_probs << "\nCE (from logits): " << ce_logits << "\n"; 57 return 0; 58 } 59
This program shows two stable ways to compute multiclass cross-entropy. The softmax function uses the log-sum-exp trick by subtracting the maximum logit. The loss is computed either from probabilities with an epsilon guard or directly from logits, which is preferred for both stability and efficiency.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Numerically stable sigmoid 5 static inline double sigmoid(double z) { 6 if (z >= 0) { 7 double ez = exp(-z); 8 return 1.0 / (1.0 + ez); 9 } else { 10 double ez = exp(z); 11 return ez / (1.0 + ez); 12 } 13 } 14 15 // Stable BCE computed directly from logit z and label y in {0,1} 16 static inline double bce_from_logit(double z, int y) { 17 // L = max(z,0) - z*y + log(1 + exp(-|z|)) 18 double a = fabs(z); 19 return max(z, 0.0) - z * (double)y + log1p(exp(-a)); 20 } 21 22 // Gradient dL/dz = sigmoid(z) - y 23 static inline double bce_grad_wrt_logit(double z, int y) { 24 return sigmoid(z) - (double)y; 25 } 26 27 int main() { 28 vector<double> logits = {5.0, -5.0, 0.0, 2.0, -2.0}; 29 vector<int> labels = {1, 0, 1, 0, 1}; 30 31 cout << fixed << setprecision(6); 32 for (size_t i = 0; i < logits.size(); ++i) { 33 double z = logits[i]; 34 int y = labels[i]; 35 double loss = bce_from_logit(z, y); 36 double grad = bce_grad_wrt_logit(z, y); 37 double p = sigmoid(z); 38 cout << "z=" << z << ", y=" << y 39 << ", p(sigmoid)=" << p 40 << ", loss(BCE)=" << loss 41 << ", grad(dL/dz)=" << grad << "\n"; 42 } 43 return 0; 44 } 45
This example implements a numerically stable binary cross-entropy directly from logits and computes its gradient. The stable form avoids overflow for large |z|, and the gradient simplifies to sigmoid(z) β y, which is central to logistic regression training.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Utilities for stable softmax and loss 5 vector<double> softmax(const vector<double>& z) { 6 double m = *max_element(z.begin(), z.end()); 7 double sumExp = 0.0; 8 vector<double> e(z.size()); 9 for (size_t i = 0; i < z.size(); ++i) { e[i] = exp(z[i] - m); sumExp += e[i]; } 10 for (size_t i = 0; i < z.size(); ++i) e[i] /= sumExp; 11 return e; 12 } 13 14 double nll_from_logits(const vector<double>& z, int y) { 15 double m = *max_element(z.begin(), z.end()); 16 double sumExp = 0.0; 17 for (double zi : z) sumExp += exp(zi - m); 18 return -(z[y] - m) + log(sumExp); 19 } 20 21 // Matrix-vector multiply: y = W x + b, W: k x d, x: d 22 vector<double> affine_logits(const vector<vector<double>>& W, const vector<double>& b, const vector<double>& x) { 23 size_t k = W.size(), d = x.size(); 24 vector<double> z(k); 25 for (size_t i = 0; i < k; ++i) { 26 double s = b[i]; 27 for (size_t j = 0; j < d; ++j) s += W[i][j] * x[j]; 28 z[i] = s; 29 } 30 return z; 31 } 32 33 int main() { 34 // Create a toy 2D, 3-class dataset (linearly separable-ish) 35 std::mt19937 rng(42); 36 normal_distribution<double> noise(0.0, 0.2); 37 vector<vector<double>> X; vector<int> Y; 38 int n_per_class = 60; // total n = 180 39 vector<vector<double>> centers = {{2,2}, {-2,0}, {0,-2}}; 40 for (int c = 0; c < 3; ++c) { 41 for (int i = 0; i < n_per_class; ++i) { 42 vector<double> x = {centers[c][0] + noise(rng), centers[c][1] + noise(rng)}; 43 X.push_back(x); Y.push_back(c); 44 } 45 } 46 47 // Model: W (k x d), b (k) 48 size_t k = 3, d = 2; 49 vector<vector<double>> W(k, vector<double>(d)); 50 vector<double> b(k, 0.0); 51 // Initialize small random weights 52 uniform_real_distribution<double> uni(-0.01, 0.01); 53 for (size_t i = 0; i < k; ++i) for (size_t j = 0; j < d; ++j) W[i][j] = uni(rng); 54 55 double lr = 0.1; int epochs = 50; int n = (int)X.size(); 56 57 for (int ep = 1; ep <= epochs; ++ep) { 58 // Shuffle data each epoch 59 vector<int> idx(n); iota(idx.begin(), idx.end(), 0); 60 shuffle(idx.begin(), idx.end(), rng); 61 62 double epoch_loss = 0.0; 63 // Simple SGD with batch size 1 (for brevity) 64 for (int t = 0; t < n; ++t) { 65 int i = idx[t]; 66 const vector<double>& x = X[i]; 67 int y = Y[i]; 68 // Forward 69 vector<double> z = affine_logits(W, b, x); 70 vector<double> p = softmax(z); 71 double loss = -log(max(1e-15, p[y])); 72 epoch_loss += loss; 73 // Gradients: dL/dz = p - y_onehot 74 vector<double> dz = p; dz[y] -= 1.0; // size k 75 // Update W and b: dL/dW = dz * x^T, dL/db = dz 76 for (size_t r = 0; r < k; ++r) { 77 for (size_t j = 0; j < d; ++j) W[r][j] -= lr * dz[r] * x[j]; 78 b[r] -= lr * dz[r]; 79 } 80 } 81 cout << "Epoch " << ep << ": avg loss = " << (epoch_loss / n) << "\n"; 82 } 83 84 // Evaluate accuracy 85 int correct = 0; 86 for (int i = 0; i < n; ++i) { 87 vector<double> z = affine_logits(W, b, X[i]); 88 vector<double> p = softmax(z); 89 int pred = (int)(max_element(p.begin(), p.end()) - p.begin()); 90 if (pred == Y[i]) ++correct; 91 } 92 cout << "Training accuracy: " << (100.0 * correct / n) << "%\n"; 93 return 0; 94 } 95
This program trains a softmax regression model with stochastic gradient descent on a toy 2D, 3-class dataset. It uses the classic gradient dL/dz = p β y_onehot and updates weights with dL/dW = (p β y) x^T. The example demonstrates how cross-entropy integrates with model training end-to-end.