πŸŽ“How I Study AIHISA
πŸ“–Read
πŸ“„PapersπŸ“°Blogs🎬Courses
πŸ’‘Learn
πŸ›€οΈPathsπŸ“šTopicsπŸ’‘Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
βˆ‘MathIntermediate

Cross-Entropy Loss

Key Points

  • β€’
    Cross-entropy loss measures how well predicted probabilities match the true labels by penalizing confident wrong predictions heavily.
  • β€’
    For multiclass problems, cross-entropy is computed as the negative log of the predicted probability for the correct class.
  • β€’
    Using logits with the log-sum-exp trick and softmax yields numerically stable cross-entropy computations.
  • β€’
    Binary cross-entropy specializes the formula to two classes and pairs naturally with the sigmoid function.
  • β€’
    The gradient of softmax cross-entropy with respect to logits is simply predicted probabilities minus one-hot labels.
  • β€’
    Minimizing cross-entropy is equivalent to maximum likelihood estimation under a categorical or Bernoulli model.
  • β€’
    Cross-entropy connects to KL divergence: it equals entropy plus the KL divergence between true and predicted distributions.
  • β€’
    Efficient C++ implementations run in O(nk) time per batch where n is batch size and k is number of classes.

Prerequisites

  • β†’Logarithms and Exponentials β€” Cross-entropy uses log and exp; stability relies on properties like log-sum-exp.
  • β†’Basic Probability Distributions β€” Understanding probabilities, categorical and Bernoulli models is essential to interpret the loss.
  • β†’Vectors and Matrices β€” Implementations use vector/matrix operations for logits, probabilities, and gradients.
  • β†’Gradient-Based Optimization β€” Training with cross-entropy requires computing gradients and using algorithms like SGD.
  • β†’One-Hot Encoding and Label Formats β€” Correct loss computation depends on matching label representation to the formula.

Detailed Explanation

Tap terms for definitions

01Overview

Hook β†’ Imagine a weather app that says there’s a 99% chance of sun, but it rains. You’d be more upset than if it had said 55%. Cross-entropy captures exactly that frustration: it punishes being confidently wrong more than being uncertain. Concept β†’ Cross-entropy loss is a numerical score used to evaluate probabilistic predictions. When you assign a probability to each possible outcome, cross-entropy tells you how surprising the true outcome is under your prediction. The lower the cross-entropy, the better your probabilities match reality. In classification, it is the standard objective used to train models to output calibrated probabilities. Example β†’ In a 3-class task, if the true class is 2 and your model outputs p = [0.1, 0.8, 0.1], the loss is βˆ’log(0.8), which is small because you were quite confident and correct. If instead p = [0.98, 0.01, 0.01], the loss is βˆ’log(0.01), which is huge because you were confident in the wrong class.

02Intuition & Analogies

Hook β†’ Think of cross-entropy like a guessing game where you bet points on outcomes. If you bet almost all your points on the wrong outcome, you lose big; if you spread your bets or put more on the correct outcome, you lose less. Concept β†’ The negative log function turns probabilities into a penalty scale that grows rapidly as probabilities approach zero. This means the loss strongly discourages placing very low probability on the true class. Because logs turn products into sums, cross-entropy aggregates independent prediction penalties cleanly across samples. This is why it’s the go-to loss for models that output probabilities. Example β†’ Suppose you predict the next word in a sentence. If the true next word gets probability 0.5, the loss is βˆ’log(0.5) β‰ˆ 0.693; if it gets 0.1, the loss is β‰ˆ 2.302; if it gets 0.001, the loss is β‰ˆ 6.908. Each time you cut the assigned probability by 10, the penalty goes up by about log(10), reflecting how much more β€œsurprised” your model is. This aligns with how we want learning to behave: fix big mistakes (very confident and wrong) before nudging minor ones.

03Formal Definition

Hookβ†’To compare "what is true" against "what we predict," we need a function that is zero when they match perfectly and increases as they diverge. Conceptβ†’For discrete distributions p (true) and q (predicted) over classes i=1, ..., k, the cross-entropy is H(p, q) = βˆ’βˆ‘i=1k​ pi​ log qi​. In supervised classification with one-hot labels y, this reduces per sample to L = βˆ’log(qy​), the negative log-probability assigned to the true class. For binary classification with label y ∈ {0, 1} and predicted probability p, the loss is L = βˆ’[y log p + (1 βˆ’ y) log(1 βˆ’ p)]. Exampleβ†’If y = [0, 1, 0, 0] and q = [0.05, 0.9, 0.03, 0.02], then L = βˆ’log(0.9). If multiple labels are independently active (multi-label), we sum binary cross-entropies over classes.

04When to Use

Hook β†’ Use cross-entropy whenever your model must output probabilities that reflect real-world uncertainty. Concept β†’ It is ideal for classification tasks where each sample belongs to exactly one class (softmax cross-entropy), to one of two classes (binary cross-entropy), or to multiple independent labels (sum of binary cross-entropies). Because it matches the negative log-likelihood of categorical/Bernoulli models, minimizing cross-entropy corresponds to maximum likelihood estimation, yielding statistically grounded learning. Example β†’ Choose cross-entropy for image recognition (softmax over object classes), language modeling (softmax over vocabulary), click-through prediction (binary with sigmoid), and recommendation with multi-label outputs (sum of BCE terms). Avoid it when you predict continuous values (use MSE) or when outputs are not probabilities (unless you map them through sigmoid/softmax).

⚠️Common Mistakes

Hook β†’ Many training instabilities blamed on β€œbad optimization” are actually due to subtle mistakes in how cross-entropy is computed. Concept β†’ Frequent pitfalls include: computing softmax naively without numerical stabilization (overflow/underflow), taking log(0) due to zero probabilities, mixing label formats (index vs one-hot), and forgetting to average over the batch, which changes learning rate scaling. Another common error is applying softmax twice or mixing sigmoid with softmax. Example β†’ If logits are large, e^{z} can overflow. The fix is the log-sum-exp trick: subtract the maximum logit m before exponentiating. Similarly, compute BCE from logits using a stable formula max(z,0) βˆ’ zΒ·y + log(1 + e^{βˆ’|z|}). Also ensure that one-hot labels sum to 1, class indices are in range, and that reduction (mean vs sum) is chosen consistently. For multi-label tasks, do not use softmax; use independent sigmoids with BCE across classes. Finally, regularize and monitor calibration if predicted probabilities appear overconfident.

Key Formulas

Cross-Entropy

H(p,q)=βˆ’i=1βˆ‘k​pi​logqi​

Explanation: This is the expected negative log-probability the model assigns to outcomes drawn from p. In classification with one-hot labels, it reduces to the negative log of the true class probability.

Binary Cross-Entropy (BCE)

Lbinary​(y,p)=βˆ’[ylogp+(1βˆ’y)log(1βˆ’p)]

Explanation: This specializes cross-entropy to two classes using the Bernoulli model. It penalizes assigning low probability to the true label.

Softmax

pi​=softmax(z)i​=βˆ‘j=1k​ezj​ezi​​

Explanation: Softmax converts unnormalized scores (logits) into a valid probability distribution over k classes. It is used before computing multiclass cross-entropy.

Stable NLL from Logits

L=βˆ’zy​+logj=1βˆ‘k​ezj​=βˆ’(zyβ€‹βˆ’m)+logj=1βˆ‘k​ezjβ€‹βˆ’m,m=jmax​zj​

Explanation: This computes cross-entropy directly from logits without explicitly forming probabilities. Subtracting m implements the log-sum-exp trick for numerical stability.

Stable Binary Log-Loss from Logit

Οƒ(z)=1+eβˆ’z1​,LBCE-logit​(z,y)=max(z,0)βˆ’zy+log(1+eβˆ’βˆ£z∣)

Explanation: A numerically stable BCE computed directly from the logit z. It avoids overflow or underflow in exp and log for large |z|.

Softmax CE Gradient

βˆ‚ziβ€‹βˆ‚L​=piβ€‹βˆ’yi​

Explanation: For multiclass softmax cross-entropy, the gradient with respect to each logit is simply the predicted probability minus the one-hot label. This makes backpropagation efficient.

Binary CE Gradient

βˆ‚zβˆ‚L​=Οƒ(z)βˆ’y

Explanation: For BCE with a sigmoid, the derivative w.r.t. the logit equals the prediction minus the label. This compact form simplifies gradient computation.

Cross-Entropy Decomposition

H(p,q)=H(p)+DKL​(pβˆ₯q)

Explanation: Cross-entropy equals the entropy of p plus the KL divergence from p to q. Since H(p) is fixed, minimizing cross-entropy is equivalent to minimizing KL divergence.

Perplexity

PP=exp(n1​t=1βˆ‘nβ€‹βˆ’logp(yt​))

Explanation: Perplexity is the exponential of average cross-entropy, commonly used in language modeling. Lower perplexity indicates better predictive performance.

Empirical Risk (Average CE)

Lemp​=n1​i=1βˆ‘n​H(y(i),q(i))

Explanation: The dataset loss is the average of per-sample cross-entropy losses. Using the mean keeps the gradient scale consistent across batch sizes.

Complexity Analysis

Computing cross-entropy for a single multiclass example from probabilities is O(k) time and O(1) extra space, where k is the number of classes, because we only access the probability at the true index or sum over k entries for one-hot labels. When computed from logits using the stable log-sum-exp trick, we require one pass to find the maximum (O(k)), one pass to accumulate the shifted exponentials (O(k)), and constant-time arithmetic; this remains O(k) time and O(1) extra space beyond the logits themselves. For a batch of n examples, the total cost is O(nk), which typically dominates in high-class-count tasks (e.g., large vocabularies). Binary cross-entropy is O(1) per example since k = 1; for multi-label with k independent sigmoids, it is O(k) similarly to softmax CE but without the normalization cost. In training with softmax regression (multinomial logistic regression) of d input features and k classes, a forward pass computes logits as W x + b in O(dk) per example, and the gradient has the same complexity. Over n examples and E epochs, the total time is O(E n d k), and memory is O(d k) for parameters plus O(k) per example for temporary vectors (probabilities, gradients). Using numerically stable forms (log-sum-exp or BCE-from-logits) adds only constant overhead for max and absolute value operations while significantly reducing the risk of floating-point overflow or underflow. Parallelization over the batch dimension and vectorized BLAS operations can reduce constant factors but do not change asymptotic complexity.

Code Examples

Numerically Stable Softmax and Multiclass Cross-Entropy (from logits or probabilities)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute softmax probabilities using the log-sum-exp trick for stability.
5vector<double> softmax(const vector<double>& logits) {
6 double m = *max_element(logits.begin(), logits.end());
7 double sumExp = 0.0;
8 vector<double> exps(logits.size());
9 for (size_t i = 0; i < logits.size(); ++i) {
10 exps[i] = exp(logits[i] - m);
11 sumExp += exps[i];
12 }
13 vector<double> probs(logits.size());
14 for (size_t i = 0; i < logits.size(); ++i) probs[i] = exps[i] / sumExp;
15 return probs;
16}
17
18// Cross-entropy from probabilities and one-hot label.
19double cross_entropy_from_probs(const vector<double>& probs, const vector<int>& one_hot) {
20 const double eps = 1e-15; // avoid log(0)
21 if (probs.size() != one_hot.size()) throw runtime_error("Size mismatch");
22 double loss = 0.0;
23 for (size_t i = 0; i < probs.size(); ++i) {
24 if (one_hot[i]) {
25 loss = -log(max(probs[i], eps));
26 break; // one-hot has a single 1
27 }
28 }
29 return loss;
30}
31
32// Cross-entropy directly from logits and true class index using stable log-sum-exp.
33double cross_entropy_from_logits(const vector<double>& logits, int y) {
34 if (y < 0 || y >= (int)logits.size()) throw runtime_error("Label out of range");
35 double m = *max_element(logits.begin(), logits.end());
36 double sumExp = 0.0;
37 for (double z : logits) sumExp += exp(z - m);
38 // L = -(z_y - m) + log(sum_j exp(z_j - m))
39 return -(logits[y] - m) + log(sumExp);
40}
41
42int main() {
43 // Example: 4-class problem, logits for one sample
44 vector<double> logits = {2.3, -1.2, 0.7, 1.1};
45 int true_class = 0; // class index 0..3
46
47 vector<double> probs = softmax(logits);
48 vector<int> one_hot = {1, 0, 0, 0};
49
50 double ce_probs = cross_entropy_from_probs(probs, one_hot);
51 double ce_logits = cross_entropy_from_logits(logits, true_class);
52
53 cout << fixed << setprecision(6);
54 cout << "Probabilities: ";
55 for (double p : probs) cout << p << ' ';
56 cout << "\nCE (from probs): " << ce_probs << "\nCE (from logits): " << ce_logits << "\n";
57 return 0;
58}
59

This program shows two stable ways to compute multiclass cross-entropy. The softmax function uses the log-sum-exp trick by subtracting the maximum logit. The loss is computed either from probabilities with an epsilon guard or directly from logits, which is preferred for both stability and efficiency.

Time: O(k) per exampleSpace: O(k) for temporary arrays
Stable Binary Cross-Entropy from Logits with Gradient
1#include <bits/stdc++.h>
2using namespace std;
3
4// Numerically stable sigmoid
5static inline double sigmoid(double z) {
6 if (z >= 0) {
7 double ez = exp(-z);
8 return 1.0 / (1.0 + ez);
9 } else {
10 double ez = exp(z);
11 return ez / (1.0 + ez);
12 }
13}
14
15// Stable BCE computed directly from logit z and label y in {0,1}
16static inline double bce_from_logit(double z, int y) {
17 // L = max(z,0) - z*y + log(1 + exp(-|z|))
18 double a = fabs(z);
19 return max(z, 0.0) - z * (double)y + log1p(exp(-a));
20}
21
22// Gradient dL/dz = sigmoid(z) - y
23static inline double bce_grad_wrt_logit(double z, int y) {
24 return sigmoid(z) - (double)y;
25}
26
27int main() {
28 vector<double> logits = {5.0, -5.0, 0.0, 2.0, -2.0};
29 vector<int> labels = {1, 0, 1, 0, 1};
30
31 cout << fixed << setprecision(6);
32 for (size_t i = 0; i < logits.size(); ++i) {
33 double z = logits[i];
34 int y = labels[i];
35 double loss = bce_from_logit(z, y);
36 double grad = bce_grad_wrt_logit(z, y);
37 double p = sigmoid(z);
38 cout << "z=" << z << ", y=" << y
39 << ", p(sigmoid)=" << p
40 << ", loss(BCE)=" << loss
41 << ", grad(dL/dz)=" << grad << "\n";
42 }
43 return 0;
44}
45

This example implements a numerically stable binary cross-entropy directly from logits and computes its gradient. The stable form avoids overflow for large |z|, and the gradient simplifies to sigmoid(z) βˆ’ y, which is central to logistic regression training.

Time: O(1) per exampleSpace: O(1)
Softmax Regression (Multinomial Logistic Regression) Training on a Toy Dataset
1#include <bits/stdc++.h>
2using namespace std;
3
4// Utilities for stable softmax and loss
5vector<double> softmax(const vector<double>& z) {
6 double m = *max_element(z.begin(), z.end());
7 double sumExp = 0.0;
8 vector<double> e(z.size());
9 for (size_t i = 0; i < z.size(); ++i) { e[i] = exp(z[i] - m); sumExp += e[i]; }
10 for (size_t i = 0; i < z.size(); ++i) e[i] /= sumExp;
11 return e;
12}
13
14double nll_from_logits(const vector<double>& z, int y) {
15 double m = *max_element(z.begin(), z.end());
16 double sumExp = 0.0;
17 for (double zi : z) sumExp += exp(zi - m);
18 return -(z[y] - m) + log(sumExp);
19}
20
21// Matrix-vector multiply: y = W x + b, W: k x d, x: d
22vector<double> affine_logits(const vector<vector<double>>& W, const vector<double>& b, const vector<double>& x) {
23 size_t k = W.size(), d = x.size();
24 vector<double> z(k);
25 for (size_t i = 0; i < k; ++i) {
26 double s = b[i];
27 for (size_t j = 0; j < d; ++j) s += W[i][j] * x[j];
28 z[i] = s;
29 }
30 return z;
31}
32
33int main() {
34 // Create a toy 2D, 3-class dataset (linearly separable-ish)
35 std::mt19937 rng(42);
36 normal_distribution<double> noise(0.0, 0.2);
37 vector<vector<double>> X; vector<int> Y;
38 int n_per_class = 60; // total n = 180
39 vector<vector<double>> centers = {{2,2}, {-2,0}, {0,-2}};
40 for (int c = 0; c < 3; ++c) {
41 for (int i = 0; i < n_per_class; ++i) {
42 vector<double> x = {centers[c][0] + noise(rng), centers[c][1] + noise(rng)};
43 X.push_back(x); Y.push_back(c);
44 }
45 }
46
47 // Model: W (k x d), b (k)
48 size_t k = 3, d = 2;
49 vector<vector<double>> W(k, vector<double>(d));
50 vector<double> b(k, 0.0);
51 // Initialize small random weights
52 uniform_real_distribution<double> uni(-0.01, 0.01);
53 for (size_t i = 0; i < k; ++i) for (size_t j = 0; j < d; ++j) W[i][j] = uni(rng);
54
55 double lr = 0.1; int epochs = 50; int n = (int)X.size();
56
57 for (int ep = 1; ep <= epochs; ++ep) {
58 // Shuffle data each epoch
59 vector<int> idx(n); iota(idx.begin(), idx.end(), 0);
60 shuffle(idx.begin(), idx.end(), rng);
61
62 double epoch_loss = 0.0;
63 // Simple SGD with batch size 1 (for brevity)
64 for (int t = 0; t < n; ++t) {
65 int i = idx[t];
66 const vector<double>& x = X[i];
67 int y = Y[i];
68 // Forward
69 vector<double> z = affine_logits(W, b, x);
70 vector<double> p = softmax(z);
71 double loss = -log(max(1e-15, p[y]));
72 epoch_loss += loss;
73 // Gradients: dL/dz = p - y_onehot
74 vector<double> dz = p; dz[y] -= 1.0; // size k
75 // Update W and b: dL/dW = dz * x^T, dL/db = dz
76 for (size_t r = 0; r < k; ++r) {
77 for (size_t j = 0; j < d; ++j) W[r][j] -= lr * dz[r] * x[j];
78 b[r] -= lr * dz[r];
79 }
80 }
81 cout << "Epoch " << ep << ": avg loss = " << (epoch_loss / n) << "\n";
82 }
83
84 // Evaluate accuracy
85 int correct = 0;
86 for (int i = 0; i < n; ++i) {
87 vector<double> z = affine_logits(W, b, X[i]);
88 vector<double> p = softmax(z);
89 int pred = (int)(max_element(p.begin(), p.end()) - p.begin());
90 if (pred == Y[i]) ++correct;
91 }
92 cout << "Training accuracy: " << (100.0 * correct / n) << "%\n";
93 return 0;
94}
95

This program trains a softmax regression model with stochastic gradient descent on a toy 2D, 3-class dataset. It uses the classic gradient dL/dz = p βˆ’ y_onehot and updates weights with dL/dW = (p βˆ’ y) x^T. The example demonstrates how cross-entropy integrates with model training end-to-end.

Time: O(E n d k) for E epochs over n samples, d features, k classesSpace: O(dk) for parameters plus O(k) temporaries
#cross-entropy#binary cross-entropy#softmax#sigmoid#negative log-likelihood#log-sum-exp#kl divergence#logistic regression#multiclass classification#numerical stability#label smoothing#perplexity#probability calibration#maximum likelihood