🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Cross-Entropy

Key Points

  • •
    Cross-entropy measures how well a proposed distribution Q predicts outcomes actually generated by a true distribution P.
  • •
    It equals the entropy of P plus the Kullback–Leibler divergence from P to Q: H(P,Q) = H(P) + DK​L(P||Q).
  • •
    Lower cross-entropy means Q puts higher probability on outcomes that P considers likely; the minimum is reached when Q = P.
  • •
    In machine learning, minimizing cross-entropy is equivalent to maximizing the log-likelihood of the observed data.
  • •
    For classification, softmax cross-entropy is the standard loss for multi-class problems, and binary cross-entropy is used for two classes.
  • •
    Numerical stability is crucial: use log-sum-exp for softmax and stable logistic-loss formulas to avoid overflow/underflow.
  • •
    The computational cost is linear in the number of classes and data points, typically O(nK) for n samples and K classes.
  • •
    Perplexity is the exponential of cross-entropy and interprets the loss as an effective branching factor of a predictive model.

Prerequisites

  • →Basic Probability — Understanding probability mass functions and expectations is essential to interpret p(x), q(x), and averages.
  • →Logarithms and Exponents — Cross-entropy uses logarithms; knowing log bases, log-sum-exp, and exponent properties is crucial.
  • →Entropy — Cross-entropy generalizes entropy; recognizing H(P) helps interpret minimal coding cost.
  • →KL Divergence — The relation H(P,Q) = H(P) + D_KL(P||Q) underpins interpretation and inequalities.
  • →Linear Algebra (vectors/matrices) — Model logits, probabilities, and gradients are vector/matrix operations in code.
  • →Calculus for Gradients — Training with cross-entropy relies on derivatives with respect to logits and parameters.
  • →Floating-Point Numerics — Awareness of overflow/underflow motivates stable implementations (log-sum-exp, log1p).
  • →Optimization Basics — Gradient descent and regularization concepts are needed to train models with cross-entropy.

Detailed Explanation

Tap terms for definitions

01Overview

Cross-entropy is a way to quantify how well one probability distribution (Q) predicts data that actually follows another distribution (P). Imagine you have a model that assigns probabilities to possible outcomes. Cross-entropy tells you the average number of “surprise bits” you incur when you use Q to encode events that really come from P. Formally, for discrete variables, it is H(P,Q) = -\sum_x p(x) \log q(x). If Q matches P perfectly, cross-entropy is just the entropy of P, the irreducible uncertainty in the process. If Q mismatches P, the cross-entropy is larger, reflecting extra coding cost or prediction error. In machine learning, cross-entropy appears as the standard loss for probabilistic classification and language modeling: it is exactly the negative log-likelihood of the data under the model. Training by minimizing cross-entropy pushes the model to assign higher probability to observed outcomes. Two common special cases are binary cross-entropy (for two-class problems, often with a sigmoid output) and softmax cross-entropy (for multi-class problems, with probabilities from a softmax over logits). Because it directly penalizes miscalibrated probabilities, cross-entropy is not just about getting the right label, but about predicting well-calibrated probabilities. Computationally, cross-entropy is straightforward and efficient to compute—it’s a sum or average of -\log predicted probabilities for the correct outcomes. However, numerical stability must be handled carefully, especially when probabilities are very small. Techniques like log-sum-exp for softmax and stable logistic-loss formulas are standard to ensure robust computation.

02Intuition & Analogies

Think of guessing the next word in a sentence. If your model says “the” has probability 0.2 and a rare word has probability 0.001, and the true next word is actually “the,” then your guess was pretty good—you won’t be very surprised. But if your model assigned a tiny probability to the word that actually appears, your surprise is huge. Cross-entropy takes that feeling of surprise, converts it into a numeric penalty (-log of the assigned probability), and averages it over many trials. A helpful analogy is data compression. Suppose you design a code to compress messages based on your belief Q about symbol frequencies. If your belief Q matches the true frequencies P, you achieve the theoretical compression limit, the entropy H(P). If you’re wrong, your code wastes bits: you spend more bits than necessary on common symbols or too few bits on rare ones and have to compensate overall. The average wasted bits per symbol is exactly the KL divergence D_KL(P||Q), which adds on top of H(P) to form the cross-entropy H(P,Q) = H(P) + D_KL(P||Q). So cross-entropy is the total expected code length when coding P using a code optimized for Q. In classification, each example says, “the correct class is k.” Your model provides a probability for each class. The cross-entropy loss for that example is -log of the model’s probability of class k. If the model is confident and correct (probability near 1), the penalty is near 0; if it’s confident and wrong (probability near 0), the penalty skyrockets. This gradually nudges the model to move probability mass toward the correct labels in future predictions.

03Formal Definition

Let P and Q be probability distributions on a discrete set X. The cross-entropy from P to Q is defined as H(P,Q) = -∑x∈X​ p(x) log q(x), where p(x) and q(x) are the probability mass functions of P and Q, respectively, and the logarithm base determines the unit (bits for base 2, nats for base e). If there exists an x with p(x) > 0 but q(x) = 0, we define H(P,Q) = +∞ since the event occurs under P but is deemed impossible by Q. Entropy is H(P) = -∑x​ p(x) log p(x). The Kullback–Leibler divergence is D_{KL}(P∥ Q) = ∑x​ p(x) log q(x)p(x)​. These satisfy H(P,Q) = H(P) + D_{KL}(P∥ Q) and therefore H(P,Q) ≥ H(P) with equality iff P=Q almost everywhere. In expectation notation, H(P,Q) = Ex∼P​[-log q(x)]. For empirical data \{xi​\}_{i=1}^n sampled i.i.d. from P, an unbiased estimator of cross-entropy is the sample mean: H^(P,Q) = -n1​∑i=1n​ log q(xi​). In supervised multi-class learning, with a one-hot label vector y (yk​=1 for the correct class k), model logits z produce probabilities via softmax: qi​ = ∑j​ezj​ezi​​. The per-example softmax cross-entropy is L = -∑i​ yi​ log qi​ = -log qk​. In binary classification with logit z and label y ∈ \{0,1\}, the binary cross-entropy can be written stably as ℓ(z,y) = log(1 + ez) - y z, which equals -[ylog σ(z) + (1-y)log(1-σ(z))] with σ the sigmoid.

04When to Use

Use cross-entropy whenever your model outputs probabilities and you want to compare them to true outcomes: classification, language modeling, next-token prediction, and probabilistic forecasting. In supervised classification, softmax cross-entropy is the default for K-class problems; it directly optimizes log-likelihood and yields convenient gradients (softmax probability minus one-hot label). For binary outcomes, use binary cross-entropy with a single logit and a sigmoid. Cross-entropy is also appropriate in density estimation and generative modeling, where minimizing cross-entropy between the data distribution and the model distribution corresponds to maximum likelihood. In information retrieval or recommendation, if you frame the task as choosing the correct item among many, cross-entropy over a softmax of scores is standard (sometimes with sampled softmax for efficiency). When you care about calibrated probabilities (not just accuracy), cross-entropy is preferable to 0–1 loss because it penalizes overconfident wrong predictions more strongly. It’s also differentiable almost everywhere, enabling gradient-based training. However, don’t use cross-entropy when targets are not probabilities (e.g., regression to real values); use squared error or other regression losses instead. If classes are extremely imbalanced, you might modify cross-entropy with class weights or focal loss so rare classes get sufficient emphasis. When numeric ranges are large (e.g., many classes or extreme logits), always use numerically stable implementations (log-sum-exp, stable logistic loss).

⚠️Common Mistakes

• Feeding raw logits into -\log without softmax: For multi-class problems, you must convert logits to probabilities using softmax (or use the equivalent log-sum-exp form). Skipping this leads to meaningless losses. • Numerical underflow/overflow: Computing e^{z} directly for large |z| can overflow; small probabilities can underflow to zero. Use log-sum-exp for softmax cross-entropy and the stable logistic-loss formula \log(1+e^{z}) - y z for binary cross-entropy. Many libraries also provide log1p and expm1 for better numerical behavior. • Not guarding q(x)=0 when p(x)>0: Theoretically this yields infinite loss. In code, clamp probabilities with a small \epsilon, or better, operate in log-space to avoid taking log(0). • Mismatched labels and predictions: Using integer class IDs with a loss that expects probability vectors, or mixing one-hot and sparse formulations inconsistently. Ensure the ground-truth format matches the loss API. • Ignoring normalization: For discrete distributions p and q, forgetting to normalize to sum to 1 skews cross-entropy and KL results. Always verify sums and handle negative or NaN entries. • Wrong log base interpretation: The base of the logarithm changes the unit (bits vs nats) and perplexity calculation. Be consistent across experiments, especially when comparing to reported baselines. • Averaging vs summing: Mixing up mean loss and sum loss across a batch changes gradients’ effective scale. Use a consistent reduction and tune learning rates accordingly. • Label smoothing misuse: Applying label smoothing can improve calibration but changes the effective target distribution. Don’t compare smoothed-loss values directly to unsmoothed ones without noting the change.

Key Formulas

Cross-Entropy (Discrete)

H(P,Q)=−x∈X∑​p(x)logq(x)

Explanation: Average surprise when encoding outcomes from P using probabilities from Q. Lower values indicate Q aligns better with P.

Entropy

H(P)=−x∈X∑​p(x)logp(x)

Explanation: Intrinsic uncertainty of P. This is the best possible average code length if you know P exactly.

KL Divergence

DKL​(P∥Q)=x∈X∑​p(x)logq(x)p(x)​

Explanation: Extra cost of using Q instead of the true P. It is always nonnegative and zero iff P = Q almost everywhere.

Cross-Entropy Decomposition

H(P,Q)=H(P)+DKL​(P∥Q)

Explanation: Cross-entropy equals irreducible uncertainty plus mismatch penalty. It shows why H(P,Q) ≥ H(P).

Expectation Form

H(P,Q)=Ex∼P​[−logq(x)]

Explanation: Cross-entropy is the P-expectation of negative log probability assigned by Q. Useful for Monte Carlo and empirical estimates.

Empirical Cross-Entropy

H^(P,Q)=−n1​i=1∑n​logq(xi​)

Explanation: Given i.i.d. samples from P, the sample mean negative log-probability estimates cross-entropy and equals average NLL.

Softmax

qi​=softmax(z)i​=∑j=1K​ezj​ezi​​

Explanation: Turns logits into a valid probability distribution over K classes. Used before computing multi-class cross-entropy.

Softmax Cross-Entropy (One-Hot)

L=−i=1∑K​yi​logqi​=−logqk​

Explanation: For a one-hot label y (class k), the loss is the negative log-probability assigned to the correct class.

Log-Sum-Exp Trick

logi=1∑K​ezi​=m+logi=1∑K​ezi​−m,m=imax​zi​

Explanation: Stabilizes softmax computations by factoring out the maximum logit, preventing overflow/underflow.

Gradient of Softmax Cross-Entropy

∂zi​∂L​=softmax(z)i​−yi​

Explanation: Elegant gradient used in backprop: predicted probability minus one-hot target for each class.

Stable Binary Cross-Entropy (Logistic Loss)

ℓ(z,y)=log(1+ez)−yz

Explanation: Equivalent to -[ ylog σ(z) + (1-y)log(1-σ(z)) ] but numerically stable for large ∣z∣.

Perplexity

PP=eH(P,Q)

Explanation: The exponential of cross-entropy (with natural logs). Interpreted as the effective number of equally likely outcomes.

Cross-Entropy Lower Bound

H(P,Q)≥H(P)

Explanation: Cross-entropy is minimized when Q = P, otherwise it exceeds H(P) by the KL divergence.

Label Smoothing

ykLS​=(1−α)⋅1[k=y]+Kα​

Explanation: Smoothed targets mix the true class with a uniform distribution by factor α to reduce overconfidence.

Complexity Analysis

For discrete distributions with K outcomes, computing cross-entropy H(P,Q) requires a single pass over K entries: multiply p(x) by log q(x) and sum. This is O(K) time and O(1) extra space when streaming over the vectors. If you also compute H(P) and DK​L(P||Q), you still remain O(K) with a small constant factor increase. Normalizing inputs to valid probability vectors adds another O(K) pass. In supervised learning with a batch of n examples and K classes, softmax cross-entropy requires computing the log-sum-exp for each example (O(K)) and subtracting the correct logit, yielding O(nK) time. The gradient with respect to logits is also O(nK) since it is softmax minus one-hot. Memory-wise, if you compute in-place and avoid storing the full softmax, you can keep O(K) per example temporary space; with a batch, that is O(nK) if you retain all probabilities for backprop. Many frameworks fuse these steps to reduce memory (e.g., computing gradients from logits and labels without materializing softmax). Binary cross-entropy for n examples uses O(n d) time when computing logits z=Xw for feature dimension d, plus O(n) for the loss itself. Using the stable logistic loss log(1+ez) - y z avoids overflow without changing complexity. The gradient computation is O(n d), and storing parameters is O(d). If you add L2 regularization, cost remains linear. In large-class settings (e.g., language models with large vocabularies), naive O(K) per example can be expensive; approximations like sampled softmax or hierarchical softmax reduce time to O(log K) or O(s) with s samples. Vectorization and cache-friendly memory layouts significantly affect practical runtime.

Code Examples

Cross-Entropy, Entropy, and KL for Discrete Distributions
1#include <bits/stdc++.h>
2using namespace std;
3
4// Normalize a vector to sum to 1 and clamp negatives to 0
5vector<double> normalize(const vector<double>& v) {
6 vector<double> out(v.size());
7 double sum = 0.0;
8 for (double x : v) sum += max(0.0, x);
9 if (sum == 0.0) throw runtime_error("All entries are non-positive; cannot normalize.");
10 for (size_t i = 0; i < v.size(); ++i) out[i] = max(0.0, v[i]) / sum;
11 return out;
12}
13
14// Safe log with epsilon to avoid log(0).
15double safe_log(double x, double eps = 1e-15) {
16 if (x <= eps) x = eps;
17 return log(x); // natural log -> nats
18}
19
20struct InfoMeasures {
21 double entropy_P; // H(P)
22 double cross_entropy; // H(P,Q)
23 double kl_PQ; // D_KL(P||Q)
24};
25
26InfoMeasures compute_info_measures(const vector<double>& p_in, const vector<double>& q_in) {
27 if (p_in.size() != q_in.size()) throw runtime_error("Size mismatch between p and q.");
28 vector<double> p = normalize(p_in);
29 vector<double> q = normalize(q_in);
30
31 double H_P = 0.0, H_PQ = 0.0;
32 for (size_t i = 0; i < p.size(); ++i) {
33 double pi = p[i];
34 if (pi == 0.0) continue; // term is 0 in both sums
35 H_P += -pi * safe_log(pi);
36 H_PQ += -pi * safe_log(q[i]); // if q[i]=0 -> large penalty
37 }
38 double KL = H_PQ - H_P;
39 return {H_P, H_PQ, KL};
40}
41
42int main() {
43 // Example: P is near-uniform over 4 outcomes, Q is biased
44 vector<double> P = {0.24, 0.26, 0.25, 0.25};
45 vector<double> Q = {0.70, 0.10, 0.10, 0.10};
46
47 try {
48 InfoMeasures res = compute_info_measures(P, Q);
49 cout.setf(ios::fixed); cout << setprecision(6);
50 cout << "H(P) = " << res.entropy_P << " nats\n";
51 cout << "H(P,Q) = " << res.cross_entropy << " nats\n";
52 cout << "D_KL(P||Q) = " << res.kl_PQ << " nats\n";
53 // Verify relation numerically
54 cout << "Check: H(P)+KL = " << (res.entropy_P + res.kl_PQ) << "\n";
55 } catch (const exception& e) {
56 cerr << "Error: " << e.what() << "\n";
57 }
58 return 0;
59}
60

This program normalizes two discrete distributions P and Q, then computes entropy H(P), cross-entropy H(P,Q), and KL divergence D_KL(P||Q). It safeguards logs with a small epsilon to avoid log(0). The output verifies the identity H(P,Q) = H(P) + D_KL(P||Q).

Time: O(K) where K is the number of outcomes.Space: O(K) to store normalized vectors; O(1) extra during summation.
Numerically Stable Softmax Cross-Entropy and Gradient (Batch)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute stable log-sum-exp of a vector
5static inline double log_sum_exp(const vector<double>& z) {
6 double m = *max_element(z.begin(), z.end());
7 double sum = 0.0;
8 for (double v : z) sum += exp(v - m);
9 return m + log(sum);
10}
11
12// Given logits (batch x K) and integer labels (batch), compute mean loss and gradient dL/dz
13struct SCEOutput { double loss; vector<vector<double>> grad; };
14
15SCEOutput softmax_cross_entropy_with_grad(const vector<vector<double>>& logits,
16 const vector<int>& labels) {
17 size_t n = logits.size();
18 if (n == 0) throw runtime_error("Empty batch");
19 size_t K = logits[0].size();
20 if (labels.size() != n) throw runtime_error("Label size mismatch");
21 for (const auto& row : logits) if (row.size() != K) throw runtime_error("Jagged logits");
22
23 vector<vector<double>> grad(n, vector<double>(K, 0.0));
24 double loss_sum = 0.0;
25
26 for (size_t i = 0; i < n; ++i) {
27 int yi = labels[i];
28 if (yi < 0 || (size_t)yi >= K) throw runtime_error("Label out of range");
29 const auto& z = logits[i];
30 double lse = log_sum_exp(z); // log denominator
31 // Loss: -z[yi] + logsumexp(z)
32 loss_sum += (-z[yi] + lse);
33 // Gradient: softmax(z) - one_hot(yi)
34 for (size_t k = 0; k < K; ++k) {
35 double pk = exp(z[k] - lse); // softmax prob
36 grad[i][k] = pk - (k == (size_t)yi ? 1.0 : 0.0);
37 }
38 }
39
40 SCEOutput out;
41 out.loss = loss_sum / static_cast<double>(n);
42 // Average gradient over batch (common reduction)
43 for (size_t i = 0; i < n; ++i)
44 for (size_t k = 0; k < K; ++k)
45 grad[i][k] /= static_cast<double>(n);
46 out.grad = move(grad);
47 return out;
48}
49
50int main() {
51 // Example: 3 samples, 4 classes
52 vector<vector<double>> logits = {
53 {2.0, 1.0, 0.1, -0.5}, // true class 0
54 {0.3, 1.2, -0.7, -0.2}, // true class 1
55 {-1.0, 0.0, 0.5, 1.0} // true class 3
56 };
57 vector<int> labels = {0, 1, 3};
58
59 auto res = softmax_cross_entropy_with_grad(logits, labels);
60 cout.setf(ios::fixed); cout << setprecision(6);
61 cout << "Mean loss (nats): " << res.loss << "\n";
62
63 cout << "Gradient (first sample):\n";
64 for (double g : res.grad[0]) cout << g << ' ';
65 cout << "\n";
66 return 0;
67}
68

This code computes the average softmax cross-entropy over a batch using the stable log-sum-exp trick and returns the gradient with respect to the logits. The gradient simplifies to softmax probabilities minus the one-hot labels, facilitating efficient backpropagation.

Time: O(nK) for n samples and K classes.Space: O(nK) to store gradients; O(K) temporary per sample.
Binary Cross-Entropy (Stable) with Logistic Regression Training
1#include <bits/stdc++.h>
2using namespace std;
3
4// Stable logistic loss: log(1 + exp(z)) - y*z
5static inline double logistic_loss(double z, int y) {
6 // For stability: use conditional form
7 if (z >= 0) {
8 return log1p(exp(-z)) + (1 - y) * z; // equals log(1+e^z) - y*z
9 } else {
10 return log1p(exp(z)) - y * z; // avoids overflow when z << 0
11 }
12}
13
14// Sigmoid for predictions (may underflow for extreme z but fine for reporting)
15static inline double sigmoid(double z) {
16 if (z >= 0) {
17 double ez = exp(-z);
18 return 1.0 / (1.0 + ez);
19 } else {
20 double ez = exp(z);
21 return ez / (1.0 + ez);
22 }
23}
24
25// Compute loss and gradient over batch X (n x d), labels y in {0,1}
26struct LossGrad { double loss; vector<double> grad_w; double grad_b; };
27
28LossGrad loss_and_grad(const vector<vector<double>>& X, const vector<int>& y,
29 const vector<double>& w, double b, double l2 = 0.0) {
30 size_t n = X.size();
31 size_t d = w.size();
32 if (y.size() != n) throw runtime_error("Label size mismatch");
33 for (const auto& xi : X) if (xi.size() != d) throw runtime_error("Feature size mismatch");
34
35 vector<double> grad_w(d, 0.0);
36 double grad_b = 0.0;
37 double loss_sum = 0.0;
38
39 for (size_t i = 0; i < n; ++i) {
40 double z = inner_product(X[i].begin(), X[i].end(), w.begin(), 0.0) + b;
41 loss_sum += logistic_loss(z, y[i]);
42 double p = sigmoid(z);
43 double err = p - static_cast<double>(y[i]); // derivative of loss wrt z
44 for (size_t j = 0; j < d; ++j) grad_w[j] += err * X[i][j];
45 grad_b += err;
46 }
47
48 // Average and add L2 regularization (if l2 > 0)
49 double inv_n = 1.0 / static_cast<double>(n);
50 for (size_t j = 0; j < d; ++j) grad_w[j] = grad_w[j] * inv_n + l2 * w[j];
51 grad_b *= inv_n; // often bias is not regularized
52 double reg = 0.5 * l2 * inner_product(w.begin(), w.end(), w.begin(), 0.0);
53 double mean_loss = loss_sum * inv_n + reg;
54
55 return {mean_loss, grad_w, grad_b};
56}
57
58int main() {
59 // Create a simple linearly separable dataset in 2D
60 vector<vector<double>> X; vector<int> y;
61 std::mt19937 rng(42);
62 normal_distribution<double> N1(-2.0, 1.0), N2(2.0, 1.0);
63 for (int i = 0; i < 100; ++i) {
64 double x0 = N1(rng), x1 = N1(rng); X.push_back({x0, x1}); y.push_back(0);
65 double x2 = N2(rng), x3 = N2(rng); X.push_back({x2, x3}); y.push_back(1);
66 }
67
68 size_t d = 2;
69 vector<double> w(d, 0.0);
70 double b = 0.0;
71 double lr = 0.1, l2 = 0.0;
72
73 for (int it = 0; it < 200; ++it) {
74 auto lg = loss_and_grad(X, y, w, b, l2);
75 // Gradient descent update
76 for (size_t j = 0; j < d; ++j) w[j] -= lr * lg.grad_w[j];
77 b -= lr * lg.grad_b;
78 if ((it+1) % 50 == 0) {
79 cout.setf(ios::fixed); cout << setprecision(6);
80 cout << "Iter " << (it+1) << ": loss = " << lg.loss << "\n";
81 }
82 }
83
84 // Evaluate accuracy
85 int correct = 0; int total = (int)X.size();
86 for (int i = 0; i < total; ++i) {
87 double z = inner_product(X[i].begin(), X[i].end(), w.begin(), 0.0) + b;
88 int pred = sigmoid(z) >= 0.5 ? 1 : 0;
89 correct += (pred == y[i]);
90 }
91 cout << "Training accuracy: " << (100.0 * correct / total) << "%\n";
92
93 return 0;
94}
95

This program trains a logistic regression classifier using the stable binary cross-entropy (logistic) loss. It computes the mean loss and gradients over a batch, applies gradient descent, and reports the training loss and accuracy. The implementation uses numerically stable formulas to prevent overflow.

Time: O(n d) per iteration for n samples and d features.Space: O(d) for parameters and gradients; O(1) extra per sample.
#cross-entropy#entropy#kl divergence#negative log-likelihood#softmax#sigmoid#log-sum-exp#perplexity#label smoothing#classification loss#maximum likelihood#probability calibration#binary cross-entropy#softmax gradient#logistic regression