RLHF Mathematics
Key Points
- •RLHF turns human preferences between two model outputs into training signals using a probabilistic model of choice.
- •The Bradley–Terry model says the chance y₁ is preferred over y₂ equals a sigmoid of the reward difference: P(y₁>y₂)=
- •Only differences in rewards matter; adding a constant to all rewards changes nothing, so we must fix a gauge (e.g., mean-zero) or regularize.
- •Fitting the model maximizes a concave log-likelihood, which can be optimized by gradient ascent or (damped) Newton’s method.
- •Pairwise preference learning is equivalent to logistic regression on the feature difference
- •Regularization and careful numerics (stable sigmoid, bounded steps) are essential to avoid overfitting and overflow.
- •The method scales linearly with the number of comparisons and, with features, linearly in feature dimension.
- •Beyond RLHF, Bradley–Terry is used in sports rankings, A/B testing, and information retrieval (pairwise ranking).
Prerequisites
- →Logistic regression — Bradley–Terry with features is logistic regression on feature differences; understanding its loss and gradients directly applies.
- →Probability and odds — Interpreting sigmoid outputs, log-odds, and Bernoulli likelihoods is central to the model.
- →Convex optimization basics — Fitting the model uses gradient ascent/descent and benefits from knowledge of concavity and regularization.
- →Linear algebra — Feature-based models use dot products; Hessian structure relates to graph Laplacians.
- →Numerical stability — Stable evaluation of exp/sigmoid prevents overflow/underflow in training.
Detailed Explanation
Tap terms for definitions01Overview
Reinforcement Learning from Human Feedback (RLHF) often relies on pairwise comparisons: given two candidate responses to the same prompt, which one is better? The Bradley–Terry model provides a simple, powerful way to turn these comparisons into probabilities. Each response y is assigned a real-valued score r(y), interpreted as “how good this response is.” The probability that response y₁ is preferred over y₂ is modeled as a logistic (sigmoid) function of their score difference: the larger r(y₁)−r(y₂), the more likely y₁ is chosen. Because only differences matter, we can learn r by maximizing the likelihood of the observed human choices. This results in a convex (concave in parameters) optimization problem with well-known gradients and Hessians, allowing efficient training. In RLHF practice, r(y) is the output of a reward model (often a neural network) that maps text to a scalar. The pairwise Bradley–Terry loss is differentiable and stable, making it a standard choice for training reward models before using them to guide policy optimization. The same mathematics also underlies classic ranking tasks (sports, search) and connects closely to logistic regression when r is linear in features. The key advantages are simplicity, interpretability (odds scale), and scalability.
02Intuition & Analogies
Imagine a taste test for two cookies, A and B. If A is much tastier than B, almost everyone will choose A; if they taste nearly the same, choices are closer to a coin flip. The Bradley–Terry model turns this story into math by assigning each cookie a hidden “tastiness score” r. The chance someone picks A over B depends only on the difference r(A)−r(B). Pass that difference through the S-shaped sigmoid curve σ, and you get a probability between 0 and 1. Big positive differences push the probability near 1 (A almost surely wins), big negative differences push it near 0 (B almost surely wins), and small differences hover near 0.5 (hard to tell). In RLHF, replace cookies with model responses. Humans compare two responses to the same prompt and say which is better. We assume there’s an underlying reward r(y) the human is implicitly judging. We don’t see r directly, but by observing many choices, we can back out r so that the model’s predicted preference probabilities match human behavior. Another way to see it: think of two teams playing a game. The log-odds that team 1 beats team 2 equals the difference in their strengths. That’s exactly what Bradley–Terry encodes: log-odds = r₁−r₂. This odds view makes interpretation intuitive: increasing r₁ by 1 multiplies the odds of winning by e. Finally, because only differences matter, sliding all scores up by the same constant changes nothing—so we must anchor the scale by making the average score zero or by adding regularization.
03Formal Definition
04When to Use
Use the Bradley–Terry model whenever your supervision is pairwise preferences instead of absolute scores. In RLHF, humans often find “A is better than B” judgments easier and more consistent than giving numeric ratings, making pairwise data abundant. It is well suited for training a reward model r(y) from comparisons before using that model to optimize a policy. Beyond RLHF, apply it to rank sports teams from match outcomes, compare products in A/B tests, or learn to rank search results from click data (pairwise click preferences). Choose it when: (1) you want a probabilistic, differentiable objective compatible with gradient methods; (2) you need interpretable odds (log-odds = score difference); (3) your data’s graph of comparisons is reasonably connected so information can propagate. Prefer the feature-based (logistic regression) variant when generalizing to unseen items via features \phi(y) matters (e.g., new model responses). If you require listwise ranking over more than two items at once, consider the Plackett–Luce extension, but pairwise BT remains a solid baseline for scalability and robustness.
⚠️Common Mistakes
- Ignoring identifiability: Adding a constant to all r(y) leaves probabilities unchanged. Without anchoring (e.g., mean-zero constraint) or regularization, parameters can drift and training may be unstable.
- Numerical instability: Computing e^{-x} for large |x| can overflow/underflow. Always use a numerically stable sigmoid.
- Data sparsity or disconnected comparison graph: If some items never connect to others via comparison paths, their scores are weakly determined or undefined. Ensure sufficient coverage or add priors/regularization.
- Overfitting: With few comparisons per item, unregularized fits memorize noise. Use \ell_{2} penalties, early stopping, or Bayesian priors.
- Label encoding mix-ups: Be consistent about whether o \in {0,1} or y \in {\pm 1}. The gradient formulas differ by a simple transformation; a mismatch silently degrades training.
- Using accuracy alone: Raw win-rate accuracy hides calibration. Check log-likelihood, AUC, and calibration (e.g., reliability plots) to ensure probabilities are meaningful.
- Forgetting that only differences matter: Penalizing absolute r(y) with asymmetric constraints can bias results. Prefer symmetric constraints (mean-zero) or isotropic \ell_{2}.
- Too-large learning rates: Because the sigmoid saturates, large steps can stall learning at extremes. Use smaller steps or second-order methods (damped Newton/IRLS).
Key Formulas
Logistic Sigmoid
Explanation: Maps any real input to (0,1), used to convert score differences into probabilities. It is S-shaped, approaching 0 or 1 for large negative or positive inputs.
Bradley–Terry Probability
Explanation: The probability that y₁ is preferred to y₂ depends only on the difference of their rewards. Larger differences increase confidence of preference.
Log-Likelihood for Pairwise Data
Explanation: This sums the log-probabilities assigned to each observed choice. Maximizing this fits r to match observed preferences.
Gradient w.r.t. Item Scores
Explanation: Each comparison contributes error terms based on how surprising the outcome is under current scores. The last term is regularization.
Compact Loss with ±1 Labels
Explanation: Using ±1 labels simplifies the expression. Preferred outcome has positive margin; the log-sigmoid encourages large positive margins.
Feature-based Reward (Logistic Regression)
Explanation: If reward is linear in features, BT reduces to logistic regression on the difference of feature vectors, enabling standard ML tooling.
Gradient for Pairwise Logistic Regression
Explanation: Classic logistic regression gradient where the feature is the difference of the two items’ features. Regularization keeps weights bounded.
Log-Odds Equals Score Difference
Explanation: The log-odds are linear in the score difference, making interpretation of r straightforward on the odds scale.
Hessian (Laplacian Structure)
Explanation: The negative Hessian is a weighted graph Laplacian over items, guaranteeing concavity and efficient second-order methods when the graph is connected.
\ell_{2} Regularized Objective
Explanation: Penalizes large scores to improve generalization and fix the gauge implicitly. The strength controls the trade-off.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Pair {int a, b; int winner;}; // winner = 0 means a>b, 1 means b>a 5 6 // Numerically stable sigmoid 7 static inline double sigmoid(double x) { 8 if (x >= 0) { 9 double z = exp(-x); 10 return 1.0 / (1.0 + z); 11 } else { 12 double z = exp(x); 13 return z / (1.0 + z); 14 } 15 } 16 17 int main() { 18 ios::sync_with_stdio(false); 19 cin.tie(nullptr); 20 21 // 1) Generate synthetic data 22 int n = 8; // number of items (responses) 23 int m = 5000; // number of comparisons 24 unsigned seed = 42; 25 mt19937 rng(seed); 26 normal_distribution<double> nd(0.0, 1.0); 27 uniform_int_distribution<int> uid(0, n-1); 28 29 // Ground-truth scores (mean-centered) 30 vector<double> r_true(n); 31 for (int i = 0; i < n; ++i) r_true[i] = nd(rng); 32 double mean = accumulate(r_true.begin(), r_true.end(), 0.0) / n; 33 for (double &x : r_true) x -= mean; 34 35 // Create comparisons according to Bradley–Terry 36 vector<Pair> data; 37 data.reserve(m); 38 for (int k = 0; k < m; ++k) { 39 int a = uid(rng), b = uid(rng); 40 while (b == a) b = uid(rng); 41 double p = sigmoid(r_true[a] - r_true[b]); 42 bernoulli_distribution bd(p); 43 bool a_wins = bd(rng); 44 Pair pr{a, b, a_wins ? 0 : 1}; 45 data.push_back(pr); 46 } 47 48 // 2) Train Bradley–Terry scores with gradient ascent 49 vector<double> r(n, 0.0); // initialize learned scores to 0 50 double lr = 0.1; // learning rate 51 double lambda = 1e-3; // L2 regularization strength (stabilizes scale) 52 int epochs = 20; 53 54 vector<double> grad(n, 0.0); 55 for (int ep = 0; ep < epochs; ++ep) { 56 fill(grad.begin(), grad.end(), 0.0); 57 shuffle(data.begin(), data.end(), rng); 58 double loglik = 0.0; 59 for (const auto &pr : data) { 60 int a = pr.a, b = pr.b; 61 // Define outcome y in {+1, -1}: +1 means a preferred, -1 means b preferred 62 int y = (pr.winner == 0) ? +1 : -1; 63 double z = (r[a] - r[b]) * y; // signed margin 64 double p = sigmoid(z); // probability of observed outcome 65 loglik += log(max(1e-15, p)); // accumulate log-likelihood (safe) 66 double coeff = (1.0 - p) * y; // gradient of log sigma(z) wrt (r[a]-r[b]) 67 grad[a] += coeff; 68 grad[b] -= coeff; 69 } 70 // Add L2 regularization gradient and update 71 for (int i = 0; i < n; ++i) { 72 grad[i] -= lambda * r[i]; 73 r[i] += lr * grad[i]; 74 } 75 // Mean-center to fix gauge (optional but helps stability) 76 double mu = accumulate(r.begin(), r.end(), 0.0) / n; 77 for (double &x : r) x -= mu; 78 79 if ((ep+1) % 5 == 0) { 80 cerr << "Epoch " << (ep+1) << ", avg loglik per pair = " << (loglik / m) << "\n"; 81 } 82 } 83 84 // 3) Evaluate: rank correlation with ground truth & sample predictions 85 vector<int> idx(n); iota(idx.begin(), idx.end(), 0); 86 sort(idx.begin(), idx.end(), [&](int i, int j){return r[i] > r[j];}); 87 cout << "Learned scores (descending):\n"; 88 for (int id : idx) cout << "item " << id << ": r=" << r[id] << ", r_true=" << r_true[id] << "\n"; 89 90 // Predict win probability for a random pair 91 int A = 0, B = 1; 92 double pred = sigmoid(r[A] - r[B]); 93 cout << fixed << setprecision(3); 94 cout << "P(item " << A << " > item " << B << ") = " << pred << "\n"; 95 return 0; 96 } 97
This program synthesizes pairwise comparisons from hidden true scores, then learns item scores r by maximizing the Bradley–Terry log-likelihood using gradient ascent with L2 regularization and mean-centering. It reports learned vs. true scores and predicts a sample preference probability. The sigmoid is implemented stably to avoid overflow/underflow.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Stable sigmoid 5 static inline double sigmoid(double x) { 6 if (x >= 0) { double z = exp(-x); return 1.0 / (1.0 + z); } 7 else { double z = exp(x); return z / (1.0 + z); } 8 } 9 10 struct Example { 11 vector<double> f1; // features of response 1 12 vector<double> f2; // features of response 2 13 int y; // 1 if response 1 preferred, 0 otherwise 14 }; 15 16 int main(){ 17 ios::sync_with_stdio(false); 18 cin.tie(nullptr); 19 20 int d = 10; // feature dimension 21 int m = 4000; // number of pairwise examples 22 unsigned seed = 7; 23 mt19937 rng(seed); 24 normal_distribution<double> nd(0.0, 1.0); 25 26 // Ground-truth linear reward: r(y)=w*^T phi(y) 27 vector<double> w_true(d); 28 for (int i = 0; i < d; ++i) w_true[i] = nd(rng); 29 30 // Generate synthetic examples 31 vector<Example> data; data.reserve(m); 32 for (int k = 0; k < m; ++k) { 33 Example ex; ex.f1.resize(d); ex.f2.resize(d); 34 for (int i = 0; i < d; ++i) { ex.f1[i] = nd(rng); ex.f2[i] = nd(rng); } 35 // Preference probability depends on difference of features 36 double z = 0.0; for (int i = 0; i < d; ++i) z += w_true[i] * (ex.f1[i] - ex.f2[i]); 37 double p = sigmoid(z); 38 bernoulli_distribution bd(p); 39 ex.y = bd(rng) ? 1 : 0; 40 data.push_back(move(ex)); 41 } 42 43 // Train logistic regression on feature differences 44 vector<double> w(d, 0.0); 45 double lr = 0.2; // learning rate 46 double lambda = 1e-3;// L2 regularization 47 int epochs = 15; 48 49 vector<int> order(m); iota(order.begin(), order.end(), 0); 50 for (int ep = 0; ep < epochs; ++ep) { 51 shuffle(order.begin(), order.end(), rng); 52 double loglik = 0.0; 53 // Accumulate gradient in mini-batches (here full-batch for simplicity) 54 vector<double> grad(d, 0.0); 55 for (int idx : order) { 56 const auto &ex = data[idx]; 57 // Delta features 58 double z = 0.0; 59 for (int i = 0; i < d; ++i) z += w[i] * (ex.f1[i] - ex.f2[i]); 60 double p = sigmoid(z); 61 loglik += ex.y ? log(max(1e-15, p)) : log(max(1e-15, 1.0 - p)); 62 double err = (ex.y - p); // gradient coefficient 63 for (int i = 0; i < d; ++i) grad[i] += err * (ex.f1[i] - ex.f2[i]); 64 } 65 for (int i = 0; i < d; ++i) { 66 grad[i] -= lambda * w[i]; 67 w[i] += lr * grad[i] / m; // average gradient update 68 } 69 if ((ep+1) % 5 == 0) cerr << "Epoch " << (ep+1) << ": avg loglik = " << (loglik/m) << "\n"; 70 } 71 72 // Evaluate: accuracy on the training set (for demo) and compare to w_true direction 73 int correct = 0; 74 for (const auto &ex : data) { 75 double z = 0.0; for (int i = 0; i < d; ++i) z += w[i] * (ex.f1[i] - ex.f2[i]); 76 int pred = sigmoid(z) >= 0.5 ? 1 : 0; 77 correct += (pred == ex.y); 78 } 79 cout << fixed << setprecision(3); 80 cout << "Training accuracy: " << (double)correct / m << "\n"; 81 82 // Print first few learned vs. true weights (scaled cosine similarity) 83 double dot=0, n1=0, n2=0; for (int i=0;i<d;++i){dot+=w[i]*w_true[i]; n1+=w[i]*w[i]; n2+=w_true[i]*w_true[i];} 84 cout << "Cosine similarity(w, w_true): " << (dot / (sqrt(n1)*sqrt(n2)+1e-12)) << "\n"; 85 86 return 0; 87 } 88
This example trains a linear reward model from pairwise comparisons, which is exactly logistic regression on the feature differences. It synthesizes data from a ground-truth weight vector, fits w with gradient ascent on log-likelihood plus L2 regularization, and reports training accuracy and alignment with the true weights.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 static inline double sigmoid(double x){ 5 if (x >= 0) { double z = exp(-x); return 1.0/(1.0+z); } 6 else { double z = exp(x); return z/(1.0+z); } 7 } 8 9 struct Pair {int i,j; double uncertainty;}; 10 11 int main(){ 12 ios::sync_with_stdio(false); 13 cin.tie(nullptr); 14 15 // Suppose we have current Bradley–Terry scores r for n items 16 int n = 10; 17 vector<double> r = {0.9, -0.2, 1.3, 0.1, -1.1, 0.4, -0.7, 0.0, 0.8, -0.5}; 18 19 // For active data collection, prioritize pairs with high predictive uncertainty 20 // A natural measure is p(1-p) where p = sigma(r[i]-r[j]); it is maximized near 0.5 21 vector<Pair> candidates; 22 for (int i = 0; i < n; ++i) { 23 for (int j = i+1; j < n; ++j) { 24 double p = sigmoid(r[i]-r[j]); 25 double u = p*(1.0-p); // variance of Bernoulli; proxy for informativeness 26 candidates.push_back({i,j,u}); 27 } 28 } 29 30 int K = 8; // request top-K most uncertain pairs 31 nth_element(candidates.begin(), candidates.begin()+min(K,(int)candidates.size()) , candidates.end(), 32 [](const Pair& a, const Pair& b){return a.uncertainty > b.uncertainty;}); 33 sort(candidates.begin(), candidates.begin()+min(K,(int)candidates.size()), 34 [](const Pair& a, const Pair& b){return a.uncertainty > b.uncertainty;}); 35 36 cout << fixed << setprecision(4); 37 cout << "Top-" << K << " most uncertain pairs (i,j) with uncertainty p(1-p):\n"; 38 for (int k = 0; k < K && k < (int)candidates.size(); ++k) { 39 auto &c = candidates[k]; 40 cout << "(" << c.i << "," << c.j << ")\tunc=" << c.uncertainty << "\n"; 41 } 42 43 return 0; 44 } 45
Given current scores, this utility selects the top-K pairs with highest predictive uncertainty p(1−p). These are most informative for labeling in the next round of data collection, aligning with active learning principles in RLHF preference gathering.