📚TheoryIntermediate

RLHF Mathematics

Key Points

•
RLHF turns human preferences between two model outputs into training signals using a probabilistic model of choice.
•
The Bradley–Terry model says the chance y₁ is preferred over y₂ equals a sigmoid of the reward difference: P(y₁>y₂)= $σ (r (y_{1}) - r (y_{2})) .$
•
Only differences in rewards matter; adding a constant to all rewards changes nothing, so we must fix a gauge (e.g., mean-zero) or regularize.
•
Fitting the model maximizes a concave log-likelihood, which can be optimized by gradient ascent or (damped) Newton’s method.
•
Pairwise preference learning is equivalent to logistic regression on the feature difference $φ (y_{1}) - φ (y_{2}) .$
•
Regularization and careful numerics (stable sigmoid, bounded steps) are essential to avoid overfitting and overflow.
•
The method scales linearly with the number of comparisons and, with features, linearly in feature dimension.
•
Beyond RLHF, Bradley–Terry is used in sports rankings, A/B testing, and information retrieval (pairwise ranking).

Prerequisites

→Logistic regression — Bradley–Terry with features is logistic regression on feature differences; understanding its loss and gradients directly applies.
→Probability and odds — Interpreting sigmoid outputs, log-odds, and Bernoulli likelihoods is central to the model.
→Convex optimization basics — Fitting the model uses gradient ascent/descent and benefits from knowledge of concavity and regularization.
→Linear algebra — Feature-based models use dot products; Hessian structure relates to graph Laplacians.
→Numerical stability — Stable evaluation of exp/sigmoid prevents overflow/underflow in training.

Detailed Explanation

Tap terms for definitions

01Overview

Reinforcement Learning from Human Feedback (RLHF) often relies on pairwise comparisons: given two candidate responses to the same prompt, which one is better? The Bradley–Terry model provides a simple, powerful way to turn these comparisons into probabilities. Each response y is assigned a real-valued score r(y), interpreted as “how good this response is.” The probability that response y₁ is preferred over y₂ is modeled as a logistic (sigmoid) function of their score difference: the larger r(y₁)−r(y₂), the more likely y₁ is chosen. Because only differences matter, we can learn r by maximizing the likelihood of the observed human choices. This results in a convex (concave in parameters) optimization problem with well-known gradients and Hessians, allowing efficient training. In RLHF practice, r(y) is the output of a reward model (often a neural network) that maps text to a scalar. The pairwise Bradley–Terry loss is differentiable and stable, making it a standard choice for training reward models before using them to guide policy optimization. The same mathematics also underlies classic ranking tasks (sports, search) and connects closely to logistic regression when r is linear in features. The key advantages are simplicity, interpretability (odds scale), and scalability.

02Intuition & Analogies

Imagine a taste test for two cookies, A and B. If A is much tastier than B, almost everyone will choose A; if they taste nearly the same, choices are closer to a coin flip. The Bradley–Terry model turns this story into math by assigning each cookie a hidden “tastiness score” r. The chance someone picks A over B depends only on the difference r(A)−r(B). Pass that difference through the S-shaped sigmoid curve σ, and you get a probability between 0 and 1. Big positive differences push the probability near 1 (A almost surely wins), big negative differences push it near 0 (B almost surely wins), and small differences hover near 0.5 (hard to tell). In RLHF, replace cookies with model responses. Humans compare two responses to the same prompt and say which is better. We assume there’s an underlying reward r(y) the human is implicitly judging. We don’t see r directly, but by observing many choices, we can back out r so that the model’s predicted preference probabilities match human behavior. Another way to see it: think of two teams playing a game. The log-odds that team 1 beats team 2 equals the difference in their strengths. That’s exactly what Bradley–Terry encodes: log-odds = r₁−r₂. This odds view makes interpretation intuitive: increasing r₁ by 1 multiplies the odds of winning by e. Finally, because only differences matter, sliding all scores up by the same constant changes nothing—so we must anchor the scale by making the average score zero or by adding regularization.

03Formal Definition

Let Y be a set of items (e.g., responses). Assign each y

\in

Y a real-valued score r(y). For any ordered pair (

y_{1}

y_{2}

), the Bradley–Terry model defines the preference probability as P(

y_{1}

≻

y_{2}

) =

\sigma\big

(r(

y_{1}

) - r(

y_{2}

)

\big

), where

σ

(x) = 1/(1+

e^{- x}

) is the logistic sigmoid. Given a dataset of pairwise comparisons D = {(

y_{1}^{(k)}

y_{2}^{(k)}

o^{(k)}

)}_{

k=1}^{m}

, with outcomes

o^{(k)}

\in

\{0,1\} indicating whether the first item was preferred (1) or not (0), the (conditional) log-likelihood is L(r) =

\sum_{k = 1}^{m}

\big

[

o^{(k)}

lo g

σ

(

Δ^{(k)}

) + (1-

o^{(k)}

)

lo g

σ

Δ^{(k)}

)

\big

], where

Δ^{(k)}

= r(

y_{1}^{(k)}

) - r(

y_{2}^{(k)}

). The model is identifiable only up to an additive constant on r; common fixes include imposing

\sum_{y}

r(y) = 0 or adding an

ℓ_{2}

regularizer

\frac{λ}{2}

\sum_{y}

r(y)^2. When r is linear in features, r(y) =

w^{⊤}

ϕ

(y), the likelihood reduces to logistic regression on the feature difference

ϕ

(

y_{1}

ϕ

(

y_{2}

), enabling standard optimization techniques (SGD, Newton, L-BFGS). The Hessian has a Laplacian structure in the item-score parameterization, ensuring concavity and unique solutions modulo the gauge constraint.

04When to Use

Use the Bradley–Terry model whenever your supervision is pairwise preferences instead of absolute scores. In RLHF, humans often find “A is better than B” judgments easier and more consistent than giving numeric ratings, making pairwise data abundant. It is well suited for training a reward model r(y) from comparisons before using that model to optimize a policy. Beyond RLHF, apply it to rank sports teams from match outcomes, compare products in A/B tests, or learn to rank search results from click data (pairwise click preferences). Choose it when: (1) you want a probabilistic, differentiable objective compatible with gradient methods; (2) you need interpretable odds (log-odds = score difference); (3) your data’s graph of comparisons is reasonably connected so information can propagate. Prefer the feature-based (logistic regression) variant when generalizing to unseen items via features \phi(y) matters (e.g., new model responses). If you require listwise ranking over more than two items at once, consider the Plackett–Luce extension, but pairwise BT remains a solid baseline for scalability and robustness.

⚠️Common Mistakes

Ignoring identifiability: Adding a constant to all r(y) leaves probabilities unchanged. Without anchoring (e.g., mean-zero constraint) or regularization, parameters can drift and training may be unstable.
Numerical instability: Computing e^{-x} for large |x| can overflow/underflow. Always use a numerically stable sigmoid.
Data sparsity or disconnected comparison graph: If some items never connect to others via comparison paths, their scores are weakly determined or undefined. Ensure sufficient coverage or add priors/regularization.
Overfitting: With few comparisons per item, unregularized fits memorize noise. Use \ell_{2} penalties, early stopping, or Bayesian priors.
Label encoding mix-ups: Be consistent about whether o \in {0,1} or y \in {\pm 1}. The gradient formulas differ by a simple transformation; a mismatch silently degrades training.
Using accuracy alone: Raw win-rate accuracy hides calibration. Check log-likelihood, AUC, and calibration (e.g., reliability plots) to ensure probabilities are meaningful.
Forgetting that only differences matter: Penalizing absolute r(y) with asymmetric constraints can bias results. Prefer symmetric constraints (mean-zero) or isotropic \ell_{2}.
Too-large learning rates: Because the sigmoid saturates, large steps can stall learning at extremes. Use smaller steps or second-order methods (damped Newton/IRLS).

Key Formulas

Logistic Sigmoid

σ (x) = \frac{1}{1 + e ^{- x}}

Explanation: Maps any real input to (0,1), used to convert score differences into probabilities. It is S-shaped, approaching 0 or 1 for large negative or positive inputs.

Bradley–Terry Probability

P (y_{1} ≻ y_{2}) = σ (r (y_{1}) - r (y_{2}))

Explanation: The probability that y₁ is preferred to y₂ depends only on the difference of their rewards. Larger differences increase confidence of preference.

Log-Likelihood for Pairwise Data

L (r) = k = 1 \sum m [o^{(k)} lo g σ (Δ^{(k)}) + (1 - o^{(k)}) lo g σ (- Δ^{(k)})], Δ^{(k)} = r (y_{1}^{(k)}) - r (y_{2}^{(k)})

Explanation: This sums the log-probabilities assigned to each observed choice. Maximizing this fits r to match observed preferences.

Gradient w.r.t. Item Scores

\nabla_{r_{i}} L = (i, j) \in D \sum [1 - σ (r_{i} - r_{j})] - (j, i) \in D \sum [1 - σ (r_{j} - r_{i})] - λ r_{i}

Explanation: Each comparison contributes error terms based on how surprising the outcome is under current scores. The last term is $ℓ_{2}$ regularization.

Compact Loss with ±1 Labels

With y \in {\pm 1} : L = k = 1 \sum m lo g σ (y^{(k)} [r (y_{1}^{(k)}) - r (y_{2}^{(k)})])

Explanation: Using ±1 labels simplifies the expression. Preferred outcome has positive margin; the log-sigmoid encourages large positive margins.

Feature-based Reward (Logistic Regression)

r (y) = w^{⊤} ϕ (y), P (y_{1} ≻ y_{2}) = σ (w^{⊤} (ϕ (y_{1}) - ϕ (y_{2})))

Explanation: If reward is linear in features, BT reduces to logistic regression on the difference of feature vectors, enabling standard ML tooling.

Gradient for Pairwise Logistic Regression

\nabla_{w} L = k = 1 \sum m (o^{(k)} - σ (z^{(k)})) Δ ϕ^{(k)} - λ w, z^{(k)} = w^{⊤} Δ ϕ^{(k)}

Explanation: Classic logistic regression gradient where the feature is the difference of the two items’ features. Regularization keeps weights bounded.

Log-Odds Equals Score Difference

lo g \frac{P ( y _{1} ≻ y _{2} )}{P ( y _{2} ≻ y _{1} )} = r (y_{1}) - r (y_{2})

Explanation: The log-odds are linear in the score difference, making interpretation of r straightforward on the odds scale.

Hessian (Laplacian Structure)

H_{ii} = - j \sum w_{ij}, H_{ij} = w_{ij} (i \neq = j), w_{ij} = σ (Δ_{ij}) (1 - σ (Δ_{ij}))

Explanation: The negative Hessian is a weighted graph Laplacian over items, guaranteeing concavity and efficient second-order methods when the graph is connected.

\ell_{2} Regularized Objective

L_{reg} = L - \frac{λ}{2} y \sum (r (y))^{2}

Explanation: Penalizes large scores to improve generalization and fix the gauge implicitly. The strength $λ$ controls the trade-off.

Complexity Analysis

Let n be the number of distinct items (responses), m the number of observed comparisons, and d the feature dimension (if using a linear reward model r(y)=

w^{T}

φ (y)) .

For the item-score parameterization (learning one scalar

r_{i}

per item), each pass computing the log-likelihood and its gradient visits each comparison once and performs O(1) work, yielding O(m) time per epoch. The space required is O(n) for the scores plus O(m) to store the comparisons. If comparisons stream from disk, you can keep O(n) memory and read comparisons sequentially. For the feature-based (pairwise logistic regression) model, evaluating

σ (w^{T} (φ_{1} - φ_{2}))

costs O(d), so a full pass is O(m d). Stochastic updates process mini-batches of size B in O(B d) time with O(d) parameter memory. Second-order methods (e.g., Newton/IRLS) on item scores require solving a sparse linear system with a Laplacian-like Hessian; using conjugate gradient on a graph with average degree Δ typically costs O((m+n)√κ) per Newton step, where κ is the condition number, and needs O(n+m) space. Numerical stability requires careful sigmoid evaluation to avoid overflow for large

∣ x ∣

; a stable implementation keeps time O(1) per evaluation. Regularization adds negligible overhead. Overall, training scales linearly with m (and with d in the feature case), making Bradley–Terry suitable for large preference datasets common in RLHF. Batch shuffling and caching can improve locality; parallelization across comparisons is straightforward since gradients add up linearly.

Code Examples

Fit Bradley–Terry item scores from pairwise comparisons (gradient ascent)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Pair {int a, b; int winner;}; // winner = 0 means a>b, 1 means b>a
5 
6 // Numerically stable sigmoid
7 static inline double sigmoid(double x) {
8     if (x >= 0) {
9         double z = exp(-x);
10         return 1.0 / (1.0 + z);
11     } else {
12         double z = exp(x);
13         return z / (1.0 + z);
14     }
15 }
16 
17 int main() {
18     ios::sync_with_stdio(false);
19     cin.tie(nullptr);
20 
21     // 1) Generate synthetic data
22     int n = 8;                // number of items (responses)
23     int m = 5000;             // number of comparisons
24     unsigned seed = 42;
25     mt19937 rng(seed);
26     normal_distribution<double> nd(0.0, 1.0);
27     uniform_int_distribution<int> uid(0, n-1);
28 
29     // Ground-truth scores (mean-centered)
30     vector<double> r_true(n);
31     for (int i = 0; i < n; ++i) r_true[i] = nd(rng);
32     double mean = accumulate(r_true.begin(), r_true.end(), 0.0) / n;
33     for (double &x : r_true) x -= mean;
34 
35     // Create comparisons according to Bradley–Terry
36     vector<Pair> data;
37     data.reserve(m);
38     for (int k = 0; k < m; ++k) {
39         int a = uid(rng), b = uid(rng);
40         while (b == a) b = uid(rng);
41         double p = sigmoid(r_true[a] - r_true[b]);
42         bernoulli_distribution bd(p);
43         bool a_wins = bd(rng);
44         Pair pr{a, b, a_wins ? 0 : 1};
45         data.push_back(pr);
46     }
47 
48     // 2) Train Bradley–Terry scores with gradient ascent
49     vector<double> r(n, 0.0); // initialize learned scores to 0
50     double lr = 0.1;          // learning rate
51     double lambda = 1e-3;     // L2 regularization strength (stabilizes scale)
52     int epochs = 20;
53 
54     vector<double> grad(n, 0.0);
55     for (int ep = 0; ep < epochs; ++ep) {
56         fill(grad.begin(), grad.end(), 0.0);
57         shuffle(data.begin(), data.end(), rng);
58         double loglik = 0.0;
59         for (const auto &pr : data) {
60             int a = pr.a, b = pr.b;
61             // Define outcome y in {+1, -1}: +1 means a preferred, -1 means b preferred
62             int y = (pr.winner == 0) ? +1 : -1;
63             double z = (r[a] - r[b]) * y;        // signed margin
64             double p = sigmoid(z);               // probability of observed outcome
65             loglik += log(max(1e-15, p));        // accumulate log-likelihood (safe)
66             double coeff = (1.0 - p) * y;        // gradient of log sigma(z) wrt (r[a]-r[b])
67             grad[a] += coeff;
68             grad[b] -= coeff;
69         }
70         // Add L2 regularization gradient and update
71         for (int i = 0; i < n; ++i) {
72             grad[i] -= lambda * r[i];
73             r[i] += lr * grad[i];
74         }
75         // Mean-center to fix gauge (optional but helps stability)
76         double mu = accumulate(r.begin(), r.end(), 0.0) / n;
77         for (double &x : r) x -= mu;
78 
79         if ((ep+1) % 5 == 0) {
80             cerr << "Epoch " << (ep+1) << ", avg loglik per pair = " << (loglik / m) << "\n";
81         }
82     }
83 
84     // 3) Evaluate: rank correlation with ground truth & sample predictions
85     vector<int> idx(n); iota(idx.begin(), idx.end(), 0);
86     sort(idx.begin(), idx.end(), [&](int i, int j){return r[i] > r[j];});
87     cout << "Learned scores (descending):\n";
88     for (int id : idx) cout << "item " << id << ": r=" << r[id] << ", r_true=" << r_true[id] << "\n";
89 
90     // Predict win probability for a random pair
91     int A = 0, B = 1;
92     double pred = sigmoid(r[A] - r[B]);
93     cout << fixed << setprecision(3);
94     cout << "P(item " << A << " > item " << B << ") = " << pred << "\n";
95     return 0;
96 }
97

This program synthesizes pairwise comparisons from hidden true scores, then learns item scores r by maximizing the Bradley–Terry log-likelihood using gradient ascent with L2 regularization and mean-centering. It reports learned vs. true scores and predicts a sample preference probability. The sigmoid is implemented stably to avoid overflow/underflow.

Time: O(m) per epoch (each comparison contributes O(1) work); overall O(m · epochs).Space: O(n + m) for scores, gradients, and stored comparisons.

Pairwise logistic regression (feature-based reward model) for RLHF

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Stable sigmoid
5 static inline double sigmoid(double x) {
6     if (x >= 0) { double z = exp(-x); return 1.0 / (1.0 + z); }
7     else { double z = exp(x); return z / (1.0 + z); }
8 }
9 
10 struct Example {
11     vector<double> f1; // features of response 1
12     vector<double> f2; // features of response 2
13     int y;             // 1 if response 1 preferred, 0 otherwise
14 };
15 
16 int main(){
17     ios::sync_with_stdio(false);
18     cin.tie(nullptr);
19 
20     int d = 10;          // feature dimension
21     int m = 4000;        // number of pairwise examples
22     unsigned seed = 7;
23     mt19937 rng(seed);
24     normal_distribution<double> nd(0.0, 1.0);
25 
26     // Ground-truth linear reward: r(y)=w*^T phi(y)
27     vector<double> w_true(d);
28     for (int i = 0; i < d; ++i) w_true[i] = nd(rng);
29 
30     // Generate synthetic examples
31     vector<Example> data; data.reserve(m);
32     for (int k = 0; k < m; ++k) {
33         Example ex; ex.f1.resize(d); ex.f2.resize(d);
34         for (int i = 0; i < d; ++i) { ex.f1[i] = nd(rng); ex.f2[i] = nd(rng); }
35         // Preference probability depends on difference of features
36         double z = 0.0; for (int i = 0; i < d; ++i) z += w_true[i] * (ex.f1[i] - ex.f2[i]);
37         double p = sigmoid(z);
38         bernoulli_distribution bd(p);
39         ex.y = bd(rng) ? 1 : 0;
40         data.push_back(move(ex));
41     }
42 
43     // Train logistic regression on feature differences
44     vector<double> w(d, 0.0);
45     double lr = 0.2;     // learning rate
46     double lambda = 1e-3;// L2 regularization
47     int epochs = 15;
48 
49     vector<int> order(m); iota(order.begin(), order.end(), 0);
50     for (int ep = 0; ep < epochs; ++ep) {
51         shuffle(order.begin(), order.end(), rng);
52         double loglik = 0.0;
53         // Accumulate gradient in mini-batches (here full-batch for simplicity)
54         vector<double> grad(d, 0.0);
55         for (int idx : order) {
56             const auto &ex = data[idx];
57             // Delta features
58             double z = 0.0; 
59             for (int i = 0; i < d; ++i) z += w[i] * (ex.f1[i] - ex.f2[i]);
60             double p = sigmoid(z);
61             loglik += ex.y ? log(max(1e-15, p)) : log(max(1e-15, 1.0 - p));
62             double err = (ex.y - p); // gradient coefficient
63             for (int i = 0; i < d; ++i) grad[i] += err * (ex.f1[i] - ex.f2[i]);
64         }
65         for (int i = 0; i < d; ++i) {
66             grad[i] -= lambda * w[i];
67             w[i] += lr * grad[i] / m; // average gradient update
68         }
69         if ((ep+1) % 5 == 0) cerr << "Epoch " << (ep+1) << ": avg loglik = " << (loglik/m) << "\n";
70     }
71 
72     // Evaluate: accuracy on the training set (for demo) and compare to w_true direction
73     int correct = 0;
74     for (const auto &ex : data) {
75         double z = 0.0; for (int i = 0; i < d; ++i) z += w[i] * (ex.f1[i] - ex.f2[i]);
76         int pred = sigmoid(z) >= 0.5 ? 1 : 0;
77         correct += (pred == ex.y);
78     }
79     cout << fixed << setprecision(3);
80     cout << "Training accuracy: " << (double)correct / m << "\n";
81 
82     // Print first few learned vs. true weights (scaled cosine similarity)
83     double dot=0, n1=0, n2=0; for (int i=0;i<d;++i){dot+=w[i]*w_true[i]; n1+=w[i]*w[i]; n2+=w_true[i]*w_true[i];}
84     cout << "Cosine similarity(w, w_true): " << (dot / (sqrt(n1)*sqrt(n2)+1e-12)) << "\n";
85 
86     return 0;
87 }
88

This example trains a linear reward model from pairwise comparisons, which is exactly logistic regression on the feature differences. It synthesizes data from a ground-truth weight vector, fits w with gradient ascent on log-likelihood plus L2 regularization, and reports training accuracy and alignment with the true weights.

Time: O(m · d) per epoch for computing probabilities and gradients.Space: O(m · d) to store the dataset and O(d) for parameters/gradients.

Active pair selection: choose the most informative (most uncertain) comparisons

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 static inline double sigmoid(double x){
5     if (x >= 0) { double z = exp(-x); return 1.0/(1.0+z); }
6     else { double z = exp(x); return z/(1.0+z); }
7 }
8 
9 struct Pair {int i,j; double uncertainty;};
10 
11 int main(){
12     ios::sync_with_stdio(false);
13     cin.tie(nullptr);
14 
15     // Suppose we have current Bradley–Terry scores r for n items
16     int n = 10; 
17     vector<double> r = {0.9, -0.2, 1.3, 0.1, -1.1, 0.4, -0.7, 0.0, 0.8, -0.5};
18 
19     // For active data collection, prioritize pairs with high predictive uncertainty
20     // A natural measure is p(1-p) where p = sigma(r[i]-r[j]); it is maximized near 0.5
21     vector<Pair> candidates;
22     for (int i = 0; i < n; ++i) {
23         for (int j = i+1; j < n; ++j) {
24             double p = sigmoid(r[i]-r[j]);
25             double u = p*(1.0-p); // variance of Bernoulli; proxy for informativeness
26             candidates.push_back({i,j,u});
27         }
28     }
29 
30     int K = 8; // request top-K most uncertain pairs
31     nth_element(candidates.begin(), candidates.begin()+min(K,(int)candidates.size()) , candidates.end(),
32                 [](const Pair& a, const Pair& b){return a.uncertainty > b.uncertainty;});
33     sort(candidates.begin(), candidates.begin()+min(K,(int)candidates.size()),
34          [](const Pair& a, const Pair& b){return a.uncertainty > b.uncertainty;});
35 
36     cout << fixed << setprecision(4);
37     cout << "Top-" << K << " most uncertain pairs (i,j) with uncertainty p(1-p):\n";
38     for (int k = 0; k < K && k < (int)candidates.size(); ++k) {
39         auto &c = candidates[k];
40         cout << "(" << c.i << "," << c.j << ")\tunc=" << c.uncertainty << "\n";
41     }
42 
43     return 0;
44 }
45

Given current scores, this utility selects the top-K pairs with highest predictive uncertainty p(1−p). These are most informative for labeling in the next round of data collection, aligning with active learning principles in RLHF preference gathering.

Time: O(n^2) to enumerate all pairs and O(n^2) selection cost (nth_element is linear on the number of pairs).Space: O(n^2) to store candidate pairs (can be reduced with streaming selection).

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Pair {int a, b; int winner;}; // winner = 0 means a>b, 1 means b>a
5
6	// Numerically stable sigmoid
7	static inline double sigmoid(double x) {
8	if (x >= 0) {
9	double z = exp(-x);
10	return 1.0 / (1.0 + z);
11	} else {
12	double z = exp(x);
13	return z / (1.0 + z);
14	}
15	}
16
17	int main() {
18	ios::sync_with_stdio(false);
19	cin.tie(nullptr);
20
21	// 1) Generate synthetic data
22	int n = 8; // number of items (responses)
23	int m = 5000; // number of comparisons
24	unsigned seed = 42;
25	mt19937 rng(seed);
26	normal_distribution<double> nd(0.0, 1.0);
27	uniform_int_distribution<int> uid(0, n-1);
28
29	// Ground-truth scores (mean-centered)
30	vector<double> r_true(n);
31	for (int i = 0; i < n; ++i) r_true[i] = nd(rng);
32	double mean = accumulate(r_true.begin(), r_true.end(), 0.0) / n;
33	for (double &x : r_true) x -= mean;
34
35	// Create comparisons according to Bradley–Terry
36	vector<Pair> data;
37	data.reserve(m);
38	for (int k = 0; k < m; ++k) {
39	int a = uid(rng), b = uid(rng);
40	while (b == a) b = uid(rng);
41	double p = sigmoid(r_true[a] - r_true[b]);
42	bernoulli_distribution bd(p);
43	bool a_wins = bd(rng);
44	Pair pr{a, b, a_wins ? 0 : 1};
45	data.push_back(pr);
46	}
47
48	// 2) Train Bradley–Terry scores with gradient ascent
49	vector<double> r(n, 0.0); // initialize learned scores to 0
50	double lr = 0.1; // learning rate
51	double lambda = 1e-3; // L2 regularization strength (stabilizes scale)
52	int epochs = 20;
53
54	vector<double> grad(n, 0.0);
55	for (int ep = 0; ep < epochs; ++ep) {
56	fill(grad.begin(), grad.end(), 0.0);
57	shuffle(data.begin(), data.end(), rng);
58	double loglik = 0.0;
59	for (const auto &pr : data) {
60	int a = pr.a, b = pr.b;
61	// Define outcome y in {+1, -1}: +1 means a preferred, -1 means b preferred
62	int y = (pr.winner == 0) ? +1 : -1;
63	double z = (r[a] - r[b]) * y; // signed margin
64	double p = sigmoid(z); // probability of observed outcome
65	loglik += log(max(1e-15, p)); // accumulate log-likelihood (safe)
66	double coeff = (1.0 - p) * y; // gradient of log sigma(z) wrt (r[a]-r[b])
67	grad[a] += coeff;
68	grad[b] -= coeff;
69	}
70	// Add L2 regularization gradient and update
71	for (int i = 0; i < n; ++i) {
72	grad[i] -= lambda * r[i];
73	r[i] += lr * grad[i];
74	}
75	// Mean-center to fix gauge (optional but helps stability)
76	double mu = accumulate(r.begin(), r.end(), 0.0) / n;
77	for (double &x : r) x -= mu;
78
79	if ((ep+1) % 5 == 0) {
80	cerr << "Epoch " << (ep+1) << ", avg loglik per pair = " << (loglik / m) << "\n";
81	}
82	}
83
84	// 3) Evaluate: rank correlation with ground truth & sample predictions
85	vector<int> idx(n); iota(idx.begin(), idx.end(), 0);
86	sort(idx.begin(), idx.end(), [&](int i, int j){return r[i] > r[j];});
87	cout << "Learned scores (descending):\n";
88	for (int id : idx) cout << "item " << id << ": r=" << r[id] << ", r_true=" << r_true[id] << "\n";
89
90	// Predict win probability for a random pair
91	int A = 0, B = 1;
92	double pred = sigmoid(r[A] - r[B]);
93	cout << fixed << setprecision(3);
94	cout << "P(item " << A << " > item " << B << ") = " << pred << "\n";
95	return 0;
96	}
97

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Stable sigmoid
5	static inline double sigmoid(double x) {
6	if (x >= 0) { double z = exp(-x); return 1.0 / (1.0 + z); }
7	else { double z = exp(x); return z / (1.0 + z); }
8	}
9
10	struct Example {
11	vector<double> f1; // features of response 1
12	vector<double> f2; // features of response 2
13	int y; // 1 if response 1 preferred, 0 otherwise
14	};
15
16	int main(){
17	ios::sync_with_stdio(false);
18	cin.tie(nullptr);
19
20	int d = 10; // feature dimension
21	int m = 4000; // number of pairwise examples
22	unsigned seed = 7;
23	mt19937 rng(seed);
24	normal_distribution<double> nd(0.0, 1.0);
25
26	// Ground-truth linear reward: r(y)=w*^T phi(y)
27	vector<double> w_true(d);
28	for (int i = 0; i < d; ++i) w_true[i] = nd(rng);
29
30	// Generate synthetic examples
31	vector<Example> data; data.reserve(m);
32	for (int k = 0; k < m; ++k) {
33	Example ex; ex.f1.resize(d); ex.f2.resize(d);
34	for (int i = 0; i < d; ++i) { ex.f1[i] = nd(rng); ex.f2[i] = nd(rng); }
35	// Preference probability depends on difference of features
36	double z = 0.0; for (int i = 0; i < d; ++i) z += w_true[i] * (ex.f1[i] - ex.f2[i]);
37	double p = sigmoid(z);
38	bernoulli_distribution bd(p);
39	ex.y = bd(rng) ? 1 : 0;
40	data.push_back(move(ex));
41	}
42
43	// Train logistic regression on feature differences
44	vector<double> w(d, 0.0);
45	double lr = 0.2; // learning rate
46	double lambda = 1e-3;// L2 regularization
47	int epochs = 15;
48
49	vector<int> order(m); iota(order.begin(), order.end(), 0);
50	for (int ep = 0; ep < epochs; ++ep) {
51	shuffle(order.begin(), order.end(), rng);
52	double loglik = 0.0;
53	// Accumulate gradient in mini-batches (here full-batch for simplicity)
54	vector<double> grad(d, 0.0);
55	for (int idx : order) {
56	const auto &ex = data[idx];
57	// Delta features
58	double z = 0.0;
59	for (int i = 0; i < d; ++i) z += w[i] * (ex.f1[i] - ex.f2[i]);
60	double p = sigmoid(z);
61	loglik += ex.y ? log(max(1e-15, p)) : log(max(1e-15, 1.0 - p));
62	double err = (ex.y - p); // gradient coefficient
63	for (int i = 0; i < d; ++i) grad[i] += err * (ex.f1[i] - ex.f2[i]);
64	}
65	for (int i = 0; i < d; ++i) {
66	grad[i] -= lambda * w[i];
67	w[i] += lr * grad[i] / m; // average gradient update
68	}
69	if ((ep+1) % 5 == 0) cerr << "Epoch " << (ep+1) << ": avg loglik = " << (loglik/m) << "\n";
70	}
71
72	// Evaluate: accuracy on the training set (for demo) and compare to w_true direction
73	int correct = 0;
74	for (const auto &ex : data) {
75	double z = 0.0; for (int i = 0; i < d; ++i) z += w[i] * (ex.f1[i] - ex.f2[i]);
76	int pred = sigmoid(z) >= 0.5 ? 1 : 0;
77	correct += (pred == ex.y);
78	}
79	cout << fixed << setprecision(3);
80	cout << "Training accuracy: " << (double)correct / m << "\n";
81
82	// Print first few learned vs. true weights (scaled cosine similarity)
83	double dot=0, n1=0, n2=0; for (int i=0;i<d;++i){dot+=w[i]w_true[i]; n1+=w[i]w[i]; n2+=w_true[i]*w_true[i];}
84	cout << "Cosine similarity(w, w_true): " << (dot / (sqrt(n1)*sqrt(n2)+1e-12)) << "\n";
85
86	return 0;
87	}
88

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	static inline double sigmoid(double x){
5	if (x >= 0) { double z = exp(-x); return 1.0/(1.0+z); }
6	else { double z = exp(x); return z/(1.0+z); }
7	}
8
9	struct Pair {int i,j; double uncertainty;};
10
11	int main(){
12	ios::sync_with_stdio(false);
13	cin.tie(nullptr);
14
15	// Suppose we have current Bradley–Terry scores r for n items
16	int n = 10;
17	vector<double> r = {0.9, -0.2, 1.3, 0.1, -1.1, 0.4, -0.7, 0.0, 0.8, -0.5};
18
19	// For active data collection, prioritize pairs with high predictive uncertainty
20	// A natural measure is p(1-p) where p = sigma(r[i]-r[j]); it is maximized near 0.5
21	vector<Pair> candidates;
22	for (int i = 0; i < n; ++i) {
23	for (int j = i+1; j < n; ++j) {
24	double p = sigmoid(r[i]-r[j]);
25	double u = p*(1.0-p); // variance of Bernoulli; proxy for informativeness
26	candidates.push_back({i,j,u});
27	}
28	}
29
30	int K = 8; // request top-K most uncertain pairs
31	nth_element(candidates.begin(), candidates.begin()+min(K,(int)candidates.size()) , candidates.end(),
32	[](const Pair& a, const Pair& b){return a.uncertainty > b.uncertainty;});
33	sort(candidates.begin(), candidates.begin()+min(K,(int)candidates.size()),
34	[](const Pair& a, const Pair& b){return a.uncertainty > b.uncertainty;});
35
36	cout << fixed << setprecision(4);
37	cout << "Top-" << K << " most uncertain pairs (i,j) with uncertainty p(1-p):\n";
38	for (int k = 0; k < K && k < (int)candidates.size(); ++k) {
39	auto &c = candidates[k];
40	cout << "(" << c.i << "," << c.j << ")\tunc=" << c.uncertainty << "\n";
41	}
42
43	return 0;
44	}
45