🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Contrastive Learning

Key Points

  • •
    Contrastive learning teaches models by pulling together similar examples (positives) and pushing apart dissimilar ones (negatives).
  • •
    The core building block is a similarity score (often cosine similarity) and a softmax-based loss like InfoNCE/NT-Xent with a temperature parameter.
  • •
    Positive pairs are usually created by augmenting the same input twice, while negatives come from other items in the batch or a memory bank.
  • •
    Large, diverse negatives and strong augmentations are crucial to avoid collapse and to learn invariant, discriminative features.
  • •
    The computational bottleneck is pairwise similarity over the batch, which costs O(B2 d) time and O(B2) memory if you materialize the matrix.
  • •
    Projection heads and L2 normalization are standard tricks that stabilize training and improve downstream performance.
  • •
    Contrastive pretraining transfers well to classification, retrieval, and multimodal alignment (e.g., CLIP).
  • •
    Careful implementation details (masking self-similarity, stable log-sum-exp, correct temperature) make a big difference in training stability.

Prerequisites

  • →Linear algebra (vectors, dot product, norms) — Cosine similarity, normalization, and matrix multiplications rely on vector and matrix operations.
  • →Probability and expectations — Understanding softmax, temperature scaling, and InfoNCE’s probabilistic interpretation requires basic probability.
  • →Cross-entropy and softmax — The NT-Xent/InfoNCE loss is a cross-entropy over similarities.
  • →Backpropagation and gradient-based optimization — Training contrastive models requires computing gradients and updating parameters with optimizers like SGD/Adam.
  • →Data augmentation — Positives are formed by augmentations, so knowing modality-appropriate transforms is essential.
  • →Numerical stability (log-sum-exp) — Stable computation of softmax/log-likelihood avoids overflow and underflow.
  • →Batch processing and memory complexity — Contrastive loss scales quadratically with batch size; understanding O(B^2) effects helps design efficient training.
  • →Metric learning basics — Contrastive and triplet losses are classic metric learning objectives closely related to this paradigm.

Detailed Explanation

Tap terms for definitions

01Overview

Contrastive learning is a way for models to learn useful representations without needing labeled data. Instead of predicting explicit labels, the model compares pairs of examples and learns which ones should be considered similar (positives) and which should be dissimilar (negatives). The model encodes inputs into vectors (embeddings) and uses a similarity function—usually cosine similarity—to measure how close two embeddings are. The training objective (e.g., InfoNCE or NT-Xent) encourages the embedding of a view of an example to be close to another view of the same example, while pushing it away from embeddings of other examples in the batch. This process leads the model to discover features that are invariant to augmentations (like cropping, color jitter, or noise) and yet discriminative between different inputs.

A typical pipeline uses two stochastic augmentations to create two "views" of each input, processes them with a shared encoder (such as a CNN or Transformer), and applies a small projection head before the contrastive loss. The loss is computed across the whole batch, treating all other samples as negatives. Variants like MoCo use a momentum encoder and a memory bank to increase the number of negatives, while BYOL and SimSiam avoid explicit negatives by using stop-gradients and asymmetric networks. Once trained, the projection head is usually discarded and the frozen encoder serves as a general feature extractor for downstream tasks like classification (via a linear probe), retrieval (nearest neighbor search), or alignment across modalities (as in CLIP).

02Intuition & Analogies

Imagine teaching someone to recognize your friend in photos. You show two different pictures of your friend (different lighting, angle, clothes) and say, "these are the same person," and then show many pictures of other people and say, "these are different." With enough such comparisons, you learn what facial features matter (eyes, nose, shape) and what doesn’t (lighting, color tint, background). That’s contrastive learning in spirit.

Consider magnets on a whiteboard: each input is a magnet. Positive pairs are two magnets that should snap together; negatives are magnets that should repel each other. The learning algorithm adjusts each magnet’s position on the board (the embedding space) so that positives end up near each other and negatives far apart. If you always bring the same magnet pairs close and push different ones away, a natural structure emerges: clusters for similar things.

Temperature is like the stiffness of the magnet force: a low temperature makes the attraction/repulsion very sharp (only the closest matter), while a high temperature softens it (many neighbors have influence). Data augmentation is like showing the same object under different disguises—cropping, noise, or color changes—so the model learns to ignore superficial changes and focus on essential identity.

Finally, think of study flashcards. For each term, you create synonyms (positives) and gather unrelated terms (negatives). By consistently pairing terms with their synonyms and distinguishing them from unrelated ones, you internalize the concept’s essence. Contrastive learning similarly builds an internal map of concepts by repeatedly comparing and contrasting.

03Formal Definition

Let x ∈ X be an input and t ∼ T be a random augmentation. Two views of the same input are v1​=t1​(x), v2​=t2​(x). An encoder fθ​: X → Rd maps a view to a representation h=fθ​(v). A projection head gϕ​: Rd → Rp yields z=gϕ​(h), which is L2-normalized to z~ = z / \∣z∥. Similarity is typically cosine: sim(u,v) = u⊤ v. For a batch of N inputs, we form two augmented views per input, producing 2N normalized embeddings \{z~1​, …, z~2N​\}. Each sample i ∈ \{1,…,2N\} has exactly one positive index p(i) (its paired view) and 2N-2 negatives. The NT-Xent loss with temperature τ>0 is \[ L = 2N1​ ∑i=12N​ -log ∑k=1,k=i2N​exp(sim(z~i​,z~k​)/τ)exp(sim(z~i​,z~p(i)​)/τ)​. \] This is a cross-entropy loss where the positive index is the correct class and all other indices are impostors. Variants (InfoNCE) differ in how positives/negatives are defined and sampled (in-batch negatives, memory banks, or queues). In MoCo, a momentum encoder defines a slowly changing key set; in BYOL/SimSiam, stop-gradients avoid explicit negatives. After training, gϕ​ is often discarded; fθ​ is used as the representation function for downstream tasks.

04When to Use

  • When labels are scarce or expensive: Pretrain with contrastive learning on large unlabeled datasets, then fine-tune or train a linear classifier with few labels.
  • When you need invariant features: Use strong augmentations to make the representation robust to nuisances (e.g., crops, color jitter, noise, masking).
  • For retrieval and metric learning: Contrastive representations support nearest-neighbor search and ranking tasks because distances reflect semantic similarity.
  • For multimodal alignment: Learn a shared space where images and texts about the same concept are close (e.g., CLIP). Positives are matched pairs across modalities; negatives are mismatched pairs.
  • For self-supervised pretraining in vision, audio, graphs, and NLP: Contrastive learning is broadly applicable; choose augmentations appropriate to the modality.
  • When batch size is limited: Consider memory banks or momentum queues (MoCo) to increase negatives without huge batches, or negative-free methods (BYOL/SimSiam) if negatives are impractical.
  • When interpretability matters: Contrastive embeddings can be probed with simple tools (k-NN, linear probes) to assess what was learned before committing to full fine-tuning.

⚠️Common Mistakes

  • Forgetting L2 normalization: Without normalization, dot products depend on vector norms, destabilizing training and making temperature hard to tune.
  • Not masking self-similarity: Including an item’s similarity with itself in the denominator leaks trivial positives and reduces effective negative count.
  • Weak or trivial augmentations: If two views are too similar, the task is too easy and the model may not learn invariances. Use domain-appropriate, sufficiently strong augmentations.
  • Too small batch or too few negatives: Leads to poor discrimination. Use larger batches, memory banks, or queues; or switch to methods designed for small-batch settings.
  • Bad temperature \tau: Too small gives overly peaky distributions (training instability); too large makes the task ambiguous. Tune \tau on a validation proxy (e.g., linear probe accuracy).
  • Training collapse: All embeddings become identical. Mitigate with negatives, stop-gradients (BYOL/SimSiam), predictor heads, or regularization.
  • Ignoring projection head: Skipping g_{\phi} typically hurts; using it during training but discarding it at evaluation is a well-supported best practice.
  • Numerical instability: Use the log-sum-exp trick when computing denominators; avoid overflow/underflow in exp; maintain float precision.
  • Data leakage between views: Ensure two augmentations are independently sampled; don’t accidentally reuse identical views. Ensure proper shuffling across workers.

Key Formulas

Cosine Similarity

sim(u,v)=∥u∥∥v∥u⊤v​

Explanation: This measures how aligned two vectors are, ignoring their magnitudes. In contrastive learning, we usually L2-normalize embeddings so the similarity is just a dot product in [-1, 1].

InfoNCE Loss

LInfoNCE​=−E[logexp(s(x,y)/τ)+∑k=1K​exp(s(x,yk−​)/τ)exp(s(x,y)/τ)​]

Explanation: The probability mass is assigned to the positive pair over the positive and K negatives via softmax. Minimizing this encourages the model to score the true positive higher than all negatives.

NT-Xent (SimCLR)

LNT-Xent​=2N1​i=1∑2N​−log∑k=1,k=i2N​exp(sim(zi​,zk​)/τ)exp(sim(zi​,zp(i)​)/τ)​

Explanation: Each view in a batch acts as an anchor with exactly one positive (its paired view) and all others as negatives. The temperature τ controls the sharpness of the softmax over similarities.

InfoNCE Mutual Information Lower Bound

I(X;Y)≥logN−LNCE​

Explanation: For a batch with N candidates, the InfoNCE loss lower-bounds the mutual information between X and Y. Lower loss implies higher estimated mutual information.

Softmax

pi​=∑j​exp(sj​)exp(si​)​

Explanation: Converts scores to a probability distribution. In contrastive learning, scores are often similarities divided by temperature before applying softmax.

Log-Sum-Exp Trick

s~i​=si​−m,logj∑​exp(sj​)=m+logj∑​exp(s~j​),m=jmax​sj​

Explanation: Subtracting the maximum score before exponentiation improves numerical stability. This avoids overflow when computing softmax or log-likelihoods.

Contrastive (Siamese) Loss

Lcontrastive​=yd(a,p)2+(1−y)max{0,m−d(a,n)}2

Explanation: Supervised contrastive loss for labeled pairs: positives (y=1) are pulled together, negatives (y=0) are pushed apart by a margin m using distance d.

Triplet Loss

Ltriplet​=max{0, d(a,p)−d(a,n)+m}

Explanation: Enforces that the anchor-positive distance is at least m smaller than the anchor-negative distance. Common in metric learning with hard negative mining.

Temperature-Scaled Similarity

s(u,v;τ)=τu⊤v​

Explanation: Dividing by τ rescales logits before softmax. Smaller τ increases the contrast between close and far pairs, yielding sharper probability distributions.

Complexity Analysis

Let B be the number of items per batch (producing 2B views) and d be the embedding dimension. Computing the contrastive loss requires all pairwise similarities between anchors and candidates. If you explicitly materialize the 2B × 2B similarity matrix S=Z ZT (with normalized rows), the time cost is O(B2 d) because each dot product costs O(d) and there are O(B2) pairs. The memory to store S is O(B2). With stable log-sum-exp, you also perform O(B2) exponentials and additions. If memory is tight, you can avoid forming the full matrix by computing per-anchor similarities on the fly, which reduces peak memory to O(B d) but keeps the time at O(B2 d). Using a memory bank or momentum queue of size Q changes the cost: computing similarities against the queue is O(B Q d) per step and O(Q) memory for the bank, while the in-batch part remains O(B2 d) if you keep it. On GPU, matrix multiplications leverage BLAS and are bandwidth/compute efficient; still, the dominant term scales quadratically with batch size, making large B the key bottleneck. Backpropagation through the softmax and similarity adds only constant factors to the forward cost. Normalization (L2) is O(B d). Projection heads (small MLPs) typically have O(d p) parameters and add O(B d p) compute, usually minor compared to the all-pairs similarity. For retrieval or k-NN evaluation with M queries and N database items, a brute-force search costs O(M N d) time and O(N d) memory, often mitigated by approximate nearest neighbor indexing in practice.

Code Examples

Compute NT-Xent (InfoNCE) loss from precomputed embeddings (no deep learning library)
1#include <bits/stdc++.h>
2using namespace std;
3
4// L2-normalize a vector in-place
5void l2_normalize(vector<double>& v) {
6 double norm = 0.0;
7 for (double x : v) norm += x * x;
8 norm = sqrt(max(norm, 1e-30));
9 for (double& x : v) x /= norm;
10}
11
12// Dot product
13double dot(const vector<double>& a, const vector<double>& b) {
14 double s = 0.0;
15 for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i];
16 return s;
17}
18
19// Compute NT-Xent loss for two sets of embeddings Z1 and Z2 (size N x d), temperature tau.
20// Assumes Z1[i] and Z2[i] are positive pairs. Uses cosine similarity via pre-normalization.
21double nt_xent_loss(const vector<vector<double>>& Z1_in,
22 const vector<vector<double>>& Z2_in,
23 double tau) {
24 int N = (int)Z1_in.size();
25 int d = (int)Z1_in[0].size();
26 // Copy and L2-normalize
27 vector<vector<double>> Z1 = Z1_in, Z2 = Z2_in;
28 for (int i = 0; i < N; ++i) { l2_normalize(Z1[i]); l2_normalize(Z2[i]); }
29
30 // Concatenate to 2N embeddings
31 vector<vector<double>> Z(2*N, vector<double>(d));
32 for (int i = 0; i < N; ++i) Z[i] = Z1[i];
33 for (int i = 0; i < N; ++i) Z[N + i] = Z2[i];
34
35 double loss = 0.0;
36 int total = 2 * N;
37
38 // For each anchor, compute stable log-softmax over all others
39 for (int i = 0; i < total; ++i) {
40 int pos = (i < N) ? (N + i) : (i - N); // paired index
41
42 // Compute similarities s[i][k] = dot(Z[i], Z[k]) / tau, k != i
43 vector<double> sims; sims.reserve(total - 1);
44 vector<int> idx; idx.reserve(total - 1);
45 double maxSim = -1e100;
46 for (int k = 0; k < total; ++k) if (k != i) {
47 double s = dot(Z[i], Z[k]) / tau;
48 sims.push_back(s);
49 idx.push_back(k);
50 if (s > maxSim) maxSim = s;
51 }
52 // Log-sum-exp denominator
53 double denom = 0.0;
54 for (double s : sims) denom += exp(s - maxSim);
55 double logDenom = maxSim + log(max(denom, 1e-300));
56
57 // Numerator: exp(sim(i, pos))
58 double s_ip = dot(Z[i], Z[pos]) / tau;
59 double logNumer = s_ip; // since log(exp(s_ip)) = s_ip
60
61 loss += -(logNumer - logDenom);
62 }
63 return loss / total;
64}
65
66int main() {
67 ios::sync_with_stdio(false);
68 cin.tie(nullptr);
69
70 int N = 4; // batch size (number of items); total views = 2N
71 int d = 8; // embedding dimension
72 double tau = 0.2; // temperature
73
74 // Create toy embeddings for two views of each item
75 vector<vector<double>> Z1(N, vector<double>(d));
76 vector<vector<double>> Z2(N, vector<double>(d));
77 std::mt19937 rng(42);
78 std::normal_distribution<double> noise(0.0, 0.1);
79
80 // Base prototypes per item
81 vector<vector<double>> base(N, vector<double>(d));
82 for (int i = 0; i < N; ++i) {
83 for (int j = 0; j < d; ++j) base[i][j] = (i + 1) * 0.5 + 0.1 * j; // simple separable pattern
84 }
85 // Two noisy views per item
86 for (int i = 0; i < N; ++i) {
87 for (int j = 0; j < d; ++j) {
88 Z1[i][j] = base[i][j] + noise(rng);
89 Z2[i][j] = base[i][j] + noise(rng);
90 }
91 }
92
93 double L = nt_xent_loss(Z1, Z2, tau);
94 cout << fixed << setprecision(6);
95 cout << "NT-Xent loss (toy): " << L << "\n";
96
97 return 0;
98}
99

This standalone C++ program computes the NT-Xent (InfoNCE) loss for two sets of precomputed embeddings, treating Z1[i] and Z2[i] as positives and all other pairs as negatives. It L2-normalizes embeddings to make the dot product equal to cosine similarity, builds the 2N set of views, and for each anchor applies a numerically stable log-sum-exp softmax to form the loss. This demonstrates the essential mechanics of contrastive loss without requiring any deep learning framework.

Time: O(B^2 d) where B=N is the batch size (2B views) and d is the embedding dimension.Space: O(B d) to store embeddings and O(B) per-anchor temporary storage (no full similarity matrix is materialized).
Minimal SimCLR-style training step with LibTorch (C++ PyTorch)
1#include <torch/torch.h>
2#include <iostream>
3#include <vector>
4#include <random>
5
6// Simple MLP projection head: Linear -> ReLU -> Linear
7struct MLP : torch::nn::Module {
8 torch::nn::Linear fc1{nullptr}, fc2{nullptr};
9 MLP(int in_dim, int hidden, int out_dim) {
10 fc1 = register_module("fc1", torch::nn::Linear(in_dim, hidden));
11 fc2 = register_module("fc2", torch::nn::Linear(hidden, out_dim));
12 }
13 torch::Tensor forward(torch::Tensor x) {
14 x = torch::relu(fc1->forward(x));
15 x = fc2->forward(x);
16 return x;
17 }
18};
19
20// NT-Xent loss using cosine similarity with temperature
21torch::Tensor nt_xent(torch::Tensor z1, torch::Tensor z2, double tau) {
22 // L2 normalize along feature dim
23 z1 = torch::nn::functional::normalize(z1, torch::nn::functional::NormalizeFuncOptions().p(2).dim(1));
24 z2 = torch::nn::functional::normalize(z2, torch::nn::functional::NormalizeFuncOptions().p(2).dim(1));
25
26 // Concatenate: [2N, d]
27 auto z = torch::cat({z1, z2}, 0); // [2N, d]
28 auto N = z1.size(0);
29
30 // Similarity matrix: [2N, 2N] = z * z^T / tau
31 auto sim = torch::mm(z, z.t()) / tau;
32
33 // Mask self-similarity by setting diagonal to -inf
34 auto mask = torch::eye(2 * N, sim.options()).to(sim.device());
35 sim = sim.masked_fill(mask.to(torch::kBool), -1e9);
36
37 // For each anchor i, positive index is i^ (paired view)
38 // Build targets: for i in [0,N-1], pos=N+i ; for i in [N,2N-1], pos=i-N
39 auto targets = torch::arange(0, 2 * N, sim.options().dtype(torch::kLong));
40 targets = torch::where(targets < N, targets + N, targets - N);
41
42 // Cross-entropy over rows of sim
43 auto loss = torch::nn::functional::cross_entropy(sim, targets);
44 return loss;
45}
46
47int main() {
48 torch::manual_seed(0);
49 // Hyperparameters
50 const int B = 8; // batch size (items); total views = 2B
51 const int in_dim = 32; // input feature size (toy)
52 const int hid = 64; // MLP hidden size
53 const int out_dim = 32;// projection size
54 const double tau = 0.2;
55
56 // Toy "encoder": here we just use a single Linear layer to mimic an encoder
57 auto encoder = torch::nn::Linear(in_dim, out_dim);
58 auto projector = std::make_shared<MLP>(out_dim, hid, out_dim);
59
60 // Register to module list for optimizer
61 std::vector<torch::optim::OptimizerParamGroup> params;
62 params.emplace_back(encoder->parameters());
63 params.emplace_back(projector->parameters());
64
65 torch::optim::Adam optim(params, torch::optim::AdamOptions(1e-3));
66
67 // Create a toy batch of inputs X: [B, in_dim]
68 auto X = torch::randn({B, in_dim});
69
70 // Two simple augmentations: Gaussian noise and dropout-like masking
71 auto augment = [](torch::Tensor t) {
72 auto noise = 0.1 * torch::randn_like(t);
73 auto mask = (torch::rand_like(t) > 0.1).to(t.dtype()); // keep 90%
74 return (t * mask) + noise;
75 };
76
77 auto X1 = augment(X);
78 auto X2 = augment(X);
79
80 // Forward: encoder -> projector
81 auto h1 = encoder->forward(X1);
82 auto h2 = encoder->forward(X2);
83 auto z1 = projector->forward(h1);
84 auto z2 = projector->forward(h2);
85
86 auto loss = nt_xent(z1, z2, tau);
87 std::cout << "Loss before step: " << loss.item<double>() << "\n";
88
89 optim.zero_grad();
90 loss.backward();
91 optim.step();
92
93 auto loss_after = nt_xent(projector->forward(encoder->forward(augment(X))),
94 projector->forward(encoder->forward(augment(X))), tau);
95 std::cout << "Loss after one step (toy): " << loss_after.item<double>() << "\n";
96
97 return 0;
98}
99

This example shows a minimal SimCLR-style training step in LibTorch (PyTorch’s C++ API). A simple linear encoder and a small MLP projection head map two independently augmented views of each input to embeddings. The NT-Xent loss is computed by building a [2N, 2N] similarity matrix (cosine via normalization and dot product), masking self-similarities, constructing the positive indices, and applying cross-entropy. An optimizer step updates the parameters. Although the encoder here is a toy linear layer, the same routine applies to CNNs or Transformers.

Time: O(B^2 d) for the loss (matrix multiply and softmax), plus O(B d out_dim) for the encoder/projector forward/backward.Space: O(B d) for activations plus O(B^2) for the similarity matrix; parameters add O(d·out_dim + hidden·out_dim).
k-NN retrieval with cosine similarity on learned embeddings
1#include <bits/stdc++.h>
2using namespace std;
3
4// Normalize in-place
5void l2_normalize(vector<double>& v) {
6 double n = 0.0; for (double x : v) n += x*x; n = sqrt(max(n, 1e-30));
7 for (double &x : v) x /= n;
8}
9
10double cosine(const vector<double>& a, const vector<double>& b) {
11 double s = 0.0; for (size_t i=0;i<a.size();++i) s += a[i]*b[i];
12 return s; // after normalization, dot = cosine
13}
14
15// Return index of nearest neighbor in gallery for each query
16vector<int> knn_top1(const vector<vector<double>>& gallery,
17 const vector<int>& gallery_labels,
18 const vector<vector<double>>& queries) {
19 int G = (int)gallery.size();
20 int Q = (int)queries.size();
21 int d = (int)gallery[0].size();
22
23 // Copy and normalize
24 vector<vector<double>> Gz = gallery, Qz = queries;
25 for (auto &v : Gz) l2_normalize(v);
26 for (auto &v : Qz) l2_normalize(v);
27
28 vector<int> pred(Q);
29 for (int i = 0; i < Q; ++i) {
30 double best = -1e100; int bestj = -1;
31 for (int j = 0; j < G; ++j) {
32 double s = cosine(Qz[i], Gz[j]);
33 if (s > best) { best = s; bestj = j; }
34 }
35 pred[i] = gallery_labels[bestj];
36 }
37 return pred;
38}
39
40int main(){
41 // Toy gallery with labels {0,1}
42 vector<vector<double>> gallery = {{1,0,0},{0.9,0.1,0},{0,1,0},{0.1,0.9,0}};
43 vector<int> labels = {0,0,1,1};
44 vector<vector<double>> queries = {{0.8,0.2,0},{0.2,0.8,0}};
45
46 auto preds = knn_top1(gallery, labels, queries);
47 for (size_t i=0;i<preds.size();++i) cout << "Query " << i << " -> label " << preds[i] << "\n";
48 return 0;
49}
50

This code evaluates embeddings using nearest-neighbor classification with cosine similarity. After L2 normalization, cosine reduces to a dot product. For each query, it finds the most similar gallery item and returns its label. This mirrors common contrastive evaluation protocols (k-NN or linear probe) to assess representation quality.

Time: O(Q · G · d) where Q is the number of queries, G is the gallery size, and d is the dimension.Space: O(G d + Q d) to store normalized copies; additional O(1) per-query scratch.
#contrastive learning#infonce#nt-xent#simclr#moco#byol#cosine similarity#temperature scaling#projection head#self-supervised learning#metric learning#triplet loss#memory bank#k-nn retrieval#mutual information