🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Embedding Spaces & Distributed Representations

Key Points

  • •
    Embedding spaces map discrete things like words or products to dense vectors so that similar items are close together.
  • •
    Distances and angles in these spaces (often cosine similarity) reflect semantic or functional similarity.
  • •
    An embedding matrix is just a big table of numbers; looking up an item selects a row, which is its vector.
  • •
    We train embeddings by predicting contexts, co-occurrences, or labels so that useful geometry emerges.
  • •
    Cosine similarity, dot product, and Euclidean distance are the core operations used for search and clustering.
  • •
    Training at scale relies on tricks like negative sampling, mini-batching, and regularization.
  • •
    Choosing the dimension d is a trade-off between expressiveness, speed, and overfitting.
  • •
    Normalization and careful evaluation (nearest neighbors, clustering quality, downstream tasks) are essential.

Prerequisites

  • →Linear Algebra Fundamentals — Embeddings rely on vectors, dot products, norms, and matrices.
  • →Basic Probability — Negative sampling, softmax, and probabilistic objectives use distributions.
  • →Gradient Descent and Optimization — Training embeddings is an optimization problem over parameters.
  • →Arrays and Memory Layout — Embedding tables are large arrays; cache/memory access patterns matter.
  • →Vector Similarities and Distances — Cosine, dot product, and Euclidean distance define neighborhoods in embedding space.
  • →C++ Basics and Standard Library — Implementing lookup, training steps, and search efficiently requires C++ proficiency.
  • →Dimensionality Reduction — Compression techniques like PCA or random projections are common with embeddings.

Detailed Explanation

Tap terms for definitions

01Overview

Embedding spaces and distributed representations turn symbolic objects (words, users, products, molecules) into dense numeric vectors in (\mathbb{R}^d). Instead of treating items as unrelated IDs, embeddings arrange them so that geometric relationships—distances and angles—capture meaningful notions of similarity. For example, in a good word embedding space, vectors for "cat" and "dog" are closer to each other than to "banana". These vectors are learned from data by optimizing objectives that reward useful proximity (e.g., predicting context words, clicks, or labels).

The central tool is an embedding matrix: one row per item, with d coordinates each. Looking up an item returns its vector, and simple linear algebra (dot products, norms, cosine similarity) powers retrieval, clustering, and recommendation. Training methods include word2vec (skip-gram with negative sampling), GloVe (matrix factorization of co-occurrence), and contrastive or triplet losses in deep models. Dimensionality reduction (e.g., PCA or random projections) helps compress or visualize spaces.

Why this matters: vectors make learning and search fast. Once items live in a shared space, you can do k-nearest-neighbor search for recommendation, compute centroid prototypes for classification, or measure diversity. Embeddings are now a universal interface between raw symbols and machine learning models.

02Intuition & Analogies

Imagine organizing a library without reading the books. You place books on shelves so that related topics end up near each other. Over time, you learn to co-locate physics with math, and cooking near nutrition. An embedding space is like a huge, precise library where every item has an address in a multi-dimensional room. The closer two items are, the more they tend to be used together, co-occur, or share properties.

Another analogy: coordinates on a world map. A city’s latitude and longitude are just two numbers that tell you where it sits; nearby cities likely share climate and culture. Embeddings give each item many coordinates (perhaps 100–1000). Two items with similar coordinates are likely related. The “direction” between two points might encode a concept: moving in the “plural” direction could turn "car" into "cars", or in a “royalty” direction might relate "man" to "king" and "woman" to "queen".

Think of a recipe as a list of ingredients and amounts. Two recipes that use similar ingredients in similar proportions taste alike. If you represent each recipe as a vector of ingredient quantities, comparing recipes becomes comparing vectors. Likewise, if two words appear in similar contexts (the distributional hypothesis: “You shall know a word by the company it keeps”), their vectors should resemble each other.

Finally, picture magnets on a whiteboard. During training, we nudge magnets so that items that should be together pull closer, and unrelated items push apart. After many nudges guided by data, a stable arrangement emerges where geometry mirrors meaning.

03Formal Definition

Let \(V\) be a finite vocabulary of items. An embedding is a function \(f: V → Rd\) that assigns each item \(w ∈ V\) a vector \(xw​ = f(w)\). We usually parameterize \(f\) by an embedding matrix \(E ∈ R∣V∣×d\), where the \(w\)-th row \(Ew:​\) is \(xw​\). If \(ew​\) is the one-hot basis vector for \(w\), then \(xw​ = ew⊤​ E\). A similarity function \(s: Rd × Rd → R\) measures relatedness. Common choices include dot product \(s(x,y) = x⊤y\), cosine similarity \(cos(θ) = ∥x∥2​∥y∥2​x⊤y​\), and the negative of a distance metric like Euclidean distance. Learning embeddings means optimizing parameters \(E\) (and possibly additional matrices) to minimize a loss \(L(E)\) over observed data so that desired relationships are preserved in the geometry. Examples: (1) Skip-gram with negative sampling maximizes the likelihood that targets predict their contexts while pushing against random negatives. (2) Matrix factorization finds \(E\) and \(F\) so that a co-occurrence matrix \(M ≈ EF⊤\) under some weighting. The embedding space is the set \(\{xw​: w ∈ V\} ⊂ Rd\) equipped with \(s\) (or a distance). Downstream tasks—nearest-neighbor retrieval, clustering, classification via linear models—operate by simple linear algebra in this space.

04When to Use

Use embeddings whenever you have discrete identifiers but want to generalize or compute similarity:

  • Language and text: represent words, subwords, documents for search, topic modeling, or as inputs to neural networks.
  • Recommendations: map users and items to a joint space where inner products approximate click or purchase probabilities.
  • Knowledge graphs: embed entities and relations to enable link prediction.
  • Bioinformatics and chemistry: embed proteins, genes, or molecules to capture structural or functional similarity.
  • Vision and audio: embed images or clips so that similar content clusters together; e.g., face recognition via triplet loss.

Embeddings shine when labels are sparse but co-occurrence or interaction data is plentiful. They enable fast similarity search and “cold-start” generalization to unseen combinations. Choose cosine similarity when only direction matters (scale-invariant comparisons), dot product when magnitude carries signal (as in implicit-feedback recommenders), and Euclidean distance when your loss or model assumes it.

Do not overuse embeddings for tiny vocabularies (one-hot may suffice) or when interpretability is paramount (dimensions lack direct meaning). If exactness is critical, be careful with approximate nearest neighbor methods; for high-dimensional spaces, their speed-ups come with recall trade-offs.

⚠️Common Mistakes

  • Skipping normalization when using cosine similarity: cosine is scale-invariant, but if you compare unnormalized vectors with a dot product and call it “cosine,” rankings will be biased by vector lengths. Normalize vectors to unit norm when you intend cosine.
  • Confusing metrics: Euclidean distance and cosine similarity induce different neighborhoods, especially when norms vary. Decide based on your loss and evaluation task, and keep it consistent between training and retrieval.
  • Dimension d too small or too large: too small underfits (can’t separate concepts), too large overfits and costs memory/compute. Start with a heuristic (e.g., 50–300 for small vocabularies, 256–1024 for large) and tune.
  • Poor negative sampling: uniform negatives in skewed vocabularies yield slow or unstable training. Use frequency-smoothed distributions (e.g., (p(w) \propto f(w)^{3/4})).
  • Ignoring OOV (out-of-vocabulary) handling: freezing at training vocabulary can break downstream use. Add UNK embeddings, subword models, or hashing tricks.
  • Training–inference mismatch: training with dot product but retrieving with cosine (or vice versa) can degrade performance. Match the similarity used in loss to the one used at query time or include explicit normalization layers.
  • Not monitoring drift and bias: embedding spaces can encode dataset biases or drift over time. Periodically recalibrate, regularize, and evaluate fairness and stability.
  • Numerical instability: sigmoid on large-magnitude dot products can overflow. Use stable implementations and consider clipping or temperature scaling.

Key Formulas

Embedding Lookup

xw​=ew⊤​E=Ew:​

Explanation: The embedding of item w is the w-th row of the embedding matrix E. Multiplying a one-hot vector by E selects that row.

Dot Product

x⊤y=i=1∑d​xi​yi​

Explanation: Measures alignment between two vectors by summing elementwise products. Larger values indicate higher alignment.

Cosine Similarity

cos(θ)=∥x∥2​∥y∥2​x⊤y​

Explanation: Angle-based similarity that ignores vector magnitudes. Often used for text and retrieval where direction matters more than scale.

Euclidean Distance

∥x−y∥2​=i=1∑d​(xi​−yi​)2​

Explanation: Straight-line distance between vectors. Useful when the learning objective assumes Euclidean geometry.

Softmax

pi​=softmax(z)i​=∑j​exp(zj​)exp(zi​)​

Explanation: Turns scores into a probability distribution over classes or contexts. Used in language models and classification.

Cross-Entropy Loss

LCE​=−i=1∑C​yi​logpi​

Explanation: Measures the gap between the true one-hot label y and predicted probabilities p. For a single true class, it reduces to -log of that class's probability.

Skip-gram Negative Sampling

LSGNS​=−logσ(uc⊤​vt​)−k=1∑K​logσ(−unk​⊤​vt​)

Explanation: Encourages target vector vt​ and true context uc​ to align while pushing away K sampled negatives unk​​. This yields efficient training for word embeddings.

L2 Regularization

Ω(E)=λ∥E∥F2​=λw=1∑∣V∣​i=1∑d​Ew,i2​

Explanation: Penalizes large weights to prevent overfitting and improve generalization. The Frobenius norm is the L2 norm over all matrix entries.

Triplet Loss

Ltriplet​=max(0,m+d(a,p)−d(a,n))

Explanation: For anchor a, positive p, and negative n, enforces that a is closer to p than to n by at least margin m under distance d.

PCA Projection

C=n1​X⊤X,Z=XWm​

Explanation: Compute covariance C of centered data X, take top-m eigenvectors Wm​, and project X to lower dimension Z. Preserves maximal variance linearly.

Johnson–Lindenstrauss Bound

m≥2ϵ2​−3ϵ3​4ln(n)​

Explanation: Number of random projection dimensions m required to preserve pairwise distances within (1±ε) for n points with high probability.

Nearest Neighbor Retrieval

argi∈Vmax​s(xq​,xi​)

Explanation: Given a query vector xq​, return the item i with the highest similarity under s. For k-NN, return the top-k indices.

k-Means Objective

{μc​}min​i=1∑n​∥xi​−μc(i)​∥22​

Explanation: Partitions vectors into clusters by minimizing within-cluster squared distances to centroids μc​. Useful to summarize or segment embedding spaces.

Complexity Analysis

Let ∣V∣ be the vocabulary size and d the embedding dimension. Storing the embedding matrix requires O(∣V∣ d) space; for example, ∣V∣=1M and d=300 at 4 bytes/float needs about 1.2 GB. A single embedding lookup is O(d) time (just reading a row). Computing cosine similarity between two vectors is O(d) for the dot product plus O(d) for norms if not precomputed; with unit-normalized vectors, it reduces to a single O(d) dot product. Brute-force nearest-neighbor search over N items is O(N d) per query: compute N dot products of size d and select the top-k. Exact indexes (kd-trees) struggle in high d due to the curse of dimensionality; approximate methods (e.g., HNSW, product quantization) reduce query time at the cost of recall. If vectors are pre-normalized, you can store them compactly (e.g., 8-bit quantization) to reduce memory bandwidth, often the bottleneck in similarity search. For training with skip-gram negative sampling, each (target, context) update costs O((K+1) d), where K is the number of negatives, because you compute (K+1) dot products and update (K+1) output vectors plus one input vector. Over T training pairs, the total time is O(T (K+1) d). Adding L2 regularization does not change asymptotic cost. Mini-batching improves cache locality and vectorizes operations. Dimensionality reduction via random projection from D to m dimensions costs O(D m) per vector and uses O(D m) space for the projection matrix if dense. PCA training (exact SVD) is O(n D2) or O(D3) depending on regime; incremental or randomized methods are preferred for large D. Overall, the dominant costs are memory for E and bandwidth for similarity queries; algorithmic choices (K, d, indexing) control the speed–accuracy trade-off.

Code Examples

Cosine similarity and brute-force k-NN over an embedding table
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute L2 norm of a vector
5double l2norm(const vector<double>& v){
6 double s = 0.0; for(double x : v) s += x*x; return sqrt(max(0.0, s));
7}
8
9// Normalize a vector to unit L2 norm (in-place)
10void normalize(vector<double>& v){
11 double n = l2norm(v); if(n == 0) return; for(double &x : v) x /= n;
12}
13
14// Dot product
15double dot(const vector<double>& a, const vector<double>& b){
16 double s = 0.0; size_t d = a.size();
17 for(size_t i=0;i<d;++i) s += a[i]*b[i];
18 return s;
19}
20
21// Return top-k nearest neighbors by cosine (assumes all rows normalized)
22vector<pair<int,double>> topk_cosine(const vector<vector<double>>& E, const vector<double>& q, int k){
23 vector<pair<double,int>> scores; scores.reserve(E.size());
24 for(size_t i=0;i<E.size();++i){
25 scores.emplace_back(dot(E[i], q), (int)i);
26 }
27 // partial_sort to get top-k
28 nth_element(scores.begin(), scores.begin()+k, scores.end(), [](auto& a, auto& b){return a.first > b.first;});
29 sort(scores.begin(), scores.begin()+k, [](auto& a, auto& b){return a.first > b.first;});
30 vector<pair<int,double>> ans; ans.reserve(k);
31 for(int i=0;i<k;++i) ans.emplace_back(scores[i].second, scores[i].first);
32 return ans;
33}
34
35int main(){
36 ios::sync_with_stdio(false);
37 cin.tie(nullptr);
38
39 // Toy vocabulary
40 vector<string> vocab = {"cat","dog","banana","apple","lion","car"};
41 // Toy 3D embeddings (random-ish but we'll normalize)
42 vector<vector<double>> E = {
43 {0.9, 0.1, 0.0}, // cat
44 {0.88, 0.12, 0.05},// dog
45 {0.0, 0.9, 0.1}, // banana
46 {0.05, 0.88, 0.12},// apple
47 {0.95, 0.05, 0.0}, // lion
48 {0.1, 0.05, 0.95} // car
49 };
50
51 // Normalize all rows for cosine
52 for(auto& v : E) normalize(v);
53
54 // Query: use the vector for "cat"
55 vector<double> q = E[0]; // already normalized
56
57 int k = 3;
58 auto nn = topk_cosine(E, q, k);
59
60 cout << "Top-" << k << " nearest to '" << vocab[0] << "':\n";
61 for(auto [idx, score] : nn){
62 cout << fixed << setprecision(4) << vocab[idx] << "\t" << score << "\n";
63 }
64 return 0;
65}
66

We build a tiny embedding table, L2-normalize all rows, and compute cosine similarities to a query vector using dot products (since normalization makes cosine equal to dot). We then find the top-k nearest neighbors with partial sorting. This mirrors how semantic search or recommendation retrieves similar items.

Time: O(N d + N log k) per query (dot products plus partial sorting). With k << N, nth_element makes it close to O(N d).Space: O(N d) for the embedding table plus O(N) for temporary scores.
One training step of Skip-gram with Negative Sampling (SGNS)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Numerically stable sigmoid
5inline double sigmoid(double x){
6 if(x >= 0){
7 double z = exp(-x); return 1.0 / (1.0 + z);
8 } else {
9 double z = exp(x); return z / (1.0 + z);
10 }
11}
12
13// Perform one SGNS update for (target t, positive context c) with K negatives
14void sgns_update(vector<vector<double>>& W_in, vector<vector<double>>& W_out,
15 int t, int c, const vector<int>& negatives, double lr){
16 int d = (int)W_in[0].size();
17 vector<double> &v = W_in[t]; // input embedding for target
18 vector<double> grad_v(d, 0.0); // accumulate gradient for v
19
20 // Positive pair gradient
21 vector<double> &u_pos = W_out[c];
22 double pos_score = inner_product(v.begin(), v.end(), u_pos.begin(), 0.0);
23 double pos_sig = sigmoid(pos_score); // sigma(u_c^T v)
24 double coeff_pos = (pos_sig - 1.0); // dL/d(u_c^T v)
25
26 // Update u_pos and accumulate grad for v
27 for(int i=0;i<d;++i){
28 double gu = coeff_pos * v[i]; // dL/du_c_i
29 u_pos[i] -= lr * gu;
30 grad_v[i] += coeff_pos * u_pos[i]; // note: uses updated u_pos (OK for SGD variant)
31 }
32
33 // Negatives: for each n, add sigma(u_n^T v) * u_n to grad_v and update u_n
34 for(int n_id : negatives){
35 vector<double> &u_neg = W_out[n_id];
36 double neg_score = inner_product(v.begin(), v.end(), u_neg.begin(), 0.0);
37 double neg_sig = sigmoid(neg_score); // sigma(u_n^T v)
38 double coeff_neg = neg_sig; // dL/d(u_n^T v)
39 for(int i=0;i<d;++i){
40 double gu = coeff_neg * v[i]; // dL/du_n_i
41 u_neg[i] -= lr * gu; // gradient ascent on log sigma(-s) is gradient descent here via coeff_neg
42 grad_v[i] += coeff_neg * u_neg[i];
43 }
44 }
45
46 // Finally update v (target embedding)
47 for(int i=0;i<d;++i){ v[i] -= lr * grad_v[i]; }
48}
49
50// Sample K negatives from a given cumulative distribution function (CDF)
51vector<int> sample_negatives(const vector<double>& cdf, int K, mt19937& rng){
52 uniform_real_distribution<double> U(0.0, 1.0);
53 vector<int> out; out.reserve(K);
54 for(int k=0;k<K;++k){
55 double r = U(rng);
56 int idx = (int)(lower_bound(cdf.begin(), cdf.end(), r) - cdf.begin());
57 if(idx == (int)cdf.size()) idx = (int)cdf.size()-1;
58 out.push_back(idx);
59 }
60 return out;
61}
62
63int main(){
64 ios::sync_with_stdio(false);
65 cin.tie(nullptr);
66
67 // Small vocab and dim
68 int V = 6, d = 8; double lr = 0.05; int K = 4;
69
70 // Initialize embeddings with small random values
71 mt19937 rng(42);
72 normal_distribution<double> N(0.0, 0.1);
73 vector<vector<double>> W_in(V, vector<double>(d)), W_out(V, vector<double>(d));
74 for(int i=0;i<V;++i){
75 for(int j=0;j<d;++j){ W_in[i][j] = N(rng); W_out[i][j] = N(rng); }
76 }
77
78 // Example frequencies (smoothed) for negative sampling
79 vector<double> freq = {10, 12, 3, 4, 9, 2};
80 double sum = 0.0; for(double f : freq) sum += pow(f, 0.75);
81 vector<double> prob(V), cdf(V);
82 for(int i=0;i<V;++i){ prob[i] = pow(freq[i], 0.75) / sum; }
83 partial_sum(prob.begin(), prob.end(), cdf.begin());
84
85 // One observed (target,context) pair, e.g., ("cat","dog") as ids (0,1)
86 int t = 0, c = 1;
87 vector<int> negatives = sample_negatives(cdf, K, rng);
88
89 // Before update: compute pos score
90 double before = inner_product(W_in[t].begin(), W_in[t].end(), W_out[c].begin(), 0.0);
91
92 sgns_update(W_in, W_out, t, c, negatives, lr);
93
94 double after = inner_product(W_in[t].begin(), W_in[t].end(), W_out[c].begin(), 0.0);
95
96 cout << fixed << setprecision(6);
97 cout << "Positive score before: " << before << "\n";
98 cout << "Positive score after : " << after << "\n";
99 cout << "Sampled negatives: ";
100 for(int id : negatives) cout << id << ' ';
101 cout << "\n";
102 return 0;
103}
104

Implements a single SGD update for skip-gram with negative sampling. The positive pair (t, c) is pulled together, while K negatives are pushed away. We use a numerically stable sigmoid and sample negatives from a frequency-smoothed distribution. The printed dot product typically increases after the update, indicating stronger alignment between target and true context.

Time: O((K+1) d) per update (dot products and vector updates).Space: O(V d) for W_in and O(V d) for W_out; O(K) temporary storage.
Random projection to compress embeddings (JL transform style)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Project N vectors in R^D to R^m using a dense Gaussian random matrix R (D x m)
5vector<vector<double>> random_projection(const vector<vector<double>>& X, int m, unsigned seed=123){
6 int N = (int)X.size();
7 if(N == 0) return {};
8 int D = (int)X[0].size();
9 mt19937 rng(seed);
10 normal_distribution<double> N01(0.0, 1.0);
11 // R scaled by 1/sqrt(m)
12 vector<vector<double>> R(D, vector<double>(m));
13 double scale = 1.0 / sqrt((double)m);
14 for(int i=0;i<D;++i) for(int j=0;j<m;++j) R[i][j] = N01(rng) * scale;
15
16 vector<vector<double>> Z(N, vector<double>(m, 0.0));
17 for(int n=0;n<N;++n){
18 for(int j=0;j<m;++j){
19 double s = 0.0;
20 for(int i=0;i<D;++i) s += X[n][i] * R[i][j];
21 Z[n][j] = s;
22 }
23 }
24 return Z;
25}
26
27// Utility: cosine similarity
28double cosine(const vector<double>& a, const vector<double>& b){
29 auto dot = inner_product(a.begin(), a.end(), b.begin(), 0.0);
30 auto na = sqrt(inner_product(a.begin(), a.end(), a.begin(), 0.0));
31 auto nb = sqrt(inner_product(b.begin(), b.end(), b.begin(), 0.0));
32 if(na == 0 || nb == 0) return 0.0;
33 return dot / (na*nb);
34}
35
36int main(){
37 ios::sync_with_stdio(false);
38 cin.tie(nullptr);
39
40 // Build N synthetic D-dimensional vectors
41 int N = 200, D = 128, m = 32;
42 mt19937 rng(7);
43 normal_distribution<double> N01(0.0, 1.0);
44 vector<vector<double>> X(N, vector<double>(D));
45 for(int n=0;n<N;++n) for(int i=0;i<D;++i) X[n][i] = N01(rng);
46
47 // Compute a few pairwise cosines before projection
48 vector<pair<int,int>> pairs = {{0,1},{0,2},{10,11},{50,120}};
49 vector<double> before;
50 for(auto [i,j]: pairs) before.push_back(cosine(X[i], X[j]));
51
52 // Project down to m dimensions
53 auto Z = random_projection(X, m, 123);
54
55 // Compute the same cosines after projection
56 vector<double> after;
57 for(auto [i,j]: pairs) after.push_back(cosine(Z[i], Z[j]));
58
59 cout << fixed << setprecision(5);
60 for(size_t k=0;k<pairs.size();++k){
61 cout << "Pair (" << pairs[k].first << "," << pairs[k].second << ")\tbefore="
62 << before[k] << "\tafter=" << after[k] << "\n";
63 }
64
65 return 0;
66}
67

Generates synthetic high-dimensional vectors, then compresses them using a dense Gaussian random projection scaled by 1/sqrt(m). The Johnson–Lindenstrauss lemma guarantees approximate distance (and thus cosine) preservation with high probability for sufficiently large m. The printed cosine pairs illustrate that similarities are largely preserved after compression.

Time: O(N D m) to form the projection and apply it (dominated by the matrix multiply X·R).Space: O(D m) for the projection matrix plus O(N m) for the compressed vectors.
#embeddings#dense vectors#cosine similarity#word2vec#negative sampling#nearest neighbor#random projection#pca#triplet loss#matrix factorization#contrastive learning#vector search#normalization#distributed representations#semantic similarity