šŸŽ“How I Study AIHISA
šŸ“–Read
šŸ“„PapersšŸ“°BlogsšŸŽ¬Courses
šŸ’”Learn
šŸ›¤ļøPathsšŸ“šTopicsšŸ’”ConceptsšŸŽ“Shorts
šŸŽÆPractice
ā±ļøCoach🧩Problems🧠ThinkingšŸŽÆPrompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
āš™ļøAlgorithmIntermediate

t-SNE & UMAP

Key Points

  • •
    t-SNE and UMAP are nonlinear dimensionality-reduction methods that preserve local neighborhoods to make high-dimensional data visible in 2D or 3D.
  • •
    t-SNE builds probabilities from Gaussian kernels in high dimensions and uses a heavy-tailed Student-t distribution in low dimensions to avoid crowding.
  • •
    UMAP builds a fuzzy k-nearest-neighbor graph and optimizes a cross-entropy objective with negative sampling to produce an embedding.
  • •
    t-SNE’s key knob is perplexity, which roughly controls neighborhood size; UMAP’s main knobs are nn​eighbors and mind​ist, controlling locality and tightness of clusters.
  • •
    Naive t-SNE is O(n2) per iteration; scalable variants use Barnes–Hut or FFT-based approximations, while UMAP scales via approximate neighbor search (e.g., NN-descent).
  • •
    Preprocessing (standardization, removing duplicates/outliers) dramatically affects results for both methods.
  • •
    Global geometry is not faithfully preserved; interpret local distances and cluster separations with caution.
  • •
    Initialization, random seed, and learning-rate schedules can noticeably change the final visualization.

Prerequisites

  • →Euclidean distance and vector norms — Both algorithms rely on distances to build neighborhoods and measure similarity.
  • →Probability distributions and entropy — t-SNE uses conditional probabilities and perplexity; UMAP uses fuzzy memberships and cross-entropy.
  • →Gradient descent and momentum — Optimization of both objectives requires iterative gradient-based updates.
  • →k-Nearest neighbors and graphs — UMAP constructs and optimizes over a kNN graph; understanding graphs is essential.
  • →Numerical stability and scaling — Exponentials, normalization, and step sizes can cause instability without care.
  • →Principal Component Analysis (PCA) — Commonly used as a preprocessing step to denoise and reduce dimensionality before t-SNE/UMAP.
  • →Random number generation — Both methods rely on randomized initialization and, for UMAP, negative sampling.

Detailed Explanation

Tap terms for definitions

01Overview

t-SNE (t-Distributed Stochastic Neighbor Embedding) and UMAP (Uniform Manifold Approximation and Projection) are two of the most popular algorithms for turning complex, high-dimensional datasets into low-dimensional visualizations. Both focus on preserving local structure: if two points are close in the original space, they should remain close in the embedding. t-SNE does this by translating distances into conditional probabilities and then finding low-dimensional points that match these probabilities under a heavy-tailed Student-t distribution. UMAP builds a graph of nearest neighbors with fuzzy (soft) connection strengths and then optimizes a cross-entropy objective so that pairs that are connected in the high-dimensional graph are also close in the embedding.

The key differences are in modeling and scalability. t-SNE’s heavy-tailed low-dimensional distribution directly mitigates the ā€œcrowding problem,ā€ but naive optimization is quadratic in the number of points. UMAP constructs an explicit k-nearest-neighbor graph and uses efficient, almost linear-time neighbor finding plus negative sampling during optimization, typically making it faster on large datasets. Hyperparameters also differ: t-SNE’s perplexity influences the effective neighborhood size, while UMAP’s n_neighbors and min_dist respectively set the local connectivity scale and how tightly points can pack. Used wisely, both give insightful, visually separable clusters that help you understand structure, outliers, and relationships in data.

02Intuition & Analogies

Imagine plotting the locations of people in a huge city by their interests, habits, or preferences—features you cannot see directly. You want a 2D map where friends (similar people) stay near each other. In the original high-dimensional world, friendship strength decays smoothly with dissimilarity. t-SNE converts this idea into probabilities: for each person, it assigns a probability distribution over all other people that emphasizes close friends more than distant ones. Then it looks for a 2D placement so that, in 2D, the friendship probabilities look similar. The catch is that squeezing many points into 2D can cause crowding. t-SNE’s fix is to use a heavy-tailed Student-t distribution in 2D, which makes far-away points exert less force than a Gaussian would. This lets clusters spread out naturally and reduces the tendency for everything to collapse into the center.

UMAP thinks in terms of neighborhoods and roads. First, it builds a road map: each person connects to their k closest neighbors, and the thickness of each road reflects how strongly they are neighbors (a fuzzy membership strength). This creates a soft, weighted graph of local relationships. Next, UMAP draws a 2D map and asks: if two people are connected by a thick road in the original map, they should be close in 2D; if they are not connected, they shouldn’t be too close by accident. It optimizes a cross-entropy score that rewards keeping connected pairs close and, through negative sampling, discourages unconnected pairs from becoming near neighbors. Parameters like n_neighbors decide how many roads each person has, and min_dist says how tightly neighborhoods can be drawn in 2D. The result is a map that respects local streets while still allowing overall layout to emerge.

03Formal Definition

For t-SNE, given points X = {x1​,…,xn​} in RD, define conditional probabilities with Gaussian kernels: pj∣i​ = āˆ‘k=i​exp(āˆ’āˆ„xiā€‹āˆ’xkā€‹āˆ„2/(2σi2​))exp(āˆ’āˆ„xiā€‹āˆ’xjā€‹āˆ„2/(2σi2​))​, j = i, with pi∣i​=0. The bandwidth σi​ is chosen per point so that the perplexity Perp(Pi​) = 2^{H(Pi​)} matches a user-specified value, where H(Pi​) = - āˆ‘j​ pj∣i​ log2​ pj∣i​. Symmetrize to pij​ = (pi∣j​ + pj∣i​)/(2n). For a low-dimensional embedding Y = {yi​} āŠ‚ Rd, define qij​ = āˆ‘k=l​(1+∄ykā€‹āˆ’ylā€‹āˆ„2)āˆ’1(1+∄yiā€‹āˆ’yjā€‹āˆ„2)āˆ’1​. t-SNE minimizes the Kullback–Leibler divergence C = āˆ‘i=j​ pij​ log qij​pij​​ via gradient descent. For UMAP, compute a k-nearest-neighbor graph. For each i, let ρi​ be the distance to its closest non-zero neighbor (often the k-th neighbor threshold), and choose σi​ so that āˆ‘j​ \exp\big(-(dij​-ρi​)/\sigma_i\big) ā‰ˆ log2​(k) with membership strengths μij​ = 1 if dij​ ≤ ρi​ and \exp\big(-(dij​-ρi​)/\sigma_i\big) otherwise. Symmetrize with fuzzy union wij​ = μij​ + μji​ - μij​μji​. In the embedding space, define pij(low)​ = 1+a∄yiā€‹āˆ’yjā€‹āˆ„2b1​ (with a,b fit from mind​ist). Optimize the cross-entropy L = āˆ‘i<j​ \Big[- wij​log pij(low)​ - (1-wij​)log(1-pij(low)​)\Big] using stochastic gradient descent with negative sampling.

04When to Use

Use t-SNE or UMAP when you want to visualize high-dimensional data (text embeddings, images, genetic profiles) in 2D or 3D while emphasizing local neighborhoods. t-SNE often excels at clear separation of clusters due to its heavy-tailed low-dimensional kernel and the early exaggeration phase, making it a top choice for small to medium datasets when you can afford O(n^2) work or have Barnes–Hut/FFT approximations available. If you care primarily about visual cluster purity and local neighborhoods, t-SNE is a strong default.

UMAP is typically faster and more scalable, especially when paired with approximate neighbor search (e.g., NN-descent). It controls the balance between local and global structure via n_neighbors and allows you to regulate how tightly points pack via min_dist. If you need to handle larger datasets, retain more of the global manifold (e.g., continuous trajectories), or embed new points into an existing model, UMAP is often the better choice. In pipelines where you may later perform clustering or nearest-neighbor queries in the embedding, UMAP’s graph-based framework and transform capability are advantageous. For both methods, always standardize features, reduce obvious noise (e.g., with PCA), and try several hyperparameters to validate stability.

āš ļøCommon Mistakes

• Interpreting absolute positions or axes: Only relative distances and neighborhoods are meaningful. Rotations, reflections, or translations of the embedding are arbitrary. • Using raw, unscaled features: Euclidean distances are sensitive to scale. Standardize or whiten features, and consider PCA to denoise before t-SNE/UMAP. • Expecting global geometry: t-SNE especially does not preserve global distances. Far-away clusters or cluster sizes can be misleading about true densities or distances. • Poor hyperparameter choices: Too small or too large perplexity (t-SNE) or n_neighbors/min_dist (UMAP) can fragment clusters or over-merge them. Start with perplexity in [5, 50] and n_neighbors in [10, 50]. • Skipping early exaggeration in t-SNE: It improves separation of clusters early on; omitting or misusing it hurts results. • Quadratic implementations on large n: Naive O(n^2) t-SNE quickly becomes infeasible. Use Barnes–Hut or FFT-based (FIt-SNE) methods; for UMAP, use NN-descent for kNN. • Insufficient iterations or unstable learning rates: Both methods need careful optimization. Use momentum in t-SNE and appropriate learning rates; in UMAP, tune epochs and negative sampling rate. • Ignoring randomness: Different seeds can produce different layouts. For reproducibility and comparison, fix the seed and run multiple trials. • Overtrusting cluster separations for decision making: These are visualization tools; validate findings with quantitative methods and out-of-sample checks.

Key Formulas

t-SNE Conditional Probability

pj∣i​=āˆ‘k=i​exp(āˆ’2σi2ā€‹āˆ„xiā€‹āˆ’xkā€‹āˆ„2​)exp(āˆ’2σi2ā€‹āˆ„xiā€‹āˆ’xjā€‹āˆ„2​)​

Explanation: Defines how likely j is to be chosen as a neighbor of i under a Gaussian kernel. Each point i has its own bandwidth σi​ tuned to match a target perplexity.

Perplexity and Entropy

H(Pi​)=āˆ’jāˆ‘ā€‹pj∣i​log2​pj∣i​,Perp(Pi​)=2H(Pi​)

Explanation: Entropy in bits determines the effective number of neighbors (perplexity). Matching a user-provided perplexity controls local scale in t-SNE.

Symmetric Joint Probabilities

pij​=2npi∣j​+pj∣i​​

Explanation: t-SNE uses a symmetric matrix P by averaging conditionals and normalizing by 2n. This prevents directional biases.

Student-t Low-Dim Similarities

qij​=āˆ‘k=l​(1+∄ykā€‹āˆ’ylā€‹āˆ„2)āˆ’1(1+∄yiā€‹āˆ’yjā€‹āˆ„2)āˆ’1​

Explanation: In the embedding, similarities come from a heavy-tailed kernel that reduces crowding by allowing moderate influence from distant pairs.

t-SNE Objective (KL Divergence)

C=i=jāˆ‘ā€‹pij​logqij​pij​​

Explanation: Minimizing KL(P||Q) makes low-dimensional similarities qij​ match high-dimensional similarities pij​, emphasizing local structure.

t-SNE Gradient

āˆ‚yiā€‹āˆ‚C​=4jāˆ‘ā€‹(pijā€‹āˆ’qij​)1+∄yiā€‹āˆ’yjā€‹āˆ„2yiā€‹āˆ’yj​​

Explanation: The gradient pulls points together if pij​>qij​ and pushes apart otherwise. The heavy-tail denominator tempers forces at larger distances.

UMAP Directed Membership

μij​={1,exp(āˆ’Ļƒi​dijā€‹āˆ’Ļi​​),​dij​≤ρi​dij​>ρi​​

Explanation: Directed edge strengths from i to j are 1 within a local threshold ρi​ and decay exponentially beyond it, with scale σi​ tuned per point.

UMAP Fuzzy Union

wij​=μij​+μjiā€‹āˆ’Ī¼ij​μji​

Explanation: Combines directed memberships into a symmetric weight capturing mutual neighborhood strength. High only if both directions are strong.

UMAP Low-Dimensional Similarity

pij(low)​=1+a∄yiā€‹āˆ’yjā€‹āˆ„2b1​

Explanation: UMAP uses a parametric curve with parameters a and b (fitted from mind​ist and spread) that controls how quickly similarity decays with distance.

UMAP Cross-Entropy Objective

L=i<jāˆ‘ā€‹[āˆ’wij​logpij(low)ā€‹āˆ’(1āˆ’wij​)log(1āˆ’pij(low)​)]

Explanation: Encourages connected pairs (high wij​) to be close (high pij(low)​) while discouraging unconnected pairs from being too close.

Squared Euclidean Distance

dij2​=∄xiā€‹āˆ’xjā€‹āˆ„2=t=1āˆ‘D​(xitā€‹āˆ’xjt​)2

Explanation: Basic distance measure used for both methods by default. Preprocessing to equalize feature scales is crucial.

Complexity Analysis

Let n be the number of points and D the input dimension. Computing the full pairwise distance matrix costs O(n2D) time and O(n2) space, which both t-SNE and a naive UMAP implementation may do initially. For t-SNE, choosing per-point bandwidths (sigmas) via binary search costs O(n2) overall (each row repeated kernel evaluations), so building P is O(n2D + n2). The naive optimization step computes all pairwise interactions, making each iteration O(n2) time and O(n2) space if you explicitly form Q, or O(n2) time and O(n) space if you stream forces. Barnes–Hut reduces this to roughly O(n log n) per iteration using a tree to aggregate far-field forces, and FIt-SNE reduces it further (empirically near-linear) using FFT-based convolutions on an interpolation grid. UMAP’s core cost is constructing a kNN graph. With naive all-pairs search, this is O(n2D) time and O(nk) space (storing only edges to k neighbors). With NN-descent or other approximate methods, it often approaches O(n log n) or better in practice. The fuzzy membership scaling (solving for σi​) is O(nk) with a small constant from a short binary search per node. The optimization is linear in the number of edges: each epoch performs O(E) attractive updates plus O(E ā‹… nsr) repulsive updates via negative sampling, where E ā‰ˆ n k. Thus, total time is O(epochs ā‹… E ā‹… (1 + nsr)). Memory for UMAP is O(E + nd) for edges and the embedding. In summary, naive t-SNE is quadratic and becomes slow for large n without approximations, while UMAP with approximate neighbors typically scales to much larger datasets with near-linear behavior in n.

Code Examples

t-SNE: Compute Perplexity-Based Conditional Probabilities and Symmetric P
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute squared Euclidean distance matrix (n x n)
5vector<vector<double>> squaredDistanceMatrix(const vector<vector<double>>& X){
6 int n = (int)X.size();
7 int d = (int)X[0].size();
8 vector<vector<double>> D(n, vector<double>(n, 0.0));
9 for(int i=0;i<n;i++){
10 D[i][i]=0.0;
11 for(int j=i+1;j<n;j++){
12 double s=0.0;
13 for(int t=0;t<d;t++){
14 double diff = X[i][t]-X[j][t];
15 s += diff*diff;
16 }
17 D[i][j]=D[j][i]=s;
18 }
19 }
20 return D;
21}
22
23// For each i, find beta = 1/(2 sigma_i^2) such that perplexity matches target.
24// Returns conditional probabilities Pcond (n x n) with zeros on diagonal and sigmas.
25pair<vector<vector<double>>, vector<double>> tsneConditionalP(
26 const vector<vector<double>>& D, double perplexity,
27 int max_iter=50, double tol=1e-5){
28 int n=(int)D.size();
29 vector<vector<double>> P(n, vector<double>(n, 0.0));
30 vector<double> sigmas(n, 0.0);
31 double logU = log2(perplexity); // target entropy in bits
32
33 for(int i=0;i<n;i++){
34 double beta = 1.0; // 1/(2*sigma^2)
35 double betamin = -numeric_limits<double>::infinity();
36 double betamax = numeric_limits<double>::infinity();
37 // Binary search for beta
38 for(int it=0; it<max_iter; ++it){
39 // Compute row probabilities with current beta
40 double sumP = 0.0;
41 for(int j=0;j<n;j++){
42 if(i==j){ P[i][j]=0.0; continue; }
43 double val = exp(-beta * D[i][j]);
44 P[i][j] = val;
45 sumP += val;
46 }
47 // Normalize
48 if(sumP == 0.0){
49 for(int j=0;j<n;j++) if(i!=j) P[i][j] = 1.0/(n-1);
50 break;
51 }
52 for(int j=0;j<n;j++) if(i!=j) P[i][j] /= sumP;
53 // Compute entropy H in bits
54 double H = 0.0;
55 for(int j=0;j<n;j++) if(i!=j && P[i][j] > 0.0) H -= P[i][j] * (log(P[i][j]) / log(2.0));
56 // Check and update beta bounds
57 double Hdiff = H - logU;
58 if (fabs(Hdiff) < tol) break;
59 if (Hdiff > 0) { // Entropy too high -> distribution too broad -> increase beta
60 betamin = beta;
61 if (isinf(betamax)) beta *= 2.0; else beta = 0.5*(beta + betamax);
62 } else { // Entropy too low -> distribution too peaky -> decrease beta
63 betamax = beta;
64 if (isinf(betamin)) beta *= 0.5; else beta = 0.5*(beta + betamin);
65 }
66 }
67 // Save sigma from beta
68 sigmas[i] = sqrt(1.0/(2.0*max(beta, 1e-12)));
69 }
70
71 return {P, sigmas};
72}
73
74// Symmetrize conditionals into joint probabilities P_ij = (p_{i|j} + p_{j|i})/(2n)
75vector<vector<double>> symmetrizeP(const vector<vector<double>>& Pcond){
76 int n = (int)Pcond.size();
77 vector<vector<double>> P(n, vector<double>(n, 0.0));
78 for(int i=0;i<n;i++){
79 for(int j=i+1;j<n;j++){
80 double v = (Pcond[i][j] + Pcond[j][i]) / (2.0 * n);
81 P[i][j]=P[j][i]=v;
82 }
83 }
84 return P;
85}
86
87int main(){
88 ios::sync_with_stdio(false);
89 cin.tie(nullptr);
90
91 // Build a tiny 3D dataset with 3 clusters
92 int n=60, d=3;
93 vector<vector<double>> X(n, vector<double>(d));
94 mt19937 rng(42);
95 normal_distribution<double> N(0.0, 0.3);
96 for(int i=0;i<n;i++){
97 int c = i/20; // cluster id 0,1,2
98 for(int t=0;t<d;t++){
99 double center = (c==0? -2.0 : (c==1? 0.0 : 2.0));
100 X[i][t] = center + N(rng);
101 }
102 }
103
104 auto D = squaredDistanceMatrix(X);
105 double perplexity = 30.0;
106 auto pr = tsneConditionalP(D, perplexity);
107 auto Pcond = pr.first; auto sigmas = pr.second;
108 auto P = symmetrizeP(Pcond);
109
110 // Simple diagnostics
111 cout << fixed << setprecision(6);
112 cout << "Average sigma: ";
113 double avg=0; for(double s: sigmas) avg+=s; avg/=sigmas.size();
114 cout << avg << "\n";
115
116 // Check that P sums to ~1 over i!=j
117 double sumP=0.0; for(int i=0;i<n;i++) for(int j=0;j<n;j++) if(i!=j) sumP+=P[i][j];
118 cout << "Sum P (off-diagonal) = " << sumP << " (should be ~1.0)\n";
119
120 // Show a few P values across clusters
121 cout << "P[0][1] (likely same cluster): " << P[0][1] << "\n";
122 cout << "P[0][50] (likely different cluster): " << P[0][50] << "\n";
123 return 0;
124}
125

This program constructs a small 3D dataset, computes the squared pairwise distances, finds per-point Gaussian bandwidths via binary search to match a target perplexity, and forms the symmetric t-SNE probability matrix P. It prints diagnostics to verify normalization and to compare within-cluster vs. across-cluster similarities.

Time: O(n^2 D) to build distances plus O(n^2) for the binary searches (each row repeatedly computes exponentials).Space: O(n^2) for storing the distance matrix and P.
t-SNE: One Simple Gradient-Descent Epoch in 2D (Naive O(n^2))
1#include <bits/stdc++.h>
2using namespace std;
3
4vector<vector<double>> squaredDistanceMatrix(const vector<vector<double>>& X){
5 int n = (int)X.size();
6 int d = (int)X[0].size();
7 vector<vector<double>> D(n, vector<double>(n, 0.0));
8 for(int i=0;i<n;i++) for(int j=i+1;j<n;j++){
9 double s=0.0; for(int t=0;t<d;t++){ double diff=X[i][t]-X[j][t]; s+=diff*diff; }
10 D[i][j]=D[j][i]=s;
11 }
12 return D;
13}
14
15pair<vector<vector<double>>, vector<double>> tsneConditionalP(
16 const vector<vector<double>>& D, double perplexity, int max_iter=50, double tol=1e-5){
17 int n=(int)D.size();
18 vector<vector<double>> P(n, vector<double>(n,0.0));
19 vector<double> sigmas(n,0.0);
20 double logU = log2(perplexity);
21 for(int i=0;i<n;i++){
22 double beta=1.0, betamin=-numeric_limits<double>::infinity(), betamax=numeric_limits<double>::infinity();
23 for(int it=0; it<max_iter; ++it){
24 double sumP=0.0;
25 for(int j=0;j<n;j++){
26 if(i==j){P[i][j]=0.0; continue;} double v=exp(-beta*D[i][j]); P[i][j]=v; sumP+=v;
27 }
28 if(sumP==0){ for(int j=0;j<n;j++) if(i!=j) P[i][j]=1.0/(n-1); break; }
29 for(int j=0;j<n;j++) if(i!=j) P[i][j]/=sumP;
30 double H=0.0; for(int j=0;j<n;j++) if(i!=j && P[i][j]>0) H -= P[i][j]*(log(P[i][j])/log(2.0));
31 double diff = H - logU; if(fabs(diff)<tol) break;
32 if(diff>0){ betamin=beta; beta = isinf(betamax)? beta*2.0 : 0.5*(beta+betamax);} else { betamax=beta; beta = isinf(betamin)? beta*0.5 : 0.5*(beta+betamin);} }
33 sigmas[i]=sqrt(1.0/(2.0*max(beta,1e-12)));
34 }
35 return {P, sigmas};
36}
37
38vector<vector<double>> symmetrizeP(const vector<vector<double>>& Pcond){
39 int n=(int)Pcond.size(); vector<vector<double>> P(n, vector<double>(n,0.0));
40 for(int i=0;i<n;i++) for(int j=i+1;j<n;j++){ double v=(Pcond[i][j]+Pcond[j][i])/(2.0*n); P[i][j]=P[j][i]=v; }
41 return P;
42}
43
44// Compute Q numerators (t-kernel) and normalization Z, and accumulate gradients.
45void tsneEpoch(vector<vector<double>>& Y, const vector<vector<double>>& P,
46 double learning_rate, double momentum, vector<vector<double>>& velocity){
47 int n=(int)Y.size();
48 // Compute pairwise 1/(1+dist2) and Z
49 vector<vector<double>> num(n, vector<double>(n,0.0));
50 double Z=0.0;
51 for(int i=0;i<n;i++){
52 for(int j=i+1;j<n;j++){
53 double dx=Y[i][0]-Y[j][0], dy=Y[i][1]-Y[j][1];
54 double d2=dx*dx+dy*dy; double v = 1.0/(1.0 + d2);
55 num[i][j]=num[j][i]=v; Z += 2.0*v;
56 }
57 }
58 // Compute gradients
59 vector<vector<double>> grad(n, vector<double>(2,0.0));
60 for(int i=0;i<n;i++){
61 for(int j=0;j<n;j++) if(i!=j){
62 double qij = num[i][j]/Z; // low-dim prob
63 double pij = P[i][j];
64 double dx=Y[i][0]-Y[j][0], dy=Y[i][1]-Y[j][1];
65 double d2 = dx*dx+dy*dy; double coeff = 4.0*(pij - qij)/(1.0 + d2);
66 grad[i][0] += coeff*dx; grad[i][1] += coeff*dy;
67 }
68 }
69 // Update with momentum
70 for(int i=0;i<n;i++){
71 for(int a=0;a<2;a++){
72 velocity[i][a] = momentum*velocity[i][a] + learning_rate*grad[i][a];
73 Y[i][a] += velocity[i][a];
74 }
75 }
76}
77
78int main(){
79 ios::sync_with_stdio(false); cin.tie(nullptr);
80
81 // Small dataset
82 int n=80, d=3; vector<vector<double>> X(n, vector<double>(d));
83 mt19937 rng(0); normal_distribution<double> N(0.0, 0.35);
84 for(int i=0;i<n;i++){
85 int c=i/40; for(int t=0;t<d;t++){ double center = (c==0? -2.0:2.0); X[i][t] = center + N(rng);} }
86
87 auto D = squaredDistanceMatrix(X);
88 auto pr = tsneConditionalP(D, 30.0);
89 auto P = symmetrizeP(pr.first);
90
91 // Initialize Y randomly in small range
92 vector<vector<double>> Y(n, vector<double>(2));
93 uniform_real_distribution<double> U(-0.0001, 0.0001);
94 for(int i=0;i<n;i++){ Y[i][0]=U(rng); Y[i][1]=U(rng);}
95 vector<vector<double>> vel(n, vector<double>(2,0.0));
96
97 double lr=200.0; double mom=0.5; double exaggeration=12.0; int iters=400;
98
99 // Apply early exaggeration by scaling P for first phase
100 vector<vector<double>> Pex = P;
101 for(int i=0;i<n;i++) for(int j=0;j<n;j++) Pex[i][j]*=exaggeration;
102
103 for(int t=0;t<iters;t++){
104 if(t==250){ mom=0.8; lr=100.0; Pex=P; } // drop exaggeration
105 tsneEpoch(Y, Pex, lr, mom, vel);
106 if((t+1)%100==0){
107 // Report simple spread metric (variance)
108 double mx=0,my=0; for(auto &y:Y){mx+=y[0]; my+=y[1];} mx/=n; my/=n;
109 double vx=0,vy=0; for(auto &y:Y){vx+=(y[0]-mx)*(y[0]-mx); vy+=(y[1]-my)*(y[1]-my);} vx/=n; vy/=n;
110 cout << "Iter " << (t+1) << ": varX=" << vx << ", varY=" << vy << "\n";
111 }
112 }
113
114 // Print first few embedded points
115 cout << fixed << setprecision(4);
116 for(int i=0;i<5;i++) cout << "Y["<<i<<"] = ("<<Y[i][0]<<", "<<Y[i][1]<<")\n";
117 return 0;
118}
119

This standalone demo computes t-SNE probabilities P from data, initializes a 2D embedding, and performs naive O(n^2) gradient steps with an early-exaggeration phase. It prints progress and a few coordinates. For larger datasets, replace the naive force computation with Barnes–Hut or FIt-SNE.

Time: Per iteration O(n^2) for pairwise forces; initialization adds O(n^2 D) for distances and O(n^2) for building P.Space: O(n^2) to store P and temporary numerators; O(n) for Y and velocities.
UMAP (Simplified): Build kNN Graph, Fuzzy Weights, and One Training Pass with Negative Sampling
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Edge{int i,j; double w;};
5
6vector<vector<double>> squaredDistanceMatrix(const vector<vector<double>>& X){
7 int n=(int)X.size(), d=(int)X[0].size();
8 vector<vector<double>> D(n, vector<double>(n,0.0));
9 for(int i=0;i<n;i++) for(int j=i+1;j<n;j++){
10 double s=0.0; for(int t=0;t<d;t++){ double diff=X[i][t]-X[j][t]; s+=diff*diff; }
11 D[i][j]=D[j][i]=s;
12 }
13 return D;
14}
15
16// Get k nearest neighbors (indices and distances) for each point (naive O(n^2))
17vector<vector<pair<int,double>>> knn(const vector<vector<double>>& D, int k){
18 int n=(int)D.size();
19 vector<vector<pair<int,double>>> nbrs(n);
20 for(int i=0;i<n;i++){
21 vector<pair<double,int>> cand; cand.reserve(n-1);
22 for(int j=0;j<n;j++) if(i!=j) cand.push_back({D[i][j], j});
23 nth_element(cand.begin(), cand.begin()+k, cand.end());
24 cand.resize(k);
25 // sort for stable kth distance
26 sort(cand.begin(), cand.end());
27 nbrs[i].reserve(k);
28 for(auto &p: cand) nbrs[i].push_back({p.second, sqrt(max(p.first, 0.0))}); // store Euclidean distance
29 }
30 return nbrs;
31}
32
33// Solve sigma_i so that sum_j exp(-(d_ij - rho_i)/sigma_i) ~ log2(k)
34double solveSigma(const vector<double>& dists, double rho, double target, int iters=32){
35 // Binary search on sigma in (1e-3, 1e3)
36 double lo=1e-3, hi=1e3;
37 for(int it=0; it<iters; ++it){
38 double mid = sqrt(lo*hi);
39 double s=0.0;
40 for(double d: dists){
41 double val = (d<=rho)? 1.0 : exp(-(d - rho)/mid);
42 s += val;
43 }
44 if(s > target) hi=mid; else lo=mid;
45 }
46 return (lo+hi)*0.5;
47}
48
49// Build UMAP fuzzy union weights w_ij
50vector<Edge> buildUMAPGraph(const vector<vector<double>>& D, int k){
51 int n=(int)D.size();
52 auto nbrs = knn(D, k);
53
54 vector<double> rho(n, 0.0), sigma(n, 1.0);
55 for(int i=0;i<n;i++){
56 // rho_i: distance to the nearest non-zero neighbor (or kth neighbor baseline)
57 // Here we use the distance to the 1st neighbor as minimal threshold.
58 double ri = nbrs[i].empty()? 0.0 : nbrs[i][0].second;
59 rho[i] = ri;
60 vector<double> dists; dists.reserve(nbrs[i].size());
61 for(auto &p: nbrs[i]) dists.push_back(p.second);
62 double target = log2(max(2, k));
63 sigma[i] = solveSigma(dists, rho[i], target);
64 }
65
66 // Directed memberships mu_ij for j in N_k(i)
67 vector<vector<pair<int,double>>> directed(n);
68 for(int i=0;i<n;i++){
69 for(auto &p: nbrs[i]){
70 int j=p.first; double dij = p.second;
71 double mu = (dij <= rho[i])? 1.0 : exp(-(dij - rho[i])/max(1e-12, sigma[i]));
72 directed[i].push_back({j, mu});
73 }
74 }
75
76 // Fuzzy union: w_ij = mu_ij + mu_ji - mu_ij * mu_ji, store only existing pairs
77 // We'll place all candidate edges in a map to merge both directions.
78 struct PairHash { size_t operator()(const long long& x) const { return std::hash<long long>()(x);} };
79 unordered_map<long long, pair<double,double>, PairHash> tmp; // key -> (mu_ij, mu_ji)
80 auto key = [&](int a,int b){ return ( (long long)min(a,b) << 32 ) | (long long)max(a,b); };
81
82 for(int i=0;i<n;i++) for(auto &p: directed[i]){
83 int j=p.first; double mu=p.second; long long kx = key(i,j);
84 if(i<j){ tmp[kx].first = max(tmp[kx].first, mu); }
85 else { tmp[kx].second = max(tmp[kx].second, mu); }
86 }
87
88 vector<Edge> edges; edges.reserve(tmp.size());
89 for(auto &kv: tmp){
90 // recover i,j from key
91 int i = (int)(kv.first >> 32);
92 int j = (int)(kv.first & 0xFFFFFFFF);
93 double a = kv.second.first; double b = kv.second.second;
94 double w = a + b - a*b; // fuzzy union
95 if(w>0) edges.push_back({i,j,w});
96 }
97 return edges;
98}
99
100// One simplified training epoch (stochastic) for UMAP embedding in 2D
101void umapTrainEpoch(vector<vector<double>>& Y, const vector<Edge>& edges,
102 double a, double b, double lr, int negative_sample_rate,
103 mt19937& rng){
104 int n=(int)Y.size();
105 uniform_int_distribution<int> UnifIdx(0, n-1);
106 vector<int> order(edges.size()); iota(order.begin(), order.end(), 0);
107 shuffle(order.begin(), order.end(), rng);
108
109 for(int ii=0; ii<(int)order.size(); ++ii){
110 const Edge &e = edges[order[ii]]; int i=e.i, j=e.j; double w=e.w;
111 // Attractive update for (i,j)
112 double dx = Y[i][0]-Y[j][0], dy=Y[i][1]-Y[j][1];
113 double dist = sqrt(dx*dx+dy*dy) + 1e-7;
114 double dist2b = pow(dist, 2.0*b);
115 double pij = 1.0/(1.0 + a*dist2b);
116 // Gradient magnitude for attractive term (approx; simplified from UMAP reference)
117 double grad_coeff = -2.0 * a * b * w * pow(dist, 2.0*b - 2.0) * pij; // damped by pij
118 double gx = grad_coeff * dx; double gy = grad_coeff * dy;
119 // Update positions (symmetrically)
120 Y[i][0] += -lr * gx; Y[i][1] += -lr * gy;
121 Y[j][0] -= -lr * gx; Y[j][1] -= -lr * gy;
122
123 // Negative samples: repel random non-neighbors
124 for(int r=0; r<negative_sample_rate; ++r){
125 int jn = UnifIdx(rng); if(jn==i) jn = (jn+1)%n;
126 double dxn = Y[i][0]-Y[jn][0], dyn=Y[i][1]-Y[jn][1];
127 double dn = sqrt(dxn*dxn+dyn*dyn) + 1e-7;
128 double dn2b = pow(dn, 2.0*b);
129 double pneg = 1.0/(1.0 + a*dn2b);
130 // Repulsive gradient (pushing apart); scaled to be weaker at long range
131 double rep_coeff = 2.0 * b * pneg * (1.0 - pneg) / (dn + 1e-3);
132 double grx = rep_coeff * dxn; double gry = rep_coeff * dyn;
133 Y[i][0] += lr * grx; Y[i][1] += lr * gry;
134 Y[jn][0] -= lr * grx; Y[jn][1] -= lr * gry;
135 }
136 }
137}
138
139int main(){
140 ios::sync_with_stdio(false); cin.tie(nullptr);
141
142 // Synthetic dataset: three 3D clusters
143 int n=90, d=3; vector<vector<double>> X(n, vector<double>(d));
144 mt19937 rng(123);
145 normal_distribution<double> N(0.0, 0.35);
146 for(int i=0;i<n;i++){
147 int c=i/30; double center = (c==0? -3.0 : (c==1? 0.0 : 3.0));
148 for(int t=0;t<d;t++) X[i][t] = center + N(rng);
149 }
150
151 auto D = squaredDistanceMatrix(X);
152 int n_neighbors = 15;
153 auto edges = buildUMAPGraph(D, n_neighbors);
154
155 // Embedding initialization
156 vector<vector<double>> Y(n, vector<double>(2));
157 uniform_real_distribution<double> U(-0.01, 0.01);
158 for(int i=0;i<n;i++){ Y[i][0]=U(rng); Y[i][1]=U(rng);}
159
160 // Typical (a,b) for min_dist~0.1 per UMAP defaults (pre-fitted constants)
161 double a=1.929; double b=0.7915;
162 double lr=1.0; int epochs=200; int neg_rate=5;
163
164 for(int ep=0; ep<epochs; ++ep){
165 umapTrainEpoch(Y, edges, a, b, lr, neg_rate, rng);
166 if((ep+1)%50==0){
167 // Report simple spread metric
168 double mx=0,my=0; for(auto &y:Y){mx+=y[0]; my+=y[1];} mx/=n; my/=n;
169 double vx=0,vy=0; for(auto &y:Y){vx+=(y[0]-mx)*(y[0]-mx); vy+=(y[1]-my)*(y[1]-my);} vx/=n; vy/=n;
170 cout << "Epoch "<<(ep+1)<<": varX="<<vx<<", varY="<<vy<<"\n";
171 }
172 }
173
174 cout << fixed << setprecision(4);
175 for(int i=0;i<5;i++) cout<<"Y["<<i<<"]=("<<Y[i][0]<<","<<Y[i][1]<<")\n";
176 return 0;
177}
178

This educational UMAP sketch builds a kNN graph, computes fuzzy union weights, and runs a simplified SGD with negative sampling to optimize a 2D embedding. The attractive and repulsive gradients are approximations of the reference implementation; they illustrate the mechanics but are not tuned for large-scale use.

Time: O(n^2 D) to compute distances; O(n^2) to find kNN naively; O(n k) to compute fuzzy weights; and per epoch O(E Ā· (1 + nsr)) with E ā‰ˆ n k.Space: O(n^2) if storing all distances (for simplicity here) or O(nk) for the graph plus O(n) for the embedding.
#t-sne#umap#dimensionality reduction#perplexity#k-nearest neighbors#kl divergence#student-t distribution#fuzzy simplicial set#negative sampling#barnes-hut#fit-sne#nn-descent#min_dist#cross-entropy#embedding