t-SNE & UMAP
Key Points
- ā¢t-SNE and UMAP are nonlinear dimensionality-reduction methods that preserve local neighborhoods to make high-dimensional data visible in 2D or 3D.
- ā¢t-SNE builds probabilities from Gaussian kernels in high dimensions and uses a heavy-tailed Student-t distribution in low dimensions to avoid crowding.
- ā¢UMAP builds a fuzzy k-nearest-neighbor graph and optimizes a cross-entropy objective with negative sampling to produce an embedding.
- ā¢t-SNEās key knob is perplexity, which roughly controls neighborhood size; UMAPās main knobs are and mi, controlling locality and tightness of clusters.
- ā¢Naive t-SNE is O() per iteration; scalable variants use BarnesāHut or FFT-based approximations, while UMAP scales via approximate neighbor search (e.g., NN-descent).
- ā¢Preprocessing (standardization, removing duplicates/outliers) dramatically affects results for both methods.
- ā¢Global geometry is not faithfully preserved; interpret local distances and cluster separations with caution.
- ā¢Initialization, random seed, and learning-rate schedules can noticeably change the final visualization.
Prerequisites
- āEuclidean distance and vector norms ā Both algorithms rely on distances to build neighborhoods and measure similarity.
- āProbability distributions and entropy ā t-SNE uses conditional probabilities and perplexity; UMAP uses fuzzy memberships and cross-entropy.
- āGradient descent and momentum ā Optimization of both objectives requires iterative gradient-based updates.
- āk-Nearest neighbors and graphs ā UMAP constructs and optimizes over a kNN graph; understanding graphs is essential.
- āNumerical stability and scaling ā Exponentials, normalization, and step sizes can cause instability without care.
- āPrincipal Component Analysis (PCA) ā Commonly used as a preprocessing step to denoise and reduce dimensionality before t-SNE/UMAP.
- āRandom number generation ā Both methods rely on randomized initialization and, for UMAP, negative sampling.
Detailed Explanation
Tap terms for definitions01Overview
t-SNE (t-Distributed Stochastic Neighbor Embedding) and UMAP (Uniform Manifold Approximation and Projection) are two of the most popular algorithms for turning complex, high-dimensional datasets into low-dimensional visualizations. Both focus on preserving local structure: if two points are close in the original space, they should remain close in the embedding. t-SNE does this by translating distances into conditional probabilities and then finding low-dimensional points that match these probabilities under a heavy-tailed Student-t distribution. UMAP builds a graph of nearest neighbors with fuzzy (soft) connection strengths and then optimizes a cross-entropy objective so that pairs that are connected in the high-dimensional graph are also close in the embedding.
The key differences are in modeling and scalability. t-SNEās heavy-tailed low-dimensional distribution directly mitigates the ācrowding problem,ā but naive optimization is quadratic in the number of points. UMAP constructs an explicit k-nearest-neighbor graph and uses efficient, almost linear-time neighbor finding plus negative sampling during optimization, typically making it faster on large datasets. Hyperparameters also differ: t-SNEās perplexity influences the effective neighborhood size, while UMAPās n_neighbors and min_dist respectively set the local connectivity scale and how tightly points can pack. Used wisely, both give insightful, visually separable clusters that help you understand structure, outliers, and relationships in data.
02Intuition & Analogies
Imagine plotting the locations of people in a huge city by their interests, habits, or preferencesāfeatures you cannot see directly. You want a 2D map where friends (similar people) stay near each other. In the original high-dimensional world, friendship strength decays smoothly with dissimilarity. t-SNE converts this idea into probabilities: for each person, it assigns a probability distribution over all other people that emphasizes close friends more than distant ones. Then it looks for a 2D placement so that, in 2D, the friendship probabilities look similar. The catch is that squeezing many points into 2D can cause crowding. t-SNEās fix is to use a heavy-tailed Student-t distribution in 2D, which makes far-away points exert less force than a Gaussian would. This lets clusters spread out naturally and reduces the tendency for everything to collapse into the center.
UMAP thinks in terms of neighborhoods and roads. First, it builds a road map: each person connects to their k closest neighbors, and the thickness of each road reflects how strongly they are neighbors (a fuzzy membership strength). This creates a soft, weighted graph of local relationships. Next, UMAP draws a 2D map and asks: if two people are connected by a thick road in the original map, they should be close in 2D; if they are not connected, they shouldnāt be too close by accident. It optimizes a cross-entropy score that rewards keeping connected pairs close and, through negative sampling, discourages unconnected pairs from becoming near neighbors. Parameters like n_neighbors decide how many roads each person has, and min_dist says how tightly neighborhoods can be drawn in 2D. The result is a map that respects local streets while still allowing overall layout to emerge.
03Formal Definition
04When to Use
Use t-SNE or UMAP when you want to visualize high-dimensional data (text embeddings, images, genetic profiles) in 2D or 3D while emphasizing local neighborhoods. t-SNE often excels at clear separation of clusters due to its heavy-tailed low-dimensional kernel and the early exaggeration phase, making it a top choice for small to medium datasets when you can afford O(n^2) work or have BarnesāHut/FFT approximations available. If you care primarily about visual cluster purity and local neighborhoods, t-SNE is a strong default.
UMAP is typically faster and more scalable, especially when paired with approximate neighbor search (e.g., NN-descent). It controls the balance between local and global structure via n_neighbors and allows you to regulate how tightly points pack via min_dist. If you need to handle larger datasets, retain more of the global manifold (e.g., continuous trajectories), or embed new points into an existing model, UMAP is often the better choice. In pipelines where you may later perform clustering or nearest-neighbor queries in the embedding, UMAPās graph-based framework and transform capability are advantageous. For both methods, always standardize features, reduce obvious noise (e.g., with PCA), and try several hyperparameters to validate stability.
ā ļøCommon Mistakes
⢠Interpreting absolute positions or axes: Only relative distances and neighborhoods are meaningful. Rotations, reflections, or translations of the embedding are arbitrary. ⢠Using raw, unscaled features: Euclidean distances are sensitive to scale. Standardize or whiten features, and consider PCA to denoise before t-SNE/UMAP. ⢠Expecting global geometry: t-SNE especially does not preserve global distances. Far-away clusters or cluster sizes can be misleading about true densities or distances. ⢠Poor hyperparameter choices: Too small or too large perplexity (t-SNE) or n_neighbors/min_dist (UMAP) can fragment clusters or over-merge them. Start with perplexity in [5, 50] and n_neighbors in [10, 50]. ⢠Skipping early exaggeration in t-SNE: It improves separation of clusters early on; omitting or misusing it hurts results. ⢠Quadratic implementations on large n: Naive O(n^2) t-SNE quickly becomes infeasible. Use BarnesāHut or FFT-based (FIt-SNE) methods; for UMAP, use NN-descent for kNN. ⢠Insufficient iterations or unstable learning rates: Both methods need careful optimization. Use momentum in t-SNE and appropriate learning rates; in UMAP, tune epochs and negative sampling rate. ⢠Ignoring randomness: Different seeds can produce different layouts. For reproducibility and comparison, fix the seed and run multiple trials. ⢠Overtrusting cluster separations for decision making: These are visualization tools; validate findings with quantitative methods and out-of-sample checks.
Key Formulas
t-SNE Conditional Probability
Explanation: Defines how likely j is to be chosen as a neighbor of i under a Gaussian kernel. Each point i has its own bandwidth tuned to match a target perplexity.
Perplexity and Entropy
Explanation: Entropy in bits determines the effective number of neighbors (perplexity). Matching a user-provided perplexity controls local scale in t-SNE.
Symmetric Joint Probabilities
Explanation: t-SNE uses a symmetric matrix P by averaging conditionals and normalizing by 2n. This prevents directional biases.
Student-t Low-Dim Similarities
Explanation: In the embedding, similarities come from a heavy-tailed kernel that reduces crowding by allowing moderate influence from distant pairs.
t-SNE Objective (KL Divergence)
Explanation: Minimizing KL(P||Q) makes low-dimensional similarities match high-dimensional similarities , emphasizing local structure.
t-SNE Gradient
Explanation: The gradient pulls points together if and pushes apart otherwise. The heavy-tail denominator tempers forces at larger distances.
UMAP Directed Membership
Explanation: Directed edge strengths from i to j are 1 within a local threshold and decay exponentially beyond it, with scale tuned per point.
UMAP Fuzzy Union
Explanation: Combines directed memberships into a symmetric weight capturing mutual neighborhood strength. High only if both directions are strong.
UMAP Low-Dimensional Similarity
Explanation: UMAP uses a parametric curve with parameters a and b (fitted from mi and spread) that controls how quickly similarity decays with distance.
UMAP Cross-Entropy Objective
Explanation: Encourages connected pairs (high ) to be close (high ) while discouraging unconnected pairs from being too close.
Squared Euclidean Distance
Explanation: Basic distance measure used for both methods by default. Preprocessing to equalize feature scales is crucial.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute squared Euclidean distance matrix (n x n) 5 vector<vector<double>> squaredDistanceMatrix(const vector<vector<double>>& X){ 6 int n = (int)X.size(); 7 int d = (int)X[0].size(); 8 vector<vector<double>> D(n, vector<double>(n, 0.0)); 9 for(int i=0;i<n;i++){ 10 D[i][i]=0.0; 11 for(int j=i+1;j<n;j++){ 12 double s=0.0; 13 for(int t=0;t<d;t++){ 14 double diff = X[i][t]-X[j][t]; 15 s += diff*diff; 16 } 17 D[i][j]=D[j][i]=s; 18 } 19 } 20 return D; 21 } 22 23 // For each i, find beta = 1/(2 sigma_i^2) such that perplexity matches target. 24 // Returns conditional probabilities Pcond (n x n) with zeros on diagonal and sigmas. 25 pair<vector<vector<double>>, vector<double>> tsneConditionalP( 26 const vector<vector<double>>& D, double perplexity, 27 int max_iter=50, double tol=1e-5){ 28 int n=(int)D.size(); 29 vector<vector<double>> P(n, vector<double>(n, 0.0)); 30 vector<double> sigmas(n, 0.0); 31 double logU = log2(perplexity); // target entropy in bits 32 33 for(int i=0;i<n;i++){ 34 double beta = 1.0; // 1/(2*sigma^2) 35 double betamin = -numeric_limits<double>::infinity(); 36 double betamax = numeric_limits<double>::infinity(); 37 // Binary search for beta 38 for(int it=0; it<max_iter; ++it){ 39 // Compute row probabilities with current beta 40 double sumP = 0.0; 41 for(int j=0;j<n;j++){ 42 if(i==j){ P[i][j]=0.0; continue; } 43 double val = exp(-beta * D[i][j]); 44 P[i][j] = val; 45 sumP += val; 46 } 47 // Normalize 48 if(sumP == 0.0){ 49 for(int j=0;j<n;j++) if(i!=j) P[i][j] = 1.0/(n-1); 50 break; 51 } 52 for(int j=0;j<n;j++) if(i!=j) P[i][j] /= sumP; 53 // Compute entropy H in bits 54 double H = 0.0; 55 for(int j=0;j<n;j++) if(i!=j && P[i][j] > 0.0) H -= P[i][j] * (log(P[i][j]) / log(2.0)); 56 // Check and update beta bounds 57 double Hdiff = H - logU; 58 if (fabs(Hdiff) < tol) break; 59 if (Hdiff > 0) { // Entropy too high -> distribution too broad -> increase beta 60 betamin = beta; 61 if (isinf(betamax)) beta *= 2.0; else beta = 0.5*(beta + betamax); 62 } else { // Entropy too low -> distribution too peaky -> decrease beta 63 betamax = beta; 64 if (isinf(betamin)) beta *= 0.5; else beta = 0.5*(beta + betamin); 65 } 66 } 67 // Save sigma from beta 68 sigmas[i] = sqrt(1.0/(2.0*max(beta, 1e-12))); 69 } 70 71 return {P, sigmas}; 72 } 73 74 // Symmetrize conditionals into joint probabilities P_ij = (p_{i|j} + p_{j|i})/(2n) 75 vector<vector<double>> symmetrizeP(const vector<vector<double>>& Pcond){ 76 int n = (int)Pcond.size(); 77 vector<vector<double>> P(n, vector<double>(n, 0.0)); 78 for(int i=0;i<n;i++){ 79 for(int j=i+1;j<n;j++){ 80 double v = (Pcond[i][j] + Pcond[j][i]) / (2.0 * n); 81 P[i][j]=P[j][i]=v; 82 } 83 } 84 return P; 85 } 86 87 int main(){ 88 ios::sync_with_stdio(false); 89 cin.tie(nullptr); 90 91 // Build a tiny 3D dataset with 3 clusters 92 int n=60, d=3; 93 vector<vector<double>> X(n, vector<double>(d)); 94 mt19937 rng(42); 95 normal_distribution<double> N(0.0, 0.3); 96 for(int i=0;i<n;i++){ 97 int c = i/20; // cluster id 0,1,2 98 for(int t=0;t<d;t++){ 99 double center = (c==0? -2.0 : (c==1? 0.0 : 2.0)); 100 X[i][t] = center + N(rng); 101 } 102 } 103 104 auto D = squaredDistanceMatrix(X); 105 double perplexity = 30.0; 106 auto pr = tsneConditionalP(D, perplexity); 107 auto Pcond = pr.first; auto sigmas = pr.second; 108 auto P = symmetrizeP(Pcond); 109 110 // Simple diagnostics 111 cout << fixed << setprecision(6); 112 cout << "Average sigma: "; 113 double avg=0; for(double s: sigmas) avg+=s; avg/=sigmas.size(); 114 cout << avg << "\n"; 115 116 // Check that P sums to ~1 over i!=j 117 double sumP=0.0; for(int i=0;i<n;i++) for(int j=0;j<n;j++) if(i!=j) sumP+=P[i][j]; 118 cout << "Sum P (off-diagonal) = " << sumP << " (should be ~1.0)\n"; 119 120 // Show a few P values across clusters 121 cout << "P[0][1] (likely same cluster): " << P[0][1] << "\n"; 122 cout << "P[0][50] (likely different cluster): " << P[0][50] << "\n"; 123 return 0; 124 } 125
This program constructs a small 3D dataset, computes the squared pairwise distances, finds per-point Gaussian bandwidths via binary search to match a target perplexity, and forms the symmetric t-SNE probability matrix P. It prints diagnostics to verify normalization and to compare within-cluster vs. across-cluster similarities.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 vector<vector<double>> squaredDistanceMatrix(const vector<vector<double>>& X){ 5 int n = (int)X.size(); 6 int d = (int)X[0].size(); 7 vector<vector<double>> D(n, vector<double>(n, 0.0)); 8 for(int i=0;i<n;i++) for(int j=i+1;j<n;j++){ 9 double s=0.0; for(int t=0;t<d;t++){ double diff=X[i][t]-X[j][t]; s+=diff*diff; } 10 D[i][j]=D[j][i]=s; 11 } 12 return D; 13 } 14 15 pair<vector<vector<double>>, vector<double>> tsneConditionalP( 16 const vector<vector<double>>& D, double perplexity, int max_iter=50, double tol=1e-5){ 17 int n=(int)D.size(); 18 vector<vector<double>> P(n, vector<double>(n,0.0)); 19 vector<double> sigmas(n,0.0); 20 double logU = log2(perplexity); 21 for(int i=0;i<n;i++){ 22 double beta=1.0, betamin=-numeric_limits<double>::infinity(), betamax=numeric_limits<double>::infinity(); 23 for(int it=0; it<max_iter; ++it){ 24 double sumP=0.0; 25 for(int j=0;j<n;j++){ 26 if(i==j){P[i][j]=0.0; continue;} double v=exp(-beta*D[i][j]); P[i][j]=v; sumP+=v; 27 } 28 if(sumP==0){ for(int j=0;j<n;j++) if(i!=j) P[i][j]=1.0/(n-1); break; } 29 for(int j=0;j<n;j++) if(i!=j) P[i][j]/=sumP; 30 double H=0.0; for(int j=0;j<n;j++) if(i!=j && P[i][j]>0) H -= P[i][j]*(log(P[i][j])/log(2.0)); 31 double diff = H - logU; if(fabs(diff)<tol) break; 32 if(diff>0){ betamin=beta; beta = isinf(betamax)? beta*2.0 : 0.5*(beta+betamax);} else { betamax=beta; beta = isinf(betamin)? beta*0.5 : 0.5*(beta+betamin);} } 33 sigmas[i]=sqrt(1.0/(2.0*max(beta,1e-12))); 34 } 35 return {P, sigmas}; 36 } 37 38 vector<vector<double>> symmetrizeP(const vector<vector<double>>& Pcond){ 39 int n=(int)Pcond.size(); vector<vector<double>> P(n, vector<double>(n,0.0)); 40 for(int i=0;i<n;i++) for(int j=i+1;j<n;j++){ double v=(Pcond[i][j]+Pcond[j][i])/(2.0*n); P[i][j]=P[j][i]=v; } 41 return P; 42 } 43 44 // Compute Q numerators (t-kernel) and normalization Z, and accumulate gradients. 45 void tsneEpoch(vector<vector<double>>& Y, const vector<vector<double>>& P, 46 double learning_rate, double momentum, vector<vector<double>>& velocity){ 47 int n=(int)Y.size(); 48 // Compute pairwise 1/(1+dist2) and Z 49 vector<vector<double>> num(n, vector<double>(n,0.0)); 50 double Z=0.0; 51 for(int i=0;i<n;i++){ 52 for(int j=i+1;j<n;j++){ 53 double dx=Y[i][0]-Y[j][0], dy=Y[i][1]-Y[j][1]; 54 double d2=dx*dx+dy*dy; double v = 1.0/(1.0 + d2); 55 num[i][j]=num[j][i]=v; Z += 2.0*v; 56 } 57 } 58 // Compute gradients 59 vector<vector<double>> grad(n, vector<double>(2,0.0)); 60 for(int i=0;i<n;i++){ 61 for(int j=0;j<n;j++) if(i!=j){ 62 double qij = num[i][j]/Z; // low-dim prob 63 double pij = P[i][j]; 64 double dx=Y[i][0]-Y[j][0], dy=Y[i][1]-Y[j][1]; 65 double d2 = dx*dx+dy*dy; double coeff = 4.0*(pij - qij)/(1.0 + d2); 66 grad[i][0] += coeff*dx; grad[i][1] += coeff*dy; 67 } 68 } 69 // Update with momentum 70 for(int i=0;i<n;i++){ 71 for(int a=0;a<2;a++){ 72 velocity[i][a] = momentum*velocity[i][a] + learning_rate*grad[i][a]; 73 Y[i][a] += velocity[i][a]; 74 } 75 } 76 } 77 78 int main(){ 79 ios::sync_with_stdio(false); cin.tie(nullptr); 80 81 // Small dataset 82 int n=80, d=3; vector<vector<double>> X(n, vector<double>(d)); 83 mt19937 rng(0); normal_distribution<double> N(0.0, 0.35); 84 for(int i=0;i<n;i++){ 85 int c=i/40; for(int t=0;t<d;t++){ double center = (c==0? -2.0:2.0); X[i][t] = center + N(rng);} } 86 87 auto D = squaredDistanceMatrix(X); 88 auto pr = tsneConditionalP(D, 30.0); 89 auto P = symmetrizeP(pr.first); 90 91 // Initialize Y randomly in small range 92 vector<vector<double>> Y(n, vector<double>(2)); 93 uniform_real_distribution<double> U(-0.0001, 0.0001); 94 for(int i=0;i<n;i++){ Y[i][0]=U(rng); Y[i][1]=U(rng);} 95 vector<vector<double>> vel(n, vector<double>(2,0.0)); 96 97 double lr=200.0; double mom=0.5; double exaggeration=12.0; int iters=400; 98 99 // Apply early exaggeration by scaling P for first phase 100 vector<vector<double>> Pex = P; 101 for(int i=0;i<n;i++) for(int j=0;j<n;j++) Pex[i][j]*=exaggeration; 102 103 for(int t=0;t<iters;t++){ 104 if(t==250){ mom=0.8; lr=100.0; Pex=P; } // drop exaggeration 105 tsneEpoch(Y, Pex, lr, mom, vel); 106 if((t+1)%100==0){ 107 // Report simple spread metric (variance) 108 double mx=0,my=0; for(auto &y:Y){mx+=y[0]; my+=y[1];} mx/=n; my/=n; 109 double vx=0,vy=0; for(auto &y:Y){vx+=(y[0]-mx)*(y[0]-mx); vy+=(y[1]-my)*(y[1]-my);} vx/=n; vy/=n; 110 cout << "Iter " << (t+1) << ": varX=" << vx << ", varY=" << vy << "\n"; 111 } 112 } 113 114 // Print first few embedded points 115 cout << fixed << setprecision(4); 116 for(int i=0;i<5;i++) cout << "Y["<<i<<"] = ("<<Y[i][0]<<", "<<Y[i][1]<<")\n"; 117 return 0; 118 } 119
This standalone demo computes t-SNE probabilities P from data, initializes a 2D embedding, and performs naive O(n^2) gradient steps with an early-exaggeration phase. It prints progress and a few coordinates. For larger datasets, replace the naive force computation with BarnesāHut or FIt-SNE.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Edge{int i,j; double w;}; 5 6 vector<vector<double>> squaredDistanceMatrix(const vector<vector<double>>& X){ 7 int n=(int)X.size(), d=(int)X[0].size(); 8 vector<vector<double>> D(n, vector<double>(n,0.0)); 9 for(int i=0;i<n;i++) for(int j=i+1;j<n;j++){ 10 double s=0.0; for(int t=0;t<d;t++){ double diff=X[i][t]-X[j][t]; s+=diff*diff; } 11 D[i][j]=D[j][i]=s; 12 } 13 return D; 14 } 15 16 // Get k nearest neighbors (indices and distances) for each point (naive O(n^2)) 17 vector<vector<pair<int,double>>> knn(const vector<vector<double>>& D, int k){ 18 int n=(int)D.size(); 19 vector<vector<pair<int,double>>> nbrs(n); 20 for(int i=0;i<n;i++){ 21 vector<pair<double,int>> cand; cand.reserve(n-1); 22 for(int j=0;j<n;j++) if(i!=j) cand.push_back({D[i][j], j}); 23 nth_element(cand.begin(), cand.begin()+k, cand.end()); 24 cand.resize(k); 25 // sort for stable kth distance 26 sort(cand.begin(), cand.end()); 27 nbrs[i].reserve(k); 28 for(auto &p: cand) nbrs[i].push_back({p.second, sqrt(max(p.first, 0.0))}); // store Euclidean distance 29 } 30 return nbrs; 31 } 32 33 // Solve sigma_i so that sum_j exp(-(d_ij - rho_i)/sigma_i) ~ log2(k) 34 double solveSigma(const vector<double>& dists, double rho, double target, int iters=32){ 35 // Binary search on sigma in (1e-3, 1e3) 36 double lo=1e-3, hi=1e3; 37 for(int it=0; it<iters; ++it){ 38 double mid = sqrt(lo*hi); 39 double s=0.0; 40 for(double d: dists){ 41 double val = (d<=rho)? 1.0 : exp(-(d - rho)/mid); 42 s += val; 43 } 44 if(s > target) hi=mid; else lo=mid; 45 } 46 return (lo+hi)*0.5; 47 } 48 49 // Build UMAP fuzzy union weights w_ij 50 vector<Edge> buildUMAPGraph(const vector<vector<double>>& D, int k){ 51 int n=(int)D.size(); 52 auto nbrs = knn(D, k); 53 54 vector<double> rho(n, 0.0), sigma(n, 1.0); 55 for(int i=0;i<n;i++){ 56 // rho_i: distance to the nearest non-zero neighbor (or kth neighbor baseline) 57 // Here we use the distance to the 1st neighbor as minimal threshold. 58 double ri = nbrs[i].empty()? 0.0 : nbrs[i][0].second; 59 rho[i] = ri; 60 vector<double> dists; dists.reserve(nbrs[i].size()); 61 for(auto &p: nbrs[i]) dists.push_back(p.second); 62 double target = log2(max(2, k)); 63 sigma[i] = solveSigma(dists, rho[i], target); 64 } 65 66 // Directed memberships mu_ij for j in N_k(i) 67 vector<vector<pair<int,double>>> directed(n); 68 for(int i=0;i<n;i++){ 69 for(auto &p: nbrs[i]){ 70 int j=p.first; double dij = p.second; 71 double mu = (dij <= rho[i])? 1.0 : exp(-(dij - rho[i])/max(1e-12, sigma[i])); 72 directed[i].push_back({j, mu}); 73 } 74 } 75 76 // Fuzzy union: w_ij = mu_ij + mu_ji - mu_ij * mu_ji, store only existing pairs 77 // We'll place all candidate edges in a map to merge both directions. 78 struct PairHash { size_t operator()(const long long& x) const { return std::hash<long long>()(x);} }; 79 unordered_map<long long, pair<double,double>, PairHash> tmp; // key -> (mu_ij, mu_ji) 80 auto key = [&](int a,int b){ return ( (long long)min(a,b) << 32 ) | (long long)max(a,b); }; 81 82 for(int i=0;i<n;i++) for(auto &p: directed[i]){ 83 int j=p.first; double mu=p.second; long long kx = key(i,j); 84 if(i<j){ tmp[kx].first = max(tmp[kx].first, mu); } 85 else { tmp[kx].second = max(tmp[kx].second, mu); } 86 } 87 88 vector<Edge> edges; edges.reserve(tmp.size()); 89 for(auto &kv: tmp){ 90 // recover i,j from key 91 int i = (int)(kv.first >> 32); 92 int j = (int)(kv.first & 0xFFFFFFFF); 93 double a = kv.second.first; double b = kv.second.second; 94 double w = a + b - a*b; // fuzzy union 95 if(w>0) edges.push_back({i,j,w}); 96 } 97 return edges; 98 } 99 100 // One simplified training epoch (stochastic) for UMAP embedding in 2D 101 void umapTrainEpoch(vector<vector<double>>& Y, const vector<Edge>& edges, 102 double a, double b, double lr, int negative_sample_rate, 103 mt19937& rng){ 104 int n=(int)Y.size(); 105 uniform_int_distribution<int> UnifIdx(0, n-1); 106 vector<int> order(edges.size()); iota(order.begin(), order.end(), 0); 107 shuffle(order.begin(), order.end(), rng); 108 109 for(int ii=0; ii<(int)order.size(); ++ii){ 110 const Edge &e = edges[order[ii]]; int i=e.i, j=e.j; double w=e.w; 111 // Attractive update for (i,j) 112 double dx = Y[i][0]-Y[j][0], dy=Y[i][1]-Y[j][1]; 113 double dist = sqrt(dx*dx+dy*dy) + 1e-7; 114 double dist2b = pow(dist, 2.0*b); 115 double pij = 1.0/(1.0 + a*dist2b); 116 // Gradient magnitude for attractive term (approx; simplified from UMAP reference) 117 double grad_coeff = -2.0 * a * b * w * pow(dist, 2.0*b - 2.0) * pij; // damped by pij 118 double gx = grad_coeff * dx; double gy = grad_coeff * dy; 119 // Update positions (symmetrically) 120 Y[i][0] += -lr * gx; Y[i][1] += -lr * gy; 121 Y[j][0] -= -lr * gx; Y[j][1] -= -lr * gy; 122 123 // Negative samples: repel random non-neighbors 124 for(int r=0; r<negative_sample_rate; ++r){ 125 int jn = UnifIdx(rng); if(jn==i) jn = (jn+1)%n; 126 double dxn = Y[i][0]-Y[jn][0], dyn=Y[i][1]-Y[jn][1]; 127 double dn = sqrt(dxn*dxn+dyn*dyn) + 1e-7; 128 double dn2b = pow(dn, 2.0*b); 129 double pneg = 1.0/(1.0 + a*dn2b); 130 // Repulsive gradient (pushing apart); scaled to be weaker at long range 131 double rep_coeff = 2.0 * b * pneg * (1.0 - pneg) / (dn + 1e-3); 132 double grx = rep_coeff * dxn; double gry = rep_coeff * dyn; 133 Y[i][0] += lr * grx; Y[i][1] += lr * gry; 134 Y[jn][0] -= lr * grx; Y[jn][1] -= lr * gry; 135 } 136 } 137 } 138 139 int main(){ 140 ios::sync_with_stdio(false); cin.tie(nullptr); 141 142 // Synthetic dataset: three 3D clusters 143 int n=90, d=3; vector<vector<double>> X(n, vector<double>(d)); 144 mt19937 rng(123); 145 normal_distribution<double> N(0.0, 0.35); 146 for(int i=0;i<n;i++){ 147 int c=i/30; double center = (c==0? -3.0 : (c==1? 0.0 : 3.0)); 148 for(int t=0;t<d;t++) X[i][t] = center + N(rng); 149 } 150 151 auto D = squaredDistanceMatrix(X); 152 int n_neighbors = 15; 153 auto edges = buildUMAPGraph(D, n_neighbors); 154 155 // Embedding initialization 156 vector<vector<double>> Y(n, vector<double>(2)); 157 uniform_real_distribution<double> U(-0.01, 0.01); 158 for(int i=0;i<n;i++){ Y[i][0]=U(rng); Y[i][1]=U(rng);} 159 160 // Typical (a,b) for min_dist~0.1 per UMAP defaults (pre-fitted constants) 161 double a=1.929; double b=0.7915; 162 double lr=1.0; int epochs=200; int neg_rate=5; 163 164 for(int ep=0; ep<epochs; ++ep){ 165 umapTrainEpoch(Y, edges, a, b, lr, neg_rate, rng); 166 if((ep+1)%50==0){ 167 // Report simple spread metric 168 double mx=0,my=0; for(auto &y:Y){mx+=y[0]; my+=y[1];} mx/=n; my/=n; 169 double vx=0,vy=0; for(auto &y:Y){vx+=(y[0]-mx)*(y[0]-mx); vy+=(y[1]-my)*(y[1]-my);} vx/=n; vy/=n; 170 cout << "Epoch "<<(ep+1)<<": varX="<<vx<<", varY="<<vy<<"\n"; 171 } 172 } 173 174 cout << fixed << setprecision(4); 175 for(int i=0;i<5;i++) cout<<"Y["<<i<<"]=("<<Y[i][0]<<","<<Y[i][1]<<")\n"; 176 return 0; 177 } 178
This educational UMAP sketch builds a kNN graph, computes fuzzy union weights, and runs a simplified SGD with negative sampling to optimize a 2D embedding. The attractive and repulsive gradients are approximations of the reference implementation; they illustrate the mechanics but are not tuned for large-scale use.