⚙️AlgorithmIntermediate

t-SNE & UMAP

Key Points

•
t-SNE and UMAP are nonlinear dimensionality-reduction methods that preserve local neighborhoods to make high-dimensional data visible in 2D or 3D.
•
t-SNE builds probabilities from Gaussian kernels in high dimensions and uses a heavy-tailed Student-t distribution in low dimensions to avoid crowding.
•
UMAP builds a fuzzy k-nearest-neighbor graph and optimizes a cross-entropy objective with negative sampling to produce an embedding.
•
t-SNE’s key knob is perplexity, which roughly controls neighborhood size; UMAP’s main knobs are $n_{n} e i g hb ors$ and mi $n_{d} i s t$ , controlling locality and tightness of clusters.
•
Naive t-SNE is O( $n^{2}$ ) per iteration; scalable variants use Barnes–Hut or FFT-based approximations, while UMAP scales via approximate neighbor search (e.g., NN-descent).
•
Preprocessing (standardization, removing duplicates/outliers) dramatically affects results for both methods.
•
Global geometry is not faithfully preserved; interpret local distances and cluster separations with caution.
•
Initialization, random seed, and learning-rate schedules can noticeably change the final visualization.

Prerequisites

→Euclidean distance and vector norms — Both algorithms rely on distances to build neighborhoods and measure similarity.
→Probability distributions and entropy — t-SNE uses conditional probabilities and perplexity; UMAP uses fuzzy memberships and cross-entropy.
→Gradient descent and momentum — Optimization of both objectives requires iterative gradient-based updates.
→k-Nearest neighbors and graphs — UMAP constructs and optimizes over a kNN graph; understanding graphs is essential.
→Numerical stability and scaling — Exponentials, normalization, and step sizes can cause instability without care.
→Principal Component Analysis (PCA) — Commonly used as a preprocessing step to denoise and reduce dimensionality before t-SNE/UMAP.
→Random number generation — Both methods rely on randomized initialization and, for UMAP, negative sampling.

Detailed Explanation

Tap terms for definitions

01Overview

t-SNE (t-Distributed Stochastic Neighbor Embedding) and UMAP (Uniform Manifold Approximation and Projection) are two of the most popular algorithms for turning complex, high-dimensional datasets into low-dimensional visualizations. Both focus on preserving local structure: if two points are close in the original space, they should remain close in the embedding. t-SNE does this by translating distances into conditional probabilities and then finding low-dimensional points that match these probabilities under a heavy-tailed Student-t distribution. UMAP builds a graph of nearest neighbors with fuzzy (soft) connection strengths and then optimizes a cross-entropy objective so that pairs that are connected in the high-dimensional graph are also close in the embedding.

The key differences are in modeling and scalability. t-SNE’s heavy-tailed low-dimensional distribution directly mitigates the “crowding problem,” but naive optimization is quadratic in the number of points. UMAP constructs an explicit k-nearest-neighbor graph and uses efficient, almost linear-time neighbor finding plus negative sampling during optimization, typically making it faster on large datasets. Hyperparameters also differ: t-SNE’s perplexity influences the effective neighborhood size, while UMAP’s n_neighbors and min_dist respectively set the local connectivity scale and how tightly points can pack. Used wisely, both give insightful, visually separable clusters that help you understand structure, outliers, and relationships in data.

02Intuition & Analogies

Imagine plotting the locations of people in a huge city by their interests, habits, or preferences—features you cannot see directly. You want a 2D map where friends (similar people) stay near each other. In the original high-dimensional world, friendship strength decays smoothly with dissimilarity. t-SNE converts this idea into probabilities: for each person, it assigns a probability distribution over all other people that emphasizes close friends more than distant ones. Then it looks for a 2D placement so that, in 2D, the friendship probabilities look similar. The catch is that squeezing many points into 2D can cause crowding. t-SNE’s fix is to use a heavy-tailed Student-t distribution in 2D, which makes far-away points exert less force than a Gaussian would. This lets clusters spread out naturally and reduces the tendency for everything to collapse into the center.

UMAP thinks in terms of neighborhoods and roads. First, it builds a road map: each person connects to their k closest neighbors, and the thickness of each road reflects how strongly they are neighbors (a fuzzy membership strength). This creates a soft, weighted graph of local relationships. Next, UMAP draws a 2D map and asks: if two people are connected by a thick road in the original map, they should be close in 2D; if they are not connected, they shouldn’t be too close by accident. It optimizes a cross-entropy score that rewards keeping connected pairs close and, through negative sampling, discourages unconnected pairs from becoming near neighbors. Parameters like n_neighbors decide how many roads each person has, and min_dist says how tightly neighborhoods can be drawn in 2D. The result is a map that respects local streets while still allowing overall layout to emerge.

03Formal Definition

For t-SNE, given points X = {

x_{1}

\dots

x_{n}

} in

R^{D}

, define conditional probabilities with Gaussian kernels:

p_{j ∣ i}

\frac{e x p ( - ∥ x _{i} - x _{j} ∥ ^{2} / ( 2 σ _{i}^{2} ) )}{\sum _{k \neq = i} e x p ( - ∥ x _{i} - x _{k} ∥ ^{2} / ( 2 σ _{i}^{2} ) )}

, j

\neq =

i, with

p_{i ∣ i} = 0

. The bandwidth

σ_{i}

is chosen per point so that the perplexity Perp(

P_{i}

) = 2^{H(

P_{i}

)} matches a user-specified value, where H(

P_{i}

) = -

\sum_{j}

p_{j ∣ i}

lo g_{2}

p_{j ∣ i}

. Symmetrize to

p_{ij}

= (

p_{i ∣ j}

p_{j ∣ i}

)/(2n). For a low-dimensional embedding Y = {

y_{i}

}

\subset

R^{d}

, define

q_{ij}

\frac{( 1 + ∥ y _{i} - y _{j} ∥ ^{2} ) ^{- 1}}{\sum _{k \neq = l} ( 1 + ∥ y _{k} - y _{l} ∥ ^{2} ) ^{- 1}}

. t-SNE minimizes the Kullback–Leibler divergence C =

\sum_{i \neq = j}

p_{ij}

lo g

\frac{p _{ij}}{q _{ij}}

via gradient descent. For UMAP, compute a k-nearest-neighbor graph. For each i, let

ρ_{i}

be the distance to its closest non-zero neighbor (often the k-th neighbor threshold), and choose

σ_{i}

so that

\sum_{j}

\exp\big

(-(

d_{ij}

ρ_{i}

\sigma_i\big

)

\approx

lo g_{2}

(k) with membership strengths

μ_{ij}

= 1 if

d_{ij}

\leq

ρ_{i}

and

\exp\big

(-(

d_{ij}

ρ_{i}

\sigma_i\big

) otherwise. Symmetrize with fuzzy union

w_{ij}

μ_{ij}

μ_{ji}

μ_{ij} μ_{ji}

. In the embedding space, define

p_{ij}^{(l o w)}

\frac{1}{1 + a ∥ y _{i} - y _{j} ∥ ^{2 b}}

(with a,b fit from mi

n_{d} i s t

). Optimize the cross-entropy L =

\sum_{i < j}

\Big

w_{ij} lo g

p_{ij}^{(l o w)}

- (1-

w_{ij}

)

lo g

(1-

p_{ij}^{(l o w)}

)

\Big

] using stochastic gradient descent with negative sampling.

04When to Use

Use t-SNE or UMAP when you want to visualize high-dimensional data (text embeddings, images, genetic profiles) in 2D or 3D while emphasizing local neighborhoods. t-SNE often excels at clear separation of clusters due to its heavy-tailed low-dimensional kernel and the early exaggeration phase, making it a top choice for small to medium datasets when you can afford O(n^2) work or have Barnes–Hut/FFT approximations available. If you care primarily about visual cluster purity and local neighborhoods, t-SNE is a strong default.

UMAP is typically faster and more scalable, especially when paired with approximate neighbor search (e.g., NN-descent). It controls the balance between local and global structure via n_neighbors and allows you to regulate how tightly points pack via min_dist. If you need to handle larger datasets, retain more of the global manifold (e.g., continuous trajectories), or embed new points into an existing model, UMAP is often the better choice. In pipelines where you may later perform clustering or nearest-neighbor queries in the embedding, UMAP’s graph-based framework and transform capability are advantageous. For both methods, always standardize features, reduce obvious noise (e.g., with PCA), and try several hyperparameters to validate stability.

⚠️Common Mistakes

• Interpreting absolute positions or axes: Only relative distances and neighborhoods are meaningful. Rotations, reflections, or translations of the embedding are arbitrary. • Using raw, unscaled features: Euclidean distances are sensitive to scale. Standardize or whiten features, and consider PCA to denoise before t-SNE/UMAP. • Expecting global geometry: t-SNE especially does not preserve global distances. Far-away clusters or cluster sizes can be misleading about true densities or distances. • Poor hyperparameter choices: Too small or too large perplexity (t-SNE) or n_neighbors/min_dist (UMAP) can fragment clusters or over-merge them. Start with perplexity in [5, 50] and n_neighbors in [10, 50]. • Skipping early exaggeration in t-SNE: It improves separation of clusters early on; omitting or misusing it hurts results. • Quadratic implementations on large n: Naive O(n^2) t-SNE quickly becomes infeasible. Use Barnes–Hut or FFT-based (FIt-SNE) methods; for UMAP, use NN-descent for kNN. • Insufficient iterations or unstable learning rates: Both methods need careful optimization. Use momentum in t-SNE and appropriate learning rates; in UMAP, tune epochs and negative sampling rate. • Ignoring randomness: Different seeds can produce different layouts. For reproducibility and comparison, fix the seed and run multiple trials. • Overtrusting cluster separations for decision making: These are visualization tools; validate findings with quantitative methods and out-of-sample checks.

Key Formulas

t-SNE Conditional Probability

p_{j ∣ i} = \frac{exp ( - \frac{∥ x _{i} - x _{j} ∥ ^{2}}{2 σ _{i}^{2}} )}{\sum _{k \neq = i} exp ( - \frac{∥ x _{i} - x _{k} ∥ ^{2}}{2 σ _{i}^{2}} )}

Explanation: Defines how likely j is to be chosen as a neighbor of i under a Gaussian kernel. Each point i has its own bandwidth $σ_{i}$ tuned to match a target perplexity.

Perplexity and Entropy

H (P_{i}) = - j \sum p_{j ∣ i} lo g_{2} p_{j ∣ i}, Perp (P_{i}) = 2^{H (P_{i})}

Explanation: Entropy in bits determines the effective number of neighbors (perplexity). Matching a user-provided perplexity controls local scale in t-SNE.

Symmetric Joint Probabilities

p_{ij} = \frac{p _{i ∣ j} + p _{j ∣ i}}{2 n}

Explanation: t-SNE uses a symmetric matrix P by averaging conditionals and normalizing by 2n. This prevents directional biases.

Student-t Low-Dim Similarities

q_{ij} = \frac{( 1 + ∥ y _{i} - y _{j} ∥ ^{2} ) ^{- 1}}{\sum _{k \neq = l} ( 1 + ∥ y _{k} - y _{l} ∥ ^{2} ) ^{- 1}}

Explanation: In the embedding, similarities come from a heavy-tailed kernel that reduces crowding by allowing moderate influence from distant pairs.

t-SNE Objective (KL Divergence)

C = i \neq = j \sum p_{ij} lo g \frac{p _{ij}}{q _{ij}}

Explanation: Minimizing KL(P||Q) makes low-dimensional similarities $q_{ij}$ match high-dimensional similarities $p_{ij}$ , emphasizing local structure.

t-SNE Gradient

\frac{\partial C}{\partial y _{i}} = 4 j \sum (p_{ij} - q_{ij}) \frac{y _{i} - y _{j}}{1 + ∥ y _{i} - y _{j} ∥ ^{2}}

Explanation: The gradient pulls points together if $p_{ij} > q_{ij}$ and pushes apart otherwise. The heavy-tail denominator tempers forces at larger distances.

UMAP Directed Membership

μ_{ij} = {1, exp (- \frac{d _{ij} - ρ _{i}}{σ _{i}}), d_{ij} \leq ρ_{i} d_{ij} > ρ_{i}

Explanation: Directed edge strengths from i to j are 1 within a local threshold $ρ_{i}$ and decay exponentially beyond it, with scale $σ_{i}$ tuned per point.

UMAP Fuzzy Union

w_{ij} = μ_{ij} + μ_{ji} - μ_{ij} μ_{ji}

Explanation: Combines directed memberships into a symmetric weight capturing mutual neighborhood strength. High only if both directions are strong.

UMAP Low-Dimensional Similarity

p_{ij}^{(l o w)} = \frac{1}{1 + a ∥ y _{i} - y _{j} ∥ ^{2 b}}

Explanation: UMAP uses a parametric curve with parameters a and b (fitted from mi $n_{d} i s t$ and spread) that controls how quickly similarity decays with distance.

UMAP Cross-Entropy Objective

L = i < j \sum [- w_{ij} lo g p_{ij}^{(l o w)} - (1 - w_{ij}) lo g (1 - p_{ij}^{(l o w)})]

Explanation: Encourages connected pairs (high $w_{ij}$ ) to be close (high $p_{ij}^{(l o w)}$ ) while discouraging unconnected pairs from being too close.

Squared Euclidean Distance

d_{ij}^{2} = ∥ x_{i} - x_{j} ∥^{2} = t = 1 \sum D (x_{i t} - x_{j t})^{2}

Explanation: Basic distance measure used for both methods by default. Preprocessing to equalize feature scales is crucial.

Complexity Analysis

Let n be the number of points and D the input dimension. Computing the full pairwise distance matrix costs O(

n^{2}

D) time and O(

n^{2}

) space, which both t-SNE and a naive UMAP implementation may do initially. For t-SNE, choosing per-point bandwidths (sigmas) via binary search costs O(

n^{2}

) overall (each row repeated kernel evaluations), so building P is O(

n^{2}

D +

n^{2}

). The naive optimization step computes all pairwise interactions, making each iteration O(

n^{2}

) time and O(

n^{2}

) space if you explicitly form Q, or O(

n^{2}

) time and O(n) space if you stream forces. Barnes–Hut reduces this to roughly O(n

lo g

n) per iteration using a tree to aggregate far-field forces, and FIt-SNE reduces it further (empirically near-linear) using FFT-based convolutions on an interpolation grid. UMAP’s core cost is constructing a kNN graph. With naive all-pairs search, this is O(

n^{2}

D) time and O(nk) space (storing only edges to k neighbors). With NN-descent or other approximate methods, it often approaches O(n

lo g

n) or better in practice. The fuzzy membership scaling (solving for

σ_{i}

) is O(nk) with a small constant from a short binary search per node. The optimization is linear in the number of edges: each epoch performs O(E) attractive updates plus O(E

\cdot

nsr

) repulsive updates via negative sampling, where E

\approx

n k. Thus, total time is O(

epochs

\cdot

\cdot

(1 +

nsr

)). Memory for UMAP is O(E + nd) for edges and the embedding. In summary, naive t-SNE is quadratic and becomes slow for large n without approximations, while UMAP with approximate neighbors typically scales to much larger datasets with near-linear behavior in n.

Code Examples

t-SNE: Compute Perplexity-Based Conditional Probabilities and Symmetric P

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Compute squared Euclidean distance matrix (n x n)
5 vector<vector<double>> squaredDistanceMatrix(const vector<vector<double>>& X){
6     int n = (int)X.size();
7     int d = (int)X[0].size();
8     vector<vector<double>> D(n, vector<double>(n, 0.0));
9     for(int i=0;i<n;i++){
10         D[i][i]=0.0;
11         for(int j=i+1;j<n;j++){
12             double s=0.0;
13             for(int t=0;t<d;t++){
14                 double diff = X[i][t]-X[j][t];
15                 s += diff*diff;
16             }
17             D[i][j]=D[j][i]=s;
18         }
19     }
20     return D;
21 }
22 
23 // For each i, find beta = 1/(2 sigma_i^2) such that perplexity matches target.
24 // Returns conditional probabilities Pcond (n x n) with zeros on diagonal and sigmas.
25 pair<vector<vector<double>>, vector<double>> tsneConditionalP(
26         const vector<vector<double>>& D, double perplexity,
27         int max_iter=50, double tol=1e-5){
28     int n=(int)D.size();
29     vector<vector<double>> P(n, vector<double>(n, 0.0));
30     vector<double> sigmas(n, 0.0);
31     double logU = log2(perplexity); // target entropy in bits
32 
33     for(int i=0;i<n;i++){
34         double beta = 1.0; // 1/(2*sigma^2)
35         double betamin = -numeric_limits<double>::infinity();
36         double betamax =  numeric_limits<double>::infinity();
37         // Binary search for beta
38         for(int it=0; it<max_iter; ++it){
39             // Compute row probabilities with current beta
40             double sumP = 0.0;
41             for(int j=0;j<n;j++){
42                 if(i==j){ P[i][j]=0.0; continue; }
43                 double val = exp(-beta * D[i][j]);
44                 P[i][j] = val;
45                 sumP += val;
46             }
47             // Normalize
48             if(sumP == 0.0){
49                 for(int j=0;j<n;j++) if(i!=j) P[i][j] = 1.0/(n-1);
50                 break;
51             }
52             for(int j=0;j<n;j++) if(i!=j) P[i][j] /= sumP;
53             // Compute entropy H in bits
54             double H = 0.0;
55             for(int j=0;j<n;j++) if(i!=j && P[i][j] > 0.0) H -= P[i][j] * (log(P[i][j]) / log(2.0));
56             // Check and update beta bounds
57             double Hdiff = H - logU;
58             if (fabs(Hdiff) < tol) break;
59             if (Hdiff > 0) { // Entropy too high -> distribution too broad -> increase beta
60                 betamin = beta;
61                 if (isinf(betamax)) beta *= 2.0; else beta = 0.5*(beta + betamax);
62             } else { // Entropy too low -> distribution too peaky -> decrease beta
63                 betamax = beta;
64                 if (isinf(betamin)) beta *= 0.5; else beta = 0.5*(beta + betamin);
65             }
66         }
67         // Save sigma from beta
68         sigmas[i] = sqrt(1.0/(2.0*max(beta, 1e-12)));
69     }
70 
71     return {P, sigmas};
72 }
73 
74 // Symmetrize conditionals into joint probabilities P_ij = (p_{i|j} + p_{j|i})/(2n)
75 vector<vector<double>> symmetrizeP(const vector<vector<double>>& Pcond){
76     int n = (int)Pcond.size();
77     vector<vector<double>> P(n, vector<double>(n, 0.0));
78     for(int i=0;i<n;i++){
79         for(int j=i+1;j<n;j++){
80             double v = (Pcond[i][j] + Pcond[j][i]) / (2.0 * n);
81             P[i][j]=P[j][i]=v;
82         }
83     }
84     return P;
85 }
86 
87 int main(){
88     ios::sync_with_stdio(false);
89     cin.tie(nullptr);
90 
91     // Build a tiny 3D dataset with 3 clusters
92     int n=60, d=3; 
93     vector<vector<double>> X(n, vector<double>(d));
94     mt19937 rng(42);
95     normal_distribution<double> N(0.0, 0.3);
96     for(int i=0;i<n;i++){
97         int c = i/20; // cluster id 0,1,2
98         for(int t=0;t<d;t++){
99             double center = (c==0? -2.0 : (c==1? 0.0 : 2.0));
100             X[i][t] = center + N(rng);
101         }
102     }
103 
104     auto D = squaredDistanceMatrix(X);
105     double perplexity = 30.0;
106     auto pr = tsneConditionalP(D, perplexity);
107     auto Pcond = pr.first; auto sigmas = pr.second;
108     auto P = symmetrizeP(Pcond);
109 
110     // Simple diagnostics
111     cout << fixed << setprecision(6);
112     cout << "Average sigma: ";
113     double avg=0; for(double s: sigmas) avg+=s; avg/=sigmas.size();
114     cout << avg << "\n";
115 
116     // Check that P sums to ~1 over i!=j
117     double sumP=0.0; for(int i=0;i<n;i++) for(int j=0;j<n;j++) if(i!=j) sumP+=P[i][j];
118     cout << "Sum P (off-diagonal) = " << sumP << " (should be ~1.0)\n";
119 
120     // Show a few P values across clusters
121     cout << "P[0][1] (likely same cluster): " << P[0][1] << "\n";
122     cout << "P[0][50] (likely different cluster): " << P[0][50] << "\n";
123     return 0;
124 }
125

This program constructs a small 3D dataset, computes the squared pairwise distances, finds per-point Gaussian bandwidths via binary search to match a target perplexity, and forms the symmetric t-SNE probability matrix P. It prints diagnostics to verify normalization and to compare within-cluster vs. across-cluster similarities.

Time: O(n^2 D) to build distances plus O(n^2) for the binary searches (each row repeatedly computes exponentials).Space: O(n^2) for storing the distance matrix and P.

t-SNE: One Simple Gradient-Descent Epoch in 2D (Naive O(n^2))

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 vector<vector<double>> squaredDistanceMatrix(const vector<vector<double>>& X){
5     int n = (int)X.size();
6     int d = (int)X[0].size();
7     vector<vector<double>> D(n, vector<double>(n, 0.0));
8     for(int i=0;i<n;i++) for(int j=i+1;j<n;j++){
9         double s=0.0; for(int t=0;t<d;t++){ double diff=X[i][t]-X[j][t]; s+=diff*diff; }
10         D[i][j]=D[j][i]=s;
11     }
12     return D;
13 }
14 
15 pair<vector<vector<double>>, vector<double>> tsneConditionalP(
16     const vector<vector<double>>& D, double perplexity, int max_iter=50, double tol=1e-5){
17     int n=(int)D.size();
18     vector<vector<double>> P(n, vector<double>(n,0.0));
19     vector<double> sigmas(n,0.0);
20     double logU = log2(perplexity);
21     for(int i=0;i<n;i++){
22         double beta=1.0, betamin=-numeric_limits<double>::infinity(), betamax=numeric_limits<double>::infinity();
23         for(int it=0; it<max_iter; ++it){
24             double sumP=0.0;
25             for(int j=0;j<n;j++){
26                 if(i==j){P[i][j]=0.0; continue;} double v=exp(-beta*D[i][j]); P[i][j]=v; sumP+=v;
27             }
28             if(sumP==0){ for(int j=0;j<n;j++) if(i!=j) P[i][j]=1.0/(n-1); break; }
29             for(int j=0;j<n;j++) if(i!=j) P[i][j]/=sumP;
30             double H=0.0; for(int j=0;j<n;j++) if(i!=j && P[i][j]>0) H -= P[i][j]*(log(P[i][j])/log(2.0));
31             double diff = H - logU; if(fabs(diff)<tol) break;
32             if(diff>0){ betamin=beta; beta = isinf(betamax)? beta*2.0 : 0.5*(beta+betamax);} else { betamax=beta; beta = isinf(betamin)? beta*0.5 : 0.5*(beta+betamin);}        }
33         sigmas[i]=sqrt(1.0/(2.0*max(beta,1e-12)));
34     }
35     return {P, sigmas};
36 }
37 
38 vector<vector<double>> symmetrizeP(const vector<vector<double>>& Pcond){
39     int n=(int)Pcond.size(); vector<vector<double>> P(n, vector<double>(n,0.0));
40     for(int i=0;i<n;i++) for(int j=i+1;j<n;j++){ double v=(Pcond[i][j]+Pcond[j][i])/(2.0*n); P[i][j]=P[j][i]=v; }
41     return P;
42 }
43 
44 // Compute Q numerators (t-kernel) and normalization Z, and accumulate gradients.
45 void tsneEpoch(vector<vector<double>>& Y, const vector<vector<double>>& P,
46                double learning_rate, double momentum, vector<vector<double>>& velocity){
47     int n=(int)Y.size();
48     // Compute pairwise 1/(1+dist2) and Z
49     vector<vector<double>> num(n, vector<double>(n,0.0));
50     double Z=0.0;
51     for(int i=0;i<n;i++){
52         for(int j=i+1;j<n;j++){
53             double dx=Y[i][0]-Y[j][0], dy=Y[i][1]-Y[j][1];
54             double d2=dx*dx+dy*dy; double v = 1.0/(1.0 + d2);
55             num[i][j]=num[j][i]=v; Z += 2.0*v;
56         }
57     }
58     // Compute gradients
59     vector<vector<double>> grad(n, vector<double>(2,0.0));
60     for(int i=0;i<n;i++){
61         for(int j=0;j<n;j++) if(i!=j){
62             double qij = num[i][j]/Z; // low-dim prob
63             double pij = P[i][j];
64             double dx=Y[i][0]-Y[j][0], dy=Y[i][1]-Y[j][1];
65             double d2 = dx*dx+dy*dy; double coeff = 4.0*(pij - qij)/(1.0 + d2);
66             grad[i][0] += coeff*dx; grad[i][1] += coeff*dy;
67         }
68     }
69     // Update with momentum
70     for(int i=0;i<n;i++){
71         for(int a=0;a<2;a++){
72             velocity[i][a] = momentum*velocity[i][a] + learning_rate*grad[i][a];
73             Y[i][a] += velocity[i][a];
74         }
75     }
76 }
77 
78 int main(){
79     ios::sync_with_stdio(false); cin.tie(nullptr);
80 
81     // Small dataset
82     int n=80, d=3; vector<vector<double>> X(n, vector<double>(d));
83     mt19937 rng(0); normal_distribution<double> N(0.0, 0.35);
84     for(int i=0;i<n;i++){
85         int c=i/40; for(int t=0;t<d;t++){ double center = (c==0? -2.0:2.0); X[i][t] = center + N(rng);} }
86 
87     auto D = squaredDistanceMatrix(X);
88     auto pr = tsneConditionalP(D, 30.0);
89     auto P = symmetrizeP(pr.first);
90 
91     // Initialize Y randomly in small range
92     vector<vector<double>> Y(n, vector<double>(2));
93     uniform_real_distribution<double> U(-0.0001, 0.0001);
94     for(int i=0;i<n;i++){ Y[i][0]=U(rng); Y[i][1]=U(rng);}    
95     vector<vector<double>> vel(n, vector<double>(2,0.0));
96 
97     double lr=200.0; double mom=0.5; double exaggeration=12.0; int iters=400;
98 
99     // Apply early exaggeration by scaling P for first phase
100     vector<vector<double>> Pex = P;
101     for(int i=0;i<n;i++) for(int j=0;j<n;j++) Pex[i][j]*=exaggeration;
102 
103     for(int t=0;t<iters;t++){
104         if(t==250){ mom=0.8; lr=100.0; Pex=P; } // drop exaggeration
105         tsneEpoch(Y, Pex, lr, mom, vel);
106         if((t+1)%100==0){
107             // Report simple spread metric (variance)
108             double mx=0,my=0; for(auto &y:Y){mx+=y[0]; my+=y[1];} mx/=n; my/=n;
109             double vx=0,vy=0; for(auto &y:Y){vx+=(y[0]-mx)*(y[0]-mx); vy+=(y[1]-my)*(y[1]-my);} vx/=n; vy/=n;
110             cout << "Iter " << (t+1) << ": varX=" << vx << ", varY=" << vy << "\n";
111         }
112     }
113 
114     // Print first few embedded points
115     cout << fixed << setprecision(4);
116     for(int i=0;i<5;i++) cout << "Y["<<i<<"] = ("<<Y[i][0]<<", "<<Y[i][1]<<")\n";
117     return 0;
118 }
119

This standalone demo computes t-SNE probabilities P from data, initializes a 2D embedding, and performs naive O(n^2) gradient steps with an early-exaggeration phase. It prints progress and a few coordinates. For larger datasets, replace the naive force computation with Barnes–Hut or FIt-SNE.

Time: Per iteration O(n^2) for pairwise forces; initialization adds O(n^2 D) for distances and O(n^2) for building P.Space: O(n^2) to store P and temporary numerators; O(n) for Y and velocities.

UMAP (Simplified): Build kNN Graph, Fuzzy Weights, and One Training Pass with Negative Sampling

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Edge{int i,j; double w;};
5 
6 vector<vector<double>> squaredDistanceMatrix(const vector<vector<double>>& X){
7     int n=(int)X.size(), d=(int)X[0].size();
8     vector<vector<double>> D(n, vector<double>(n,0.0));
9     for(int i=0;i<n;i++) for(int j=i+1;j<n;j++){
10         double s=0.0; for(int t=0;t<d;t++){ double diff=X[i][t]-X[j][t]; s+=diff*diff; }
11         D[i][j]=D[j][i]=s;
12     }
13     return D;
14 }
15 
16 // Get k nearest neighbors (indices and distances) for each point (naive O(n^2))
17 vector<vector<pair<int,double>>> knn(const vector<vector<double>>& D, int k){
18     int n=(int)D.size();
19     vector<vector<pair<int,double>>> nbrs(n);
20     for(int i=0;i<n;i++){
21         vector<pair<double,int>> cand; cand.reserve(n-1);
22         for(int j=0;j<n;j++) if(i!=j) cand.push_back({D[i][j], j});
23         nth_element(cand.begin(), cand.begin()+k, cand.end());
24         cand.resize(k);
25         // sort for stable kth distance
26         sort(cand.begin(), cand.end());
27         nbrs[i].reserve(k);
28         for(auto &p: cand) nbrs[i].push_back({p.second, sqrt(max(p.first, 0.0))}); // store Euclidean distance
29     }
30     return nbrs;
31 }
32 
33 // Solve sigma_i so that sum_j exp(-(d_ij - rho_i)/sigma_i) ~ log2(k)
34 double solveSigma(const vector<double>& dists, double rho, double target, int iters=32){
35     // Binary search on sigma in (1e-3, 1e3)
36     double lo=1e-3, hi=1e3;
37     for(int it=0; it<iters; ++it){
38         double mid = sqrt(lo*hi);
39         double s=0.0;
40         for(double d: dists){
41             double val = (d<=rho)? 1.0 : exp(-(d - rho)/mid);
42             s += val;
43         }
44         if(s > target) hi=mid; else lo=mid;
45     }
46     return (lo+hi)*0.5;
47 }
48 
49 // Build UMAP fuzzy union weights w_ij
50 vector<Edge> buildUMAPGraph(const vector<vector<double>>& D, int k){
51     int n=(int)D.size();
52     auto nbrs = knn(D, k);
53 
54     vector<double> rho(n, 0.0), sigma(n, 1.0);
55     for(int i=0;i<n;i++){
56         // rho_i: distance to the nearest non-zero neighbor (or kth neighbor baseline)
57         // Here we use the distance to the 1st neighbor as minimal threshold.
58         double ri = nbrs[i].empty()? 0.0 : nbrs[i][0].second;
59         rho[i] = ri;
60         vector<double> dists; dists.reserve(nbrs[i].size());
61         for(auto &p: nbrs[i]) dists.push_back(p.second);
62         double target = log2(max(2, k));
63         sigma[i] = solveSigma(dists, rho[i], target);
64     }
65 
66     // Directed memberships mu_ij for j in N_k(i)
67     vector<vector<pair<int,double>>> directed(n);
68     for(int i=0;i<n;i++){
69         for(auto &p: nbrs[i]){
70             int j=p.first; double dij = p.second;
71             double mu = (dij <= rho[i])? 1.0 : exp(-(dij - rho[i])/max(1e-12, sigma[i]));
72             directed[i].push_back({j, mu});
73         }
74     }
75 
76     // Fuzzy union: w_ij = mu_ij + mu_ji - mu_ij * mu_ji, store only existing pairs
77     // We'll place all candidate edges in a map to merge both directions.
78     struct PairHash { size_t operator()(const long long& x) const { return std::hash<long long>()(x);} };
79     unordered_map<long long, pair<double,double>, PairHash> tmp; // key -> (mu_ij, mu_ji)
80     auto key = [&](int a,int b){ return ( (long long)min(a,b) << 32 ) | (long long)max(a,b); };
81 
82     for(int i=0;i<n;i++) for(auto &p: directed[i]){
83         int j=p.first; double mu=p.second; long long kx = key(i,j);
84         if(i<j){ tmp[kx].first = max(tmp[kx].first, mu); }
85         else   { tmp[kx].second = max(tmp[kx].second, mu); }
86     }
87 
88     vector<Edge> edges; edges.reserve(tmp.size());
89     for(auto &kv: tmp){
90         // recover i,j from key
91         int i = (int)(kv.first >> 32);
92         int j = (int)(kv.first & 0xFFFFFFFF);
93         double a = kv.second.first; double b = kv.second.second;
94         double w = a + b - a*b; // fuzzy union
95         if(w>0) edges.push_back({i,j,w});
96     }
97     return edges;
98 }
99 
100 // One simplified training epoch (stochastic) for UMAP embedding in 2D
101 void umapTrainEpoch(vector<vector<double>>& Y, const vector<Edge>& edges,
102                     double a, double b, double lr, int negative_sample_rate,
103                     mt19937& rng){
104     int n=(int)Y.size();
105     uniform_int_distribution<int> UnifIdx(0, n-1);
106     vector<int> order(edges.size()); iota(order.begin(), order.end(), 0);
107     shuffle(order.begin(), order.end(), rng);
108 
109     for(int ii=0; ii<(int)order.size(); ++ii){
110         const Edge &e = edges[order[ii]]; int i=e.i, j=e.j; double w=e.w;
111         // Attractive update for (i,j)
112         double dx = Y[i][0]-Y[j][0], dy=Y[i][1]-Y[j][1];
113         double dist = sqrt(dx*dx+dy*dy) + 1e-7;
114         double dist2b = pow(dist, 2.0*b);
115         double pij = 1.0/(1.0 + a*dist2b);
116         // Gradient magnitude for attractive term (approx; simplified from UMAP reference)
117         double grad_coeff = -2.0 * a * b * w * pow(dist, 2.0*b - 2.0) * pij; // damped by pij
118         double gx = grad_coeff * dx; double gy = grad_coeff * dy;
119         // Update positions (symmetrically)
120         Y[i][0] += -lr * gx; Y[i][1] += -lr * gy;
121         Y[j][0] -= -lr * gx; Y[j][1] -= -lr * gy;
122 
123         // Negative samples: repel random non-neighbors
124         for(int r=0; r<negative_sample_rate; ++r){
125             int jn = UnifIdx(rng); if(jn==i) jn = (jn+1)%n;
126             double dxn = Y[i][0]-Y[jn][0], dyn=Y[i][1]-Y[jn][1];
127             double dn = sqrt(dxn*dxn+dyn*dyn) + 1e-7;
128             double dn2b = pow(dn, 2.0*b);
129             double pneg = 1.0/(1.0 + a*dn2b);
130             // Repulsive gradient (pushing apart); scaled to be weaker at long range
131             double rep_coeff = 2.0 * b * pneg * (1.0 - pneg) / (dn + 1e-3);
132             double grx = rep_coeff * dxn; double gry = rep_coeff * dyn;
133             Y[i][0] += lr * grx; Y[i][1] += lr * gry;
134             Y[jn][0] -= lr * grx; Y[jn][1] -= lr * gry;
135         }
136     }
137 }
138 
139 int main(){
140     ios::sync_with_stdio(false); cin.tie(nullptr);
141 
142     // Synthetic dataset: three 3D clusters
143     int n=90, d=3; vector<vector<double>> X(n, vector<double>(d));
144     mt19937 rng(123);
145     normal_distribution<double> N(0.0, 0.35);
146     for(int i=0;i<n;i++){
147         int c=i/30; double center = (c==0? -3.0 : (c==1? 0.0 : 3.0));
148         for(int t=0;t<d;t++) X[i][t] = center + N(rng);
149     }
150 
151     auto D = squaredDistanceMatrix(X);
152     int n_neighbors = 15;
153     auto edges = buildUMAPGraph(D, n_neighbors);
154 
155     // Embedding initialization
156     vector<vector<double>> Y(n, vector<double>(2));
157     uniform_real_distribution<double> U(-0.01, 0.01);
158     for(int i=0;i<n;i++){ Y[i][0]=U(rng); Y[i][1]=U(rng);}    
159 
160     // Typical (a,b) for min_dist~0.1 per UMAP defaults (pre-fitted constants)
161     double a=1.929; double b=0.7915; 
162     double lr=1.0; int epochs=200; int neg_rate=5;
163 
164     for(int ep=0; ep<epochs; ++ep){
165         umapTrainEpoch(Y, edges, a, b, lr, neg_rate, rng);
166         if((ep+1)%50==0){
167             // Report simple spread metric
168             double mx=0,my=0; for(auto &y:Y){mx+=y[0]; my+=y[1];} mx/=n; my/=n;
169             double vx=0,vy=0; for(auto &y:Y){vx+=(y[0]-mx)*(y[0]-mx); vy+=(y[1]-my)*(y[1]-my);} vx/=n; vy/=n;
170             cout << "Epoch "<<(ep+1)<<": varX="<<vx<<", varY="<<vy<<"\n";
171         }
172     }
173 
174     cout << fixed << setprecision(4);
175     for(int i=0;i<5;i++) cout<<"Y["<<i<<"]=("<<Y[i][0]<<","<<Y[i][1]<<")\n";
176     return 0;
177 }
178

This educational UMAP sketch builds a kNN graph, computes fuzzy union weights, and runs a simplified SGD with negative sampling to optimize a 2D embedding. The attractive and repulsive gradients are approximations of the reference implementation; they illustrate the mechanics but are not tuned for large-scale use.

Time: O(n^2 D) to compute distances; O(n^2) to find kNN naively; O(n k) to compute fuzzy weights; and per epoch O(E · (1 + nsr)) with E ≈ n k.Space: O(n^2) if storing all distances (for simplicity here) or O(nk) for the graph plus O(n) for the embedding.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Compute squared Euclidean distance matrix (n x n)
5	vector<vector<double>> squaredDistanceMatrix(const vector<vector<double>>& X){
6	int n = (int)X.size();
7	int d = (int)X[0].size();
8	vector<vector<double>> D(n, vector<double>(n, 0.0));
9	for(int i=0;i<n;i++){
10	D[i][i]=0.0;
11	for(int j=i+1;j<n;j++){
12	double s=0.0;
13	for(int t=0;t<d;t++){
14	double diff = X[i][t]-X[j][t];
15	s += diff*diff;
16	}
17	D[i][j]=D[j][i]=s;
18	}
19	}
20	return D;
21	}
22
23	// For each i, find beta = 1/(2 sigma_i^2) such that perplexity matches target.
24	// Returns conditional probabilities Pcond (n x n) with zeros on diagonal and sigmas.
25	pair<vector<vector<double>>, vector<double>> tsneConditionalP(
26	const vector<vector<double>>& D, double perplexity,
27	int max_iter=50, double tol=1e-5){
28	int n=(int)D.size();
29	vector<vector<double>> P(n, vector<double>(n, 0.0));
30	vector<double> sigmas(n, 0.0);
31	double logU = log2(perplexity); // target entropy in bits
32
33	for(int i=0;i<n;i++){
34	double beta = 1.0; // 1/(2*sigma^2)
35	double betamin = -numeric_limits<double>::infinity();
36	double betamax = numeric_limits<double>::infinity();
37	// Binary search for beta
38	for(int it=0; it<max_iter; ++it){
39	// Compute row probabilities with current beta
40	double sumP = 0.0;
41	for(int j=0;j<n;j++){
42	if(i==j){ P[i][j]=0.0; continue; }
43	double val = exp(-beta * D[i][j]);
44	P[i][j] = val;
45	sumP += val;
46	}
47	// Normalize
48	if(sumP == 0.0){
49	for(int j=0;j<n;j++) if(i!=j) P[i][j] = 1.0/(n-1);
50	break;
51	}
52	for(int j=0;j<n;j++) if(i!=j) P[i][j] /= sumP;
53	// Compute entropy H in bits
54	double H = 0.0;
55	for(int j=0;j<n;j++) if(i!=j && P[i][j] > 0.0) H -= P[i][j] * (log(P[i][j]) / log(2.0));
56	// Check and update beta bounds
57	double Hdiff = H - logU;
58	if (fabs(Hdiff) < tol) break;
59	if (Hdiff > 0) { // Entropy too high -> distribution too broad -> increase beta
60	betamin = beta;
61	if (isinf(betamax)) beta = 2.0; else beta = 0.5(beta + betamax);
62	} else { // Entropy too low -> distribution too peaky -> decrease beta
63	betamax = beta;
64	if (isinf(betamin)) beta = 0.5; else beta = 0.5(beta + betamin);
65	}
66	}
67	// Save sigma from beta
68	sigmas[i] = sqrt(1.0/(2.0*max(beta, 1e-12)));
69	}
70
71	return {P, sigmas};
72	}
73
74	// Symmetrize conditionals into joint probabilities P_ij = (p_{i\|j} + p_{j\|i})/(2n)
75	vector<vector<double>> symmetrizeP(const vector<vector<double>>& Pcond){
76	int n = (int)Pcond.size();
77	vector<vector<double>> P(n, vector<double>(n, 0.0));
78	for(int i=0;i<n;i++){
79	for(int j=i+1;j<n;j++){
80	double v = (Pcond[i][j] + Pcond[j][i]) / (2.0 * n);
81	P[i][j]=P[j][i]=v;
82	}
83	}
84	return P;
85	}
86
87	int main(){
88	ios::sync_with_stdio(false);
89	cin.tie(nullptr);
90
91	// Build a tiny 3D dataset with 3 clusters
92	int n=60, d=3;
93	vector<vector<double>> X(n, vector<double>(d));
94	mt19937 rng(42);
95	normal_distribution<double> N(0.0, 0.3);
96	for(int i=0;i<n;i++){
97	int c = i/20; // cluster id 0,1,2
98	for(int t=0;t<d;t++){
99	double center = (c==0? -2.0 : (c==1? 0.0 : 2.0));
100	X[i][t] = center + N(rng);
101	}
102	}
103
104	auto D = squaredDistanceMatrix(X);
105	double perplexity = 30.0;
106	auto pr = tsneConditionalP(D, perplexity);
107	auto Pcond = pr.first; auto sigmas = pr.second;
108	auto P = symmetrizeP(Pcond);
109
110	// Simple diagnostics
111	cout << fixed << setprecision(6);
112	cout << "Average sigma: ";
113	double avg=0; for(double s: sigmas) avg+=s; avg/=sigmas.size();
114	cout << avg << "\n";
115
116	// Check that P sums to ~1 over i!=j
117	double sumP=0.0; for(int i=0;i<n;i++) for(int j=0;j<n;j++) if(i!=j) sumP+=P[i][j];
118	cout << "Sum P (off-diagonal) = " << sumP << " (should be ~1.0)\n";
119
120	// Show a few P values across clusters
121	cout << "P[0][1] (likely same cluster): " << P[0][1] << "\n";
122	cout << "P[0][50] (likely different cluster): " << P[0][50] << "\n";
123	return 0;
124	}
125

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Edge{int i,j; double w;};
5
6	vector<vector<double>> squaredDistanceMatrix(const vector<vector<double>>& X){
7	int n=(int)X.size(), d=(int)X[0].size();
8	vector<vector<double>> D(n, vector<double>(n,0.0));
9	for(int i=0;i<n;i++) for(int j=i+1;j<n;j++){
10	double s=0.0; for(int t=0;t<d;t++){ double diff=X[i][t]-X[j][t]; s+=diff*diff; }
11	D[i][j]=D[j][i]=s;
12	}
13	return D;
14	}
15
16	// Get k nearest neighbors (indices and distances) for each point (naive O(n^2))
17	vector<vector<pair<int,double>>> knn(const vector<vector<double>>& D, int k){
18	int n=(int)D.size();
19	vector<vector<pair<int,double>>> nbrs(n);
20	for(int i=0;i<n;i++){
21	vector<pair<double,int>> cand; cand.reserve(n-1);
22	for(int j=0;j<n;j++) if(i!=j) cand.push_back({D[i][j], j});
23	nth_element(cand.begin(), cand.begin()+k, cand.end());
24	cand.resize(k);
25	// sort for stable kth distance
26	sort(cand.begin(), cand.end());
27	nbrs[i].reserve(k);
28	for(auto &p: cand) nbrs[i].push_back({p.second, sqrt(max(p.first, 0.0))}); // store Euclidean distance
29	}
30	return nbrs;
31	}
32
33	// Solve sigma_i so that sum_j exp(-(d_ij - rho_i)/sigma_i) ~ log2(k)
34	double solveSigma(const vector<double>& dists, double rho, double target, int iters=32){
35	// Binary search on sigma in (1e-3, 1e3)
36	double lo=1e-3, hi=1e3;
37	for(int it=0; it<iters; ++it){
38	double mid = sqrt(lo*hi);
39	double s=0.0;
40	for(double d: dists){
41	double val = (d<=rho)? 1.0 : exp(-(d - rho)/mid);
42	s += val;
43	}
44	if(s > target) hi=mid; else lo=mid;
45	}
46	return (lo+hi)*0.5;
47	}
48
49	// Build UMAP fuzzy union weights w_ij
50	vector<Edge> buildUMAPGraph(const vector<vector<double>>& D, int k){
51	int n=(int)D.size();
52	auto nbrs = knn(D, k);
53
54	vector<double> rho(n, 0.0), sigma(n, 1.0);
55	for(int i=0;i<n;i++){
56	// rho_i: distance to the nearest non-zero neighbor (or kth neighbor baseline)
57	// Here we use the distance to the 1st neighbor as minimal threshold.
58	double ri = nbrs[i].empty()? 0.0 : nbrs[i][0].second;
59	rho[i] = ri;
60	vector<double> dists; dists.reserve(nbrs[i].size());
61	for(auto &p: nbrs[i]) dists.push_back(p.second);
62	double target = log2(max(2, k));
63	sigma[i] = solveSigma(dists, rho[i], target);
64	}
65
66	// Directed memberships mu_ij for j in N_k(i)
67	vector<vector<pair<int,double>>> directed(n);
68	for(int i=0;i<n;i++){
69	for(auto &p: nbrs[i]){
70	int j=p.first; double dij = p.second;
71	double mu = (dij <= rho[i])? 1.0 : exp(-(dij - rho[i])/max(1e-12, sigma[i]));
72	directed[i].push_back({j, mu});
73	}
74	}
75
76	// Fuzzy union: w_ij = mu_ij + mu_ji - mu_ij * mu_ji, store only existing pairs
77	// We'll place all candidate edges in a map to merge both directions.
78	struct PairHash { size_t operator()(const long long& x) const { return std::hash<long long>()(x);} };
79	unordered_map<long long, pair<double,double>, PairHash> tmp; // key -> (mu_ij, mu_ji)
80	auto key = [&](int a,int b){ return ( (long long)min(a,b) << 32 ) \| (long long)max(a,b); };
81
82	for(int i=0;i<n;i++) for(auto &p: directed[i]){
83	int j=p.first; double mu=p.second; long long kx = key(i,j);
84	if(i<j){ tmp[kx].first = max(tmp[kx].first, mu); }
85	else { tmp[kx].second = max(tmp[kx].second, mu); }
86	}
87
88	vector<Edge> edges; edges.reserve(tmp.size());
89	for(auto &kv: tmp){
90	// recover i,j from key
91	int i = (int)(kv.first >> 32);
92	int j = (int)(kv.first & 0xFFFFFFFF);
93	double a = kv.second.first; double b = kv.second.second;
94	double w = a + b - a*b; // fuzzy union
95	if(w>0) edges.push_back({i,j,w});
96	}
97	return edges;
98	}
99
100	// One simplified training epoch (stochastic) for UMAP embedding in 2D
101	void umapTrainEpoch(vector<vector<double>>& Y, const vector<Edge>& edges,
102	double a, double b, double lr, int negative_sample_rate,
103	mt19937& rng){
104	int n=(int)Y.size();
105	uniform_int_distribution<int> UnifIdx(0, n-1);
106	vector<int> order(edges.size()); iota(order.begin(), order.end(), 0);
107	shuffle(order.begin(), order.end(), rng);
108
109	for(int ii=0; ii<(int)order.size(); ++ii){
110	const Edge &e = edges[order[ii]]; int i=e.i, j=e.j; double w=e.w;
111	// Attractive update for (i,j)
112	double dx = Y[i][0]-Y[j][0], dy=Y[i][1]-Y[j][1];
113	double dist = sqrt(dxdx+dydy) + 1e-7;
114	double dist2b = pow(dist, 2.0*b);
115	double pij = 1.0/(1.0 + a*dist2b);
116	// Gradient magnitude for attractive term (approx; simplified from UMAP reference)
117	double grad_coeff = -2.0 * a * b * w * pow(dist, 2.0b - 2.0) pij; // damped by pij
118	double gx = grad_coeff * dx; double gy = grad_coeff * dy;
119	// Update positions (symmetrically)
120	Y[i][0] += -lr * gx; Y[i][1] += -lr * gy;
121	Y[j][0] -= -lr * gx; Y[j][1] -= -lr * gy;
122
123	// Negative samples: repel random non-neighbors
124	for(int r=0; r<negative_sample_rate; ++r){
125	int jn = UnifIdx(rng); if(jn==i) jn = (jn+1)%n;
126	double dxn = Y[i][0]-Y[jn][0], dyn=Y[i][1]-Y[jn][1];
127	double dn = sqrt(dxndxn+dyndyn) + 1e-7;
128	double dn2b = pow(dn, 2.0*b);
129	double pneg = 1.0/(1.0 + a*dn2b);
130	// Repulsive gradient (pushing apart); scaled to be weaker at long range
131	double rep_coeff = 2.0 * b * pneg * (1.0 - pneg) / (dn + 1e-3);
132	double grx = rep_coeff * dxn; double gry = rep_coeff * dyn;
133	Y[i][0] += lr * grx; Y[i][1] += lr * gry;
134	Y[jn][0] -= lr * grx; Y[jn][1] -= lr * gry;
135	}
136	}
137	}
138
139	int main(){
140	ios::sync_with_stdio(false); cin.tie(nullptr);
141
142	// Synthetic dataset: three 3D clusters
143	int n=90, d=3; vector<vector<double>> X(n, vector<double>(d));
144	mt19937 rng(123);
145	normal_distribution<double> N(0.0, 0.35);
146	for(int i=0;i<n;i++){
147	int c=i/30; double center = (c==0? -3.0 : (c==1? 0.0 : 3.0));
148	for(int t=0;t<d;t++) X[i][t] = center + N(rng);
149	}
150
151	auto D = squaredDistanceMatrix(X);
152	int n_neighbors = 15;
153	auto edges = buildUMAPGraph(D, n_neighbors);
154
155	// Embedding initialization
156	vector<vector<double>> Y(n, vector<double>(2));
157	uniform_real_distribution<double> U(-0.01, 0.01);
158	for(int i=0;i<n;i++){ Y[i][0]=U(rng); Y[i][1]=U(rng);}
159
160	// Typical (a,b) for min_dist~0.1 per UMAP defaults (pre-fitted constants)
161	double a=1.929; double b=0.7915;
162	double lr=1.0; int epochs=200; int neg_rate=5;
163
164	for(int ep=0; ep<epochs; ++ep){
165	umapTrainEpoch(Y, edges, a, b, lr, neg_rate, rng);
166	if((ep+1)%50==0){
167	// Report simple spread metric
168	double mx=0,my=0; for(auto &y:Y){mx+=y[0]; my+=y[1];} mx/=n; my/=n;
169	double vx=0,vy=0; for(auto &y:Y){vx+=(y[0]-mx)(y[0]-mx); vy+=(y[1]-my)(y[1]-my);} vx/=n; vy/=n;
170	cout << "Epoch "<<(ep+1)<<": varX="<<vx<<", varY="<<vy<<"\n";
171	}
172	}
173
174	cout << fixed << setprecision(4);
175	for(int i=0;i<5;i++) cout<<"Y["<<i<<"]=("<<Y[i][0]<<","<<Y[i][1]<<")\n";
176	return 0;
177	}
178