📚TheoryIntermediate

Positional Encoding Theory

Key Points

•
Transformers are permutation-invariant by default, so they need positional encodings to understand word order in sequences.
•
Sinusoidal positional encodings inject absolute position using sines and cosines at exponentially spaced frequencies.
•
The encoding for position pos and channel 2i/2i+1 is sin(pos/10000^{2i/ $d_{m} o d e l$ }) and cos(pos/10000^{2i/ $d_{m} o d e l$ }).
•
These encodings allow the model to compute relative positions linearly, thanks to trigonometric addition identities.
•
They are parameter-free, deterministic, and extrapolate gracefully to longer sequences than seen in training.
•
Adding positional encodings is O(n d) in time and space, negligible compared to attention’s O( $n^{2}$ d).
•
Use sinusoidal encodings when you want simple, memory-light positional signals with good length generalization.

Prerequisites

→Vectors and matrices — Positional encodings and attention use matrix operations and vector additions.
→Trigonometric functions — Sinusoidal encodings rely on sine/cosine and angle addition identities.
→Softmax and numerical stability — Attention normalizes scores with softmax; stability matters for implementation.
→Embeddings — Understanding how tokens map to continuous vectors is essential before adding positions.
→Transformer architecture basics — Context for where and why positional encodings are added.
→Time and space complexity — To analyze the overhead of generating and applying positional encodings.
→C++ I/O and STL containers — Implementations use vectors, loops, and printing.
→Floating-point precision — Large positions and exponentials can be sensitive to precision.

Detailed Explanation

Tap terms for definitions

01Overview

Hook → Imagine reading a sentence with all words shuffled: the bag of words is the same, but the meaning collapses without order. Concept → Transformers, powered by self-attention, naturally see inputs as sets, not ordered sequences. Positional encodings are the “GPS coordinates” that tell the model where each token sits in the sequence. Sinusoidal positional encodings (from the original Transformer paper) encode absolute position using sinusoids at multiple frequencies so the model can infer both coarse and fine-grained distances. They are computed by simple formulas—no learned parameters—and then added to token embeddings before attention. Example → If d_model = 8 and pos = 5, we compute eight numbers: sin and cos pairs at four frequencies. This vector is added to the token’s embedding at position 5, nudging attention to notice its place in line.

02Intuition & Analogies

Hook → Think of a marching band on a football field. Each musician must know both who they are (their instrument) and exactly where they stand (their spot on the grid). Concept → Token embeddings tell the model “who” each token is; positional encodings tell it “where” they are in the sequence. Using sinusoids is like giving every position a unique barcode made from low and high musical notes (frequencies). Low frequencies change slowly across positions and capture global order (beginning vs. end), while high frequencies change rapidly and capture local differences (neighbors). Because these barcodes are built from sines and cosines, shifting a position corresponds to a simple rotation of each sinusoidal pair. This means the model can compute relative distances with linear operations—great for attention layers which are linear before softmax. Example → If you move 3 steps forward, each [sin, cos] pair just rotates by an angle tied to that frequency. The model can learn weights that recognize, say, “a verb tends to attend to a noun 1–3 steps back,” regardless of the absolute position, because the relative shift looks consistent across the whole field.

03Formal Definition

Hoo k \to W e

want a position map with structure the network can exploit algebraically.

C o n ce pt \to F or

model dimension

d_{m o d e l}

(assumed even), define angles

θ_{p os, i}

\frac{pos}

{10000^{2i/

d_{m o d e l}

}} for

i = 0

,1,

\dots

\frac{d _{m o d e l}}{2}

-1. The positional encoding is a vector

P E_{p os}

\in

R^{d_{m o d e l}}

with interleaved sine and cosine channels:

P E_{p os, 2 i}

sin

(

θ_{p os, i}

) and

P E_{p os, 2 i + 1}

cos

(

θ_{p os, i}

). These channels correspond to exponentially spaced wavelengths

λ_{i}

= 2

π \cdot

10000^{2i/

d_{m o d e l}

E x am pl e \to F or

d_{m o d e l} = 8

, we get i

\in

\{0,1,2,3\}; compute four angles and fill [

sin θ_{0}

cos θ_{0}

sin θ_{1}

cos θ_{1}

, ...]. The complete input to attention is

x_{p os} = e mb e d (t o k e n_{p os}

) +

P E_{p os}

. Crucially, trigonometric addition identities imply that

P E_{p os + k}

is a linear transform (a 2

\times

2 rotation) of each [

P E_{p os, 2 i}

P E_{p os, 2 i + 1}

] pair, enabling relative-position reasoning by linear layers.

04When to Use

Hook → You need order, but don’t want extra parameters or training complexity. Concept → Use sinusoidal positional encodings in Transformers when absolute order matters (language, code, time series, speech, DNA/proteins) and you want strong extrapolation to longer lengths than training. They are ideal for resource-constrained settings, deterministic pipelines, or as a baseline to compare against learned or relative schemes. Example use cases: (1) Machine translation where sentences may be longer at inference time than in training; (2) Sensor and time-series models where timestamps can exceed training windows; (3) Lightweight inference on CPU where parameter count must be minimal; (4) Educational or research prototypes that need reproducibility without tuning extra parameters. Prefer learned or relative encodings (e.g., rotary or bias-based) when data-specific patterns or very long-context performance require it, but sinusoidal remains a robust, simple default.

⚠️Common Mistakes

Mixing degrees and radians: std::sin and std::cos in C++ use radians. Feeding degrees silently produces wrong patterns. Always compute angles in radians as defined by the formula. - Off-by-one positions: The standard uses pos starting at 0. Starting at 1 shifts all phases and can subtly degrade performance or mismatch pretrained configs. - Odd d_model handling: The classic definition assumes an even d_model for [sin, cos] pairs. If d_model is odd, decide how to handle the leftover channel (e.g., drop it or pad), but be consistent. - Recomputing each step with wrong max_len: Generating encodings for a smaller max_len than your longest sequence causes out-of-bounds or silent truncation. Precompute for the maximum needed length. - Forgetting to add (not concatenate): The vanilla approach adds PE to token embeddings elementwise. Concatenation changes dimensionality and requires reinitializing projection layers. - Precision and scaling: Using float with very large pos can accumulate error; double precision is safer. Also remember attention’s usual 1/\sqrt{d_k} scaling; positional encodings don’t replace that. - Misinterpreting extrapolation: Sinusoidal encodes arbitrarily long positions, but model parameters were trained on finite contexts; attention may still degrade with extreme lengths even though the encoding itself generalizes.

Key Formulas

Angle per channel

θ_{p os, i} = \frac{p os}{1000 0 ^{\frac{2 i}{d _{m o d e l}}}}

Explanation: This defines the phase (angle) for the i-th sinusoidal pair at position pos. Higher i uses higher frequency (smaller wavelength) due to the exponential denominator.

Sinusoidal positional encoding

P E_{p os, 2 i} = sin (θ_{p os, i}), P E_{p os, 2 i + 1} = cos (θ_{p os, i})

Explanation: Each position is mapped to interleaved sine and cosine values at multiple frequencies. These values are added to token embeddings to inject order information.

Wavelength per channel

λ_{i} = 2 π \cdot 1000 0^{\frac{2 i}{d _{m o d e l}}}

Explanation: This gives the period of the sinusoid for channel pair i. Larger i leads to shorter wavelengths which capture finer positional differences.

Scaled dot-product attention

Attention (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V

Explanation: Attention weights are computed from query–key similarities scaled by 1/√ $d_{k}$ and normalized by softmax. These weights mix the values V to produce outputs.

Angle addition identities

sin (a + b) cos (a + b) = sin a cos b + cos a sin b, = cos a cos b - sin a sin b .

Explanation: Shifting positions corresponds to adding angles. Each [sin, cos] pair transforms by a 2×2 rotation, enabling relative position computation via linear operations.

Rotation form of shifts

[sin (θ_{p os + k, i}), cos (θ_{p os + k, i})]^{⊤} Δ_{i} = [cos Δ_{i} - sin Δ_{i} sin Δ_{i} cos Δ_{i}] [sin (θ_{p os, i}), cos (θ_{p os, i})]^{⊤}, = \frac{k}{1000 0 ^{\frac{2 i}{d _{m o d e l}}}} .

Explanation: A shift by k positions is a rotation by angle Δ_i in each sinusoidal pair. This linear structure is why sinusoidal encodings support relative-distance reasoning.

Dot product depends on relative distance

PE (p os) \cdot PE (p o s^{'}) = i = 0 \sum \frac{d _{m o d e l}}{2} - 1 cos (θ_{p os, i} - θ_{p o s^{'}, i})

Explanation: The dot product between two positional encodings depends only on their angle differences. This links similarity of positions to their relative distance across frequencies.

Generation complexity

T_{PE} (n, d) = O (n d), S_{PE} (n, d) = O (n d)

Explanation: Creating a positional encoding matrix for n positions and dimension d is linear in both. Time and space grow proportionally to n×d, small compared to attention’s cost.

Complexity Analysis

Generating sinusoidal positional encodings for a sequence of length n and model dimension d requires computing one sine and one cosine per pair of channels at each position. This is O(n d) time, since we fill an n×d matrix, and O(n d) space to store it. The per-element work is constant (one pow for the base frequency per pair can be precomputed or shared, and one sin/cos per element), so practical throughput is dominated by memory bandwidth rather than compute. When adding positional encodings to token embeddings, the operation is a simple elementwise addition, also O(n d) time and O(1) extra space if done in-place (besides the storage for the PE matrix). In the context of a Transformer layer, the main cost comes from self-attention, which requires forming Q, K, and V (O(n

d^{2}

) for projections if dense) and computing attention scores Q

K^{T}

(O(

n^{2}

d_{k}

)), followed by softmax and multiplying by V (O(

n^{2}

d_{v}

)). Compared to these quadratic terms, the O(n d) positional encoding step is negligible for typical sequence lengths (n ≳ 128). If positional encodings are precomputed up to a maximum length and reused across batches, there is no per-step recomputation cost beyond indexing the first n rows, keeping runtime consistent. Memory-wise, storing PE for ma

x_{l} e n

×d is modest (e.g., 4096×1024 doubles ≈ 32 MB), and can be reduced by on-the-fly generation for each needed slice if memory is tight, trading small compute for lower storage. Precision-wise, using double can mitigate accumulated error for very large positions, though float is usually sufficient for standard lengths.

Code Examples

Generate sinusoidal positional encoding matrix

1 #include <iostream>
2 #include <vector>
3 #include <cmath>
4 #include <iomanip>
5 #include <stdexcept>
6 
7 using Matrix = std::vector<std::vector<double>>;
8 
9 // Create an n x d_model matrix of sinusoidal positional encodings.
10 // PE[pos, 2i]   = sin(pos / 10000^{2i/d_model})
11 // PE[pos, 2i+1] = cos(pos / 10000^{2i/d_model})
12 Matrix sinusoidal_positional_encoding(std::size_t max_len, std::size_t d_model) {
13     if (d_model % 2 != 0) {
14         throw std::invalid_argument("d_model must be even for [sin, cos] pairs.");
15     }
16 
17     Matrix pe(max_len, std::vector<double>(d_model, 0.0));
18 
19     // Precompute the denominator exponents for each pair to avoid recomputing pow in inner loops.
20     std::vector<double> div_terms(d_model / 2);
21     for (std::size_t i = 0; i < d_model / 2; ++i) {
22         double exponent = static_cast<double>(2 * i) / static_cast<double>(d_model);
23         div_terms[i] = std::pow(10000.0, exponent); // 10000^{2i/d_model}
24     }
25 
26     for (std::size_t pos = 0; pos < max_len; ++pos) {
27         for (std::size_t i = 0; i < d_model / 2; ++i) {
28             double theta = static_cast<double>(pos) / div_terms[i];
29             pe[pos][2 * i]     = std::sin(theta);
30             pe[pos][2 * i + 1] = std::cos(theta);
31         }
32     }
33 
34     return pe;
35 }
36 
37 int main() {
38     std::size_t max_len = 10;
39     std::size_t d_model = 8; // must be even
40 
41     Matrix pe = sinusoidal_positional_encoding(max_len, d_model);
42 
43     // Print first few rows
44     std::cout << std::fixed << std::setprecision(6);
45     for (std::size_t pos = 0; pos < max_len; ++pos) {
46         std::cout << "pos " << pos << ": ";
47         for (std::size_t j = 0; j < d_model; ++j) {
48             std::cout << std::setw(10) << pe[pos][j] << ' ';
49         }
50         std::cout << '\n';
51     }
52     return 0;
53 }
54

This program builds the standard sinusoidal positional encoding table for a given maximum length and model dimension. It precomputes the exponential denominators for efficiency and fills an n×d matrix with interleaved sine and cosine values per position. The output shows the encoding vectors for positions 0..9 with d_model=8.

Time: O(n d)Space: O(n d)

Add positional encodings to token embeddings

1 #include <iostream>
2 #include <vector>
3 #include <random>
4 #include <iomanip>
5 #include <cmath>
6 
7 using Matrix = std::vector<std::vector<double>>;
8 
9 Matrix sinusoidal_positional_encoding(std::size_t max_len, std::size_t d_model) {
10     Matrix pe(max_len, std::vector<double>(d_model, 0.0));
11     for (std::size_t i = 0; i < d_model / 2; ++i) {
12         double exponent = static_cast<double>(2 * i) / static_cast<double>(d_model);
13         double div = std::pow(10000.0, exponent);
14         for (std::size_t pos = 0; pos < max_len; ++pos) {
15             double theta = static_cast<double>(pos) / div;
16             pe[pos][2 * i]     = std::sin(theta);
17             pe[pos][2 * i + 1] = std::cos(theta);
18         }
19     }
20     return pe;
21 }
22 
23 Matrix make_embedding_matrix(std::size_t vocab_size, std::size_t d_model, unsigned seed=42) {
24     std::mt19937 gen(seed);
25     std::normal_distribution<double> dist(0.0, 0.02);
26     Matrix E(vocab_size, std::vector<double>(d_model));
27     for (std::size_t v = 0; v < vocab_size; ++v)
28         for (std::size_t j = 0; j < d_model; ++j)
29             E[v][j] = dist(gen);
30     return E;
31 }
32 
33 Matrix embed_tokens(const std::vector<int>& tokens, const Matrix& E) {
34     std::size_t n = tokens.size();
35     std::size_t d = E[0].size();
36     Matrix X(n, std::vector<double>(d));
37     for (std::size_t i = 0; i < n; ++i) {
38         int id = tokens[i];
39         for (std::size_t j = 0; j < d; ++j) X[i][j] = E[id][j];
40     }
41     return X;
42 }
43 
44 void add_positional_encoding(Matrix& X, const Matrix& PE) {
45     for (std::size_t i = 0; i < X.size(); ++i) {
46         for (std::size_t j = 0; j < X[0].size(); ++j) {
47             X[i][j] += PE[i][j]; // add position-wise
48         }
49     }
50 }
51 
52 int main() {
53     // Toy example: sequence of token IDs
54     std::vector<int> tokens = {3, 1, 4, 1, 5};
55     std::size_t vocab_size = 10;
56     std::size_t d_model = 8; // even
57 
58     Matrix E  = make_embedding_matrix(vocab_size, d_model);
59     Matrix PE = sinusoidal_positional_encoding(tokens.size(), d_model);
60 
61     Matrix X = embed_tokens(tokens, E);
62     add_positional_encoding(X, PE);
63 
64     // Print the first two rows to illustrate addition
65     std::cout << std::fixed << std::setprecision(6);
66     for (std::size_t i = 0; i < X.size(); ++i) {
67         std::cout << "pos " << i << ": ";
68         for (double v : X[i]) std::cout << std::setw(10) << v << ' ';
69         std::cout << '\n';
70     }
71     return 0;
72 }
73

We simulate a small embedding table and a short token sequence. After embedding tokens, we add the corresponding sinusoidal positional encoding row to each token’s vector. This mirrors the standard Transformer input step x_pos = embed(token_pos) + PE_pos.

Time: O(n d)Space: O(vocab_size d + n d)

Attention weights with and without positional encodings

1 #include <iostream>
2 #include <vector>
3 #include <cmath>
4 #include <iomanip>
5 #include <random>
6 
7 using Matrix = std::vector<std::vector<double>>;
8 
9 Matrix sinusoidal_positional_encoding(std::size_t max_len, std::size_t d_model) {
10     Matrix pe(max_len, std::vector<double>(d_model, 0.0));
11     for (std::size_t i = 0; i < d_model / 2; ++i) {
12         double exponent = static_cast<double>(2 * i) / static_cast<double>(d_model);
13         double div = std::pow(10000.0, exponent);
14         for (std::size_t pos = 0; pos < max_len; ++pos) {
15             double theta = static_cast<double>(pos) / div;
16             pe[pos][2 * i]     = std::sin(theta);
17             pe[pos][2 * i + 1] = std::cos(theta);
18         }
19     }
20     return pe;
21 }
22 
23 Matrix make_embedding_matrix(std::size_t vocab_size, std::size_t d_model, unsigned seed=123) {
24     std::mt19937 gen(seed);
25     std::normal_distribution<double> dist(0.0, 0.1);
26     Matrix E(vocab_size, std::vector<double>(d_model));
27     for (std::size_t v = 0; v < vocab_size; ++v)
28         for (std::size_t j = 0; j < d_model; ++j)
29             E[v][j] = dist(gen);
30     return E;
31 }
32 
33 Matrix embed(const std::vector<int>& ids, const Matrix& E) {
34     Matrix X(ids.size(), std::vector<double>(E[0].size()));
35     for (std::size_t i = 0; i < ids.size(); ++i)
36         for (std::size_t j = 0; j < E[0].size(); ++j)
37             X[i][j] = E[ids[i]][j];
38     return X;
39 }
40 
41 Matrix matmul(const Matrix& A, const Matrix& B) {
42     std::size_t n = A.size();
43     std::size_t m = B[0].size();
44     std::size_t k = B.size();
45     Matrix C(n, std::vector<double>(m, 0.0));
46     for (std::size_t i = 0; i < n; ++i)
47         for (std::size_t t = 0; t < k; ++t)
48             for (std::size_t j = 0; j < m; ++j)
49                 C[i][j] += A[i][t] * B[t][j];
50     return C;
51 }
52 
53 Matrix transpose(const Matrix& A) {
54     Matrix T(A[0].size(), std::vector<double>(A.size()));
55     for (std::size_t i = 0; i < A.size(); ++i)
56         for (std::size_t j = 0; j < A[0].size(); ++j)
57             T[j][i] = A[i][j];
58     return T;
59 }
60 
61 void row_softmax(Matrix& A) {
62     for (auto& row : A) {
63         double maxv = row[0];
64         for (double v : row) maxv = std::max(maxv, v);
65         double sum = 0.0;
66         for (double& v : row) { v = std::exp(v - maxv); sum += v; }
67         for (double& v : row) v /= (sum + 1e-12);
68     }
69 }
70 
71 Matrix add(const Matrix& A, const Matrix& B) {
72     Matrix C = A;
73     for (std::size_t i = 0; i < A.size(); ++i)
74         for (std::size_t j = 0; j < A[0].size(); ++j)
75             C[i][j] += B[i][j];
76     return C;
77 }
78 
79 Matrix attention_weights_from_inputs(const Matrix& X) {
80     // Scaled dot-product attention with Q=K=X (single head, no masks), return only weights
81     Matrix Kt = transpose(X);
82     Matrix scores = matmul(X, Kt);
83     double scale = 1.0 / std::sqrt(static_cast<double>(X[0].size()));
84     for (auto& row : scores) for (double& v : row) v *= scale;
85     row_softmax(scores);
86     return scores;
87 }
88 
89 void print_matrix(const Matrix& A, const std::string& name) {
90     std::cout << name << "\n";
91     std::cout << std::fixed << std::setprecision(4);
92     for (const auto& row : A) {
93         for (double v : row) std::cout << std::setw(8) << v << ' ';
94         std::cout << '\n';
95     }
96 }
97 
98 int main() {
99     // Two sequences with same tokens but reversed order
100     std::vector<int> seq1 = {1, 2, 3, 4};
101     std::vector<int> seq2 = {4, 3, 2, 1};
102 
103     std::size_t vocab = 10, d_model = 8;
104     Matrix E  = make_embedding_matrix(vocab, d_model, 7);
105 
106     Matrix X1 = embed(seq1, E);
107     Matrix X2 = embed(seq2, E);
108 
109     // Without positional encodings
110     Matrix W1 = attention_weights_from_inputs(X1);
111     Matrix W2 = attention_weights_from_inputs(X2);
112 
113     // With positional encodings added
114     Matrix PE = sinusoidal_positional_encoding(seq1.size(), d_model);
115     Matrix X1p = add(X1, PE);
116     Matrix X2p = add(X2, PE);
117     Matrix W1p = attention_weights_from_inputs(X1p);
118     Matrix W2p = attention_weights_from_inputs(X2p);
119 
120     print_matrix(W1,  "Weights seq1 (no PE)");
121     print_matrix(W2,  "Weights seq2 (no PE)");
122     print_matrix(W1p, "Weights seq1 (with PE)");
123     print_matrix(W2p, "Weights seq2 (with PE)");
124 
125     std::cout << "\nObserve: without PE, seq2’s weights are largely a permutation of seq1’s; with PE, the patterns differ because order is encoded." << std::endl;
126     return 0;
127 }
128

This program contrasts attention weights computed from embeddings alone versus embeddings plus positional encodings for two reversed sequences. Without positions, attention is permutation-equivariant, so reversing inputs mainly permutes the weight matrix. After adding sinusoidal encodings, the attention patterns change because absolute order information is now present.

Time: O(n^2 d) for attention, O(n d) for adding PESpace: O(n^2) for weights, O(n d) for inputs

1	#include <iostream>
2	#include <vector>
3	#include <cmath>
4	#include <iomanip>
5	#include <stdexcept>
6
7	using Matrix = std::vector<std::vector<double>>;
8
9	// Create an n x d_model matrix of sinusoidal positional encodings.
10	// PE[pos, 2i] = sin(pos / 10000^{2i/d_model})
11	// PE[pos, 2i+1] = cos(pos / 10000^{2i/d_model})
12	Matrix sinusoidal_positional_encoding(std::size_t max_len, std::size_t d_model) {
13	if (d_model % 2 != 0) {
14	throw std::invalid_argument("d_model must be even for [sin, cos] pairs.");
15	}
16
17	Matrix pe(max_len, std::vector<double>(d_model, 0.0));
18
19	// Precompute the denominator exponents for each pair to avoid recomputing pow in inner loops.
20	std::vector<double> div_terms(d_model / 2);
21	for (std::size_t i = 0; i < d_model / 2; ++i) {
22	double exponent = static_cast<double>(2 * i) / static_cast<double>(d_model);
23	div_terms[i] = std::pow(10000.0, exponent); // 10000^{2i/d_model}
24	}
25
26	for (std::size_t pos = 0; pos < max_len; ++pos) {
27	for (std::size_t i = 0; i < d_model / 2; ++i) {
28	double theta = static_cast<double>(pos) / div_terms[i];
29	pe[pos][2 * i] = std::sin(theta);
30	pe[pos][2 * i + 1] = std::cos(theta);
31	}
32	}
33
34	return pe;
35	}
36
37	int main() {
38	std::size_t max_len = 10;
39	std::size_t d_model = 8; // must be even
40
41	Matrix pe = sinusoidal_positional_encoding(max_len, d_model);
42
43	// Print first few rows
44	std::cout << std::fixed << std::setprecision(6);
45	for (std::size_t pos = 0; pos < max_len; ++pos) {
46	std::cout << "pos " << pos << ": ";
47	for (std::size_t j = 0; j < d_model; ++j) {
48	std::cout << std::setw(10) << pe[pos][j] << ' ';
49	}
50	std::cout << '\n';
51	}
52	return 0;
53	}
54

1	#include <iostream>
2	#include <vector>
3	#include <random>
4	#include <iomanip>
5	#include <cmath>
6
7	using Matrix = std::vector<std::vector<double>>;
8
9	Matrix sinusoidal_positional_encoding(std::size_t max_len, std::size_t d_model) {
10	Matrix pe(max_len, std::vector<double>(d_model, 0.0));
11	for (std::size_t i = 0; i < d_model / 2; ++i) {
12	double exponent = static_cast<double>(2 * i) / static_cast<double>(d_model);
13	double div = std::pow(10000.0, exponent);
14	for (std::size_t pos = 0; pos < max_len; ++pos) {
15	double theta = static_cast<double>(pos) / div;
16	pe[pos][2 * i] = std::sin(theta);
17	pe[pos][2 * i + 1] = std::cos(theta);
18	}
19	}
20	return pe;
21	}
22
23	Matrix make_embedding_matrix(std::size_t vocab_size, std::size_t d_model, unsigned seed=42) {
24	std::mt19937 gen(seed);
25	std::normal_distribution<double> dist(0.0, 0.02);
26	Matrix E(vocab_size, std::vector<double>(d_model));
27	for (std::size_t v = 0; v < vocab_size; ++v)
28	for (std::size_t j = 0; j < d_model; ++j)
29	E[v][j] = dist(gen);
30	return E;
31	}
32
33	Matrix embed_tokens(const std::vector<int>& tokens, const Matrix& E) {
34	std::size_t n = tokens.size();
35	std::size_t d = E[0].size();
36	Matrix X(n, std::vector<double>(d));
37	for (std::size_t i = 0; i < n; ++i) {
38	int id = tokens[i];
39	for (std::size_t j = 0; j < d; ++j) X[i][j] = E[id][j];
40	}
41	return X;
42	}
43
44	void add_positional_encoding(Matrix& X, const Matrix& PE) {
45	for (std::size_t i = 0; i < X.size(); ++i) {
46	for (std::size_t j = 0; j < X[0].size(); ++j) {
47	X[i][j] += PE[i][j]; // add position-wise
48	}
49	}
50	}
51
52	int main() {
53	// Toy example: sequence of token IDs
54	std::vector<int> tokens = {3, 1, 4, 1, 5};
55	std::size_t vocab_size = 10;
56	std::size_t d_model = 8; // even
57
58	Matrix E = make_embedding_matrix(vocab_size, d_model);
59	Matrix PE = sinusoidal_positional_encoding(tokens.size(), d_model);
60
61	Matrix X = embed_tokens(tokens, E);
62	add_positional_encoding(X, PE);
63
64	// Print the first two rows to illustrate addition
65	std::cout << std::fixed << std::setprecision(6);
66	for (std::size_t i = 0; i < X.size(); ++i) {
67	std::cout << "pos " << i << ": ";
68	for (double v : X[i]) std::cout << std::setw(10) << v << ' ';
69	std::cout << '\n';
70	}
71	return 0;
72	}
73