Positional Encoding Theory
Key Points
- •Transformers are permutation-invariant by default, so they need positional encodings to understand word order in sequences.
- •Sinusoidal positional encodings inject absolute position using sines and cosines at exponentially spaced frequencies.
- •The encoding for position pos and channel 2i/2i+1 is sin(pos/10000^{2i/}) and cos(pos/10000^{2i/}).
- •These encodings allow the model to compute relative positions linearly, thanks to trigonometric addition identities.
- •They are parameter-free, deterministic, and extrapolate gracefully to longer sequences than seen in training.
- •Adding positional encodings is O(n d) in time and space, negligible compared to attention’s O( d).
- •Use sinusoidal encodings when you want simple, memory-light positional signals with good length generalization.
Prerequisites
- →Vectors and matrices — Positional encodings and attention use matrix operations and vector additions.
- →Trigonometric functions — Sinusoidal encodings rely on sine/cosine and angle addition identities.
- →Softmax and numerical stability — Attention normalizes scores with softmax; stability matters for implementation.
- →Embeddings — Understanding how tokens map to continuous vectors is essential before adding positions.
- →Transformer architecture basics — Context for where and why positional encodings are added.
- →Time and space complexity — To analyze the overhead of generating and applying positional encodings.
- →C++ I/O and STL containers — Implementations use vectors, loops, and printing.
- →Floating-point precision — Large positions and exponentials can be sensitive to precision.
Detailed Explanation
Tap terms for definitions01Overview
Hook → Imagine reading a sentence with all words shuffled: the bag of words is the same, but the meaning collapses without order. Concept → Transformers, powered by self-attention, naturally see inputs as sets, not ordered sequences. Positional encodings are the “GPS coordinates” that tell the model where each token sits in the sequence. Sinusoidal positional encodings (from the original Transformer paper) encode absolute position using sinusoids at multiple frequencies so the model can infer both coarse and fine-grained distances. They are computed by simple formulas—no learned parameters—and then added to token embeddings before attention. Example → If d_model = 8 and pos = 5, we compute eight numbers: sin and cos pairs at four frequencies. This vector is added to the token’s embedding at position 5, nudging attention to notice its place in line.
02Intuition & Analogies
Hook → Think of a marching band on a football field. Each musician must know both who they are (their instrument) and exactly where they stand (their spot on the grid). Concept → Token embeddings tell the model “who” each token is; positional encodings tell it “where” they are in the sequence. Using sinusoids is like giving every position a unique barcode made from low and high musical notes (frequencies). Low frequencies change slowly across positions and capture global order (beginning vs. end), while high frequencies change rapidly and capture local differences (neighbors). Because these barcodes are built from sines and cosines, shifting a position corresponds to a simple rotation of each sinusoidal pair. This means the model can compute relative distances with linear operations—great for attention layers which are linear before softmax. Example → If you move 3 steps forward, each [sin, cos] pair just rotates by an angle tied to that frequency. The model can learn weights that recognize, say, “a verb tends to attend to a noun 1–3 steps back,” regardless of the absolute position, because the relative shift looks consistent across the whole field.
03Formal Definition
04When to Use
Hook → You need order, but don’t want extra parameters or training complexity. Concept → Use sinusoidal positional encodings in Transformers when absolute order matters (language, code, time series, speech, DNA/proteins) and you want strong extrapolation to longer lengths than training. They are ideal for resource-constrained settings, deterministic pipelines, or as a baseline to compare against learned or relative schemes. Example use cases: (1) Machine translation where sentences may be longer at inference time than in training; (2) Sensor and time-series models where timestamps can exceed training windows; (3) Lightweight inference on CPU where parameter count must be minimal; (4) Educational or research prototypes that need reproducibility without tuning extra parameters. Prefer learned or relative encodings (e.g., rotary or bias-based) when data-specific patterns or very long-context performance require it, but sinusoidal remains a robust, simple default.
⚠️Common Mistakes
- Mixing degrees and radians: std::sin and std::cos in C++ use radians. Feeding degrees silently produces wrong patterns. Always compute angles in radians as defined by the formula. - Off-by-one positions: The standard uses pos starting at 0. Starting at 1 shifts all phases and can subtly degrade performance or mismatch pretrained configs. - Odd d_model handling: The classic definition assumes an even d_model for [sin, cos] pairs. If d_model is odd, decide how to handle the leftover channel (e.g., drop it or pad), but be consistent. - Recomputing each step with wrong max_len: Generating encodings for a smaller max_len than your longest sequence causes out-of-bounds or silent truncation. Precompute for the maximum needed length. - Forgetting to add (not concatenate): The vanilla approach adds PE to token embeddings elementwise. Concatenation changes dimensionality and requires reinitializing projection layers. - Precision and scaling: Using float with very large pos can accumulate error; double precision is safer. Also remember attention’s usual 1/\sqrt{d_k} scaling; positional encodings don’t replace that. - Misinterpreting extrapolation: Sinusoidal encodes arbitrarily long positions, but model parameters were trained on finite contexts; attention may still degrade with extreme lengths even though the encoding itself generalizes.
Key Formulas
Angle per channel
Explanation: This defines the phase (angle) for the i-th sinusoidal pair at position pos. Higher i uses higher frequency (smaller wavelength) due to the exponential denominator.
Sinusoidal positional encoding
Explanation: Each position is mapped to interleaved sine and cosine values at multiple frequencies. These values are added to token embeddings to inject order information.
Wavelength per channel
Explanation: This gives the period of the sinusoid for channel pair i. Larger i leads to shorter wavelengths which capture finer positional differences.
Scaled dot-product attention
Explanation: Attention weights are computed from query–key similarities scaled by 1/√ and normalized by softmax. These weights mix the values V to produce outputs.
Angle addition identities
Explanation: Shifting positions corresponds to adding angles. Each [sin, cos] pair transforms by a 2×2 rotation, enabling relative position computation via linear operations.
Rotation form of shifts
Explanation: A shift by k positions is a rotation by angle Δ_i in each sinusoidal pair. This linear structure is why sinusoidal encodings support relative-distance reasoning.
Dot product depends on relative distance
Explanation: The dot product between two positional encodings depends only on their angle differences. This links similarity of positions to their relative distance across frequencies.
Generation complexity
Explanation: Creating a positional encoding matrix for n positions and dimension d is linear in both. Time and space grow proportionally to n×d, small compared to attention’s cost.
Complexity Analysis
Code Examples
1 #include <iostream> 2 #include <vector> 3 #include <cmath> 4 #include <iomanip> 5 #include <stdexcept> 6 7 using Matrix = std::vector<std::vector<double>>; 8 9 // Create an n x d_model matrix of sinusoidal positional encodings. 10 // PE[pos, 2i] = sin(pos / 10000^{2i/d_model}) 11 // PE[pos, 2i+1] = cos(pos / 10000^{2i/d_model}) 12 Matrix sinusoidal_positional_encoding(std::size_t max_len, std::size_t d_model) { 13 if (d_model % 2 != 0) { 14 throw std::invalid_argument("d_model must be even for [sin, cos] pairs."); 15 } 16 17 Matrix pe(max_len, std::vector<double>(d_model, 0.0)); 18 19 // Precompute the denominator exponents for each pair to avoid recomputing pow in inner loops. 20 std::vector<double> div_terms(d_model / 2); 21 for (std::size_t i = 0; i < d_model / 2; ++i) { 22 double exponent = static_cast<double>(2 * i) / static_cast<double>(d_model); 23 div_terms[i] = std::pow(10000.0, exponent); // 10000^{2i/d_model} 24 } 25 26 for (std::size_t pos = 0; pos < max_len; ++pos) { 27 for (std::size_t i = 0; i < d_model / 2; ++i) { 28 double theta = static_cast<double>(pos) / div_terms[i]; 29 pe[pos][2 * i] = std::sin(theta); 30 pe[pos][2 * i + 1] = std::cos(theta); 31 } 32 } 33 34 return pe; 35 } 36 37 int main() { 38 std::size_t max_len = 10; 39 std::size_t d_model = 8; // must be even 40 41 Matrix pe = sinusoidal_positional_encoding(max_len, d_model); 42 43 // Print first few rows 44 std::cout << std::fixed << std::setprecision(6); 45 for (std::size_t pos = 0; pos < max_len; ++pos) { 46 std::cout << "pos " << pos << ": "; 47 for (std::size_t j = 0; j < d_model; ++j) { 48 std::cout << std::setw(10) << pe[pos][j] << ' '; 49 } 50 std::cout << '\n'; 51 } 52 return 0; 53 } 54
This program builds the standard sinusoidal positional encoding table for a given maximum length and model dimension. It precomputes the exponential denominators for efficiency and fills an n×d matrix with interleaved sine and cosine values per position. The output shows the encoding vectors for positions 0..9 with d_model=8.
1 #include <iostream> 2 #include <vector> 3 #include <random> 4 #include <iomanip> 5 #include <cmath> 6 7 using Matrix = std::vector<std::vector<double>>; 8 9 Matrix sinusoidal_positional_encoding(std::size_t max_len, std::size_t d_model) { 10 Matrix pe(max_len, std::vector<double>(d_model, 0.0)); 11 for (std::size_t i = 0; i < d_model / 2; ++i) { 12 double exponent = static_cast<double>(2 * i) / static_cast<double>(d_model); 13 double div = std::pow(10000.0, exponent); 14 for (std::size_t pos = 0; pos < max_len; ++pos) { 15 double theta = static_cast<double>(pos) / div; 16 pe[pos][2 * i] = std::sin(theta); 17 pe[pos][2 * i + 1] = std::cos(theta); 18 } 19 } 20 return pe; 21 } 22 23 Matrix make_embedding_matrix(std::size_t vocab_size, std::size_t d_model, unsigned seed=42) { 24 std::mt19937 gen(seed); 25 std::normal_distribution<double> dist(0.0, 0.02); 26 Matrix E(vocab_size, std::vector<double>(d_model)); 27 for (std::size_t v = 0; v < vocab_size; ++v) 28 for (std::size_t j = 0; j < d_model; ++j) 29 E[v][j] = dist(gen); 30 return E; 31 } 32 33 Matrix embed_tokens(const std::vector<int>& tokens, const Matrix& E) { 34 std::size_t n = tokens.size(); 35 std::size_t d = E[0].size(); 36 Matrix X(n, std::vector<double>(d)); 37 for (std::size_t i = 0; i < n; ++i) { 38 int id = tokens[i]; 39 for (std::size_t j = 0; j < d; ++j) X[i][j] = E[id][j]; 40 } 41 return X; 42 } 43 44 void add_positional_encoding(Matrix& X, const Matrix& PE) { 45 for (std::size_t i = 0; i < X.size(); ++i) { 46 for (std::size_t j = 0; j < X[0].size(); ++j) { 47 X[i][j] += PE[i][j]; // add position-wise 48 } 49 } 50 } 51 52 int main() { 53 // Toy example: sequence of token IDs 54 std::vector<int> tokens = {3, 1, 4, 1, 5}; 55 std::size_t vocab_size = 10; 56 std::size_t d_model = 8; // even 57 58 Matrix E = make_embedding_matrix(vocab_size, d_model); 59 Matrix PE = sinusoidal_positional_encoding(tokens.size(), d_model); 60 61 Matrix X = embed_tokens(tokens, E); 62 add_positional_encoding(X, PE); 63 64 // Print the first two rows to illustrate addition 65 std::cout << std::fixed << std::setprecision(6); 66 for (std::size_t i = 0; i < X.size(); ++i) { 67 std::cout << "pos " << i << ": "; 68 for (double v : X[i]) std::cout << std::setw(10) << v << ' '; 69 std::cout << '\n'; 70 } 71 return 0; 72 } 73
We simulate a small embedding table and a short token sequence. After embedding tokens, we add the corresponding sinusoidal positional encoding row to each token’s vector. This mirrors the standard Transformer input step x_pos = embed(token_pos) + PE_pos.
1 #include <iostream> 2 #include <vector> 3 #include <cmath> 4 #include <iomanip> 5 #include <random> 6 7 using Matrix = std::vector<std::vector<double>>; 8 9 Matrix sinusoidal_positional_encoding(std::size_t max_len, std::size_t d_model) { 10 Matrix pe(max_len, std::vector<double>(d_model, 0.0)); 11 for (std::size_t i = 0; i < d_model / 2; ++i) { 12 double exponent = static_cast<double>(2 * i) / static_cast<double>(d_model); 13 double div = std::pow(10000.0, exponent); 14 for (std::size_t pos = 0; pos < max_len; ++pos) { 15 double theta = static_cast<double>(pos) / div; 16 pe[pos][2 * i] = std::sin(theta); 17 pe[pos][2 * i + 1] = std::cos(theta); 18 } 19 } 20 return pe; 21 } 22 23 Matrix make_embedding_matrix(std::size_t vocab_size, std::size_t d_model, unsigned seed=123) { 24 std::mt19937 gen(seed); 25 std::normal_distribution<double> dist(0.0, 0.1); 26 Matrix E(vocab_size, std::vector<double>(d_model)); 27 for (std::size_t v = 0; v < vocab_size; ++v) 28 for (std::size_t j = 0; j < d_model; ++j) 29 E[v][j] = dist(gen); 30 return E; 31 } 32 33 Matrix embed(const std::vector<int>& ids, const Matrix& E) { 34 Matrix X(ids.size(), std::vector<double>(E[0].size())); 35 for (std::size_t i = 0; i < ids.size(); ++i) 36 for (std::size_t j = 0; j < E[0].size(); ++j) 37 X[i][j] = E[ids[i]][j]; 38 return X; 39 } 40 41 Matrix matmul(const Matrix& A, const Matrix& B) { 42 std::size_t n = A.size(); 43 std::size_t m = B[0].size(); 44 std::size_t k = B.size(); 45 Matrix C(n, std::vector<double>(m, 0.0)); 46 for (std::size_t i = 0; i < n; ++i) 47 for (std::size_t t = 0; t < k; ++t) 48 for (std::size_t j = 0; j < m; ++j) 49 C[i][j] += A[i][t] * B[t][j]; 50 return C; 51 } 52 53 Matrix transpose(const Matrix& A) { 54 Matrix T(A[0].size(), std::vector<double>(A.size())); 55 for (std::size_t i = 0; i < A.size(); ++i) 56 for (std::size_t j = 0; j < A[0].size(); ++j) 57 T[j][i] = A[i][j]; 58 return T; 59 } 60 61 void row_softmax(Matrix& A) { 62 for (auto& row : A) { 63 double maxv = row[0]; 64 for (double v : row) maxv = std::max(maxv, v); 65 double sum = 0.0; 66 for (double& v : row) { v = std::exp(v - maxv); sum += v; } 67 for (double& v : row) v /= (sum + 1e-12); 68 } 69 } 70 71 Matrix add(const Matrix& A, const Matrix& B) { 72 Matrix C = A; 73 for (std::size_t i = 0; i < A.size(); ++i) 74 for (std::size_t j = 0; j < A[0].size(); ++j) 75 C[i][j] += B[i][j]; 76 return C; 77 } 78 79 Matrix attention_weights_from_inputs(const Matrix& X) { 80 // Scaled dot-product attention with Q=K=X (single head, no masks), return only weights 81 Matrix Kt = transpose(X); 82 Matrix scores = matmul(X, Kt); 83 double scale = 1.0 / std::sqrt(static_cast<double>(X[0].size())); 84 for (auto& row : scores) for (double& v : row) v *= scale; 85 row_softmax(scores); 86 return scores; 87 } 88 89 void print_matrix(const Matrix& A, const std::string& name) { 90 std::cout << name << "\n"; 91 std::cout << std::fixed << std::setprecision(4); 92 for (const auto& row : A) { 93 for (double v : row) std::cout << std::setw(8) << v << ' '; 94 std::cout << '\n'; 95 } 96 } 97 98 int main() { 99 // Two sequences with same tokens but reversed order 100 std::vector<int> seq1 = {1, 2, 3, 4}; 101 std::vector<int> seq2 = {4, 3, 2, 1}; 102 103 std::size_t vocab = 10, d_model = 8; 104 Matrix E = make_embedding_matrix(vocab, d_model, 7); 105 106 Matrix X1 = embed(seq1, E); 107 Matrix X2 = embed(seq2, E); 108 109 // Without positional encodings 110 Matrix W1 = attention_weights_from_inputs(X1); 111 Matrix W2 = attention_weights_from_inputs(X2); 112 113 // With positional encodings added 114 Matrix PE = sinusoidal_positional_encoding(seq1.size(), d_model); 115 Matrix X1p = add(X1, PE); 116 Matrix X2p = add(X2, PE); 117 Matrix W1p = attention_weights_from_inputs(X1p); 118 Matrix W2p = attention_weights_from_inputs(X2p); 119 120 print_matrix(W1, "Weights seq1 (no PE)"); 121 print_matrix(W2, "Weights seq2 (no PE)"); 122 print_matrix(W1p, "Weights seq1 (with PE)"); 123 print_matrix(W2p, "Weights seq2 (with PE)"); 124 125 std::cout << "\nObserve: without PE, seq2’s weights are largely a permutation of seq1’s; with PE, the patterns differ because order is encoded." << std::endl; 126 return 0; 127 } 128
This program contrasts attention weights computed from embeddings alone versus embeddings plus positional encodings for two reversed sequences. Without positions, attention is permutation-equivariant, so reversing inputs mainly permutes the weight matrix. After adding sinusoidal encodings, the attention patterns change because absolute order information is now present.