Positional Encoding Mathematics
Key Points
- •Sinusoidal positional encoding represents each token’s position using pairs of sine and cosine waves at exponentially spaced frequencies.
- •The key trick is that shifting position by k is equivalent to rotating each sine–cosine pair by k· which preserves relative position information linearly.
- •The dot product between two positional encodings depends only on their relative distance, not their absolute positions.
- •Frequencies are chosen on a log scale so the model can represent both short-range and long-range relationships.
- •Implementation is simple: alternate sin and cos values across dimensions using ω_ for ..(d/2−1).
- •Everything is computed in radians and is deterministic, so it extrapolates to sequence lengths longer than seen during training.
- •Numerically, double precision and careful indexing help avoid subtle bugs and precision drift.
- •Time and space costs are linear in sequence length times embedding dimension, which is efficient in practice.
Prerequisites
- →Trigonometric identities — Sin and cos addition formulas justify the rotation property.
- →Vectors and dot products — Understanding similarity and how attention scores relate to inner products.
- →Matrices and linear transformations — Rotations as linear maps explain shift behavior across sin–cos pairs.
- →Big-O notation — Analyze time and space for constructing and using positional encodings.
- →C++ basics and the standard library — Implement arrays/vectors, loops, and math functions reliably.
- →Exponentials and logarithms — Understand logarithmic spacing of frequencies and pow(10000, exponent).
- →Floating-point arithmetic — Avoid precision pitfalls when p is large and when comparing results.
Detailed Explanation
Tap terms for definitions01Overview
Positional encoding is how Transformers tell where each token is in a sequence. Because attention itself is order-agnostic, we inject position information into token embeddings. The sinusoidal version builds a position vector from sine and cosine functions at different frequencies. Each position p is mapped to a d-dimensional vector that alternates sin and cos values with increasing wavelengths. This encoding has two crucial properties: it is deterministic (no learned parameters) and it gracefully extrapolates to longer sequences. More importantly, relative position differences appear linearly in the representation: shifting a position by k corresponds to applying a simple rotation to each sine–cosine pair. This means attention layers (which are linear before softmax) can infer relative distances without learning explicit positional tables. Unlike learned embeddings, sinusoidal encodings use a fixed formula, which makes them light-weight and easy to reproduce. Frequencies are spaced logarithmically, allowing high frequencies to capture fine-grained local order and low frequencies to capture long-range structure. The dot product between two positional encodings depends on the difference p−q, not on p or q alone, making relative comparisons natural. This balance of simplicity, extrapolation, and linear relative-position behavior explains why sinusoidal positional encodings remain a foundational choice in many Transformer variants.
02Intuition & Analogies
Imagine labeling seats in a theater with colors that cycle in patterns: fast-changing stripes for the front rows (blue–red–blue–red every few seats) and slow-changing gradients for the entire hall (from light to dark over hundreds of seats). If you look at a seat’s combined color pattern (from all stripes and gradients), you can tell not only where it is, but also how far it is from another seat by comparing patterns. That’s sinusoidal positional encoding: we paint each position with multiple repeating signals (sine and cosine waves) at different speeds (frequencies). Fast waves change quickly between adjacent positions, encoding fine details; slow waves change slowly, encoding broad location. Now think of moving from seat p to seat p+k. For each wave, your new color is just the old color rotated by an angle proportional to k. Because rotations are linear operations, any linear layer can easily work with shifts—this is perfect for attention, which largely manipulates vectors linearly before softmax. Also, by mixing many waves whose speeds grow on a log scale, the combined pattern is unique over a very long range, like a multi-hand clock: the second hand cycles fast, the minute hand slower, the hour hand slowest. Two times can be distinguished by looking at all hands together; similarly, two positions can be distinguished by looking at all the sine–cosine pairs together. This gives the model both local sensitivity and long-distance awareness, all from a simple deterministic formula.
03Formal Definition
04When to Use
Use sinusoidal positional encoding when: (1) you need a simple, parameter-free way to inject order into Transformer models; (2) you want extrapolation to longer sequences than seen in training; (3) relative position information matters and you prefer a representation where shifts correspond to linear transformations; (4) memory or parameter budget is tight and learned positional tables would add overhead. It’s particularly effective in tasks like machine translation, text classification, and time-series modeling, where sequences may extend beyond training lengths. Consider sinusoidal PE for rapid prototyping and for on-device or edge deployments where determinism and compactness are valuable. It also serves as a solid baseline to compare against learned or relative position methods. However, if your data exhibits very specific, learnable position biases (e.g., fixed-length inputs with strong absolute position semantics), learned positional embeddings or advanced relative attention mechanisms may slightly outperform sinusoidal PE. For very long contexts with aliasing risk, pair sinusoidal PE with careful frequency ranges or alternative encodings (e.g., RoPE) designed for stable long-context behavior.
⚠️Common Mistakes
- Mixing degrees and radians: C++ std::sin and std::cos expect radians. Always compute p·ω in radians, not degrees.
- Incorrect frequency index: The standard uses \omega_i = 10000^{-2i/d}. A common bug is using 10000^{-i/d} or 1000, which shifts the wavelength range and harms performance.
- Odd d_model handling: The encoding requires pairs; if d is odd, you must decide how to handle the final component (e.g., drop one, pad with zero). Safer: enforce even d.
- Wrong interleaving: The pattern is [sin, cos, sin, cos, ...]. Accidentally placing all sin first then all cos breaks the rotation and dot-product properties.
- Precision issues: Using float may accumulate error for large p. Prefer double for stability, especially when verifying rotation properties or extrapolating.
- Off-by-one positions: Ensure positions start at 0 and increment by 1 per token. Skipping or duplicating indices misaligns attention with content.
- Forgetting extrapolation limits: While sinusoidal PE extrapolates, extremely long sequences can still suffer numerical drift or aliasing across frequencies. Validate on your maximum intended length.
- Not normalizing embeddings: When adding PE to token embeddings, keep scales comparable (e.g., initialize token embeddings with similar variance) to avoid either term dominating.
Key Formulas
Frequency Schedule
Explanation: Defines the i-th angular frequency for the sinusoidal encoding on a logarithmic scale. Lower i gives higher frequency, higher i gives lower frequency, covering short and long ranges.
Sinusoidal Components
Explanation: Each position p maps to sine–cosine pairs at multiple frequencies. Interleaving sin and cos preserves phase information and enables rotation properties.
Wavelength per Dimension
Explanation: The number of positions needed to complete one full cycle at frequency . Larger i corresponds to longer wavelengths capturing long-range structure.
Shift as Rotation
Explanation: Shifting position by k rotates each sine–cosine pair by angle k·. This linear relationship lets attention layers infer relative positions easily.
Dot Product Depends on Offset
Explanation: The similarity between two positional encodings depends only on their relative distance p−q. This property supports relative comparisons within attention.
Construction Cost
Explanation: Building encodings for n positions in d dimensions requires computing n×d trigonometric values. Time and memory grow linearly with both n and d.
Angle Addition Identities
Explanation: Core trigonometric identities used to derive the rotation property of sinusoidal positional encodings.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Generate sinusoidal positional encodings. 5 // Returns an n x d matrix where row p is PE(p). 6 vector<vector<double>> sinusoidal_pe(size_t n, size_t d) { 7 if (d % 2 != 0) throw runtime_error("d must be even for sin/cos pairs"); 8 vector<vector<double>> pe(n, vector<double>(d, 0.0)); 9 10 // Precompute frequencies omega_i = 10000^{-(2i/d)} for i in [0, d/2) 11 size_t half = d / 2; 12 vector<double> omega(half); 13 for (size_t i = 0; i < half; ++i) { 14 double exponent = -2.0 * static_cast<double>(i) / static_cast<double>(d); 15 omega[i] = pow(10000.0, exponent); // radians per position 16 } 17 18 for (size_t p = 0; p < n; ++p) { 19 for (size_t i = 0; i < half; ++i) { 20 double angle = static_cast<double>(p) * omega[i]; 21 pe[p][2*i] = sin(angle); // even index 22 pe[p][2*i + 1] = cos(angle); // odd index 23 } 24 } 25 return pe; 26 } 27 28 int main() { 29 size_t n = 8; // positions 0..7 30 size_t d = 6; // must be even 31 auto pe = sinusoidal_pe(n, d); 32 33 cout.setf(std::ios::fixed); cout << setprecision(6); 34 cout << "Positional Encoding (n=" << n << ", d=" << d << ")\n"; 35 for (size_t p = 0; p < n; ++p) { 36 cout << "p=" << p << ": "; 37 for (size_t j = 0; j < d; ++j) cout << pe[p][j] << (j+1==d?'\n':' '); 38 } 39 return 0; 40 } 41
Computes the standard sinusoidal positional encodings for n positions and even dimension d. Frequencies are spaced on a log scale via ω_i=10000^{−2i/d}. The encoding interleaves sin and cos so each pair forms a 2D plane that supports rotation under shifts. The code prints the table for inspection and is suitable for precomputing PE for Transformer inputs.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 vector<vector<double>> sinusoidal_pe(size_t n, size_t d) { 5 if (d % 2 != 0) throw runtime_error("d must be even"); 6 size_t half = d/2; 7 vector<double> omega(half); 8 for (size_t i = 0; i < half; ++i) { 9 double exponent = -2.0 * static_cast<double>(i) / static_cast<double>(d); 10 omega[i] = pow(10000.0, exponent); 11 } 12 vector<vector<double>> pe(n, vector<double>(d)); 13 for (size_t p = 0; p < n; ++p) { 14 for (size_t i = 0; i < half; ++i) { 15 double angle = static_cast<double>(p) * omega[i]; 16 pe[p][2*i] = sin(angle); 17 pe[p][2*i+1] = cos(angle); 18 } 19 } 20 return pe; 21 } 22 23 // Apply block-diagonal rotation corresponding to shift k. 24 // For each pair (2i,2i+1): v' = R(k*omega_i) * v. 25 vector<double> rotate_shift(const vector<double>& v, const vector<double>& omega, long long k) { 26 size_t d = v.size(); 27 size_t half = d/2; 28 vector<double> out(d, 0.0); 29 for (size_t i = 0; i < half; ++i) { 30 double theta = static_cast<double>(k) * omega[i]; 31 double c = cos(theta), s = sin(theta); 32 double sin_p = v[2*i]; 33 double cos_p = v[2*i+1]; 34 // [sin, cos]^T -> [[c, s], [-s, c]] * [sin, cos]^T 35 out[2*i] = c * sin_p + s * cos_p; // sin((p+k)w) 36 out[2*i+1] = -s * sin_p + c * cos_p; // cos((p+k)w) 37 } 38 return out; 39 } 40 41 int main(){ 42 size_t n = 16, d = 8; long long k = 3; // shift by k positions 43 auto pe = sinusoidal_pe(n, d); 44 45 // Rebuild omega to reuse in rotation 46 vector<double> omega(d/2); 47 for (size_t i = 0; i < d/2; ++i) omega[i] = pow(10000.0, -2.0 * (double)i / (double)d); 48 49 // Compare PE(p+k) vs rotate_shift(PE(p)) 50 double max_err = 0.0; 51 for (size_t p = 0; p + k < n; ++p) { 52 vector<double> pred = rotate_shift(pe[p], omega, k); 53 const vector<double>& truth = pe[p + k]; 54 for (size_t j = 0; j < d; ++j) max_err = max(max_err, fabs(pred[j] - truth[j])); 55 } 56 57 cout.setf(std::ios::fixed); cout << setprecision(10); 58 cout << "Max absolute error over comparisons: " << max_err << "\n"; 59 return 0; 60 } 61
This program tests the core property: shifting by k positions rotates each sin–cos pair by angle k·ω_i. We apply a block-diagonal rotation to PE(p) and compare against directly computed PE(p+k). With double precision, the maximum error is near machine epsilon, confirming the rotation identity.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 vector<vector<double>> sinusoidal_pe(size_t n, size_t d) { 5 if (d % 2 != 0) throw runtime_error("d must be even"); 6 vector<vector<double>> pe(n, vector<double>(d)); 7 vector<double> omega(d/2); 8 for (size_t i = 0; i < d/2; ++i) omega[i] = pow(10000.0, -2.0 * (double)i / (double)d); 9 for (size_t p = 0; p < n; ++p) { 10 for (size_t i = 0; i < d/2; ++i) { 11 double a = p * omega[i]; 12 pe[p][2*i] = sin(a); 13 pe[p][2*i+1] = cos(a); 14 } 15 } 16 return pe; 17 } 18 19 // Direct dot product of PE(p) and PE(q) 20 double dot_pe(const vector<double>& a, const vector<double>& b) { 21 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i]*b[i]; return s; 22 } 23 24 // Closed-form using sum_i cos((p-q)*omega_i) 25 double dot_closed_form(long long p, long long q, size_t d) { 26 double s = 0.0; 27 for (size_t i = 0; i < d/2; ++i) { 28 double omega = pow(10000.0, -2.0 * (double)i / (double)d); 29 s += cos((p - q) * omega); 30 } 31 return s; 32 } 33 34 int main(){ 35 size_t n = 64, d = 16; // test grid 36 auto pe = sinusoidal_pe(n, d); 37 38 double max_err = 0.0; 39 for (size_t p = 0; p < n; ++p) { 40 for (size_t q = 0; q < n; ++q) { 41 double lhs = dot_pe(pe[p], pe[q]); 42 double rhs = dot_closed_form((long long)p, (long long)q, d); 43 max_err = max(max_err, fabs(lhs - rhs)); 44 } 45 } 46 47 cout.setf(std::ios::fixed); cout << setprecision(10); 48 cout << "Max absolute error between direct dot and closed form: " << max_err << "\n"; 49 50 // Show how similarity decays with |p-q| 51 size_t p0 = 10; 52 cout << "Dot products vs relative offset for p0=" << p0 << ":\n"; 53 for (int dq = -10; dq <= 10; ++dq) { 54 int q = (int)p0 + dq; 55 if (q < 0 || q >= (int)n) continue; 56 double s = dot_pe(pe[p0], pe[q]); 57 cout << "dq=" << setw(3) << dq << ": " << s << "\n"; 58 } 59 return 0; 60 } 61
The program verifies that ⟨PE(p),PE(q)⟩ equals Σ_i cos((p−q)·ω_i). It also prints how similarity varies with relative offset around a reference position, illustrating that the encoding captures relative distance information naturally.