∑MathIntermediate

Positional Encoding Mathematics

Key Points

•
Sinusoidal positional encoding represents each token’s position using pairs of sine and cosine waves at exponentially spaced frequencies.
•
The key trick is that shifting position by k is equivalent to rotating each sine–cosine pair by k· $ω,$ which preserves relative position information linearly.
•
The dot product between two positional encodings depends only on their relative distance, not their absolute positions.
•
Frequencies are chosen on a log scale so the model can represent both short-range and long-range relationships.
•
Implementation is simple: alternate sin and cos values across dimensions using ω_ $i = 1000 0^{- 2 i / d}$ for $i = 0$ ..(d/2−1).
•
Everything is computed in radians and is deterministic, so it extrapolates to sequence lengths longer than seen during training.
•
Numerically, double precision and careful indexing $\frac{e v e n}{o dd}$ help avoid subtle bugs and precision drift.
•
Time and space costs are linear in sequence length times embedding dimension, which is efficient in practice.

Prerequisites

→Trigonometric identities — Sin and cos addition formulas justify the rotation property.
→Vectors and dot products — Understanding similarity and how attention scores relate to inner products.
→Matrices and linear transformations — Rotations as linear maps explain shift behavior across sin–cos pairs.
→Big-O notation — Analyze time and space for constructing and using positional encodings.
→C++ basics and the standard library — Implement arrays/vectors, loops, and math functions reliably.
→Exponentials and logarithms — Understand logarithmic spacing of frequencies and pow(10000, exponent).
→Floating-point arithmetic — Avoid precision pitfalls when p is large and when comparing results.

Detailed Explanation

Tap terms for definitions

01Overview

Positional encoding is how Transformers tell where each token is in a sequence. Because attention itself is order-agnostic, we inject position information into token embeddings. The sinusoidal version builds a position vector from sine and cosine functions at different frequencies. Each position p is mapped to a d-dimensional vector that alternates sin and cos values with increasing wavelengths. This encoding has two crucial properties: it is deterministic (no learned parameters) and it gracefully extrapolates to longer sequences. More importantly, relative position differences appear linearly in the representation: shifting a position by k corresponds to applying a simple rotation to each sine–cosine pair. This means attention layers (which are linear before softmax) can infer relative distances without learning explicit positional tables. Unlike learned embeddings, sinusoidal encodings use a fixed formula, which makes them light-weight and easy to reproduce. Frequencies are spaced logarithmically, allowing high frequencies to capture fine-grained local order and low frequencies to capture long-range structure. The dot product between two positional encodings depends on the difference p−q, not on p or q alone, making relative comparisons natural. This balance of simplicity, extrapolation, and linear relative-position behavior explains why sinusoidal positional encodings remain a foundational choice in many Transformer variants.

02Intuition & Analogies

Imagine labeling seats in a theater with colors that cycle in patterns: fast-changing stripes for the front rows (blue–red–blue–red every few seats) and slow-changing gradients for the entire hall (from light to dark over hundreds of seats). If you look at a seat’s combined color pattern (from all stripes and gradients), you can tell not only where it is, but also how far it is from another seat by comparing patterns. That’s sinusoidal positional encoding: we paint each position with multiple repeating signals (sine and cosine waves) at different speeds (frequencies). Fast waves change quickly between adjacent positions, encoding fine details; slow waves change slowly, encoding broad location. Now think of moving from seat p to seat p+k. For each wave, your new color is just the old color rotated by an angle proportional to k. Because rotations are linear operations, any linear layer can easily work with shifts—this is perfect for attention, which largely manipulates vectors linearly before softmax. Also, by mixing many waves whose speeds grow on a log scale, the combined pattern is unique over a very long range, like a multi-hand clock: the second hand cycles fast, the minute hand slower, the hour hand slowest. Two times can be distinguished by looking at all hands together; similarly, two positions can be distinguished by looking at all the sine–cosine pairs together. This gives the model both local sensitivity and long-distance awareness, all from a simple deterministic formula.

03Formal Definition

Let d be the model (embedding) dimension, assumed even. For position p

\in

\{0,1,

\dots

\}, define frequencies

ω_{i}

on a logarithmic scale for

i = 0

,1,

\dots

\frac{d}{2}

-1 by

ω_{i}

= 10000^{-

\frac{2 i}{d}

}. Construct a d-dimensional vector PE(p) by interleaving sine and cosine components: - For even indices 2i: PE(p, 2i) =

sin

(p\,

ω_{i}

) - For odd indices 2i+1: PE(p, 2i+1) =

cos

(p\,

ω_{i}

) Equivalently, PE(p) is formed by concatenating pairs [

sin

(p\,

ω_{i}

cos

(p\,

ω_{i}

)]. For any shift k, each pair obeys the addition identities:

sin

((p+k)

ω_{i}

) =

sin

ω_{i}

)

cos

ω_{i}

) +

cos

ω_{i}

)

sin

ω_{i}

cos

((p+k)

ω_{i}

) =

cos

ω_{i}

)

cos

ω_{i}

) -

sin

ω_{i}

)

sin

ω_{i}

). Thus, for each i, [

sin

((p+k)

ω_{i}

cos

((p+k)

ω_{i}

)]^

⊤

= R(k

ω_{i}

)\,[

sin

ω_{i}

cos

ω_{i}

)]^

⊤

with a 2

\times

2 rotation matrix R(

θ

) =

\begin{bmatrix}\cos\theta

sin θ

\\ -

sin θ

\cos\theta\end{bmatrix}

. Therefore PE(p+k) is obtained from PE(p) by a block-diagonal rotation. Moreover, the dot product satisfies

⟨

PE(p), PE(q)

⟩

\sum_{i}

cos

((p-q)

ω_{i}

), depending only on the relative offset p-q.

04When to Use

Use sinusoidal positional encoding when: (1) you need a simple, parameter-free way to inject order into Transformer models; (2) you want extrapolation to longer sequences than seen in training; (3) relative position information matters and you prefer a representation where shifts correspond to linear transformations; (4) memory or parameter budget is tight and learned positional tables would add overhead. It’s particularly effective in tasks like machine translation, text classification, and time-series modeling, where sequences may extend beyond training lengths. Consider sinusoidal PE for rapid prototyping and for on-device or edge deployments where determinism and compactness are valuable. It also serves as a solid baseline to compare against learned or relative position methods. However, if your data exhibits very specific, learnable position biases (e.g., fixed-length inputs with strong absolute position semantics), learned positional embeddings or advanced relative attention mechanisms may slightly outperform sinusoidal PE. For very long contexts with aliasing risk, pair sinusoidal PE with careful frequency ranges or alternative encodings (e.g., RoPE) designed for stable long-context behavior.

⚠️Common Mistakes

Mixing degrees and radians: C++ std::sin and std::cos expect radians. Always compute p·ω in radians, not degrees.
Incorrect frequency index: The standard uses \omega_i = 10000^{-2i/d}. A common bug is using 10000^{-i/d} or 1000, which shifts the wavelength range and harms performance.
Odd d_model handling: The encoding requires pairs; if d is odd, you must decide how to handle the final component (e.g., drop one, pad with zero). Safer: enforce even d.
Wrong interleaving: The pattern is [sin, cos, sin, cos, ...]. Accidentally placing all sin first then all cos breaks the rotation and dot-product properties.
Precision issues: Using float may accumulate error for large p. Prefer double for stability, especially when verifying rotation properties or extrapolating.
Off-by-one positions: Ensure positions start at 0 and increment by 1 per token. Skipping or duplicating indices misaligns attention with content.
Forgetting extrapolation limits: While sinusoidal PE extrapolates, extremely long sequences can still suffer numerical drift or aliasing across frequencies. Validate on your maximum intended length.
Not normalizing embeddings: When adding PE to token embeddings, keep scales comparable (e.g., initialize token embeddings with similar variance) to avoid either term dominating.

Key Formulas

Frequency Schedule

ω_{i} = 1000 0^{- \frac{2 i}{d}}

Explanation: Defines the i-th angular frequency for the sinusoidal encoding on a logarithmic scale. Lower i gives higher frequency, higher i gives lower frequency, covering short and long ranges.

Sinusoidal Components

PE (p, 2 i) = sin (p ω_{i}), PE (p, 2 i + 1) = cos (p ω_{i})

Explanation: Each position p maps to sine–cosine pairs at multiple frequencies. Interleaving sin and cos preserves phase information and enables rotation properties.

Wavelength per Dimension

λ_{i} = \frac{2 π}{ω _{i}} = 2 π \cdot 1000 0^{\frac{2 i}{d}}

Explanation: The number of positions needed to complete one full cycle at frequency $ω_{i}$ . Larger i corresponds to longer wavelengths capturing long-range structure.

Shift as Rotation

[sin ((p + k) ω_{i}) cos ((p + k) ω_{i})] = [cos (k ω_{i}) - sin (k ω_{i}) sin (k ω_{i}) cos (k ω_{i})] [sin (p ω_{i}) cos (p ω_{i})]

Explanation: Shifting position by k rotates each sine–cosine pair by angle k· $ω_{i}$ . This linear relationship lets attention layers infer relative positions easily.

Dot Product Depends on Offset

⟨ PE (p), PE (q)⟩ = i = 0 \sum \frac{d}{2} - 1 [sin (p ω_{i}) sin (q ω_{i}) + cos (p ω_{i}) cos (q ω_{i})] = i = 0 \sum \frac{d}{2} - 1 cos ((p - q) ω_{i})

Explanation: The similarity between two positional encodings depends only on their relative distance p−q. This property supports relative comparisons within attention.

Construction Cost

O (n d)

Explanation: Building encodings for n positions in d dimensions requires computing n×d trigonometric values. Time and memory grow linearly with both n and d.

Angle Addition Identities

sin (a + b) = sin a cos b + cos a sin b, cos (a + b) = cos a cos b - sin a sin b

Explanation: Core trigonometric identities used to derive the rotation property of sinusoidal positional encodings.

Complexity Analysis

Let n be the maximum sequence length (number of positions) and d be the embedding dimension (assumed even). Constructing the full positional encoding table requires evaluating sin and cos for each position–dimension pair. This leads to O(n d) time and O(n d) space for storing the table. In practice, you can compute encodings on the fly per batch to trade time for memory; then the per-batch cost is O(b d) for b positions. Computing the frequency vector

ω_{i}

for i in [0, d/2) is O(d). Because these frequencies are reused for all positions, you typically precompute them once. Verifying the rotation property for a single shift k involves either applying a block-diagonal rotation to each 2D pair (O(d)) or forming an explicit d×d rotation matrix (O(

d^{2}

))—the former is preferable. Dot products between two encodings are O(d), and forming all pairwise similarities for a sequence is O(

n^{2}

d), which is the same order as attention’s standard O(

n^{2}

d). Space-wise, storing the frequency vector is O(d). If you cache the entire table for n positions, memory is O(n d). For very long contexts, this can be large; computing encodings on demand avoids the O(n d) memory footprint at the cost of extra sin/cos evaluations. Using double precision increases numerical stability with negligible asymptotic impact but slightly higher constant factors. Overall, sinusoidal positional encoding adds minimal overhead compared to multi-head attention. Its deterministic, parameter-free nature avoids training-time costs and keeps both time and space complexity linear in sequence length and dimension for construction and per-step usage.

Code Examples

Build sinusoidal positional encoding table (deterministic, extrapolates)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Generate sinusoidal positional encodings.
5 // Returns an n x d matrix where row p is PE(p).
6 vector<vector<double>> sinusoidal_pe(size_t n, size_t d) {
7     if (d % 2 != 0) throw runtime_error("d must be even for sin/cos pairs");
8     vector<vector<double>> pe(n, vector<double>(d, 0.0));
9 
10     // Precompute frequencies omega_i = 10000^{-(2i/d)} for i in [0, d/2)
11     size_t half = d / 2;
12     vector<double> omega(half);
13     for (size_t i = 0; i < half; ++i) {
14         double exponent = -2.0 * static_cast<double>(i) / static_cast<double>(d);
15         omega[i] = pow(10000.0, exponent); // radians per position
16     }
17 
18     for (size_t p = 0; p < n; ++p) {
19         for (size_t i = 0; i < half; ++i) {
20             double angle = static_cast<double>(p) * omega[i];
21             pe[p][2*i]     = sin(angle); // even index
22             pe[p][2*i + 1] = cos(angle); // odd index
23         }
24     }
25     return pe;
26 }
27 
28 int main() {
29     size_t n = 8;   // positions 0..7
30     size_t d = 6;   // must be even
31     auto pe = sinusoidal_pe(n, d);
32 
33     cout.setf(std::ios::fixed); cout << setprecision(6);
34     cout << "Positional Encoding (n=" << n << ", d=" << d << ")\n";
35     for (size_t p = 0; p < n; ++p) {
36         cout << "p=" << p << ": ";
37         for (size_t j = 0; j < d; ++j) cout << pe[p][j] << (j+1==d?'\n':' ');
38     }
39     return 0;
40 }
41

Computes the standard sinusoidal positional encodings for n positions and even dimension d. Frequencies are spaced on a log scale via ω_i=10000^{−2i/d}. The encoding interleaves sin and cos so each pair forms a 2D plane that supports rotation under shifts. The code prints the table for inspection and is suitable for precomputing PE for Transformer inputs.

Time: O(n d)Space: O(n d)

Verify shift-as-rotation property for a given offset k

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 vector<vector<double>> sinusoidal_pe(size_t n, size_t d) {
5     if (d % 2 != 0) throw runtime_error("d must be even");
6     size_t half = d/2;
7     vector<double> omega(half);
8     for (size_t i = 0; i < half; ++i) {
9         double exponent = -2.0 * static_cast<double>(i) / static_cast<double>(d);
10         omega[i] = pow(10000.0, exponent);
11     }
12     vector<vector<double>> pe(n, vector<double>(d));
13     for (size_t p = 0; p < n; ++p) {
14         for (size_t i = 0; i < half; ++i) {
15             double angle = static_cast<double>(p) * omega[i];
16             pe[p][2*i] = sin(angle);
17             pe[p][2*i+1] = cos(angle);
18         }
19     }
20     return pe;
21 }
22 
23 // Apply block-diagonal rotation corresponding to shift k.
24 // For each pair (2i,2i+1): v' = R(k*omega_i) * v.
25 vector<double> rotate_shift(const vector<double>& v, const vector<double>& omega, long long k) {
26     size_t d = v.size();
27     size_t half = d/2;
28     vector<double> out(d, 0.0);
29     for (size_t i = 0; i < half; ++i) {
30         double theta = static_cast<double>(k) * omega[i];
31         double c = cos(theta), s = sin(theta);
32         double sin_p = v[2*i];
33         double cos_p = v[2*i+1];
34         // [sin, cos]^T -> [[c, s], [-s, c]] * [sin, cos]^T
35         out[2*i]   =  c * sin_p + s * cos_p;   // sin((p+k)w)
36         out[2*i+1] = -s * sin_p + c * cos_p;   // cos((p+k)w)
37     }
38     return out;
39 }
40 
41 int main(){
42     size_t n = 16, d = 8; long long k = 3; // shift by k positions
43     auto pe = sinusoidal_pe(n, d);
44 
45     // Rebuild omega to reuse in rotation
46     vector<double> omega(d/2);
47     for (size_t i = 0; i < d/2; ++i) omega[i] = pow(10000.0, -2.0 * (double)i / (double)d);
48 
49     // Compare PE(p+k) vs rotate_shift(PE(p))
50     double max_err = 0.0;
51     for (size_t p = 0; p + k < n; ++p) {
52         vector<double> pred = rotate_shift(pe[p], omega, k);
53         const vector<double>& truth = pe[p + k];
54         for (size_t j = 0; j < d; ++j) max_err = max(max_err, fabs(pred[j] - truth[j]));
55     }
56 
57     cout.setf(std::ios::fixed); cout << setprecision(10);
58     cout << "Max absolute error over comparisons: " << max_err << "\n";
59     return 0;
60 }
61

This program tests the core property: shifting by k positions rotates each sin–cos pair by angle k·ω_i. We apply a block-diagonal rotation to PE(p) and compare against directly computed PE(p+k). With double precision, the maximum error is near machine epsilon, confirming the rotation identity.

Time: O(n d)Space: O(d)

Dot product depends only on relative distance (cosine difference identity)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 vector<vector<double>> sinusoidal_pe(size_t n, size_t d) {
5     if (d % 2 != 0) throw runtime_error("d must be even");
6     vector<vector<double>> pe(n, vector<double>(d));
7     vector<double> omega(d/2);
8     for (size_t i = 0; i < d/2; ++i) omega[i] = pow(10000.0, -2.0 * (double)i / (double)d);
9     for (size_t p = 0; p < n; ++p) {
10         for (size_t i = 0; i < d/2; ++i) {
11             double a = p * omega[i];
12             pe[p][2*i] = sin(a);
13             pe[p][2*i+1] = cos(a);
14         }
15     }
16     return pe;
17 }
18 
19 // Direct dot product of PE(p) and PE(q)
20 double dot_pe(const vector<double>& a, const vector<double>& b) {
21     double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i]*b[i]; return s;
22 }
23 
24 // Closed-form using sum_i cos((p-q)*omega_i)
25 double dot_closed_form(long long p, long long q, size_t d) {
26     double s = 0.0;
27     for (size_t i = 0; i < d/2; ++i) {
28         double omega = pow(10000.0, -2.0 * (double)i / (double)d);
29         s += cos((p - q) * omega);
30     }
31     return s;
32 }
33 
34 int main(){
35     size_t n = 64, d = 16; // test grid
36     auto pe = sinusoidal_pe(n, d);
37 
38     double max_err = 0.0;
39     for (size_t p = 0; p < n; ++p) {
40         for (size_t q = 0; q < n; ++q) {
41             double lhs = dot_pe(pe[p], pe[q]);
42             double rhs = dot_closed_form((long long)p, (long long)q, d);
43             max_err = max(max_err, fabs(lhs - rhs));
44         }
45     }
46 
47     cout.setf(std::ios::fixed); cout << setprecision(10);
48     cout << "Max absolute error between direct dot and closed form: " << max_err << "\n";
49 
50     // Show how similarity decays with |p-q|
51     size_t p0 = 10;
52     cout << "Dot products vs relative offset for p0=" << p0 << ":\n";
53     for (int dq = -10; dq <= 10; ++dq) {
54         int q = (int)p0 + dq;
55         if (q < 0 || q >= (int)n) continue;
56         double s = dot_pe(pe[p0], pe[q]);
57         cout << "dq=" << setw(3) << dq << ": " << s << "\n";
58     }
59     return 0;
60 }
61

The program verifies that ⟨PE(p),PE(q)⟩ equals Σ_i cos((p−q)·ω_i). It also prints how similarity varies with relative offset around a reference position, illustrating that the encoding captures relative distance information naturally.

Time: O(n^2 d) for the verification loop; O(n d) to build PESpace: O(n d) for the table; O(1) extra during checks

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Generate sinusoidal positional encodings.
5	// Returns an n x d matrix where row p is PE(p).
6	vector<vector<double>> sinusoidal_pe(size_t n, size_t d) {
7	if (d % 2 != 0) throw runtime_error("d must be even for sin/cos pairs");
8	vector<vector<double>> pe(n, vector<double>(d, 0.0));
9
10	// Precompute frequencies omega_i = 10000^{-(2i/d)} for i in [0, d/2)
11	size_t half = d / 2;
12	vector<double> omega(half);
13	for (size_t i = 0; i < half; ++i) {
14	double exponent = -2.0 * static_cast<double>(i) / static_cast<double>(d);
15	omega[i] = pow(10000.0, exponent); // radians per position
16	}
17
18	for (size_t p = 0; p < n; ++p) {
19	for (size_t i = 0; i < half; ++i) {
20	double angle = static_cast<double>(p) * omega[i];
21	pe[p][2*i] = sin(angle); // even index
22	pe[p][2*i + 1] = cos(angle); // odd index
23	}
24	}
25	return pe;
26	}
27
28	int main() {
29	size_t n = 8; // positions 0..7
30	size_t d = 6; // must be even
31	auto pe = sinusoidal_pe(n, d);
32
33	cout.setf(std::ios::fixed); cout << setprecision(6);
34	cout << "Positional Encoding (n=" << n << ", d=" << d << ")\n";
35	for (size_t p = 0; p < n; ++p) {
36	cout << "p=" << p << ": ";
37	for (size_t j = 0; j < d; ++j) cout << pe[p][j] << (j+1==d?'\n':' ');
38	}
39	return 0;
40	}
41