🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Positional Encoding Theory

Key Points

  • •
    Transformers are permutation-invariant by default, so they need positional encodings to understand word order in sequences.
  • •
    Sinusoidal positional encodings inject absolute position using sines and cosines at exponentially spaced frequencies.
  • •
    The encoding for position pos and channel 2i/2i+1 is sin(pos/10000^{2i/dm​odel}) and cos(pos/10000^{2i/dm​odel}).
  • •
    These encodings allow the model to compute relative positions linearly, thanks to trigonometric addition identities.
  • •
    They are parameter-free, deterministic, and extrapolate gracefully to longer sequences than seen in training.
  • •
    Adding positional encodings is O(n d) in time and space, negligible compared to attention’s O(n2 d).
  • •
    Use sinusoidal encodings when you want simple, memory-light positional signals with good length generalization.

Prerequisites

  • →Vectors and matrices — Positional encodings and attention use matrix operations and vector additions.
  • →Trigonometric functions — Sinusoidal encodings rely on sine/cosine and angle addition identities.
  • →Softmax and numerical stability — Attention normalizes scores with softmax; stability matters for implementation.
  • →Embeddings — Understanding how tokens map to continuous vectors is essential before adding positions.
  • →Transformer architecture basics — Context for where and why positional encodings are added.
  • →Time and space complexity — To analyze the overhead of generating and applying positional encodings.
  • →C++ I/O and STL containers — Implementations use vectors, loops, and printing.
  • →Floating-point precision — Large positions and exponentials can be sensitive to precision.

Detailed Explanation

Tap terms for definitions

01Overview

Hook → Imagine reading a sentence with all words shuffled: the bag of words is the same, but the meaning collapses without order. Concept → Transformers, powered by self-attention, naturally see inputs as sets, not ordered sequences. Positional encodings are the “GPS coordinates” that tell the model where each token sits in the sequence. Sinusoidal positional encodings (from the original Transformer paper) encode absolute position using sinusoids at multiple frequencies so the model can infer both coarse and fine-grained distances. They are computed by simple formulas—no learned parameters—and then added to token embeddings before attention. Example → If d_model = 8 and pos = 5, we compute eight numbers: sin and cos pairs at four frequencies. This vector is added to the token’s embedding at position 5, nudging attention to notice its place in line.

02Intuition & Analogies

Hook → Think of a marching band on a football field. Each musician must know both who they are (their instrument) and exactly where they stand (their spot on the grid). Concept → Token embeddings tell the model “who” each token is; positional encodings tell it “where” they are in the sequence. Using sinusoids is like giving every position a unique barcode made from low and high musical notes (frequencies). Low frequencies change slowly across positions and capture global order (beginning vs. end), while high frequencies change rapidly and capture local differences (neighbors). Because these barcodes are built from sines and cosines, shifting a position corresponds to a simple rotation of each sinusoidal pair. This means the model can compute relative distances with linear operations—great for attention layers which are linear before softmax. Example → If you move 3 steps forward, each [sin, cos] pair just rotates by an angle tied to that frequency. The model can learn weights that recognize, say, “a verb tends to attend to a noun 1–3 steps back,” regardless of the absolute position, because the relative shift looks consistent across the whole field.

03Formal Definition

Hook→We want a position map with structure the network can exploit algebraically. Concept→For model dimension dmodel​ (assumed even), define angles θpos,i​ = \frac{pos}{10000^{2i/dmodel​}} for i=0,1,…,2dmodel​​-1. The positional encoding is a vector PEpos​ ∈ Rdmodel​ with interleaved sine and cosine channels: PEpos,2i​ = sin(θpos,i​) and PEpos,2i+1​ = cos(θpos,i​). These channels correspond to exponentially spaced wavelengths λi​ = 2π⋅ 10000^{2i/dmodel​}. Example→For dmodel​=8, we get i∈\{0,1,2,3\}; compute four angles and fill [sinθ0​, cosθ0​, sinθ1​, cosθ1​, ...]. The complete input to attention is xpos​=embed(tokenpos​) + PEpos​. Crucially, trigonometric addition identities imply that PEpos+k​ is a linear transform (a 2×2 rotation) of each [PEpos,2i​, PEpos,2i+1​] pair, enabling relative-position reasoning by linear layers.

04When to Use

Hook → You need order, but don’t want extra parameters or training complexity. Concept → Use sinusoidal positional encodings in Transformers when absolute order matters (language, code, time series, speech, DNA/proteins) and you want strong extrapolation to longer lengths than training. They are ideal for resource-constrained settings, deterministic pipelines, or as a baseline to compare against learned or relative schemes. Example use cases: (1) Machine translation where sentences may be longer at inference time than in training; (2) Sensor and time-series models where timestamps can exceed training windows; (3) Lightweight inference on CPU where parameter count must be minimal; (4) Educational or research prototypes that need reproducibility without tuning extra parameters. Prefer learned or relative encodings (e.g., rotary or bias-based) when data-specific patterns or very long-context performance require it, but sinusoidal remains a robust, simple default.

⚠️Common Mistakes

  • Mixing degrees and radians: std::sin and std::cos in C++ use radians. Feeding degrees silently produces wrong patterns. Always compute angles in radians as defined by the formula. - Off-by-one positions: The standard uses pos starting at 0. Starting at 1 shifts all phases and can subtly degrade performance or mismatch pretrained configs. - Odd d_model handling: The classic definition assumes an even d_model for [sin, cos] pairs. If d_model is odd, decide how to handle the leftover channel (e.g., drop it or pad), but be consistent. - Recomputing each step with wrong max_len: Generating encodings for a smaller max_len than your longest sequence causes out-of-bounds or silent truncation. Precompute for the maximum needed length. - Forgetting to add (not concatenate): The vanilla approach adds PE to token embeddings elementwise. Concatenation changes dimensionality and requires reinitializing projection layers. - Precision and scaling: Using float with very large pos can accumulate error; double precision is safer. Also remember attention’s usual 1/\sqrt{d_k} scaling; positional encodings don’t replace that. - Misinterpreting extrapolation: Sinusoidal encodes arbitrarily long positions, but model parameters were trained on finite contexts; attention may still degrade with extreme lengths even though the encoding itself generalizes.

Key Formulas

Angle per channel

θpos,i​=10000dmodel​2i​pos​

Explanation: This defines the phase (angle) for the i-th sinusoidal pair at position pos. Higher i uses higher frequency (smaller wavelength) due to the exponential denominator.

Sinusoidal positional encoding

PEpos,2i​=sin(θpos,i​),PEpos,2i+1​=cos(θpos,i​)

Explanation: Each position is mapped to interleaved sine and cosine values at multiple frequencies. These values are added to token embeddings to inject order information.

Wavelength per channel

λi​=2π⋅10000dmodel​2i​

Explanation: This gives the period of the sinusoid for channel pair i. Larger i leads to shorter wavelengths which capture finer positional differences.

Scaled dot-product attention

Attention(Q,K,V)=softmax(dk​​QK⊤​)V

Explanation: Attention weights are computed from query–key similarities scaled by 1/√dk​ and normalized by softmax. These weights mix the values V to produce outputs.

Angle addition identities

sin(a+b)cos(a+b)​=sinacosb+cosasinb,=cosacosb−sinasinb.​

Explanation: Shifting positions corresponds to adding angles. Each [sin, cos] pair transforms by a 2×2 rotation, enabling relative position computation via linear operations.

Rotation form of shifts

[sin(θpos+k,i​), cos(θpos+k,i​)]⊤Δi​​=[cosΔi​−sinΔi​​sinΔi​cosΔi​​][sin(θpos,i​), cos(θpos,i​)]⊤,=10000dmodel​2i​k​.​

Explanation: A shift by k positions is a rotation by angle Δ_i in each sinusoidal pair. This linear structure is why sinusoidal encodings support relative-distance reasoning.

Dot product depends on relative distance

PE(pos)⋅PE(pos′)=i=0∑2dmodel​​−1​cos(θpos,i​−θpos′,i​)

Explanation: The dot product between two positional encodings depends only on their angle differences. This links similarity of positions to their relative distance across frequencies.

Generation complexity

TPE​(n,d)=O(nd),SPE​(n,d)=O(nd)

Explanation: Creating a positional encoding matrix for n positions and dimension d is linear in both. Time and space grow proportionally to n×d, small compared to attention’s cost.

Complexity Analysis

Generating sinusoidal positional encodings for a sequence of length n and model dimension d requires computing one sine and one cosine per pair of channels at each position. This is O(n d) time, since we fill an n×d matrix, and O(n d) space to store it. The per-element work is constant (one pow for the base frequency per pair can be precomputed or shared, and one sin/cos per element), so practical throughput is dominated by memory bandwidth rather than compute. When adding positional encodings to token embeddings, the operation is a simple elementwise addition, also O(n d) time and O(1) extra space if done in-place (besides the storage for the PE matrix). In the context of a Transformer layer, the main cost comes from self-attention, which requires forming Q, K, and V (O(n d2) for projections if dense) and computing attention scores QKT (O(n2 dk​)), followed by softmax and multiplying by V (O(n2 dv​)). Compared to these quadratic terms, the O(n d) positional encoding step is negligible for typical sequence lengths (n ≳ 128). If positional encodings are precomputed up to a maximum length and reused across batches, there is no per-step recomputation cost beyond indexing the first n rows, keeping runtime consistent. Memory-wise, storing PE for maxl​en×d is modest (e.g., 4096×1024 doubles ≈ 32 MB), and can be reduced by on-the-fly generation for each needed slice if memory is tight, trading small compute for lower storage. Precision-wise, using double can mitigate accumulated error for very large positions, though float is usually sufficient for standard lengths.

Code Examples

Generate sinusoidal positional encoding matrix
1#include <iostream>
2#include <vector>
3#include <cmath>
4#include <iomanip>
5#include <stdexcept>
6
7using Matrix = std::vector<std::vector<double>>;
8
9// Create an n x d_model matrix of sinusoidal positional encodings.
10// PE[pos, 2i] = sin(pos / 10000^{2i/d_model})
11// PE[pos, 2i+1] = cos(pos / 10000^{2i/d_model})
12Matrix sinusoidal_positional_encoding(std::size_t max_len, std::size_t d_model) {
13 if (d_model % 2 != 0) {
14 throw std::invalid_argument("d_model must be even for [sin, cos] pairs.");
15 }
16
17 Matrix pe(max_len, std::vector<double>(d_model, 0.0));
18
19 // Precompute the denominator exponents for each pair to avoid recomputing pow in inner loops.
20 std::vector<double> div_terms(d_model / 2);
21 for (std::size_t i = 0; i < d_model / 2; ++i) {
22 double exponent = static_cast<double>(2 * i) / static_cast<double>(d_model);
23 div_terms[i] = std::pow(10000.0, exponent); // 10000^{2i/d_model}
24 }
25
26 for (std::size_t pos = 0; pos < max_len; ++pos) {
27 for (std::size_t i = 0; i < d_model / 2; ++i) {
28 double theta = static_cast<double>(pos) / div_terms[i];
29 pe[pos][2 * i] = std::sin(theta);
30 pe[pos][2 * i + 1] = std::cos(theta);
31 }
32 }
33
34 return pe;
35}
36
37int main() {
38 std::size_t max_len = 10;
39 std::size_t d_model = 8; // must be even
40
41 Matrix pe = sinusoidal_positional_encoding(max_len, d_model);
42
43 // Print first few rows
44 std::cout << std::fixed << std::setprecision(6);
45 for (std::size_t pos = 0; pos < max_len; ++pos) {
46 std::cout << "pos " << pos << ": ";
47 for (std::size_t j = 0; j < d_model; ++j) {
48 std::cout << std::setw(10) << pe[pos][j] << ' ';
49 }
50 std::cout << '\n';
51 }
52 return 0;
53}
54

This program builds the standard sinusoidal positional encoding table for a given maximum length and model dimension. It precomputes the exponential denominators for efficiency and fills an n×d matrix with interleaved sine and cosine values per position. The output shows the encoding vectors for positions 0..9 with d_model=8.

Time: O(n d)Space: O(n d)
Add positional encodings to token embeddings
1#include <iostream>
2#include <vector>
3#include <random>
4#include <iomanip>
5#include <cmath>
6
7using Matrix = std::vector<std::vector<double>>;
8
9Matrix sinusoidal_positional_encoding(std::size_t max_len, std::size_t d_model) {
10 Matrix pe(max_len, std::vector<double>(d_model, 0.0));
11 for (std::size_t i = 0; i < d_model / 2; ++i) {
12 double exponent = static_cast<double>(2 * i) / static_cast<double>(d_model);
13 double div = std::pow(10000.0, exponent);
14 for (std::size_t pos = 0; pos < max_len; ++pos) {
15 double theta = static_cast<double>(pos) / div;
16 pe[pos][2 * i] = std::sin(theta);
17 pe[pos][2 * i + 1] = std::cos(theta);
18 }
19 }
20 return pe;
21}
22
23Matrix make_embedding_matrix(std::size_t vocab_size, std::size_t d_model, unsigned seed=42) {
24 std::mt19937 gen(seed);
25 std::normal_distribution<double> dist(0.0, 0.02);
26 Matrix E(vocab_size, std::vector<double>(d_model));
27 for (std::size_t v = 0; v < vocab_size; ++v)
28 for (std::size_t j = 0; j < d_model; ++j)
29 E[v][j] = dist(gen);
30 return E;
31}
32
33Matrix embed_tokens(const std::vector<int>& tokens, const Matrix& E) {
34 std::size_t n = tokens.size();
35 std::size_t d = E[0].size();
36 Matrix X(n, std::vector<double>(d));
37 for (std::size_t i = 0; i < n; ++i) {
38 int id = tokens[i];
39 for (std::size_t j = 0; j < d; ++j) X[i][j] = E[id][j];
40 }
41 return X;
42}
43
44void add_positional_encoding(Matrix& X, const Matrix& PE) {
45 for (std::size_t i = 0; i < X.size(); ++i) {
46 for (std::size_t j = 0; j < X[0].size(); ++j) {
47 X[i][j] += PE[i][j]; // add position-wise
48 }
49 }
50}
51
52int main() {
53 // Toy example: sequence of token IDs
54 std::vector<int> tokens = {3, 1, 4, 1, 5};
55 std::size_t vocab_size = 10;
56 std::size_t d_model = 8; // even
57
58 Matrix E = make_embedding_matrix(vocab_size, d_model);
59 Matrix PE = sinusoidal_positional_encoding(tokens.size(), d_model);
60
61 Matrix X = embed_tokens(tokens, E);
62 add_positional_encoding(X, PE);
63
64 // Print the first two rows to illustrate addition
65 std::cout << std::fixed << std::setprecision(6);
66 for (std::size_t i = 0; i < X.size(); ++i) {
67 std::cout << "pos " << i << ": ";
68 for (double v : X[i]) std::cout << std::setw(10) << v << ' ';
69 std::cout << '\n';
70 }
71 return 0;
72}
73

We simulate a small embedding table and a short token sequence. After embedding tokens, we add the corresponding sinusoidal positional encoding row to each token’s vector. This mirrors the standard Transformer input step x_pos = embed(token_pos) + PE_pos.

Time: O(n d)Space: O(vocab_size d + n d)
Attention weights with and without positional encodings
1#include <iostream>
2#include <vector>
3#include <cmath>
4#include <iomanip>
5#include <random>
6
7using Matrix = std::vector<std::vector<double>>;
8
9Matrix sinusoidal_positional_encoding(std::size_t max_len, std::size_t d_model) {
10 Matrix pe(max_len, std::vector<double>(d_model, 0.0));
11 for (std::size_t i = 0; i < d_model / 2; ++i) {
12 double exponent = static_cast<double>(2 * i) / static_cast<double>(d_model);
13 double div = std::pow(10000.0, exponent);
14 for (std::size_t pos = 0; pos < max_len; ++pos) {
15 double theta = static_cast<double>(pos) / div;
16 pe[pos][2 * i] = std::sin(theta);
17 pe[pos][2 * i + 1] = std::cos(theta);
18 }
19 }
20 return pe;
21}
22
23Matrix make_embedding_matrix(std::size_t vocab_size, std::size_t d_model, unsigned seed=123) {
24 std::mt19937 gen(seed);
25 std::normal_distribution<double> dist(0.0, 0.1);
26 Matrix E(vocab_size, std::vector<double>(d_model));
27 for (std::size_t v = 0; v < vocab_size; ++v)
28 for (std::size_t j = 0; j < d_model; ++j)
29 E[v][j] = dist(gen);
30 return E;
31}
32
33Matrix embed(const std::vector<int>& ids, const Matrix& E) {
34 Matrix X(ids.size(), std::vector<double>(E[0].size()));
35 for (std::size_t i = 0; i < ids.size(); ++i)
36 for (std::size_t j = 0; j < E[0].size(); ++j)
37 X[i][j] = E[ids[i]][j];
38 return X;
39}
40
41Matrix matmul(const Matrix& A, const Matrix& B) {
42 std::size_t n = A.size();
43 std::size_t m = B[0].size();
44 std::size_t k = B.size();
45 Matrix C(n, std::vector<double>(m, 0.0));
46 for (std::size_t i = 0; i < n; ++i)
47 for (std::size_t t = 0; t < k; ++t)
48 for (std::size_t j = 0; j < m; ++j)
49 C[i][j] += A[i][t] * B[t][j];
50 return C;
51}
52
53Matrix transpose(const Matrix& A) {
54 Matrix T(A[0].size(), std::vector<double>(A.size()));
55 for (std::size_t i = 0; i < A.size(); ++i)
56 for (std::size_t j = 0; j < A[0].size(); ++j)
57 T[j][i] = A[i][j];
58 return T;
59}
60
61void row_softmax(Matrix& A) {
62 for (auto& row : A) {
63 double maxv = row[0];
64 for (double v : row) maxv = std::max(maxv, v);
65 double sum = 0.0;
66 for (double& v : row) { v = std::exp(v - maxv); sum += v; }
67 for (double& v : row) v /= (sum + 1e-12);
68 }
69}
70
71Matrix add(const Matrix& A, const Matrix& B) {
72 Matrix C = A;
73 for (std::size_t i = 0; i < A.size(); ++i)
74 for (std::size_t j = 0; j < A[0].size(); ++j)
75 C[i][j] += B[i][j];
76 return C;
77}
78
79Matrix attention_weights_from_inputs(const Matrix& X) {
80 // Scaled dot-product attention with Q=K=X (single head, no masks), return only weights
81 Matrix Kt = transpose(X);
82 Matrix scores = matmul(X, Kt);
83 double scale = 1.0 / std::sqrt(static_cast<double>(X[0].size()));
84 for (auto& row : scores) for (double& v : row) v *= scale;
85 row_softmax(scores);
86 return scores;
87}
88
89void print_matrix(const Matrix& A, const std::string& name) {
90 std::cout << name << "\n";
91 std::cout << std::fixed << std::setprecision(4);
92 for (const auto& row : A) {
93 for (double v : row) std::cout << std::setw(8) << v << ' ';
94 std::cout << '\n';
95 }
96}
97
98int main() {
99 // Two sequences with same tokens but reversed order
100 std::vector<int> seq1 = {1, 2, 3, 4};
101 std::vector<int> seq2 = {4, 3, 2, 1};
102
103 std::size_t vocab = 10, d_model = 8;
104 Matrix E = make_embedding_matrix(vocab, d_model, 7);
105
106 Matrix X1 = embed(seq1, E);
107 Matrix X2 = embed(seq2, E);
108
109 // Without positional encodings
110 Matrix W1 = attention_weights_from_inputs(X1);
111 Matrix W2 = attention_weights_from_inputs(X2);
112
113 // With positional encodings added
114 Matrix PE = sinusoidal_positional_encoding(seq1.size(), d_model);
115 Matrix X1p = add(X1, PE);
116 Matrix X2p = add(X2, PE);
117 Matrix W1p = attention_weights_from_inputs(X1p);
118 Matrix W2p = attention_weights_from_inputs(X2p);
119
120 print_matrix(W1, "Weights seq1 (no PE)");
121 print_matrix(W2, "Weights seq2 (no PE)");
122 print_matrix(W1p, "Weights seq1 (with PE)");
123 print_matrix(W2p, "Weights seq2 (with PE)");
124
125 std::cout << "\nObserve: without PE, seq2’s weights are largely a permutation of seq1’s; with PE, the patterns differ because order is encoded." << std::endl;
126 return 0;
127}
128

This program contrasts attention weights computed from embeddings alone versus embeddings plus positional encodings for two reversed sequences. Without positions, attention is permutation-equivariant, so reversing inputs mainly permutes the weight matrix. After adding sinusoidal encodings, the attention patterns change because absolute order information is now present.

Time: O(n^2 d) for attention, O(n d) for adding PESpace: O(n^2) for weights, O(n d) for inputs
#positional encoding#sinusoidal encoding#transformer#self-attention#scaled dot-product#absolute position#relative position#rotary embedding#sequence modeling#wavelength#frequency band#permutation invariance#softmax#extrapolation