Transformer Expressiveness
Key Points
- •Transformer expressiveness studies what kinds of sequence-to-sequence mappings a Transformer can represent or approximate.
- •With sufficient width, depth, and appropriate positional encodings, Transformers are universal approximators on bounded-length sequences.
- •Softmax attention can approximate hard routing (argmax) by using large logits or small temperature, enabling near-discrete token selection.
- •Stacking attention with feed-forward networks composes lookups and nonlinear transforms, letting Transformers simulate complex algorithms.
- •Positional information (absolute or relative) is essential; without it, a Transformer cannot distinguish permutations of the inputs.
- •Expressiveness does not imply easy training or good generalization; it only states existence of parameters that realize a mapping.
- •Theoretical universality results often assume fixed maximum length and continuous outputs; discrete tasks are handled by sharp approximations.
- •Self-attention layers have quadratic cost in sequence length, which can be a practical limit despite strong expressiveness.
Prerequisites
- →Linear algebra (vectors, matrices, dot products) — Attention is built from projections and dot products; understanding these is essential to follow the mechanics.
- →Probability and softmax function — Softmax converts logits to a distribution that weights token contributions and controls routing sharpness.
- →Neural networks and ReLU MLPs — Position-wise feed-forward networks provide nonlinearity and universal approximation in feature space.
- →Sequence modeling basics — Grasping tokens, embeddings, and positional encodings clarifies how order enters the computation.
- →Big-O complexity analysis — Expressiveness must be balanced with the quadratic cost of attention in practice.
- →Continuity and compact sets — Formal universality statements are framed for continuous functions on compact domains.
Detailed Explanation
Tap terms for definitions01Overview
Transformer expressiveness asks: given an input sequence, what functions can a Transformer compute or approximate? A modern Transformer layer combines self-attention (which dynamically routes information among tokens) with position-aware feed-forward networks and residual connections. The key insight is that attention can focus on particular tokens based on their content and position, while the feed-forward part applies rich nonlinear transformations. Theoretical results show that, under mild assumptions (e.g., bounded sequence length, suitable positional encodings, and enough width), Transformers are universal approximators of continuous sequence-to-sequence functions. Intuitively, they can simulate lookup tables, conditional logic, and arithmetic with arbitrary precision by making attention nearly discrete and by stacking layers to compose operations. This is similar in spirit to classical universal approximation theorems for multilayer perceptrons but tailored to variable-length structured inputs. In practice, this means that if a task can be expressed as a function over sequences—translation, copying, reversing, sorting small lists, aggregating statistics, or emulating finite-state machines—then some Transformer with appropriate parameters can implement it on inputs up to a fixed length. However, expressiveness is a worst-case existence statement; optimization, data, inductive bias, and computational cost determine whether we can actually find those parameters and run them efficiently.
02Intuition & Analogies
Picture a Transformer as a team of spotlight operators on a stage of tokens. Each spotlight (an attention head) can swivel to highlight whichever token seems most relevant for its current position. When the spotlight is very bright on exactly one actor (thanks to a sharp softmax), the head is effectively choosing that token, much like a pointer in a program. Multiple spotlights can work in parallel to gather different pieces of information. After the spotlights settle, a local crew (the feed-forward network at each position) transforms the gathered information—comparable to applying a small toolbox of nonlinear gadgets. Residual connections let the model keep what it already knows while adding new computed details, like taking notes in layers. Positional encodings are the stage marks on the floor telling each actor where they stand; without them, a choreographer who only sees the actors but not their marks could not stage order-specific moves. By stacking layers, Transformers can perform sequences of instructions: first look up a specific partner, then compute a relation, then forward the result to someone else who uses it for the next step. Sharpening the softmax (by scaling logits) turns the fuzzy spotlight into a near on/off switch, enabling almost discrete control flow. Because ReLU feed-forward networks can approximate arbitrary continuous functions on compact sets, once the relevant context is routed by attention, the local MLPs can implement the desired transformation. In short, attention does the routing, MLPs do the computing, and positions give order—together forming a programmable circuit over sequences.
03Formal Definition
04When to Use
Use Transformer expressiveness theory to reason about whether a Transformer architecture can, in principle, perform your sequence task. It applies when you design models for translation, summarization, program synthesis, sorting short lists, routing-based reasoning, copying/reversing, or tasks that require content- and position-dependent token interactions. If your data are bounded in length (e.g., fixed-size protocol messages, small sequences in embedded systems), universality results give confidence that a suitable Transformer can represent the mapping. It’s also helpful when crafting hand-engineered modules: relative position biases can enforce algorithmic behaviors (like shifting or reversing), and multiple heads can specialize in complementary sub-tasks (e.g., one head finds a delimiter while another aggregates numbers). The theory also guides simplifications: if you only need pooling, a single head with uniform attention plus an FFN can implement sum/average, potentially replacing more complex recurrent designs. However, remember limitations: universality says parameters exist but does not address how easy they are to learn, how well they generalize to longer sequences than seen in training, or the computational cost of quadratic attention. For very long sequences, consider sparse/linearized attention; though expressiveness might change, practice often favors tractable approximations with good inductive bias.
⚠️Common Mistakes
• Equating expressiveness with trainability: A Transformer may be able to represent a function but gradient descent might not find those parameters without careful initialization, curricula, or inductive biases. • Ignoring positional information: Without positional encodings or relative position biases, self-attention is permutation-invariant, so order-dependent tasks (like reversing or sorting) cannot be represented. • Confusing soft and hard attention: Softmax never becomes exactly one-hot at finite scale; it only approximates argmax. Expect small approximation error unless logits are very large—watch out for numerical overflow. • Overlooking fixed-length assumptions: Many universality theorems assume a maximum sequence length. A model that implements a function perfectly up to L may fail to generalize to longer sequences without architectural bias. • Neglecting precision and stability: Implementations with float32 can saturate softmax when logits are large, breaking the intended near-discrete routing; use numerically stable softmax and consider temperature/scale carefully. • Misattributing performance: Strong empirical results can come from pretraining and data scale, not just expressiveness. Conversely, failures do not contradict universality; they may reflect optimization or sample complexity issues. • Underestimating quadratic costs: Even if a function is representable, O(n^2) memory/time for attention can make it impractical for long sequences—consider approximate attention or chunking. • Forgetting capacity trade-offs: Depth, width, and number of heads interact; too few heads can bottleneck routing, while overly wide layers can overfit without improving algorithmic generalization.
Key Formulas
Scaled Dot-Product Attention
Explanation: Attention weights are computed by a scaled dot product between queries and keys, passed through softmax, and used to average the values. This is the core mechanism for routing information among tokens.
Multi-Head Attention
Explanation: Multiple heads compute attention in parallel with their own Q,K,V projections; their outputs are concatenated and linearly projected. This allows the model to capture different relationships simultaneously.
Position-wise Feed-Forward
Explanation: A small MLP is applied independently at each position. With ReLU or GELU nonlinearity, it can approximate arbitrary continuous functions on compact domains when sufficiently wide.
Transformer Universal Approximation
Explanation: For any continuous target function F on a compact set K of sequences and any desired error bound, there are Transformer parameters that achieve that accuracy. This formalizes expressiveness as an existence claim.
Softmax Temperature Limit
Explanation: As temperature approaches zero, the softmax distribution concentrates on the maximum entry, behaving like a one-hot argmax. This underpins near-discrete token selection via attention.
Relative Position Bias
Explanation: An additive bias depending on relative positions sharpens or redirects attention patterns. Properly chosen biases can implement algorithmic behaviors like shifting or reversing.
Sinusoidal Positional Encoding
Explanation: Sine and cosine waves of different frequencies encode absolute positions. Adding these to embeddings gives the model access to order information without learned parameters.
Attention Complexity
Explanation: Computing all pairwise interactions between L tokens scales quadratically in time and memory. This cost often dominates Transformer layers on long sequences.
Convex Combination View
Explanation: Each output token is a convex combination of value vectors, weighted by attention. With sharp weights, this acts like selecting or copying specific tokens.
Supremum Norm
Explanation: The sup-norm measures worst-case error over a domain. Universal approximation guarantees control this maximum deviation.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Numerically stable softmax over a vector 5 vector<double> softmax(const vector<double>& x) { 6 double m = *max_element(x.begin(), x.end()); 7 double sum = 0.0; 8 vector<double> exps(x.size()); 9 for (size_t i = 0; i < x.size(); ++i) { 10 exps[i] = exp(x[i] - m); 11 sum += exps[i]; 12 } 13 for (size_t i = 0; i < x.size(); ++i) exps[i] /= max(sum, 1e-300); 14 return exps; 15 } 16 17 // Sinusoidal positional encoding of dimension d_model for position pos 18 vector<double> sinusoidal_positional_encoding(int pos, int d_model) { 19 vector<double> pe(d_model, 0.0); 20 for (int i = 0; i < d_model; i += 2) { 21 double denom = pow(10000.0, (double)i / d_model); 22 pe[i] = sin(pos / denom); 23 if (i + 1 < d_model) pe[i + 1] = cos(pos / denom); 24 } 25 return pe; 26 } 27 28 // Simple linear layer: y = x W + b; shapes: x(1xd_in), W(d_in x d_out) 29 vector<double> linear(const vector<double>& x, const vector<vector<double>>& W, const vector<double>& b) { 30 int d_in = (int)W.size(); 31 int d_out = (int)W[0].size(); 32 vector<double> y(d_out, 0.0); 33 for (int j = 0; j < d_out; ++j) { 34 double s = b[j]; 35 for (int i = 0; i < d_in; ++i) s += x[i] * W[i][j]; 36 y[j] = s; 37 } 38 return y; 39 } 40 41 // Dot product 42 inline double dot(const vector<double>& a, const vector<double>& b) { 43 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i]; return s; 44 } 45 46 // Self-attention forward pass (single head) with optional relative position bias 47 // X: L x d_model token embeddings; returns L x d_v outputs 48 vector<vector<double>> self_attention_single_head( 49 const vector<vector<double>>& X, 50 const vector<vector<double>>& Wq, 51 const vector<vector<double>>& Wk, 52 const vector<vector<double>>& Wv, 53 const vector<double>& bq, 54 const vector<double>& bk, 55 const vector<double>& bv, 56 const vector<vector<double>>& rel_bias // L x L positional bias (can be all zeros) 57 ) { 58 int L = (int)X.size(); 59 int d_model = (int)X[0].size(); 60 int d_k = (int)Wq[0].size(); // also Wk[0].size() 61 int d_v = (int)Wv[0].size(); 62 63 vector<vector<double>> Q(L), K(L), V(L); 64 for (int i = 0; i < L; ++i) { 65 Q[i] = linear(X[i], Wq, bq); 66 K[i] = linear(X[i], Wk, bk); 67 V[i] = linear(X[i], Wv, bv); 68 } 69 70 // Compute attention outputs 71 vector<vector<double>> Z(L, vector<double>(d_v, 0.0)); 72 double scale = 1.0 / sqrt((double)d_k); 73 for (int i = 0; i < L; ++i) { 74 vector<double> logits(L, 0.0); 75 for (int j = 0; j < L; ++j) { 76 logits[j] = dot(Q[i], K[j]) * scale + rel_bias[i][j]; 77 } 78 vector<double> a = softmax(logits); // attention weights over j 79 for (int j = 0; j < L; ++j) { 80 for (int c = 0; c < d_v; ++c) Z[i][c] += a[j] * V[j][c]; 81 } 82 } 83 return Z; 84 } 85 86 int main() { 87 // Example: 4 tokens, model dim 6 88 int L = 4, d_model = 6, d_k = 4, d_v = 4; 89 90 // Toy token embeddings (could be outputs of an embedding layer) 91 vector<vector<double>> tokens(L, vector<double>(d_model, 0.0)); 92 // Put simple numeric content in first coordinate, rest zeros; add positional encodings 93 vector<double> contents = {1.0, 2.0, 3.0, 4.0}; 94 for (int i = 0; i < L; ++i) { 95 tokens[i][0] = contents[i]; 96 vector<double> pe = sinusoidal_positional_encoding(i, d_model); 97 for (int k = 0; k < d_model; ++k) tokens[i][k] += 0.1 * pe[k]; // small positional signal 98 } 99 100 // Initialize small random weights for Q,K,V (fixed seed for reproducibility) 101 std::mt19937 gen(42); 102 std::normal_distribution<double> nd(0.0, 0.2); 103 auto rand_matrix = [&](int r, int c){ 104 vector<vector<double>> W(r, vector<double>(c)); 105 for (int i = 0; i < r; ++i) for (int j = 0; j < c; ++j) W[i][j] = nd(gen); 106 return W; 107 }; 108 auto rand_bias = [&](int c){ vector<double> b(c, 0.0); return b; }; 109 110 vector<vector<double>> Wq = rand_matrix(d_model, d_k); 111 vector<vector<double>> Wk = rand_matrix(d_model, d_k); 112 vector<vector<double>> Wv = rand_matrix(d_model, d_v); 113 vector<double> bq = rand_bias(d_k), bk = rand_bias(d_k), bv = rand_bias(d_v); 114 115 // No relative bias in this basic example 116 vector<vector<double>> rel_bias(L, vector<double>(L, 0.0)); 117 118 auto Z = self_attention_single_head(tokens, Wq, Wk, Wv, bq, bk, bv, rel_bias); 119 120 cout << fixed << setprecision(4); 121 cout << "Self-attention outputs (one head):\n"; 122 for (int i = 0; i < L; ++i) { 123 cout << "pos " << i << ": "; 124 for (double v : Z[i]) cout << v << ' '; 125 cout << '\n'; 126 } 127 return 0; 128 } 129
This program implements a minimal single-head self-attention forward pass with sinusoidal positional encodings. It projects tokens to queries, keys, and values, computes scaled dot-product attention with a numerically stable softmax, and returns the weighted sum of values. Although weights are random here, the same code can realize algorithmic behaviors (copying, shifting) once Q/K/V and positional biases are set appropriately—illustrating how attention performs content- and position-dependent routing.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 vector<double> softmax(const vector<double>& x) { 5 double m = *max_element(x.begin(), x.end()); 6 double sum = 0.0; vector<double> e(x.size()); 7 for (size_t i = 0; i < x.size(); ++i) { e[i] = exp(x[i] - m); sum += e[i]; } 8 for (double &v : e) v /= max(sum, 1e-300); 9 return e; 10 } 11 12 // Reverse via attention: set logits to a very large value at j = L-1-i and small elsewhere. 13 // This mimics a Transformer head with a strong relative position bias. 14 int main() { 15 int L = 6; // sequence length 16 vector<vector<double>> V(L, vector<double>(1, 0.0)); // values hold the content (scalar per token) 17 for (int i = 0; i < L; ++i) V[i][0] = (double)(i + 1); // contents: 1,2,3,4,5,6 18 19 // Build relative bias matrix B where B[i][j] is large if j == L-1-i 20 double big = 50.0; // large scale -> softmax ~ one-hot 21 vector<vector<double>> B(L, vector<double>(L, -10.0)); // small baseline elsewhere 22 for (int i = 0; i < L; ++i) { 23 int j_star = L - 1 - i; 24 B[i][j_star] = big; 25 } 26 27 // Attention weights for each query position i 28 vector<vector<double>> A(L, vector<double>(L)); 29 for (int i = 0; i < L; ++i) A[i] = softmax(B[i]); 30 31 // Output Z = A V (here d_v=1, so it's just a weighted sum of scalars) 32 vector<double> Z(L, 0.0); 33 for (int i = 0; i < L; ++i) { 34 for (int j = 0; j < L; ++j) Z[i] += A[i][j] * V[j][0]; 35 } 36 37 cout << fixed << setprecision(6); 38 cout << "Input: "; for (int i = 0; i < L; ++i) cout << V[i][0] << ' '; cout << '\n'; 39 cout << "Output: "; for (int i = 0; i < L; ++i) cout << Z[i] << ' '; cout << '\n'; 40 41 // Show how close to exact reversal we are by comparing to reversed sequence 42 cout << "Target: "; for (int i = 0; i < L; ++i) cout << (double)(L - i) << ' '; cout << '\n'; 43 44 double max_err = 0.0; 45 for (int i = 0; i < L; ++i) max_err = max(max_err, fabs(Z[i] - (double)(L - i))); 46 cout << "Max abs error: " << max_err << '\n'; 47 return 0; 48 } 49
This example demonstrates that a single attention head with an appropriate relative position bias can implement sequence reversal. By assigning a very large logit to the reversed index j = L − 1 − i and small values elsewhere, the softmax becomes nearly one-hot, so each position i copies the value from its reversed partner. Increasing the scale makes the operation arbitrarily close to exact, illustrating how soft attention approximates hard routing and why Transformers can realize algorithmic, order-dependent functions.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute the sum of a sequence via attention: a special query attends uniformly to all tokens. 5 int main() { 6 int L = 5; // number of tokens 7 // Values hold 2D features; here we only use the first dim as numbers to sum 8 vector<vector<double>> V = {{1.0, 0.0}, {2.0, 0.0}, {3.0, 0.0}, {4.0, 0.0}, {5.0, 0.0}}; 9 10 // A single query (CLS) with zero logits to every key -> uniform softmax 11 vector<double> logits(L, 0.0); 12 // Numerically stable softmax of zeros is uniform 1/L 13 vector<double> A(L, 1.0 / L); 14 15 // Z = sum_j A_j * V_j is the mean; multiply by L to get sum 16 vector<double> Z(2, 0.0); 17 for (int j = 0; j < L; ++j) { 18 Z[0] += A[j] * V[j][0]; 19 Z[1] += A[j] * V[j][1]; 20 } 21 double mean_first = Z[0]; 22 double sum_first = mean_first * L; // emulate an FFN that scales by L 23 24 cout << fixed << setprecision(4); 25 cout << "Numbers: "; for (int j = 0; j < L; ++j) cout << V[j][0] << ' '; cout << '\n'; 26 cout << "Uniform weights: "; for (int j = 0; j < L; ++j) cout << A[j] << ' '; cout << '\n'; 27 cout << "Mean (first dim): " << mean_first << '\n'; 28 cout << "Sum (first dim): " << sum_first << '\n'; 29 return 0; 30 } 31
A special query (like a CLS token) with identical logits to all keys produces uniform attention. The attended value is the mean of token values; a subsequent linear layer can multiply by L to recover the sum. This showcases how attention plus a simple feed-forward step can exactly implement useful aggregations, a building block for many sequence-to-sequence tasks.