🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryAdvanced

Transformer Expressiveness

Key Points

  • •
    Transformer expressiveness studies what kinds of sequence-to-sequence mappings a Transformer can represent or approximate.
  • •
    With sufficient width, depth, and appropriate positional encodings, Transformers are universal approximators on bounded-length sequences.
  • •
    Softmax attention can approximate hard routing (argmax) by using large logits or small temperature, enabling near-discrete token selection.
  • •
    Stacking attention with feed-forward networks composes lookups and nonlinear transforms, letting Transformers simulate complex algorithms.
  • •
    Positional information (absolute or relative) is essential; without it, a Transformer cannot distinguish permutations of the inputs.
  • •
    Expressiveness does not imply easy training or good generalization; it only states existence of parameters that realize a mapping.
  • •
    Theoretical universality results often assume fixed maximum length and continuous outputs; discrete tasks are handled by sharp approximations.
  • •
    Self-attention layers have quadratic cost in sequence length, which can be a practical limit despite strong expressiveness.

Prerequisites

  • →Linear algebra (vectors, matrices, dot products) — Attention is built from projections and dot products; understanding these is essential to follow the mechanics.
  • →Probability and softmax function — Softmax converts logits to a distribution that weights token contributions and controls routing sharpness.
  • →Neural networks and ReLU MLPs — Position-wise feed-forward networks provide nonlinearity and universal approximation in feature space.
  • →Sequence modeling basics — Grasping tokens, embeddings, and positional encodings clarifies how order enters the computation.
  • →Big-O complexity analysis — Expressiveness must be balanced with the quadratic cost of attention in practice.
  • →Continuity and compact sets — Formal universality statements are framed for continuous functions on compact domains.

Detailed Explanation

Tap terms for definitions

01Overview

Transformer expressiveness asks: given an input sequence, what functions can a Transformer compute or approximate? A modern Transformer layer combines self-attention (which dynamically routes information among tokens) with position-aware feed-forward networks and residual connections. The key insight is that attention can focus on particular tokens based on their content and position, while the feed-forward part applies rich nonlinear transformations. Theoretical results show that, under mild assumptions (e.g., bounded sequence length, suitable positional encodings, and enough width), Transformers are universal approximators of continuous sequence-to-sequence functions. Intuitively, they can simulate lookup tables, conditional logic, and arithmetic with arbitrary precision by making attention nearly discrete and by stacking layers to compose operations. This is similar in spirit to classical universal approximation theorems for multilayer perceptrons but tailored to variable-length structured inputs. In practice, this means that if a task can be expressed as a function over sequences—translation, copying, reversing, sorting small lists, aggregating statistics, or emulating finite-state machines—then some Transformer with appropriate parameters can implement it on inputs up to a fixed length. However, expressiveness is a worst-case existence statement; optimization, data, inductive bias, and computational cost determine whether we can actually find those parameters and run them efficiently.

02Intuition & Analogies

Picture a Transformer as a team of spotlight operators on a stage of tokens. Each spotlight (an attention head) can swivel to highlight whichever token seems most relevant for its current position. When the spotlight is very bright on exactly one actor (thanks to a sharp softmax), the head is effectively choosing that token, much like a pointer in a program. Multiple spotlights can work in parallel to gather different pieces of information. After the spotlights settle, a local crew (the feed-forward network at each position) transforms the gathered information—comparable to applying a small toolbox of nonlinear gadgets. Residual connections let the model keep what it already knows while adding new computed details, like taking notes in layers. Positional encodings are the stage marks on the floor telling each actor where they stand; without them, a choreographer who only sees the actors but not their marks could not stage order-specific moves. By stacking layers, Transformers can perform sequences of instructions: first look up a specific partner, then compute a relation, then forward the result to someone else who uses it for the next step. Sharpening the softmax (by scaling logits) turns the fuzzy spotlight into a near on/off switch, enabling almost discrete control flow. Because ReLU feed-forward networks can approximate arbitrary continuous functions on compact sets, once the relevant context is routed by attention, the local MLPs can implement the desired transformation. In short, attention does the routing, MLPs do the computing, and positions give order—together forming a programmable circuit over sequences.

03Formal Definition

Let X ⊂ Rdin​ be a compact set of token embeddings and consider sequences of length L with inputs in XL. A Transformer Tθ​ maps XL to YL′ ⊂ Rdout​ via compositions of: (1) multi-head attention layers MHA(Q,K,V) with positional encodings added to inputs, and (2) position-wise feed-forward networks (FFN) with nonlinearities (e.g., ReLU), along with residual connections and layer normalization. A universal approximation statement for Transformers asserts: for any continuous function F: XL → YL′ and any ε > 0, there exists parameters θ such that supx∈XL​ \∣Tθ​(x)−F(x)∥_{∞} < ε. For discrete alphabets Σ and bounded length L, viewing tokens as one-hot or embedded points in Rd, any function f: Σ≤L → Γ≤L′ can be represented exactly by a sufficiently wide/deep Transformer with near-hard attention (e.g., with large-magnitude logits or small temperature), or approximated arbitrarily well in the continuous embedding space. A standard attention head computes, for i,j ∈ \{1,…,L\}, Sij​ = (QK⊤)_{ij}/dk​​ plus optional positional bias, Aij​ = softmaxj​(Sij​), and outputs Z=AV. Multi-head attention concatenates head outputs and applies a linear map. With positional encodings epos​(i) added to token embeddings, attention can become position-sensitive, which is crucial for representing order-dependent functions.

04When to Use

Use Transformer expressiveness theory to reason about whether a Transformer architecture can, in principle, perform your sequence task. It applies when you design models for translation, summarization, program synthesis, sorting short lists, routing-based reasoning, copying/reversing, or tasks that require content- and position-dependent token interactions. If your data are bounded in length (e.g., fixed-size protocol messages, small sequences in embedded systems), universality results give confidence that a suitable Transformer can represent the mapping. It’s also helpful when crafting hand-engineered modules: relative position biases can enforce algorithmic behaviors (like shifting or reversing), and multiple heads can specialize in complementary sub-tasks (e.g., one head finds a delimiter while another aggregates numbers). The theory also guides simplifications: if you only need pooling, a single head with uniform attention plus an FFN can implement sum/average, potentially replacing more complex recurrent designs. However, remember limitations: universality says parameters exist but does not address how easy they are to learn, how well they generalize to longer sequences than seen in training, or the computational cost of quadratic attention. For very long sequences, consider sparse/linearized attention; though expressiveness might change, practice often favors tractable approximations with good inductive bias.

⚠️Common Mistakes

• Equating expressiveness with trainability: A Transformer may be able to represent a function but gradient descent might not find those parameters without careful initialization, curricula, or inductive biases. • Ignoring positional information: Without positional encodings or relative position biases, self-attention is permutation-invariant, so order-dependent tasks (like reversing or sorting) cannot be represented. • Confusing soft and hard attention: Softmax never becomes exactly one-hot at finite scale; it only approximates argmax. Expect small approximation error unless logits are very large—watch out for numerical overflow. • Overlooking fixed-length assumptions: Many universality theorems assume a maximum sequence length. A model that implements a function perfectly up to L may fail to generalize to longer sequences without architectural bias. • Neglecting precision and stability: Implementations with float32 can saturate softmax when logits are large, breaking the intended near-discrete routing; use numerically stable softmax and consider temperature/scale carefully. • Misattributing performance: Strong empirical results can come from pretraining and data scale, not just expressiveness. Conversely, failures do not contradict universality; they may reflect optimization or sample complexity issues. • Underestimating quadratic costs: Even if a function is representable, O(n^2) memory/time for attention can make it impractical for long sequences—consider approximate attention or chunking. • Forgetting capacity trade-offs: Depth, width, and number of heads interact; too few heads can bottleneck routing, while overly wide layers can overfit without improving algorithmic generalization.

Key Formulas

Scaled Dot-Product Attention

A=softmax(dk​​QK⊤​),Z=AV

Explanation: Attention weights are computed by a scaled dot product between queries and keys, passed through softmax, and used to average the values. This is the core mechanism for routing information among tokens.

Multi-Head Attention

MHA(X)=WOconcat(Z(1),…,Z(h))

Explanation: Multiple heads compute attention in parallel with their own Q,K,V projections; their outputs are concatenated and linearly projected. This allows the model to capture different relationships simultaneously.

Position-wise Feed-Forward

FFN(x)=W2​σ(W1​x+b1​)+b2​

Explanation: A small MLP is applied independently at each position. With ReLU or GELU nonlinearity, it can approximate arbitrary continuous functions on compact domains when sufficiently wide.

Transformer Universal Approximation

∀ε>0∃θ:x∈Ksup​∥Tθ​(x)−F(x)∥∞​<ε

Explanation: For any continuous target function F on a compact set K of sequences and any desired error bound, there are Transformer parameters that achieve that accuracy. This formalizes expressiveness as an existence claim.

Softmax Temperature Limit

τ→0+lim​softmax(τz​)=eargmaxi​zi​​

Explanation: As temperature approaches zero, the softmax distribution concentrates on the maximum entry, behaving like a one-hot argmax. This underpins near-discrete token selection via attention.

Relative Position Bias

Sij​=dk​​Qi​Kj⊤​​+Bi−j​

Explanation: An additive bias depending on relative positions sharpens or redirects attention patterns. Properly chosen biases can implement algorithmic behaviors like shifting or reversing.

Sinusoidal Positional Encoding

PE(pos,2k)​=sin(100002k/dmodel​pos​),PE(pos,2k+1)​=cos(100002k/dmodel​pos​)

Explanation: Sine and cosine waves of different frequencies encode absolute positions. Adding these to embeddings gives the model access to order information without learned parameters.

Attention Complexity

Cattn​(L,d)=O(L2d),Mattn​(L)=O(L2)

Explanation: Computing all pairwise interactions between L tokens scales quadratically in time and memory. This cost often dominates Transformer layers on long sequences.

Convex Combination View

Zi​=j=1∑L​Aij​Vj​,Aij​≥0,j∑​Aij​=1

Explanation: Each output token is a convex combination of value vectors, weighted by attention. With sharp weights, this acts like selecting or copying specific tokens.

Supremum Norm

∥F∥∞​=x∈Ksup​∥F(x)∥

Explanation: The sup-norm measures worst-case error over a domain. Universal approximation guarantees control this maximum deviation.

Complexity Analysis

A standard Transformer layer with sequence length L, model dimension d, and h heads (each of size dk​=d/h) has attention cost dominated by computing QK⊤ and AV. Forming Q and K takes O(L d2) for dense projections, but the pairwise score matrix S = QK⊤/dk​​ requires O(L2 dk​) per head, i.e., O(L2 d) total. The softmax and multiplication by V each add O(L2) and O(L2 dk​) per head, respectively. Overall, attention is O(L2 d) time and O(L2) memory for storing S (and often A). The position-wise FFN costs O(L d dff​) time and O(L dff​) memory for activations, where dff​ is the hidden width (typically 4d). For moderate L, FFN can dominate; for large L, the quadratic attention term becomes the bottleneck. In space, besides parameters O(d2 + d dff​), runtime memory includes activations: O(L d) for token representations and O(L2) for attention maps (during training/backprop; inference can sometimes avoid storing A fully). Expressiveness-wise, more heads h increase parallel routing capacity without changing asymptotic cost beyond the constant factor h, since total head width is fixed to d. Techniques like sparse attention, low-rank kernels, or linear-attention variants reduce time-memory to near O(L d)–O(L log L), but may trade off exact pairwise interaction modeling, potentially affecting expressiveness for certain functions. Nonetheless, for bounded L—the regime of most universality results—the O(L2 d) cost is acceptable in theory, and width/depth can be scaled to achieve approximation guarantees.

Code Examples

Single-head self-attention with sinusoidal positional encodings (forward pass only)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Numerically stable softmax over a vector
5vector<double> softmax(const vector<double>& x) {
6 double m = *max_element(x.begin(), x.end());
7 double sum = 0.0;
8 vector<double> exps(x.size());
9 for (size_t i = 0; i < x.size(); ++i) {
10 exps[i] = exp(x[i] - m);
11 sum += exps[i];
12 }
13 for (size_t i = 0; i < x.size(); ++i) exps[i] /= max(sum, 1e-300);
14 return exps;
15}
16
17// Sinusoidal positional encoding of dimension d_model for position pos
18vector<double> sinusoidal_positional_encoding(int pos, int d_model) {
19 vector<double> pe(d_model, 0.0);
20 for (int i = 0; i < d_model; i += 2) {
21 double denom = pow(10000.0, (double)i / d_model);
22 pe[i] = sin(pos / denom);
23 if (i + 1 < d_model) pe[i + 1] = cos(pos / denom);
24 }
25 return pe;
26}
27
28// Simple linear layer: y = x W + b; shapes: x(1xd_in), W(d_in x d_out)
29vector<double> linear(const vector<double>& x, const vector<vector<double>>& W, const vector<double>& b) {
30 int d_in = (int)W.size();
31 int d_out = (int)W[0].size();
32 vector<double> y(d_out, 0.0);
33 for (int j = 0; j < d_out; ++j) {
34 double s = b[j];
35 for (int i = 0; i < d_in; ++i) s += x[i] * W[i][j];
36 y[j] = s;
37 }
38 return y;
39}
40
41// Dot product
42inline double dot(const vector<double>& a, const vector<double>& b) {
43 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i]; return s;
44}
45
46// Self-attention forward pass (single head) with optional relative position bias
47// X: L x d_model token embeddings; returns L x d_v outputs
48vector<vector<double>> self_attention_single_head(
49 const vector<vector<double>>& X,
50 const vector<vector<double>>& Wq,
51 const vector<vector<double>>& Wk,
52 const vector<vector<double>>& Wv,
53 const vector<double>& bq,
54 const vector<double>& bk,
55 const vector<double>& bv,
56 const vector<vector<double>>& rel_bias // L x L positional bias (can be all zeros)
57) {
58 int L = (int)X.size();
59 int d_model = (int)X[0].size();
60 int d_k = (int)Wq[0].size(); // also Wk[0].size()
61 int d_v = (int)Wv[0].size();
62
63 vector<vector<double>> Q(L), K(L), V(L);
64 for (int i = 0; i < L; ++i) {
65 Q[i] = linear(X[i], Wq, bq);
66 K[i] = linear(X[i], Wk, bk);
67 V[i] = linear(X[i], Wv, bv);
68 }
69
70 // Compute attention outputs
71 vector<vector<double>> Z(L, vector<double>(d_v, 0.0));
72 double scale = 1.0 / sqrt((double)d_k);
73 for (int i = 0; i < L; ++i) {
74 vector<double> logits(L, 0.0);
75 for (int j = 0; j < L; ++j) {
76 logits[j] = dot(Q[i], K[j]) * scale + rel_bias[i][j];
77 }
78 vector<double> a = softmax(logits); // attention weights over j
79 for (int j = 0; j < L; ++j) {
80 for (int c = 0; c < d_v; ++c) Z[i][c] += a[j] * V[j][c];
81 }
82 }
83 return Z;
84}
85
86int main() {
87 // Example: 4 tokens, model dim 6
88 int L = 4, d_model = 6, d_k = 4, d_v = 4;
89
90 // Toy token embeddings (could be outputs of an embedding layer)
91 vector<vector<double>> tokens(L, vector<double>(d_model, 0.0));
92 // Put simple numeric content in first coordinate, rest zeros; add positional encodings
93 vector<double> contents = {1.0, 2.0, 3.0, 4.0};
94 for (int i = 0; i < L; ++i) {
95 tokens[i][0] = contents[i];
96 vector<double> pe = sinusoidal_positional_encoding(i, d_model);
97 for (int k = 0; k < d_model; ++k) tokens[i][k] += 0.1 * pe[k]; // small positional signal
98 }
99
100 // Initialize small random weights for Q,K,V (fixed seed for reproducibility)
101 std::mt19937 gen(42);
102 std::normal_distribution<double> nd(0.0, 0.2);
103 auto rand_matrix = [&](int r, int c){
104 vector<vector<double>> W(r, vector<double>(c));
105 for (int i = 0; i < r; ++i) for (int j = 0; j < c; ++j) W[i][j] = nd(gen);
106 return W;
107 };
108 auto rand_bias = [&](int c){ vector<double> b(c, 0.0); return b; };
109
110 vector<vector<double>> Wq = rand_matrix(d_model, d_k);
111 vector<vector<double>> Wk = rand_matrix(d_model, d_k);
112 vector<vector<double>> Wv = rand_matrix(d_model, d_v);
113 vector<double> bq = rand_bias(d_k), bk = rand_bias(d_k), bv = rand_bias(d_v);
114
115 // No relative bias in this basic example
116 vector<vector<double>> rel_bias(L, vector<double>(L, 0.0));
117
118 auto Z = self_attention_single_head(tokens, Wq, Wk, Wv, bq, bk, bv, rel_bias);
119
120 cout << fixed << setprecision(4);
121 cout << "Self-attention outputs (one head):\n";
122 for (int i = 0; i < L; ++i) {
123 cout << "pos " << i << ": ";
124 for (double v : Z[i]) cout << v << ' ';
125 cout << '\n';
126 }
127 return 0;
128}
129

This program implements a minimal single-head self-attention forward pass with sinusoidal positional encodings. It projects tokens to queries, keys, and values, computes scaled dot-product attention with a numerically stable softmax, and returns the weighted sum of values. Although weights are random here, the same code can realize algorithmic behaviors (copying, shifting) once Q/K/V and positional biases are set appropriately—illustrating how attention performs content- and position-dependent routing.

Time: O(L^2 d_k + L d_model d_k + L d_model d_v)Space: O(L^2 + L(d_k + d_v))
Using relative position bias to implement sequence reversal (near-exact via sharp softmax)
1#include <bits/stdc++.h>
2using namespace std;
3
4vector<double> softmax(const vector<double>& x) {
5 double m = *max_element(x.begin(), x.end());
6 double sum = 0.0; vector<double> e(x.size());
7 for (size_t i = 0; i < x.size(); ++i) { e[i] = exp(x[i] - m); sum += e[i]; }
8 for (double &v : e) v /= max(sum, 1e-300);
9 return e;
10}
11
12// Reverse via attention: set logits to a very large value at j = L-1-i and small elsewhere.
13// This mimics a Transformer head with a strong relative position bias.
14int main() {
15 int L = 6; // sequence length
16 vector<vector<double>> V(L, vector<double>(1, 0.0)); // values hold the content (scalar per token)
17 for (int i = 0; i < L; ++i) V[i][0] = (double)(i + 1); // contents: 1,2,3,4,5,6
18
19 // Build relative bias matrix B where B[i][j] is large if j == L-1-i
20 double big = 50.0; // large scale -> softmax ~ one-hot
21 vector<vector<double>> B(L, vector<double>(L, -10.0)); // small baseline elsewhere
22 for (int i = 0; i < L; ++i) {
23 int j_star = L - 1 - i;
24 B[i][j_star] = big;
25 }
26
27 // Attention weights for each query position i
28 vector<vector<double>> A(L, vector<double>(L));
29 for (int i = 0; i < L; ++i) A[i] = softmax(B[i]);
30
31 // Output Z = A V (here d_v=1, so it's just a weighted sum of scalars)
32 vector<double> Z(L, 0.0);
33 for (int i = 0; i < L; ++i) {
34 for (int j = 0; j < L; ++j) Z[i] += A[i][j] * V[j][0];
35 }
36
37 cout << fixed << setprecision(6);
38 cout << "Input: "; for (int i = 0; i < L; ++i) cout << V[i][0] << ' '; cout << '\n';
39 cout << "Output: "; for (int i = 0; i < L; ++i) cout << Z[i] << ' '; cout << '\n';
40
41 // Show how close to exact reversal we are by comparing to reversed sequence
42 cout << "Target: "; for (int i = 0; i < L; ++i) cout << (double)(L - i) << ' '; cout << '\n';
43
44 double max_err = 0.0;
45 for (int i = 0; i < L; ++i) max_err = max(max_err, fabs(Z[i] - (double)(L - i)));
46 cout << "Max abs error: " << max_err << '\n';
47 return 0;
48}
49

This example demonstrates that a single attention head with an appropriate relative position bias can implement sequence reversal. By assigning a very large logit to the reversed index j = L − 1 − i and small values elsewhere, the softmax becomes nearly one-hot, so each position i copies the value from its reversed partner. Increasing the scale makes the operation arbitrarily close to exact, illustrating how soft attention approximates hard routing and why Transformers can realize algorithmic, order-dependent functions.

Time: O(L^2)Space: O(L^2)
Uniform-attention pooling to compute the sum (CLS-style aggregation)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute the sum of a sequence via attention: a special query attends uniformly to all tokens.
5int main() {
6 int L = 5; // number of tokens
7 // Values hold 2D features; here we only use the first dim as numbers to sum
8 vector<vector<double>> V = {{1.0, 0.0}, {2.0, 0.0}, {3.0, 0.0}, {4.0, 0.0}, {5.0, 0.0}};
9
10 // A single query (CLS) with zero logits to every key -> uniform softmax
11 vector<double> logits(L, 0.0);
12 // Numerically stable softmax of zeros is uniform 1/L
13 vector<double> A(L, 1.0 / L);
14
15 // Z = sum_j A_j * V_j is the mean; multiply by L to get sum
16 vector<double> Z(2, 0.0);
17 for (int j = 0; j < L; ++j) {
18 Z[0] += A[j] * V[j][0];
19 Z[1] += A[j] * V[j][1];
20 }
21 double mean_first = Z[0];
22 double sum_first = mean_first * L; // emulate an FFN that scales by L
23
24 cout << fixed << setprecision(4);
25 cout << "Numbers: "; for (int j = 0; j < L; ++j) cout << V[j][0] << ' '; cout << '\n';
26 cout << "Uniform weights: "; for (int j = 0; j < L; ++j) cout << A[j] << ' '; cout << '\n';
27 cout << "Mean (first dim): " << mean_first << '\n';
28 cout << "Sum (first dim): " << sum_first << '\n';
29 return 0;
30}
31

A special query (like a CLS token) with identical logits to all keys produces uniform attention. The attended value is the mean of token values; a subsequent linear layer can multiply by L to recover the sum. This showcases how attention plus a simple feed-forward step can exactly implement useful aggregations, a building block for many sequence-to-sequence tasks.

Time: O(L d_v)Space: O(d_v)
#transformer expressiveness#universal approximation#self-attention#multi-head attention#positional encoding#relative position bias#softmax temperature#sequence-to-sequence#routing#feed-forward network#causal mask#sup norm#bounded length#quadratic complexity#algorithmic tasks