Scaled Dot-Product Attention
Key Points
- •Scaled dot-product attention scores how much each value V should contribute to a query by taking dot products with keys K, scaling by \(\), applying softmax, and forming a weighted sum.
- •The scale \(\) keeps gradients well-behaved by preventing overly peaky softmax distributions when feature dimension grows.
- •Softmax turns similarity scores into a probability distribution so weights are non-negative and sum to 1 across keys.
- •Masks (like causal or padding masks) are added before softmax as large negative numbers, blocking attention to disallowed positions.
- •In matrix form, attention can process many queries at once: \((QK^/)V\).
- •Multi-head attention applies this operation multiple times in parallel on projected subspaces and concatenates results.
- •A numerically stable softmax subtracts the row-wise maximum before exponentials to avoid overflow.
- •The naive implementation runs in \(O(mn + mn + mn)\) time for an \(m \) query matrix and \(n\) keys/values, and uses \(O(mn)\) extra memory for the score matrix.
Prerequisites
- →Matrix multiplication and transposition — Attention is expressed and implemented via matrix products QK^T and AV.
- →Dot product and cosine similarity — Similarity between queries and keys is computed via dot products.
- →Softmax and numerical stability — Turning scores into probabilities and keeping computations stable is essential.
- →Basic probability distributions — Attention weights are probabilities that sum to 1 across keys.
- →Time and space complexity — Understanding the O(mn) memory and O(mnd) time costs guides design decisions.
- →Linear transformations (affine layers) — Multi-head attention uses learned linear projections of inputs into Q, K, and V.
- →C++ vectors and loops — Implementations rely on nested loops over std::vector for clarity.
Detailed Explanation
Tap terms for definitions01Overview
Scaled dot-product attention is a core operation in modern sequence models like Transformers. Given a set of queries Q, keys K, and values V, it computes attention weights that represent how relevant each value is to each query. It does this by first measuring similarity between queries and keys using dot products, then scaling these scores by the square root of the key dimension to keep values in a well-behaved numerical range. The scores are turned into probabilities with a softmax, and the final output is a weighted sum of values using these probabilities as weights. In compact matrix form, the operation is Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V. Because it is differentiable end-to-end, attention can be learned via gradient descent, allowing models to dynamically focus on different parts of the input. Scaled dot-product attention is efficient, parallelizable, and flexible: it can be applied over time steps in text, patches in images, or nodes in graphs. Variants include masking to enforce causality (no peeking into the future) and multi-head attention, which runs several attention operations in parallel on different learned subspaces, capturing diverse relationships.
02Intuition & Analogies
Imagine you are searching a bookshelf for a topic (the query). Each book has an index describing its content (the key) and its actual content (the value). To answer your question, you compare your question (query) to each book’s index (key): the more similar they are, the more that book should influence your answer. You convert these similarities into a set of percentages that sum to 100% (softmax), then take a weighted average of the books’ content (values). That final averaged content is your attended answer. The dot product is a fast way to measure how aligned two descriptions are: bigger dot product means they point in similar directions. However, if your description vectors are long (high dimensional), raw dot products tend to get large just because there are many components adding up. When scores get too large, softmax becomes overly confident about just one book, which harms learning. Dividing by (\sqrt{d_k}) acts like a temperature knob, cooling the scores so the softmax doesn’t collapse too quickly. Masks are like a “do not look here” sticker: before computing the percentages, you subtract a huge number from the forbidden books’ scores so they effectively get weight zero after softmax. Multi-head attention is like asking several librarians with different specialties to each give a recommendation and then combining all their suggestions.
03Formal Definition
04When to Use
Use scaled dot-product attention whenever you need a model to flexibly relate elements across a set or sequence. Typical cases include natural language processing (e.g., a word’s representation should attend to other relevant words), computer vision (e.g., a patch attends to other patches), and recommendation systems (e.g., a user’s current action attends to historical actions). Employ causal masking for autoregressive generation to prevent information leakage from future positions. Use padding masks to ignore padded tokens in batches of variable-length sequences. Choose multi-head attention when a single similarity notion is insufficient; multiple heads allow the model to capture different relationships (syntax vs. semantics, local vs. global, etc.). For small inputs or prototyping, a direct O(mn) score matrix is fine. For very long sequences (tens of thousands), consider approximate or sparse variants that reduce the quadratic cost, but be mindful of their trade-offs in accuracy and implementation complexity.
⚠️Common Mistakes
- Forgetting the scaling by (\sqrt{d_k}) leads to overly sharp softmax distributions and unstable training, especially as (d_k) grows. Fix by explicitly dividing scores before softmax or by using a temperature parameter (\tau = \sqrt{d_k}).
- Applying masks after softmax instead of before. Correct masking is additive before softmax, using large negative numbers to ensure masked positions receive near-zero probability.
- Numerical instability in softmax due to large scores; always subtract the row-wise maximum prior to exponentials (log-sum-exp trick) and prefer double precision for teaching/demos or float32 with care in production.
- Mixing up keys and values: keys determine where to look; values are the content to aggregate. Ensure K and V are aligned by position.
- Dimension mismatches when implementing multi-head attention: verify that d_model is divisible by the number of heads and that reshapes/splits are along the correct axes.
- Ignoring padding tokens. Without a padding mask, attention can latch onto meaningless pads, degrading performance.
- Using zero to mask before softmax. Zeros do not nullify probabilities; you must add (-\infty) (or a large negative constant) so softmax returns (near) zero probability for masked entries.
Key Formulas
Scaled Dot-Product Attention
Explanation: Computes similarity between queries and keys, scales by the square root of key dimension, converts to probabilities via softmax, and uses them to form a weighted sum of values.
Softmax
Explanation: Turns a vector of scores into non-negative weights that sum to 1, making them interpretable as probabilities.
Numerically Stable Softmax
Explanation: Subtracting the maximum score m prevents overflow in the exponentials while leaving the probabilities unchanged.
Multi-Head Attention
Explanation: Each head attends in a different learned subspace; concatenating and projecting combines their information.
Masked Attention
Explanation: Masks are added to scores before softmax so masked positions receive (near) zero probability.
Motivation for Scaling
Explanation: With independent components, the variance of dot products grows with dimension. Dividing by \(\) keeps score magnitudes stable as dimensionality increases.
Softmax Jacobian
Explanation: Describes how changes in a score affect each output probability; important for understanding gradient flow in attention.
Time Complexity (Naive)
Explanation: Computing scores Q takes \(mn\), softmax is \(mn\), and multiplying by V costs \(mn\).
Space for Score Matrix
Explanation: Storing the attention scores (and optionally weights) requires memory proportional to the number of query-key pairs.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute a numerically stable softmax over a vector 5 vector<double> softmax_stable(const vector<double>& x) { 6 double mx = *max_element(x.begin(), x.end()); 7 double sum = 0.0; 8 vector<double> exps(x.size()); 9 for (size_t i = 0; i < x.size(); ++i) { 10 exps[i] = exp(x[i] - mx); // subtract max for numerical stability 11 sum += exps[i]; 12 } 13 for (size_t i = 0; i < x.size(); ++i) exps[i] /= (sum + 1e-12); 14 return exps; // probabilities sum to ~1 15 } 16 17 // Dot product between two vectors of equal length 18 double dot(const vector<double>& a, const vector<double>& b) { 19 assert(a.size() == b.size()); 20 double s = 0.0; 21 for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i]; 22 return s; 23 } 24 25 // Single-query attention: q is 1xd_k, K is nxd_k, V is nxd_v 26 vector<double> attention_single(const vector<double>& q, 27 const vector<vector<double>>& K, 28 const vector<vector<double>>& V) { 29 size_t n = K.size(); 30 size_t d_k = q.size(); 31 assert(n == V.size()); 32 // 1) Scores: s_i = (q · K_i) / sqrt(d_k) 33 vector<double> scores(n); 34 double scale = 1.0 / sqrt((double)d_k); 35 for (size_t i = 0; i < n; ++i) { 36 assert(K[i].size() == d_k); 37 scores[i] = dot(q, K[i]) * scale; 38 } 39 // 2) Softmax to get attention weights 40 vector<double> w = softmax_stable(scores); 41 // 3) Weighted sum of values 42 size_t d_v = V[0].size(); 43 vector<double> out(d_v, 0.0); 44 for (size_t i = 0; i < n; ++i) { 45 assert(V[i].size() == d_v); 46 for (size_t j = 0; j < d_v; ++j) out[j] += w[i] * V[i][j]; 47 } 48 return out; 49 } 50 51 int main() { 52 // Example with small, readable numbers 53 // d_k = 4, n = 3 values, d_v = 2 54 vector<double> q = {0.1, 0.2, 0.3, 0.4}; 55 vector<vector<double>> K = { 56 {0.0, 0.1, 0.0, 0.1}, 57 {0.2, 0.1, 0.0, 0.0}, 58 {0.1, 0.0, 0.3, 0.1} 59 }; 60 vector<vector<double>> V = { 61 {1.0, 0.0}, 62 {0.0, 1.0}, 63 {1.0, 1.0} 64 }; 65 66 vector<double> out = attention_single(q, K, V); 67 68 cout.setf(std::ios::fixed); cout<<setprecision(6); 69 cout << "Output: "; 70 for (double x : out) cout << x << ' '; 71 cout << "\n"; 72 73 // Extra: verify weights sum to ~1 74 double scale = 1.0 / sqrt((double)q.size()); 75 vector<double> scores; 76 for (auto& k : K) scores.push_back(dot(q, k) * scale); 77 vector<double> w = softmax_stable(scores); 78 double sumw = accumulate(w.begin(), w.end(), 0.0); 79 cout << "Sum of attention weights: " << sumw << "\n"; 80 return 0; 81 } 82
This program computes scaled dot-product attention for a single query vector. It forms similarity scores to each key via dot products, scales by \(1/\sqrt{d_k}\), applies a numerically stable softmax, and returns the weighted sum of values. The example prints the output vector and confirms that attention weights sum to 1.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 using Mat = vector<vector<double>>; // simple dense matrix helper 5 6 Mat zeros(size_t r, size_t c) { return Mat(r, vector<double>(c, 0.0)); } 7 8 Mat transpose(const Mat& A) { 9 size_t r = A.size(), c = A[0].size(); 10 Mat T(c, vector<double>(r)); 11 for (size_t i = 0; i < r; ++i) 12 for (size_t j = 0; j < c; ++j) 13 T[j][i] = A[i][j]; 14 return T; 15 } 16 17 Mat matmul(const Mat& A, const Mat& B) { 18 size_t r = A.size(), k = A[0].size(), c = B[0].size(); 19 assert(B.size() == k); 20 Mat C(r, vector<double>(c, 0.0)); 21 for (size_t i = 0; i < r; ++i) 22 for (size_t t = 0; t < k; ++t) { 23 double a = A[i][t]; 24 for (size_t j = 0; j < c; ++j) C[i][j] += a * B[t][j]; 25 } 26 return C; 27 } 28 29 void softmax_rows_inplace(Mat& S) { 30 for (auto& row : S) { 31 double mx = *max_element(row.begin(), row.end()); 32 double sum = 0.0; 33 for (double& x : row) { x = exp(x - mx); sum += x; } 34 double inv = 1.0 / (sum + 1e-12); 35 for (double& x : row) x *= inv; 36 } 37 } 38 39 // Build a lower-triangular (causal) mask: allow j <= i, disallow j > i 40 Mat causal_mask(size_t m, size_t n) { 41 Mat M(m, vector<double>(n, 0.0)); 42 const double NEG_INF = -1e9; // practical -inf for softmax 43 for (size_t i = 0; i < m; ++i) 44 for (size_t j = 0; j < n; ++j) 45 if (j > i) M[i][j] = NEG_INF; 46 return M; 47 } 48 49 Mat add(const Mat& A, const Mat& B) { 50 size_t r = A.size(), c = A[0].size(); 51 assert(B.size() == r && B[0].size() == c); 52 Mat C = A; 53 for (size_t i = 0; i < r; ++i) 54 for (size_t j = 0; j < c; ++j) 55 C[i][j] += B[i][j]; 56 return C; 57 } 58 59 // Batched attention: Q (mxd_k), K (nxd_k), V (nxd_v), optional additive mask M (m x n) 60 Mat attention_batched(const Mat& Q, const Mat& K, const Mat& V, const Mat* M = nullptr) { 61 size_t m = Q.size(), d_k = Q[0].size(); 62 size_t n = K.size(); 63 size_t d_v = V[0].size(); 64 assert(K[0].size() == d_k); 65 assert(V.size() == n); 66 67 // 1) S = Q K^T / sqrt(d_k) 68 Mat KT = transpose(K); 69 Mat S = matmul(Q, KT); 70 double scale = 1.0 / sqrt((double)d_k); 71 for (size_t i = 0; i < m; ++i) 72 for (size_t j = 0; j < n; ++j) 73 S[i][j] *= scale; 74 75 // 2) Add mask before softmax, if provided 76 if (M) { 77 assert((*M).size() == m && (*M)[0].size() == n); 78 for (size_t i = 0; i < m; ++i) 79 for (size_t j = 0; j < n; ++j) 80 S[i][j] += (*M)[i][j]; 81 } 82 83 // 3) Row-wise softmax to get attention weights A 84 softmax_rows_inplace(S); // now S holds A 85 86 // 4) Output O = A V 87 Mat O = matmul(S, V); 88 return O; 89 } 90 91 int main() { 92 // Small demo: m=n=4, d_k=3, d_v=2, with causal mask 93 size_t m = 4, n = 4, d_k = 3, d_v = 2; 94 Mat Q(m, vector<double>(d_k)), K(n, vector<double>(d_k)), V(n, vector<double>(d_v)); 95 96 // Initialize with deterministic values for reproducibility 97 for (size_t i = 0; i < m; ++i) 98 for (size_t j = 0; j < d_k; ++j) 99 Q[i][j] = (i + 1) * 0.1 + (j + 1) * 0.01; 100 for (size_t i = 0; i < n; ++i) 101 for (size_t j = 0; j < d_k; ++j) 102 K[i][j] = (i + 1) * 0.05 + (j + 1) * 0.02; 103 for (size_t i = 0; i < n; ++i) 104 for (size_t j = 0; j < d_v; ++j) 105 V[i][j] = (i + 1) * 0.2 + (j) * 0.1; 106 107 Mat M = causal_mask(m, n); 108 Mat O = attention_batched(Q, K, V, &M); 109 110 cout.setf(std::ios::fixed); cout<<setprecision(6); 111 cout << "Output (each row corresponds to a query):\n"; 112 for (auto& row : O) { 113 for (double x : row) cout << x << ' '; 114 cout << '\n'; 115 } 116 return 0; 117 } 118
This program implements the full matrix form of scaled dot-product attention with an additive causal mask. It computes S = QK^T/√d_k, adds a lower-triangular mask (−1e9 above the diagonal), applies stable row-wise softmax to get attention weights, and multiplies by V to produce outputs. The example constructs small deterministic matrices for clarity and prints the resulting attended representations.
1 #include <bits/stdc++.h> 2 using namespace std; 3 using Mat = vector<vector<double>>; 4 5 Mat zeros(size_t r, size_t c) { return Mat(r, vector<double>(c, 0.0)); } 6 Mat transpose(const Mat& A){ size_t r=A.size(), c=A[0].size(); Mat T(c, vector<double>(r)); for(size_t i=0;i<r;++i) for(size_t j=0;j<c;++j) T[j][i]=A[i][j]; return T; } 7 Mat matmul(const Mat& A, const Mat& B){ size_t r=A.size(), k=A[0].size(), c=B[0].size(); assert(B.size()==k); Mat C(r, vector<double>(c,0.0)); for(size_t i=0;i<r;++i) for(size_t t=0;t<k;++t){ double a=A[i][t]; for(size_t j=0;j<c;++j) C[i][j]+=a*B[t][j]; } return C; } 8 void softmax_rows_inplace(Mat& S){ for(auto& row:S){ double mx=*max_element(row.begin(),row.end()); double sum=0.0; for(double& x:row){ x=exp(x-mx); sum+=x; } double inv=1.0/(sum+1e-12); for(double& x:row) x*=inv; }} 9 10 // Slice columns [c0, c1) from A 11 Mat slice_cols(const Mat& A, size_t c0, size_t c1){ size_t r=A.size(); Mat S(r, vector<double>(c1-c0)); for(size_t i=0;i<r;++i) for(size_t j=c0;j<c1;++j) S[i][j-c0]=A[i][j]; return S; } 12 13 // Concatenate matrices horizontally (same row count) 14 Mat hconcat(const vector<Mat>& parts){ size_t r=parts[0].size(); size_t totalc=0; for(auto& P:parts){ assert(P.size()==r); totalc += P[0].size(); } 15 Mat C(r, vector<double>(totalc)); 16 size_t off=0; for(auto& P:parts){ size_t c=P[0].size(); for(size_t i=0;i<r;++i) for(size_t j=0;j<c;++j) C[i][j+off]=P[i][j]; off+=c; } 17 return C; 18 } 19 20 Mat attention(const Mat& Q, const Mat& K, const Mat& V) { 21 size_t d_k = Q[0].size(); 22 Mat S = matmul(Q, transpose(K)); 23 double scale = 1.0 / sqrt((double)d_k); 24 for (auto& row : S) for (double& x : row) x *= scale; 25 softmax_rows_inplace(S); 26 return matmul(S, V); 27 } 28 29 int main(){ 30 // Single sequence self-attention: X (n x d_model) 31 size_t n=5, d_model=8, h=2; // 2 heads => d_k=d_v=4 32 assert(d_model % h == 0); 33 size_t d_k = d_model / h, d_v = d_model / h; 34 35 // Input tokens (deterministic pattern for demo) 36 Mat X(n, vector<double>(d_model)); 37 for(size_t i=0;i<n;++i) for(size_t j=0;j<d_model;++j) X[i][j] = 0.01*(i+1) + 0.02*(j+1); 38 39 // Projection matrices: W_Q, W_K, W_V, W_O (d_model x d_model) 40 auto init = [&](double base){ Mat W(d_model, vector<double>(d_model)); for(size_t i=0;i<d_model;++i) for(size_t j=0;j<d_model;++j) W[i][j] = base + 0.001*(i+1) + 0.0001*(j+1); return W; }; 41 Mat WQ = init(0.05), WK = init(0.04), WV = init(0.03), WO = init(0.02); 42 43 // Linear projections 44 Mat Q = matmul(X, WQ), K = matmul(X, WK), V = matmul(X, WV); 45 46 // Split into heads (concatenation of per-head outputs later) 47 vector<Mat> head_outputs; 48 for(size_t head=0; head<h; ++head){ 49 size_t c0 = head * d_k, c1 = c0 + d_k; 50 Mat Qh = slice_cols(Q, c0, c1); 51 Mat Kh = slice_cols(K, c0, c1); 52 Mat Vh = slice_cols(V, c0, c1); // d_v == d_k in this minimal demo 53 Mat Oh = attention(Qh, Kh, Vh); 54 head_outputs.push_back(Oh); 55 } 56 57 // Concat heads and apply output projection 58 Mat O_concat = hconcat(head_outputs); // n x d_model 59 Mat Out = matmul(O_concat, WO); // n x d_model 60 61 cout.setf(std::ios::fixed); cout<<setprecision(6); 62 cout << "Multi-head self-attention output (first 3 tokens):\n"; 63 for(size_t i=0;i<min((size_t)3,n); ++i){ 64 for(size_t j=0;j<d_model; ++j) cout << Out[i][j] << ' '; 65 cout << '\n'; 66 } 67 return 0; 68 } 69
This program assembles a minimal multi-head self-attention layer from the scaled dot-product primitive. It projects inputs into Q, K, V, splits along the feature dimension into h heads, applies attention per head, concatenates outputs, and applies a final output projection. This mirrors the Transformer’s core computation while keeping the math simple and explicit.