🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Sequence-to-Sequence with Attention

Key Points

  • •
    Sequence-to-sequence with attention lets a decoder focus on the most relevant parts of the input at each output step, rather than compressing everything into a single vector.
  • •
    Attention computes a set of alignment scores over encoder states and forms a context vector as a weighted average using softmax weights.
  • •
    Common attention scoring functions include dot, general (bilinear), additive (Bahdanau), and scaled dot-product (as in Transformers).
  • •
    The computational cost typically scales as O(Ts​ Tt​ d), where Ts​ and Tt​ are source and target lengths and d is the hidden size.
  • •
    Attention greatly improves translation, summarization, and dialogue models by handling long dependencies and variable-length inputs.
  • •
    In C++, you can implement attention with plain vectors and matrices: compute scores, apply softmax row-wise, and take weighted sums.
  • •
    Training uses teacher forcing and cross-entropy loss, while decoding commonly uses greedy or beam search with attention at each step.
  • •
    Numerical stability (e.g., softmax with max subtraction) and correct masking are essential for robust implementations.

Prerequisites

  • →Linear Algebra (vectors, matrices, dot products) — Attention is implemented with matrix multiplications and vector operations.
  • →Probability and Softmax — Attention weights and output distributions are probabilities computed via softmax.
  • →RNNs / GRUs / LSTMs — Classic Seq2Seq encoders/decoders are recurrent; understanding state updates is key.
  • →Backpropagation and Gradient Descent — Training learns attention and projection parameters by minimizing cross-entropy.
  • →Numerical Stability — Stable softmax and masking are crucial to avoid overflow and incorrect attention.
  • →Autoregressive Decoding — The decoder predicts each token conditioned on previous tokens and attention.
  • →Beam Search (optional) — Improves inference quality over greedy decoding by exploring multiple hypotheses.
  • →Transformer Basics (optional) — Scaled dot-product attention generalizes the same idea to fully attention-based models.

Detailed Explanation

Tap terms for definitions

01Overview

Sequence-to-Sequence (Seq2Seq) with Attention is an encoder–decoder architecture for mapping an input sequence (like a sentence) to an output sequence (like its translation). In a vanilla Seq2Seq model, the encoder compresses the entire input into a single fixed-size vector, and the decoder generates outputs from this bottleneck. Attention removes this bottleneck by allowing the decoder to dynamically look back at the encoder’s hidden states at every time step. Concretely, the decoder computes a set of relevance scores over all encoder states and forms a context vector as a weighted average. This context vector, combined with the decoder’s internal state, predicts the next token. This mechanism lets the model handle much longer sequences and alignments (which input parts support which outputs). The approach generalizes well beyond text: it’s used in speech recognition, captioning, and even non-neural algorithms that need soft alignment. Variants include Bahdanau (additive) attention, Luong (dot/general) attention, and scaled dot-product attention (central to Transformers). The practical result is better accuracy, interpretable alignments, and improved gradient flow during training, especially for long-range dependencies.

02Intuition & Analogies

Imagine translating a paragraph with a highlighter in your hand. You don’t first memorize the entire paragraph and then write the translation from memory; instead, at each translated word, you glance back and highlight the most relevant original words. That highlight intensity over the source words is your attention distribution, and the weighted combination of those highlighted words is your context. The decoder is like you choosing the next word to write, guided by what you’re currently focusing on plus your memory of what you’ve already written. Another analogy: think of a searchlight sweeping across a landscape (the encoder states). At each moment (each output token), the beam focuses on certain regions (relevant positions). The brightness at each location is the attention weight, and the combined reflected light is the context vector. Crucially, the beam can move and spread, capturing multiple relevant spots when necessary. Without attention, you’d be forced to summarize the whole landscape into a single snapshot—a lossy compression. With attention, you get to re-consult the original scene repeatedly, choosing different parts as the needs of the translation evolve. This dynamic lookup not only improves accuracy but also yields a human-interpretable alignment map, showing which input pieces influenced each output decision.

03Formal Definition

Let an input sequence x = (x1​, …, xTs​​) be encoded into hidden states hi​ ∈ Rdh​ using a recurrent, convolutional, or transformer encoder: hi​=f(xi​, hi−1​) or a parallel variant. The decoder maintains a state st​ ∈ Rds​ and predicts yt​ given previous outputs y<t​. Attention introduces a score function et,i​ = score(st−1​, hi​) that measures relevance of encoder state hi​ to producing yt​. The attention weights are αt,i​ = softmax(et,⋅​)_i = ∑j=1Ts​​exp(et,j​)exp(et,i​)​. The context vector is ct​ = ∑i=1Ts​​ αt,i​ hi​. Typical scoring choices: dot (et,i​=st−1⊤​ hi​), general/bilinear (et,i​=st−1⊤​ W hi​), and additive/Bahdanau (et,i​=v⊤ tanh(Ws​ st−1​ + Wh​ hi​)). The decoder update is st​=g(st−1​, yt−1​, ct​), followed by output distribution p(yt​ ∣ y<t​, x) = softmax(Wo​ [st​; ct​] + bo​). Training minimizes negative log-likelihood L = -∑t=1Tt​​ log p(yt​^* ∣ y<t​^*, x) using teacher forcing (feeding ground-truth yt−1​^*). During inference, greedy or beam search selects yt​ from p(⋅ ∣ y<t​, x), recomputing attention at each step.

04When to Use

Use Seq2Seq with attention whenever you need to map variable-length inputs to variable-length outputs and long-range dependencies matter. Core scenarios include machine translation, summarization (abstractive), question answering with spans or generated answers, image captioning (attention over CNN features), speech recognition (attention over acoustic frames), and code generation. Attention is also valuable when interpretability is required: alignment heatmaps show which input positions influenced each output token, aiding debugging and trust. If your baseline Seq2Seq struggles with long sentences or loses critical details, attention is a natural upgrade that usually boosts performance without changing the overall training recipe. For resource-constrained settings, attention’s extra O(T_s T_t) compute is still often worthwhile thanks to significant accuracy gains. If you need massive scalability or purely parallel computation, consider transformer-style self-attention encoders and decoders, but the cross-attention idea remains the same: compute relevance between decoder queries and encoder keys/values. In structured prediction (like parsing) or program synthesis, attention provides soft alignment, which can serve as a differentiable proxy for hard alignment or selection.

⚠️Common Mistakes

  • Forgetting numerical stability in softmax: directly exponentiating large scores can overflow. Fix by subtracting the row-wise maximum before exp. Also clamp extremely small probabilities if you compute logs.\n- Ignoring masks: you must mask padding positions in the encoder sequence so they get zero probability mass. Otherwise, the model can attend to meaningless padding and degrade.\n- Mismatched dimensions: score functions require consistent shapes (e.g., dot product needs d_s = d_h or a projection W to bridge sizes). Double-check matrix sizes and concatenations.\n- Treating context and state interchangeably: the decoder state s_t and context c_t play different roles; concatenation vs. addition changes capacity and should match your chosen variant (Luong vs. Bahdanau).\n- Overly small attention hidden size (for additive attention): too small d_a bottlenecks alignment expressiveness; too large increases compute and overfitting risk.\n- Overlooking inference-time complexity: beam search multiplies compute by beam size B and vocabulary size V per step; plan memory and time accordingly.\n- Not inspecting alignments: bad or peaky/flat attention patterns often reveal bugs (masking, scaling) or learning issues (learning rate, regularization).

Key Formulas

Encoder Recurrence

ht​=f(xt​,ht−1​)

Explanation: The encoder processes the input sequence step by step, updating its hidden state from the previous state and the current token. f can be an RNN, GRU, LSTM, CNN block, or a transformer layer (in parallel form).

Attention Score

et,i​=score(st−1​,hi​)

Explanation: Each encoder position i gets a relevance score for producing the t-th output. The score function can be dot, general (bilinear), or additive (Bahdanau).

Attention Weights (Softmax)

αt,i​=∑j=1Ts​​exp(et,j​)exp(et,i​)​

Explanation: Scores are normalized into a probability distribution over positions. Larger scores get higher weights, but the weights sum to 1 across all positions.

Context Vector

ct​=i=1∑Ts​​αt,i​hi​

Explanation: The context is a weighted average of encoder states, emphasizing relevant positions. It is fed to the decoder along with its hidden state to predict the next token.

Luong (Dot/General) Scores

et,idot​=st−1⊤​hi​,et,igeneral​=st−1⊤​Whi​

Explanation: Dot uses a simple inner product, requiring matched dimensions. General introduces a learned projection W to align different spaces.

Bahdanau (Additive) Score

et,iadd​=v⊤tanh(Ws​st−1​+Wh​hi​)

Explanation: A small feed-forward network combines decoder state and encoder state through tanh, then projects with v to a scalar score. This often works well when ds​ and dh​ differ.

Scaled Dot-Product Attention

Attn(Q,K,V)=softmax(dk​​QK⊤​)V

Explanation: Queries match against keys to produce attention weights, scaled by 1/sqrt(dk​) for stability. The output is a weighted sum over the values.

Decoder Output Distribution

p(yt​∣y<t​,x)=softmax(Wo​[st​;ct​]+bo​)

Explanation: The probability over the vocabulary for the next token is computed from a linear projection of the decoder state concatenated with the context vector.

Cross-Entropy Loss

L(θ)=−t=1∑Tt​​logpθ​(yt∗​∣y<t∗​,x)

Explanation: Training minimizes the negative log-likelihood of ground-truth tokens given the input and prior ground-truth outputs. This encourages the model to assign high probability to the correct sequence.

Stable Softmax

softmax(zi​)=∑j​exp(zj​−maxk​zk​)exp(zi​−maxj​zj​)​

Explanation: Subtracting the maximum score per row prevents numerical overflow in exponentials without changing the resulting probabilities.

Attention Complexity

Complexity≈O(Ts​Tt​d)

Explanation: At each of Tt​ decoding steps, attention compares the decoder state to all Ts​ encoder states with hidden size d. This dominates the runtime in RNN-based Seq2Seq with attention.

Beam Search Work

Beam Expansions per step=B⋅V

Explanation: Each decoding step expands B hypotheses over a vocabulary of size V before pruning. This multiplicative factor must be considered in inference time.

Complexity Analysis

For RNN-based Seq2Seq with attention, the dominant cost arises from computing attention at each decoder step. Let Ts​ be source length, Tt​ be target length, and d be a representative hidden size (e.g., max of ds​, dh​, da​). For additive (Bahdanau) attention, each step computes et,i​=vT tanh(Ws​ st−1​ + Wh​ hi​) for i=1..Ts​, costing O(Ts​ · da​(ds​ + dh​) + Ts​ · da​) ≈ O(Ts​ da​(ds​ + dh​)). Applying softmax adds O(Ts​), and forming the context ct​ adds O(Ts​ dh​). Across all Tt​ steps, total is O(Ts​ Tt​ d), where d captures these sizes. For dot/general attention, cost per step is O(Ts​ d) (or O(Ts​ d2) if a projection W is used once per step), again leading to O(Ts​ Tt​ d) overall. The decoder update (e.g., GRU/LSTM) adds O(Tt​ d2), which is typically smaller than attention if Ts​ is large. Memory usage is dominated by storing encoder states O(Ts​ dh​); the decoder only needs current states O(ds​) and temporary attention buffers O(Ts​). During training with teacher forcing, the output softmax over a vocabulary of size V costs O(Tt​ V do​ut) for the final linear layer and normalizer; for large V this can dominate, leading to approximations like sampled softmax. At inference, greedy decoding uses the same attention cost per step, while beam search multiplies compute roughly by the beam size B (O(B Ts​ d) per step) and expands B · V hypotheses before pruning. Transformers replace recurrence with parallel self-attention but maintain the same cross-attention shape O(Ts​ Tt​ d).

Code Examples

Scaled dot-product attention (single/multi-query) with stable softmax
1#include <bits/stdc++.h>
2using namespace std;
3
4using Matrix = vector<vector<double>>;
5
6Matrix transpose(const Matrix &A) {
7 int n = A.size(), m = A[0].size();
8 Matrix T(m, vector<double>(n));
9 for (int i = 0; i < n; ++i)
10 for (int j = 0; j < m; ++j)
11 T[j][i] = A[i][j];
12 return T;
13}
14
15Matrix matmul(const Matrix &A, const Matrix &B) {
16 int n = A.size(), d = A[0].size(), m = B[0].size();
17 Matrix C(n, vector<double>(m, 0.0));
18 for (int i = 0; i < n; ++i)
19 for (int k = 0; k < d; ++k)
20 for (int j = 0; j < m; ++j)
21 C[i][j] += A[i][k] * B[k][j];
22 return C;
23}
24
25vector<double> softmax_row(const vector<double> &z) {
26 double mx = *max_element(z.begin(), z.end());
27 double sum = 0.0;
28 vector<double> p(z.size());
29 for (size_t i = 0; i < z.size(); ++i) {
30 p[i] = exp(z[i] - mx); // stable softmax
31 sum += p[i];
32 }
33 for (double &v : p) v /= (sum + 1e-12);
34 return p;
35}
36
37Matrix scaled_dot_product_attention(const Matrix &Q, const Matrix &K, const Matrix &V) {
38 // Q: [n_q, d_k], K: [n_k, d_k], V: [n_k, d_v]
39 int n_q = Q.size();
40 int d_k = Q[0].size();
41 int n_k = K.size();
42 int d_v = V[0].size();
43 (void)d_v; // unused directly here
44
45 // Compute scores = Q * K^T / sqrt(d_k) => [n_q, n_k]
46 Matrix Kt = transpose(K);
47 Matrix scores = matmul(Q, Kt);
48 double scale = 1.0 / sqrt((double)d_k);
49 for (int i = 0; i < n_q; ++i)
50 for (int j = 0; j < n_k; ++j)
51 scores[i][j] *= scale;
52
53 // Row-wise softmax to get attention weights
54 Matrix weights(n_q, vector<double>(n_k));
55 for (int i = 0; i < n_q; ++i) {
56 weights[i] = softmax_row(scores[i]);
57 }
58
59 // Output = weights * V => [n_q, d_v]
60 Matrix out = matmul(weights, V);
61 return out;
62}
63
64int main() {
65 // Example with small matrices
66 // n_q=2 queries, n_k=4 keys/values, d_k=3, d_v=2
67 Matrix Q = {{0.2, 0.1, -0.3},
68 {0.0, 0.4, 0.5}};
69 Matrix K = {{0.1, -0.2, 0.3},
70 {0.0, 0.5, 0.2},
71 {0.7, -0.1, 0.0},
72 {-0.3, 0.2, 0.1}};
73 Matrix V = {{ 0.5, 0.0},
74 { 0.1, 0.2},
75 {-0.2, 0.3},
76 { 0.4, -0.1}};
77
78 Matrix out = scaled_dot_product_attention(Q, K, V);
79
80 cout << fixed << setprecision(4);
81 cout << "Attention outputs (rows per query):\n";
82 for (auto &row : out) {
83 for (double v : row) cout << v << ' ';
84 cout << '\n';
85 }
86 return 0;
87}
88

This program implements scaled dot-product attention. It computes scores by multiplying queries with transposed keys, scales by 1/sqrt(d_k) for stability, applies row-wise softmax to get attention weights, and multiplies by values to get context outputs. The example uses tiny matrices to print numerical outputs. The code includes a stable softmax and basic matrix utilities to remain dependency-free.

Time: O(n_q n_k d_k + n_q n_k + n_q n_k d_v), typically dominated by O(n_q n_k max(d_k, d_v))Space: O(n_q n_k + n_k d_v + n_q d_v), dominated by the attention weight matrix O(n_q n_k)
Additive (Bahdanau) attention over encoder states
1#include <bits/stdc++.h>
2using namespace std;
3
4using Vec = vector<double>;
5using Mat = vector<vector<double>>;
6
7Vec add(const Vec &a, const Vec &b) {
8 Vec c(a.size());
9 for (size_t i = 0; i < a.size(); ++i) c[i] = a[i] + b[i];
10 return c;
11}
12
13Vec tanh_vec(const Vec &a) {
14 Vec b(a.size());
15 for (size_t i = 0; i < a.size(); ++i) b[i] = tanh(a[i]);
16 return b;
17}
18
19Vec matvec(const Mat &W, const Vec &x) {
20 size_t r = W.size(), c = W[0].size();
21 Vec y(r, 0.0);
22 for (size_t i = 0; i < r; ++i)
23 for (size_t j = 0; j < c; ++j)
24 y[i] += W[i][j] * x[j];
25 return y;
26}
27
28double dot(const Vec &a, const Vec &b) {
29 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i]; return s;
30}
31
32vector<double> softmax_stable(const vector<double> &z) {
33 double mx = *max_element(z.begin(), z.end());
34 double sum = 0.0; vector<double> p(z.size());
35 for (size_t i = 0; i < z.size(); ++i) { p[i] = exp(z[i] - mx); sum += p[i]; }
36 for (double &v : p) v /= (sum + 1e-12);
37 return p;
38}
39
40int main() {
41 // Dimensions: d_s=3 (decoder state), d_h=4 (encoder state), d_a=5 (attention hidden)
42 // One decoder state s, multiple encoder states h_i
43 vector<Vec> H = {
44 { 0.1, 0.0, 0.3, -0.2},
45 { 0.2, 0.5, -0.1, 0.4},
46 {-0.3, 0.2, 0.6, 0.1}
47 }; // T_s=3
48 Vec s = {0.05, -0.1, 0.2};
49
50 // Parameters (random small constants for demo)
51 Mat W_s = {
52 { 0.1, -0.2, 0.3},
53 {-0.1, 0.0, 0.2},
54 { 0.05, 0.1, -0.05},
55 { 0.2, -0.1, 0.0},
56 {-0.15, 0.05, 0.1}
57 }; // 5x3
58 Mat W_h = {
59 { 0.2, 0.1, -0.2, 0.0},
60 {-0.1, 0.3, 0.05, -0.05},
61 { 0.0, -0.2, 0.1, 0.2},
62 { 0.1, 0.0, 0.0, -0.1},
63 {-0.05, 0.2, -0.1, 0.1}
64 }; // 5x4
65 Vec v = {0.1, -0.2, 0.05, 0.15, -0.1}; // 5
66
67 // Precompute W_s s
68 Vec Ws = matvec(W_s, s); // size d_a
69
70 // Scores e_i
71 vector<double> e(H.size());
72 for (size_t i = 0; i < H.size(); ++i) {
73 Vec Wh = matvec(W_h, H[i]);
74 Vec z = add(Ws, Wh);
75 Vec a = tanh_vec(z);
76 e[i] = dot(v, a);
77 }
78
79 // Attention weights
80 vector<double> alpha = softmax_stable(e);
81
82 // Context c = sum_i alpha_i * h_i
83 Vec c(H[0].size(), 0.0);
84 for (size_t i = 0; i < H.size(); ++i)
85 for (size_t j = 0; j < H[0].size(); ++j)
86 c[j] += alpha[i] * H[i][j];
87
88 cout << fixed << setprecision(4);
89 cout << "Attention weights (alpha): ";
90 for (double a_i : alpha) cout << a_i << ' '; cout << "\n";
91 cout << "Context vector (c): ";
92 for (double cj : c) cout << cj << ' '; cout << "\n";
93 return 0;
94}
95

This example computes Bahdanau (additive) attention for a single decoder state over multiple encoder states. It forms scores via a small feed-forward network, applies a stable softmax to obtain weights, and then computes the context vector as a weighted sum. Parameters are fixed for demonstration; in practice they are learned via gradient descent.

Time: O(T_s · (d_a · (d_s + d_h)) + T_s + T_s · d_h) ≈ O(T_s · d_a · (d_s + d_h))Space: O(T_s + d_a + d_h) to store scores, intermediate activations, and the context vector
Greedy decoding loop with additive attention (toy end-to-end forward pass)
1#include <bits/stdc++.h>
2using namespace std;
3
4using Vec = vector<double>;
5using Mat = vector<vector<double>>;
6
7// Utility functions
8Vec concat(const Vec &a, const Vec &b) { Vec c=a; c.insert(c.end(), b.begin(), b.end()); return c; }
9Vec concat3(const Vec &a, const Vec &b, const Vec &c) { Vec ab = concat(a,b); return concat(ab,c); }
10
11Vec matvec(const Mat &W, const Vec &x) {
12 size_t r=W.size(), c=W[0].size(); Vec y(r,0.0);
13 for(size_t i=0;i<r;++i) for(size_t j=0;j<c;++j) y[i]+=W[i][j]*x[j];
14 return y;
15}
16
17Vec add(const Vec &a, const Vec &b) { Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]+b[i]; return c; }
18Vec tanh_vec(const Vec &a) { Vec b(a.size()); for(size_t i=0;i<a.size();++i) b[i]=tanh(a[i]); return b; }
19
20double dot(const Vec &a, const Vec &b){ double s=0.0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
21
22vector<double> softmax(const vector<double> &z){ double mx=*max_element(z.begin(),z.end()); double sum=0.0; vector<double> p(z.size()); for(size_t i=0;i<z.size();++i){ p[i]=exp(z[i]-mx); sum+=p[i]; } for(double &v:p) v/=(sum+1e-12); return p; }
23
24int argmax(const vector<double> &v){ return (int)(max_element(v.begin(), v.end()) - v.begin()); }
25
26// Additive attention: e_i = v^T tanh(Ws s + Wh h_i)
27vector<double> attention_scores(const Vec &s, const vector<Vec> &H, const Mat &Ws, const Mat &Wh, const Vec &v) {
28 Vec Ws_s = matvec(Ws, s);
29 vector<double> e(H.size());
30 for(size_t i=0;i<H.size();++i){
31 Vec Wh_hi = matvec(Wh, H[i]);
32 Vec z = add(Ws_s, Wh_hi);
33 Vec a = tanh_vec(z);
34 e[i] = dot(v, a);
35 }
36 return e;
37}
38
39Vec weighted_sum(const vector<Vec> &H, const vector<double> &alpha){
40 size_t d=H[0].size(); Vec c(d,0.0);
41 for(size_t i=0;i<H.size();++i) for(size_t j=0;j<d;++j) c[j]+=alpha[i]*H[i][j];
42 return c;
43}
44
45int main(){
46 // Dimensions
47 const int d_h=4; // encoder hidden size
48 const int d_s=4; // decoder state size
49 const int d_a=3; // attention hidden size
50 const int d_e=3; // embedding size
51 const int V=6; // vocabulary size
52
53 // Seeded random generator for reproducibility
54 std::mt19937 rng(42); std::uniform_real_distribution<double> dist(-0.2,0.2);
55
56 // Toy encoder states H (T_s x d_h)
57 const int T_s=5; vector<Vec> H(T_s, Vec(d_h));
58 for(int i=0;i<T_s;++i) for(int j=0;j<d_h;++j) H[i][j]=dist(rng);
59
60 // Parameters (random small values for demo)
61 Mat Ws(d_a, Vec(d_s)), Wh(d_a, Vec(d_h)); Vec v_att(d_a);
62 for(int i=0;i<d_a;++i){ for(int j=0;j<d_s;++j) Ws[i][j]=dist(rng); for(int j=0;j<d_h;++j) Wh[i][j]=dist(rng); v_att[i]=dist(rng); }
63
64 // RNN state update: s_t = tanh(Wr [s_{t-1}; emb(y_{t-1}); c_t] + b_r)
65 Mat Wr(d_s, Vec(d_s + d_e + d_h)); Vec b_r(d_s);
66 for(int i=0;i<d_s;++i){ b_r[i]=dist(rng); for(int j=0;j<(int)Wr[0].size();++j) Wr[i][j]=dist(rng); }
67
68 // Output projection: logits = Wo [s_t; c_t] + b_o -> size V
69 Mat Wo(V, Vec(d_s + d_h)); Vec b_o(V);
70 for(int i=0;i<V;++i){ b_o[i]=dist(rng); for(int j=0;j<(int)Wo[0].size();++j) Wo[i][j]=dist(rng); }
71
72 // Embedding matrix (V x d_e)
73 Mat Emb(V, Vec(d_e)); for(int i=0;i<V;++i) for(int j=0;j<d_e;++j) Emb[i][j]=dist(rng);
74
75 // Special tokens (ids)
76 const int SOS=1, EOS=2;
77
78 // Initial states
79 Vec s_prev(d_s, 0.0); // zero state
80 Vec c_prev(d_h, 0.0); // initial context
81 int y_prev = SOS; // start token
82
83 const int max_steps=10;
84
85 cout << fixed << setprecision(4);
86 cout << "Greedy decoding with attention (token ids):\n";
87 for(int t=1;t<=max_steps;++t){
88 // 1) Compute attention over encoder states using s_{t-1}
89 vector<double> e = attention_scores(s_prev, H, Ws, Wh, v_att);
90 vector<double> alpha = softmax(e);
91 Vec c_t = weighted_sum(H, alpha);
92
93 // 2) Update decoder state using previous state, previous token embedding, and new context
94 Vec emb = Emb[y_prev];
95 Vec inp = concat3(s_prev, emb, c_t); // [d_s + d_e + d_h]
96 Vec s_t = tanh_vec(add(matvec(Wr, inp), b_r));
97
98 // 3) Compute output distribution and choose next token (greedy)
99 Vec out_inp = concat(s_t, c_t); // [d_s + d_h]
100 Vec logits = add(matvec(Wo, out_inp), b_o); // size V
101 vector<double> probs = softmax(logits);
102 int y_t = argmax(probs);
103
104 // Print step info
105 cout << "t=" << t << " -> y_t=" << y_t << ", attention=";
106 for(double a : alpha) cout << a << ' ';
107 cout << "\n";
108
109 if(y_t == EOS){ cout << "<eos> reached, stop.\n"; break; }
110 // roll
111 s_prev = s_t; c_prev = c_t; y_prev = y_t;
112 }
113
114 return 0;
115}
116

This toy program demonstrates one full greedy decoding loop with additive attention. At each step, it (1) computes attention scores over encoder states from the decoder state, (2) forms a context vector, (3) updates the decoder state via a simple tanh RNN cell using the previous state, previous token embedding, and context, and (4) projects to vocabulary logits, applies softmax, and picks the argmax token. The numbers are random but the control flow mirrors a real Seq2Seq with attention.

Time: O(T_t · (T_s · d_a · (d_s + d_h) + (d_s · (d_s + d_e + d_h)) + V · (d_s + d_h)))Space: O(T_s · d_h + d_s + d_a + V + d_e) for encoder states, parameters, and working buffers
#sequence-to-sequence#attention#encoder-decoder#bahdanau#luong#scaled dot-product#context vector#alignment#softmax#beam search#teacher forcing#cross-entropy#rnn#lstm#transformer