# Evolutionary Path from Bahdanau Attention to Transformer These notes accompany the intermediate notebooks and are meant to highlight what to emphasize when teaching the progression. ## Step 1 – Multi-Head Attention in a Bahdanau Decoder (`bahdanau-multihead-attention.ipynb`) - **Objective**: keep the GRU encoder/decoder loop intact while letting the decoder consult multiple alignment subspaces simultaneously. - **Key equations**: - Head-specific projections: $Q_i = H_{dec} W_q^{(i)}, K_i = H_{enc} W_k^{(i)}, V_i = H_{enc} W_v^{(i)}$. - Scaled attention per head: $\text{head}_i = \text{softmax}\left(\frac{Q_i K_i^\top}{\sqrt{d_h}}\right) V_i$. - Merge heads: $\text{MHA}(H_{dec}, H_{enc}) = [\text{head}_1;\ldots;\text{head}_h] W_o$. - **Diagram**: show the Bahdanau decoder time step; after obtaining the GRU query, split it into heads, run parallel attention over encoder states, concatenate, then feed through the usual GRU + dense head. Highlight that recurrence still provides temporal control. - **Teaching cues**: - Compare with scalar Bahdanau score $v^\top \tanh(W_q h + W_k H)$ to motivate why multiple subspaces help. - Emphasize reshaping: `(batch, seq, hidden)` → `(batch, heads, seq, hidden/head)`. - Point out dropout on attention weights as a regularizer before re-entering the GRU. ## Step 2 – GRU Encoder with Added Self-Attention (`encoder-self-attention-hybrid.ipynb`) - **Objective**: keep the decoder unchanged while enhancing encoder representations via Transformer-style self-attention blocks stacked on top of GRU outputs. - **Block math**: 1. Context states $H$ from GRU. 2. Self-attention: $Z = \text{LayerNorm}\big(H + \text{MHA}(H,H,H,\text{mask})\big)$. 3. Position-wise feed-forward: $H' = \text{LayerNorm}\big(Z + \text{FFN}(Z)\big)$, where $\text{FFN}(x)=\sigma(xW_1+b_1)W_2+b_2$. - **Diagram**: GRU outputs feed into a residual block (AddNorm + MHA) followed by another residual with FFN; resulting $H'$ replaces $H$ before cross-attention in the decoder. - **Teaching cues**: - Stress how AddNorm (residual + LayerNorm) stabilizes depth. - Discuss masking with variable-length sequences: valid_lens from the dataloader. - Encourage comparing validation curves against pure GRU to show the benefit of global context on the source side. ## Step 3 – Decoder with Masked Self-Attention + GRU (`decoder-self-attention-hybrid.ipynb`) - **Objective**: let the decoder attend to its own history before consulting encoder states, while still using a GRU cell to maintain hidden state continuity. - **Math**: - Build causal masks $M$ where $M_{t,t'}=-\infty$ for $t'