# Evolutionary Path from Bahdanau Attention to Transformer

These notes accompany the intermediate notebooks and are meant to highlight what to emphasize when teaching the progression.

## Step 1 – Multi-Head Attention in a Bahdanau Decoder (`bahdanau-multihead-attention.ipynb`)

- **Objective**: keep the GRU encoder/decoder loop intact while letting the decoder consult multiple alignment subspaces simultaneously.
- **Key equations**:
  - Head-specific projections: $Q_i = H_{dec} W_q^{(i)}, K_i = H_{enc} W_k^{(i)}, V_i = H_{enc} W_v^{(i)}$.
  - Scaled attention per head: $\text{head}_i = \text{softmax}\left(\frac{Q_i K_i^\top}{\sqrt{d_h}}\right) V_i$.
  - Merge heads: $\text{MHA}(H_{dec}, H_{enc}) = [\text{head}_1;\ldots;\text{head}_h] W_o$.
- **Diagram**: show the Bahdanau decoder time step; after obtaining the GRU query, split it into heads, run parallel attention over encoder states, concatenate, then feed through the usual GRU + dense head. Highlight that recurrence still provides temporal control.
- **Teaching cues**:
  - Compare with scalar Bahdanau score $v^\top \tanh(W_q h + W_k H)$ to motivate why multiple subspaces help.
  - Emphasize reshaping: `(batch, seq, hidden)` → `(batch, heads, seq, hidden/head)`.
  - Point out dropout on attention weights as a regularizer before re-entering the GRU.

## Step 2 – GRU Encoder with Added Self-Attention (`encoder-self-attention-hybrid.ipynb`)

- **Objective**: keep the decoder unchanged while enhancing encoder representations via Transformer-style self-attention blocks stacked on top of GRU outputs.
- **Block math**:
  1. Context states $H$ from GRU.
  2. Self-attention: $Z = \text{LayerNorm}\big(H + \text{MHA}(H,H,H,\text{mask})\big)$.
  3. Position-wise feed-forward: $H' = \text{LayerNorm}\big(Z + \text{FFN}(Z)\big)$, where $\text{FFN}(x)=\sigma(xW_1+b_1)W_2+b_2$.
- **Diagram**: GRU outputs feed into a residual block (AddNorm + MHA) followed by another residual with FFN; resulting $H'$ replaces $H$ before cross-attention in the decoder.
- **Teaching cues**:
  - Stress how AddNorm (residual + LayerNorm) stabilizes depth.
  - Discuss masking with variable-length sequences: valid_lens from the dataloader.
  - Encourage comparing validation curves against pure GRU to show the benefit of global context on the source side.

## Step 3 – Decoder with Masked Self-Attention + GRU (`decoder-self-attention-hybrid.ipynb`)

- **Objective**: let the decoder attend to its own history before consulting encoder states, while still using a GRU cell to maintain hidden state continuity.
- **Math**:
  - Build causal masks $M$ where $M_{t,t'}=-\infty$ for $t'<t$; zeros otherwise.
  - Self-attention: $Y = \text{LayerNorm}(X + \text{MHA}(X,X,X,M))$.
  - Cross-attention: $Z = \text{LayerNorm}(Y + \text{MHA}(Y, H_{enc}, H_{enc}))$.
  - GRU update: feed each timestep of $Z$ through the GRU to produce logits.
- **Diagram**: Within each decoder block show masked self-attn → AddNorm, then encoder-decoder attn → AddNorm, finally GRU unfolding across time.
- **Teaching cues**:
  - Walk through why masking is necessary (prevent peeking ahead during training even under teacher forcing).
  - Show how the GRU hidden state and self-attn context complement each other (state carries summary, attention re-checks precise tokens).

## Step 4 – Transformer Decoder atop GRU Encoder (`transformer-decoder-on-gru-encoder.ipynb`)

- **Objective**: replace recurrent decoding entirely with stacked Transformer decoder blocks while still using the trained GRU encoder as the memory bank.
- **Decoder block math** (per Vaswani et al.):
  1. $Y_1 = \text{AddNorm}(X, \text{MHA}_{\text{mask}}(X,X,X))$.
  2. $Y_2 = \text{AddNorm}(Y_1, \text{MHA}(Y_1, H_{enc}, H_{enc}))$.
  3. $Z = \text{AddNorm}(Y_2, \text{FFN}(Y_2))$.
- **Positional encoding**: $PE_{(pos,2i)}=\sin(pos/10000^{2i/d})$, $PE_{(pos,2i+1)}=\cos(pos/10000^{2i/d})$; add to token embeddings before the first block on both encoder and decoder sides.
- **Diagram**: depict GRU encoder on the left feeding encoder outputs into a stack of Transformer decoder blocks on the right; highlight caching of key/value tensors for autoregressive inference.
- **Teaching cues**:
  - Demonstrate how teacher forcing lets the entire target sequence run in parallel during training even without recurrence.
  - Contrast memory usage: cached key/value pairs vs GRU state.
  - Discuss `_clone_state` for beam search—each block keeps its key/value history per hypothesis.

## Step 5 – Full Transformer (`transformer.ipynb`)

- **Objective**: embrace fully parallel self-attention on both encoder and decoder, with positional encodings providing order information.
- **Architecture**:
  - Encoder: repeat `[\text{MHA} + \text{AddNorm}, \text{FFN} + \text{AddNorm}]` $N$ times.
  - Decoder: same as Step 4 but with encoder outputs now coming from Transformer layers as well.
- **Loss/regularization suggestions**:
  - Label smoothing in `nn.CrossEntropyLoss(label_smoothing)`.
  - Weight decay in Adam (`torch.optim.Adam(..., weight_decay=λ)`).
  - Higher dropout (attention + FFN) given the parallel nature.
- **Diagram**: canonical Transformer figure (stacked encoder blocks, stacked decoder blocks, arrows for cross-attention, positional encoding added to both inputs).
- **Teaching cues**:
  - Explain how multi-head self-attention subsumes both convolutional receptive fields and recurrent dependencies.
  - Highlight training/inference distinction: mask only on decoder self-attention; encoder is fully bidirectional.

## Suggested Pedagogical Flow

1. **Revisit Bahdanau**: draw the classic attention diagram and ask students how multiple “views” could help—segue into multi-head math.
2. **Residual intuition**: before introducing hybrid encoders, show a quick norm/gradient plot to justify AddNorm.
3. **Causal masking demo**: use a toy 4-token example to write the mask matrix and confirm how logits are zeroed.
4. **State caching**: in Step 4, derive why you must clone transformer caches per beam to avoid cross-contamination during inference.
5. **Regularization discussion**: once fully Transformer, discuss why label smoothing mitigates over-confident distributions and how weight decay stabilizes training with large heads/FFNs.

These notes pair with short whiteboard sketches (one per step) showing data flow arrows, residual paths, and the math blocks above each diagram so learners can connect equations to the architecture transitions.