⏱ 12 min read πŸ“Š Intermediate πŸ—“ Updated Jan 2025

🎯 The Attention Problem

Before Transformers, the state-of-the-art for sequence-to-sequence tasks (like machine translation) was an RNN encoder-decoder architecture. The encoder read the entire source sentence and compressed it into a single fixed-size hidden state vector β€” the "thought vector" β€” which the decoder would then use to generate the output one word at a time. This bottleneck was the fatal flaw.

The RNN Bottleneck

Regardless of whether the source sentence is 5 words or 500 words, the encoder must compress all its meaning into a single vector of fixed dimension (say, 512 floats). For short sentences this works tolerably. For longer sentences, information is irreversibly lost β€” the last few words are strongly represented, early words are largely forgotten.

  • Fixed-size hidden state = fixed information capacity
  • Information from early tokens diluted by many subsequent updates
  • Longer sentences β†’ worse translation quality
  • First demonstrated as a real problem by Cho et al. (2014)
Information bottleneckDegrades with length

The Alignment Problem

When translating "The cat sat on the mat" to French, the word "chat" (cat) should attend most strongly to the source word "cat," not to "mat" or "sat." But in a vanilla encoder-decoder, the decoder only has the single compressed context vector β€” it cannot selectively focus on the source words most relevant to the current output word being generated.

  • Each output word requires different source words to be emphasised
  • Soft alignment: Bahdanau et al. (2015) proposed learned attention weights
  • Attention score: how relevant is source position i to generating target position j?
  • Context vector: weighted sum of all encoder hidden states, not just the last
Soft alignmentInterpretable weights

The Intuition of Attention

Imagine reading a book and then answering a specific question about it. Rather than summarising the entire book into one sentence and answering from that summary, you'd go back and attend to the specific passages most relevant to the question. Attention mechanisms give neural networks this same ability: instead of relying on a single compressed representation, the model can dynamically focus on the most relevant parts of the input for each part of the output it produces.

πŸ”‘ Scaled Dot-Product Attention

The Transformer paper (Vaswani et al., 2017) formalised attention using three components β€” Queries, Keys, and Values β€” and introduced scaled dot-product as the scoring function. This formulation is elegant, parallelisable, and extraordinarily effective.

# Scaled Dot-Product Attention β€” the core formula
Attention(Q, K, V) = softmax(Q Β· Kα΅€ / √dβ‚–) Β· V

# Shapes (batch omitted for clarity):
#   Q (Queries): shape (seq_len_q, dβ‚–)   β€” "what am I looking for?"
#   K (Keys):    shape (seq_len_k, dβ‚–)   β€” "what do I contain?"
#   V (Values):  shape (seq_len_v, dα΅₯)   β€” "what do I output if selected?"
#   Output:      shape (seq_len_q, dα΅₯)   β€” weighted combination of Values

# Step-by-step walkthrough for a 3-token sequence ["The", "cat", "sat"]:

# Step 1: Project input embeddings into Q, K, V spaces (via learned Wq, Wk, Wv matrices)
Q = X Β· Wq    # (3 Γ— dβ‚–)
K = X Β· Wk    # (3 Γ— dβ‚–)
V = X Β· Wv    # (3 Γ— dα΅₯)

# Step 2: Compute raw attention scores β€” how much each query attends to each key
scores = Q Β· Kα΅€           # (3 Γ— 3) matrix of dot products

# Step 3: Scale by √dβ‚– to prevent softmax saturation in high dimensions
scores_scaled = scores / √dβ‚–

# Step 4: Convert scores to probabilities
weights = softmax(scores_scaled, dim=-1)   # each row sums to 1.0
# weights[i][j] = how much token i should attend to token j

# Step 5: Weighted sum of values
output = weights Β· V      # (3 Γ— dα΅₯)
# Each output token is a blended combination of all value vectors,
# proportional to attention weights.

# Example attention weights for "cat" might look like:
# "cat" attends to: "The"β†’0.05, "cat"β†’0.85, "sat"β†’0.10
# β†’ output for "cat" is mostly the value of "cat" itself (self-attention)
      

Queries, Keys, Values

The Q/K/V terminology comes from information retrieval. Think of a Key-Value store: a dictionary where each key maps to a value. A Query is what you're looking for; you compute similarity between the query and all keys to decide how much of each value to retrieve. Unlike a hard lookup, attention performs a soft retrieval β€” a weighted blend of all values.

Q = search queryK = index keyV = retrieved content

The √dβ‚– Scaling Factor

Without scaling, dot products in high dimensions grow proportionally to dβ‚–. For dβ‚– = 64, a typical dot product has magnitude ~8. The softmax of large values pushes weights towards 0 and 1 (saturation), collapsing gradients. Dividing by √dβ‚– keeps the inputs to softmax in a reasonable variance range, maintaining healthy gradient flow.

Critical for training stability

Attention as a Soft Dictionary Lookup

Consider a Python dictionary {"The": v1, "cat": v2, "sat": v3}. A hard lookup for "cat" returns exactly v2. Attention performs a soft lookup: given a query vector, compute a similarity score with every key, convert to probabilities with softmax, and return a weighted average of all values. This is differentiable end-to-end β€” the query, keys, and values are all learned via gradient descent β€” allowing the model to discover what to attend to from data alone.

πŸ”€ Multi-Head Attention

Single-head attention can only attend to information from one "perspective" at a time. Multi-head attention runs h attention operations in parallel, each with its own learned Q/K/V projections, then concatenates and projects the results. Different heads learn to specialise in different types of relationships.

# Multi-Head Attention β€” h parallel attention heads
MultiHead(Q, K, V) = Concat(head₁, headβ‚‚, …, headβ‚•) Β· Wβ‚’

# Each head i uses separate projection matrices:
headα΅’ = Attention(Q Β· Wα΅’q, K Β· Wα΅’k, V Β· Wα΅’v)

# Typical dimensions (GPT-3 large example):
#   d_model = 1024    (embedding dimension)
#   h = 16            (number of heads)
#   dβ‚– = dα΅₯ = d_model / h = 64   (per-head dimension)
#
# Each head works in a 64-dimensional subspace.
# Concatenating 16 Γ— 64-dim outputs = 1024-dim, then project with Wβ‚’.

# Total parameters per MHA layer:
#   16 heads Γ— (Wq + Wk + Wv): 3 Γ— 1024 Γ— 1024 = 3.1M
#   Output projection Wβ‚’: 1024 Γ— 1024 = 1.0M
#   Total: ~4.1M parameters per MHA layer
      
Property Single-Head Attention Multi-Head Attention
Subspaces One d_model-dimensional attention subspace h independent dβ‚–-dimensional subspaces (dβ‚– = d_model/h)
Relationship types One type of pattern per layer Different heads specialise: syntax, coreference, positional proximity, semantic similarity
Interpretability Single attention map is interpretable Visualising individual heads reveals linguistic structure (BertViz tool)
Compute O(TΒ²Β·d) β€” quadratic in sequence length Same asymptotic complexity; same total FLOPs if d_model is held constant
Expressiveness More limited representation capacity per layer Richer; empirically much better at capturing diverse linguistic phenomena

πŸ—οΈ The Full Transformer Architecture

The Transformer is built from two stacks: an encoder (processes the input) and a decoder (generates the output). Many modern models use only one stack: BERT uses encoders only; GPT uses decoders only. T5 uses both.

# ── ENCODER (one layer, repeated N times, e.g. N=12 in BERT-base) ──

Input tokens  β†’  Token Embeddings + Positional Encoding
                          ↓
               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β”‚  Multi-Head Self-Attentionβ”‚  ← every token attends to every other token
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚ + residual (add input to attention output)
                    Layer Normalisation
                          ↓
               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β”‚   Feed-Forward Network   β”‚  ← 2-layer MLP applied independently to each position
               β”‚ (expand to 4Γ—d β†’ GELU β†’ β”‚
               β”‚  project back to d)      β”‚
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚ + residual
                    Layer Normalisation
                          ↓  (becomes input to next encoder layer)

# ── DECODER (one layer, also repeated N times) ──

Target tokens (shifted right) β†’  Embeddings + Positional Encoding
                          ↓
              Masked Multi-Head Self-Attention  ← causal mask: can only see past positions
                          ↓ + residual + LayerNorm
              Multi-Head Cross-Attention        ← Q from decoder, K/V from encoder output
                          ↓ + residual + LayerNorm
              Feed-Forward Network
                          ↓ + residual + LayerNorm
                     Linear + Softmax β†’ Output probabilities

# ── POSITIONAL ENCODING ──
# Since Transformers process all positions in parallel, they have no inherent
# sense of order. Positional encodings add position information to embeddings.

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

# Learned positional embeddings (BERT, GPT-2) are an alternative.
      
Component Role Output Shape
Token Embeddings Map discrete token IDs to dense vectors. Learned lookup table of shape (vocab_size Γ— d_model). (batch, seq_len, d_model)
Positional Encoding Inject position information by adding sinusoidal (or learned) vectors to embeddings. Allows the model to distinguish "cat sat" from "sat cat." (batch, seq_len, d_model) β€” added to embeddings
Multi-Head Self-Attention Each position attends to all other positions (encoder) or all prior positions (decoder). Captures contextual relationships across the full sequence. (batch, seq_len, d_model)
Cross-Attention (Decoder) Decoder queries attend to encoder key-value pairs, aligning generated tokens with relevant input tokens β€” the modern incarnation of the original alignment problem solution. (batch, tgt_len, d_model)
Feed-Forward Network Position-wise MLP: expands to 4Γ—d_model, applies GELU/ReLU, projects back. Provides most of the network's non-linear "thinking" capacity. ~2/3 of Transformer parameters live here. (batch, seq_len, d_model)
Layer Normalisation Normalises activations across the feature dimension (not batch). Applied after each sub-layer (Post-LN, original paper) or before each sub-layer (Pre-LN, GPT-3 style β€” more stable). (batch, seq_len, d_model) β€” unchanged shape
Residual Connections Adds the input of each sub-layer directly to its output: output = sublayer(x) + x. Enables very deep Transformers (12–96 layers) to train by preserving gradient pathways, same principle as ResNet. (batch, seq_len, d_model) β€” unchanged shape

🌐 Transformer Variants & Models

Since 2017, the Transformer has spawned hundreds of variants. The most important distinction is the architecture type: encoder-only (good at understanding), decoder-only (good at generation), or encoder-decoder (good at transformation tasks like translation or summarisation).

Model Type Parameters Key Innovation
BERT (2018) Encoder-only 110M (base) / 340M (large) Masked Language Modelling (MLM): randomly mask 15% of tokens, train to predict them. Bidirectional context. Created the pre-train β†’ fine-tune paradigm for NLP.
GPT-1/2/3/4 Decoder-only 117M β†’ 1.5B β†’ 175B β†’ undisclosed Causal language modelling (predict next token). GPT-3 showed few-shot learning emerges at scale. GPT-4 is multimodal. The dominant paradigm for generation tasks.
T5 (2019) Encoder-decoder 60M–11B Frames every NLP task as text-to-text (e.g. "summarise: …" β†’ summary). Unified interface for translation, summarisation, Q&A, classification. Strong baseline.
Vision Transformer / ViT (2020) Encoder-only (image) 86M (ViT-B) to 632M (ViT-H) Splits image into 16Γ—16 patches, treats each patch as a token. Self-attention across patches achieves global receptive field from layer 1. Outperforms CNNs at scale.
LLaMA 2/3 (2023/24) Decoder-only 7B / 13B / 70B / 405B Open weights, RoPE positional embeddings, grouped-query attention (GQA) for inference efficiency, SwiGLU activation. Foundation for most open-source LLM ecosystem.

Scaling Laws: Bigger Is Predictably Better

Kaplan et al. (OpenAI, 2020) discovered that Transformer loss follows smooth power laws as a function of model size, dataset size, and compute budget β€” independently and predictably. This means: double your parameters, and your loss decreases by a predictable amount, regardless of architecture details. These scaling laws allowed labs to plan multi-hundred-million-dollar training runs with confidence. Chinchilla (Hoffmann et al., 2022) refined this: most large models were undertrained on data β€” optimal training requires ~20 tokens per parameter. GPT-4 and LLaMA 3 applied this principle aggressively.