π― The Attention Problem
Before Transformers, the state-of-the-art for sequence-to-sequence tasks (like machine translation) was an RNN encoder-decoder architecture. The encoder read the entire source sentence and compressed it into a single fixed-size hidden state vector β the "thought vector" β which the decoder would then use to generate the output one word at a time. This bottleneck was the fatal flaw.
The RNN Bottleneck
Regardless of whether the source sentence is 5 words or 500 words, the encoder must compress all its meaning into a single vector of fixed dimension (say, 512 floats). For short sentences this works tolerably. For longer sentences, information is irreversibly lost β the last few words are strongly represented, early words are largely forgotten.
- Fixed-size hidden state = fixed information capacity
- Information from early tokens diluted by many subsequent updates
- Longer sentences β worse translation quality
- First demonstrated as a real problem by Cho et al. (2014)
The Alignment Problem
When translating "The cat sat on the mat" to French, the word "chat" (cat) should attend most strongly to the source word "cat," not to "mat" or "sat." But in a vanilla encoder-decoder, the decoder only has the single compressed context vector β it cannot selectively focus on the source words most relevant to the current output word being generated.
- Each output word requires different source words to be emphasised
- Soft alignment: Bahdanau et al. (2015) proposed learned attention weights
- Attention score: how relevant is source position i to generating target position j?
- Context vector: weighted sum of all encoder hidden states, not just the last
The Intuition of Attention
Imagine reading a book and then answering a specific question about it. Rather than summarising the entire book into one sentence and answering from that summary, you'd go back and attend to the specific passages most relevant to the question. Attention mechanisms give neural networks this same ability: instead of relying on a single compressed representation, the model can dynamically focus on the most relevant parts of the input for each part of the output it produces.
π Scaled Dot-Product Attention
The Transformer paper (Vaswani et al., 2017) formalised attention using three components β Queries, Keys, and Values β and introduced scaled dot-product as the scoring function. This formulation is elegant, parallelisable, and extraordinarily effective.
# Scaled Dot-Product Attention β the core formula
Attention(Q, K, V) = softmax(Q Β· Kα΅ / βdβ) Β· V
# Shapes (batch omitted for clarity):
# Q (Queries): shape (seq_len_q, dβ) β "what am I looking for?"
# K (Keys): shape (seq_len_k, dβ) β "what do I contain?"
# V (Values): shape (seq_len_v, dα΅₯) β "what do I output if selected?"
# Output: shape (seq_len_q, dα΅₯) β weighted combination of Values
# Step-by-step walkthrough for a 3-token sequence ["The", "cat", "sat"]:
# Step 1: Project input embeddings into Q, K, V spaces (via learned Wq, Wk, Wv matrices)
Q = X Β· Wq # (3 Γ dβ)
K = X Β· Wk # (3 Γ dβ)
V = X Β· Wv # (3 Γ dα΅₯)
# Step 2: Compute raw attention scores β how much each query attends to each key
scores = Q Β· Kα΅ # (3 Γ 3) matrix of dot products
# Step 3: Scale by βdβ to prevent softmax saturation in high dimensions
scores_scaled = scores / βdβ
# Step 4: Convert scores to probabilities
weights = softmax(scores_scaled, dim=-1) # each row sums to 1.0
# weights[i][j] = how much token i should attend to token j
# Step 5: Weighted sum of values
output = weights Β· V # (3 Γ dα΅₯)
# Each output token is a blended combination of all value vectors,
# proportional to attention weights.
# Example attention weights for "cat" might look like:
# "cat" attends to: "The"β0.05, "cat"β0.85, "sat"β0.10
# β output for "cat" is mostly the value of "cat" itself (self-attention)
Queries, Keys, Values
The Q/K/V terminology comes from information retrieval. Think of a Key-Value store: a dictionary where each key maps to a value. A Query is what you're looking for; you compute similarity between the query and all keys to decide how much of each value to retrieve. Unlike a hard lookup, attention performs a soft retrieval β a weighted blend of all values.
The βdβ Scaling Factor
Without scaling, dot products in high dimensions grow proportionally to dβ. For dβ = 64, a typical dot product has magnitude ~8. The softmax of large values pushes weights towards 0 and 1 (saturation), collapsing gradients. Dividing by βdβ keeps the inputs to softmax in a reasonable variance range, maintaining healthy gradient flow.
Attention as a Soft Dictionary Lookup
Consider a Python dictionary {"The": v1, "cat": v2, "sat": v3}. A hard lookup for "cat" returns exactly v2. Attention performs a soft lookup: given a query vector, compute a similarity score with every key, convert to probabilities with softmax, and return a weighted average of all values. This is differentiable end-to-end β the query, keys, and values are all learned via gradient descent β allowing the model to discover what to attend to from data alone.
π Multi-Head Attention
Single-head attention can only attend to information from one "perspective" at a time. Multi-head attention runs h attention operations in parallel, each with its own learned Q/K/V projections, then concatenates and projects the results. Different heads learn to specialise in different types of relationships.
# Multi-Head Attention β h parallel attention heads
MultiHead(Q, K, V) = Concat(headβ, headβ, β¦, headβ) Β· Wβ
# Each head i uses separate projection matrices:
headα΅’ = Attention(Q Β· Wα΅’q, K Β· Wα΅’k, V Β· Wα΅’v)
# Typical dimensions (GPT-3 large example):
# d_model = 1024 (embedding dimension)
# h = 16 (number of heads)
# dβ = dα΅₯ = d_model / h = 64 (per-head dimension)
#
# Each head works in a 64-dimensional subspace.
# Concatenating 16 Γ 64-dim outputs = 1024-dim, then project with Wβ.
# Total parameters per MHA layer:
# 16 heads Γ (Wq + Wk + Wv): 3 Γ 1024 Γ 1024 = 3.1M
# Output projection Wβ: 1024 Γ 1024 = 1.0M
# Total: ~4.1M parameters per MHA layer
| Property | Single-Head Attention | Multi-Head Attention |
|---|---|---|
| Subspaces | One d_model-dimensional attention subspace | h independent dβ-dimensional subspaces (dβ = d_model/h) |
| Relationship types | One type of pattern per layer | Different heads specialise: syntax, coreference, positional proximity, semantic similarity |
| Interpretability | Single attention map is interpretable | Visualising individual heads reveals linguistic structure (BertViz tool) |
| Compute | O(TΒ²Β·d) β quadratic in sequence length | Same asymptotic complexity; same total FLOPs if d_model is held constant |
| Expressiveness | More limited representation capacity per layer | Richer; empirically much better at capturing diverse linguistic phenomena |
ποΈ The Full Transformer Architecture
The Transformer is built from two stacks: an encoder (processes the input) and a decoder (generates the output). Many modern models use only one stack: BERT uses encoders only; GPT uses decoders only. T5 uses both.
# ββ ENCODER (one layer, repeated N times, e.g. N=12 in BERT-base) ββ
Input tokens β Token Embeddings + Positional Encoding
β
βββββββββββββββββββββββββββ
β Multi-Head Self-Attentionβ β every token attends to every other token
ββββββββββββ¬βββββββββββββββ
β + residual (add input to attention output)
Layer Normalisation
β
βββββββββββββββββββββββββββ
β Feed-Forward Network β β 2-layer MLP applied independently to each position
β (expand to 4Γd β GELU β β
β project back to d) β
ββββββββββββ¬βββββββββββββββ
β + residual
Layer Normalisation
β (becomes input to next encoder layer)
# ββ DECODER (one layer, also repeated N times) ββ
Target tokens (shifted right) β Embeddings + Positional Encoding
β
Masked Multi-Head Self-Attention β causal mask: can only see past positions
β + residual + LayerNorm
Multi-Head Cross-Attention β Q from decoder, K/V from encoder output
β + residual + LayerNorm
Feed-Forward Network
β + residual + LayerNorm
Linear + Softmax β Output probabilities
# ββ POSITIONAL ENCODING ββ
# Since Transformers process all positions in parallel, they have no inherent
# sense of order. Positional encodings add position information to embeddings.
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
# Learned positional embeddings (BERT, GPT-2) are an alternative.
| Component | Role | Output Shape |
|---|---|---|
| Token Embeddings | Map discrete token IDs to dense vectors. Learned lookup table of shape (vocab_size Γ d_model). | (batch, seq_len, d_model) |
| Positional Encoding | Inject position information by adding sinusoidal (or learned) vectors to embeddings. Allows the model to distinguish "cat sat" from "sat cat." | (batch, seq_len, d_model) β added to embeddings |
| Multi-Head Self-Attention | Each position attends to all other positions (encoder) or all prior positions (decoder). Captures contextual relationships across the full sequence. | (batch, seq_len, d_model) |
| Cross-Attention (Decoder) | Decoder queries attend to encoder key-value pairs, aligning generated tokens with relevant input tokens β the modern incarnation of the original alignment problem solution. | (batch, tgt_len, d_model) |
| Feed-Forward Network | Position-wise MLP: expands to 4Γd_model, applies GELU/ReLU, projects back. Provides most of the network's non-linear "thinking" capacity. ~2/3 of Transformer parameters live here. | (batch, seq_len, d_model) |
| Layer Normalisation | Normalises activations across the feature dimension (not batch). Applied after each sub-layer (Post-LN, original paper) or before each sub-layer (Pre-LN, GPT-3 style β more stable). | (batch, seq_len, d_model) β unchanged shape |
| Residual Connections | Adds the input of each sub-layer directly to its output: output = sublayer(x) + x. Enables very deep Transformers (12β96 layers) to train by preserving gradient pathways, same principle as ResNet. | (batch, seq_len, d_model) β unchanged shape |
π Transformer Variants & Models
Since 2017, the Transformer has spawned hundreds of variants. The most important distinction is the architecture type: encoder-only (good at understanding), decoder-only (good at generation), or encoder-decoder (good at transformation tasks like translation or summarisation).
| Model | Type | Parameters | Key Innovation |
|---|---|---|---|
| BERT (2018) | Encoder-only | 110M (base) / 340M (large) | Masked Language Modelling (MLM): randomly mask 15% of tokens, train to predict them. Bidirectional context. Created the pre-train β fine-tune paradigm for NLP. |
| GPT-1/2/3/4 | Decoder-only | 117M β 1.5B β 175B β undisclosed | Causal language modelling (predict next token). GPT-3 showed few-shot learning emerges at scale. GPT-4 is multimodal. The dominant paradigm for generation tasks. |
| T5 (2019) | Encoder-decoder | 60Mβ11B | Frames every NLP task as text-to-text (e.g. "summarise: β¦" β summary). Unified interface for translation, summarisation, Q&A, classification. Strong baseline. |
| Vision Transformer / ViT (2020) | Encoder-only (image) | 86M (ViT-B) to 632M (ViT-H) | Splits image into 16Γ16 patches, treats each patch as a token. Self-attention across patches achieves global receptive field from layer 1. Outperforms CNNs at scale. |
| LLaMA 2/3 (2023/24) | Decoder-only | 7B / 13B / 70B / 405B | Open weights, RoPE positional embeddings, grouped-query attention (GQA) for inference efficiency, SwiGLU activation. Foundation for most open-source LLM ecosystem. |
Scaling Laws: Bigger Is Predictably Better
Kaplan et al. (OpenAI, 2020) discovered that Transformer loss follows smooth power laws as a function of model size, dataset size, and compute budget β independently and predictably. This means: double your parameters, and your loss decreases by a predictable amount, regardless of architecture details. These scaling laws allowed labs to plan multi-hundred-million-dollar training runs with confidence. Chinchilla (Hoffmann et al., 2022) refined this: most large models were undertrained on data β optimal training requires ~20 tokens per parameter. GPT-4 and LLaMA 3 applied this principle aggressively.