Transformers & Attention Mechanisms

🎯 The Attention Problem

Before Transformers, the state-of-the-art for sequence-to-sequence tasks (like machine translation) was an RNN encoder-decoder architecture. The encoder read the entire source sentence and compressed it into a single fixed-size hidden state vector — the "thought vector" — which the decoder would then use to generate the output one word at a time. This bottleneck was the fatal flaw.

The RNN Bottleneck

Regardless of whether the source sentence is 5 words or 500 words, the encoder must compress all its meaning into a single vector of fixed dimension (say, 512 floats). For short sentences this works tolerably. For longer sentences, information is irreversibly lost — the last few words are strongly represented, early words are largely forgotten.

Fixed-size hidden state = fixed information capacity
Information from early tokens diluted by many subsequent updates
Longer sentences → worse translation quality
First demonstrated as a real problem by Cho et al. (2014)

Information bottleneckDegrades with length

The Alignment Problem

When translating "The cat sat on the mat" to French, the word "chat" (cat) should attend most strongly to the source word "cat," not to "mat" or "sat." But in a vanilla encoder-decoder, the decoder only has the single compressed context vector — it cannot selectively focus on the source words most relevant to the current output word being generated.

Each output word requires different source words to be emphasised
Soft alignment: Bahdanau et al. (2015) proposed learned attention weights
Attention score: how relevant is source position i to generating target position j?
Context vector: weighted sum of all encoder hidden states, not just the last

Soft alignmentInterpretable weights

The Intuition of Attention

Imagine reading a book and then answering a specific question about it. Rather than summarising the entire book into one sentence and answering from that summary, you'd go back and attend to the specific passages most relevant to the question. Attention mechanisms give neural networks this same ability: instead of relying on a single compressed representation, the model can dynamically focus on the most relevant parts of the input for each part of the output it produces.

🔑 Scaled Dot-Product Attention

The Transformer paper (Vaswani et al., 2017) formalised attention using three components — Queries, Keys, and Values — and introduced scaled dot-product as the scoring function. This formulation is elegant, parallelisable, and extraordinarily effective.

# Scaled Dot-Product Attention — the core formula
Attention(Q, K, V) = softmax(Q · Kᵀ / √dₖ) · V

# Shapes (batch omitted for clarity):
#   Q (Queries): shape (seq_len_q, dₖ)   — "what am I looking for?"
#   K (Keys):    shape (seq_len_k, dₖ)   — "what do I contain?"
#   V (Values):  shape (seq_len_v, dᵥ)   — "what do I output if selected?"
#   Output:      shape (seq_len_q, dᵥ)   — weighted combination of Values

# Step-by-step walkthrough for a 3-token sequence ["The", "cat", "sat"]:

# Step 1: Project input embeddings into Q, K, V spaces (via learned Wq, Wk, Wv matrices)
Q = X · Wq    # (3 × dₖ)
K = X · Wk    # (3 × dₖ)
V = X · Wv    # (3 × dᵥ)

# Step 2: Compute raw attention scores — how much each query attends to each key
scores = Q · Kᵀ           # (3 × 3) matrix of dot products

# Step 3: Scale by √dₖ to prevent softmax saturation in high dimensions
scores_scaled = scores / √dₖ

# Step 4: Convert scores to probabilities
weights = softmax(scores_scaled, dim=-1)   # each row sums to 1.0
# weights[i][j] = how much token i should attend to token j

# Step 5: Weighted sum of values
output = weights · V      # (3 × dᵥ)
# Each output token is a blended combination of all value vectors,
# proportional to attention weights.

# Example attention weights for "cat" might look like:
# "cat" attends to: "The"→0.05, "cat"→0.85, "sat"→0.10
# → output for "cat" is mostly the value of "cat" itself (self-attention)

Queries, Keys, Values

The Q/K/V terminology comes from information retrieval. Think of a Key-Value store: a dictionary where each key maps to a value. A Query is what you're looking for; you compute similarity between the query and all keys to decide how much of each value to retrieve. Unlike a hard lookup, attention performs a soft retrieval — a weighted blend of all values.

Q = search queryK = index keyV = retrieved content

The √dₖ Scaling Factor

Without scaling, dot products in high dimensions grow proportionally to dₖ. For dₖ = 64, a typical dot product has magnitude ~8. The softmax of large values pushes weights towards 0 and 1 (saturation), collapsing gradients. Dividing by √dₖ keeps the inputs to softmax in a reasonable variance range, maintaining healthy gradient flow.

Critical for training stability

Attention as a Soft Dictionary Lookup

Consider a Python dictionary {"The": v1, "cat": v2, "sat": v3}. A hard lookup for "cat" returns exactly v2. Attention performs a soft lookup: given a query vector, compute a similarity score with every key, convert to probabilities with softmax, and return a weighted average of all values. This is differentiable end-to-end — the query, keys, and values are all learned via gradient descent — allowing the model to discover what to attend to from data alone.

🔀 Multi-Head Attention

Single-head attention can only attend to information from one "perspective" at a time. Multi-head attention runs h attention operations in parallel, each with its own learned Q/K/V projections, then concatenates and projects the results. Different heads learn to specialise in different types of relationships.

# Multi-Head Attention — h parallel attention heads
MultiHead(Q, K, V) = Concat(head₁, head₂, …, headₕ) · Wₒ

# Each head i uses separate projection matrices:
headᵢ = Attention(Q · Wᵢq, K · Wᵢk, V · Wᵢv)

# Typical dimensions (GPT-3 large example):
#   d_model = 1024    (embedding dimension)
#   h = 16            (number of heads)
#   dₖ = dᵥ = d_model / h = 64   (per-head dimension)
#
# Each head works in a 64-dimensional subspace.
# Concatenating 16 × 64-dim outputs = 1024-dim, then project with Wₒ.

# Total parameters per MHA layer:
#   16 heads × (Wq + Wk + Wv): 3 × 1024 × 1024 = 3.1M
#   Output projection Wₒ: 1024 × 1024 = 1.0M
#   Total: ~4.1M parameters per MHA layer

Property	Single-Head Attention	Multi-Head Attention
Subspaces	One d_model-dimensional attention subspace	h independent dₖ-dimensional subspaces (dₖ = d_model/h)
Relationship types	One type of pattern per layer	Different heads specialise: syntax, coreference, positional proximity, semantic similarity
Interpretability	Single attention map is interpretable	Visualising individual heads reveals linguistic structure (BertViz tool)
Compute	O(T²·d) — quadratic in sequence length	Same asymptotic complexity; same total FLOPs if d_model is held constant
Expressiveness	More limited representation capacity per layer	Richer; empirically much better at capturing diverse linguistic phenomena

🏗️ The Full Transformer Architecture

The Transformer is built from two stacks: an encoder (processes the input) and a decoder (generates the output). Many modern models use only one stack: BERT uses encoders only; GPT uses decoders only. T5 uses both.

# ── ENCODER (one layer, repeated N times, e.g. N=12 in BERT-base) ──

Input tokens  →  Token Embeddings + Positional Encoding
                          ↓
               ┌─────────────────────────┐
               │  Multi-Head Self-Attention│  ← every token attends to every other token
               └──────────┬──────────────┘
                          │ + residual (add input to attention output)
                    Layer Normalisation
                          ↓
               ┌─────────────────────────┐
               │   Feed-Forward Network   │  ← 2-layer MLP applied independently to each position
               │ (expand to 4×d → GELU → │
               │  project back to d)      │
               └──────────┬──────────────┘
                          │ + residual
                    Layer Normalisation
                          ↓  (becomes input to next encoder layer)

# ── DECODER (one layer, also repeated N times) ──

Target tokens (shifted right) →  Embeddings + Positional Encoding
                          ↓
              Masked Multi-Head Self-Attention  ← causal mask: can only see past positions
                          ↓ + residual + LayerNorm
              Multi-Head Cross-Attention        ← Q from decoder, K/V from encoder output
                          ↓ + residual + LayerNorm
              Feed-Forward Network
                          ↓ + residual + LayerNorm
                     Linear + Softmax → Output probabilities

# ── POSITIONAL ENCODING ──
# Since Transformers process all positions in parallel, they have no inherent
# sense of order. Positional encodings add position information to embeddings.

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

# Learned positional embeddings (BERT, GPT-2) are an alternative.

Component	Role	Output Shape
Token Embeddings	Map discrete token IDs to dense vectors. Learned lookup table of shape (vocab_size × d_model).	(batch, seq_len, d_model)
Positional Encoding	Inject position information by adding sinusoidal (or learned) vectors to embeddings. Allows the model to distinguish "cat sat" from "sat cat."	(batch, seq_len, d_model) — added to embeddings
Multi-Head Self-Attention	Each position attends to all other positions (encoder) or all prior positions (decoder). Captures contextual relationships across the full sequence.	(batch, seq_len, d_model)
Cross-Attention (Decoder)	Decoder queries attend to encoder key-value pairs, aligning generated tokens with relevant input tokens — the modern incarnation of the original alignment problem solution.	(batch, tgt_len, d_model)
Feed-Forward Network	Position-wise MLP: expands to 4×d_model, applies GELU/ReLU, projects back. Provides most of the network's non-linear "thinking" capacity. ~2/3 of Transformer parameters live here.	(batch, seq_len, d_model)
Layer Normalisation	Normalises activations across the feature dimension (not batch). Applied after each sub-layer (Post-LN, original paper) or before each sub-layer (Pre-LN, GPT-3 style — more stable).	(batch, seq_len, d_model) — unchanged shape
Residual Connections	Adds the input of each sub-layer directly to its output: output = sublayer(x) + x. Enables very deep Transformers (12–96 layers) to train by preserving gradient pathways, same principle as ResNet.	(batch, seq_len, d_model) — unchanged shape

🌐 Transformer Variants & Models

Since 2017, the Transformer has spawned hundreds of variants. The most important distinction is the architecture type: encoder-only (good at understanding), decoder-only (good at generation), or encoder-decoder (good at transformation tasks like translation or summarisation).

Model	Type	Parameters	Key Innovation
BERT (2018)	Encoder-only	110M (base) / 340M (large)	Masked Language Modelling (MLM): randomly mask 15% of tokens, train to predict them. Bidirectional context. Created the pre-train → fine-tune paradigm for NLP.
GPT-1/2/3/4	Decoder-only	117M → 1.5B → 175B → undisclosed	Causal language modelling (predict next token). GPT-3 showed few-shot learning emerges at scale. GPT-4 is multimodal. The dominant paradigm for generation tasks.
T5 (2019)	Encoder-decoder	60M–11B	Frames every NLP task as text-to-text (e.g. "summarise: …" → summary). Unified interface for translation, summarisation, Q&A, classification. Strong baseline.
Vision Transformer / ViT (2020)	Encoder-only (image)	86M (ViT-B) to 632M (ViT-H)	Splits image into 16×16 patches, treats each patch as a token. Self-attention across patches achieves global receptive field from layer 1. Outperforms CNNs at scale.
LLaMA 2/3 (2023/24)	Decoder-only	7B / 13B / 70B / 405B	Open weights, RoPE positional embeddings, grouped-query attention (GQA) for inference efficiency, SwiGLU activation. Foundation for most open-source LLM ecosystem.

Scaling Laws: Bigger Is Predictably Better

Kaplan et al. (OpenAI, 2020) discovered that Transformer loss follows smooth power laws as a function of model size, dataset size, and compute budget — independently and predictably. This means: double your parameters, and your loss decreases by a predictable amount, regardless of architecture details. These scaling laws allowed labs to plan multi-hundred-million-dollar training runs with confidence. Chinchilla (Hoffmann et al., 2022) refined this: most large models were undertrained on data — optimal training requires ~20 tokens per parameter. GPT-4 and LLaMA 3 applied this principle aggressively.