β‘ Why Transformers Replaced RNNs
Before 2017, the dominant architecture for sequence modelling was the Recurrent Neural Network (RNN) and its gated variants LSTM and GRU. These models processed text one token at a time, passing a hidden state from left to right. While effective for short sequences, they suffered from fundamental limitations that made them impractical for modern large-scale language tasks.
The Sequential Bottleneck
RNNs must process token 1 before token 2, token 2 before token 3 β and so on. This strict sequential dependency means no parallelism during training. With a 512-token sequence, you need 512 serial steps. Modern GPUs are massively parallel processors; the RNN bottleneck leaves nearly all that hardware idle, making training orders of magnitude slower than it needs to be.
Vanishing & Exploding Gradients
Backpropagation through time (BPTT) requires multiplying gradients across every timestep. With 512 steps, a gradient value less than 1 collapses to zero; greater than 1 explodes to infinity. LSTMs mitigated this with gating, but long-range dependencies (e.g., a pronoun referring to a noun 200 tokens back) remained extremely difficult to learn reliably.
Fixed-Size Context Bottleneck
In sequence-to-sequence RNN models (e.g., for translation), the entire input must be compressed into a single fixed-size hidden state vector before decoding begins. No matter how long the input, the decoder sees only that one vector. Critical early-sequence information is frequently overwritten by later content.
The Attention Is All You Need Moment (2017)
In June 2017, Vaswani et al. at Google Brain published "Attention Is All You Need" β the paper introducing the Transformer. Its key insight: you don't need recurrence at all. If you let every token attend directly to every other token simultaneously, you get full parallelism, O(1) path length between any two positions, and unlimited (theoretically) long-range dependency capture. The paper achieved state-of-the-art machine translation and launched an architectural revolution that produced BERT, GPT, T5, and every major LLM today.
The core trade-off: Transformers exchange the O(n) memory complexity of RNNs for O(nΒ²) attention complexity (every token attends to every other), but this is manageable for the sequence lengths typical at training time, and the parallelism wins decisively on modern hardware. Efficient attention variants (Flash Attention, sliding window) address the quadratic cost as context lengths grow.
ποΈ Encoder vs Decoder vs Encoder-Decoder
The original Transformer had two halves: an encoder that reads and represents the input, and a decoder that generates the output. Subsequent research showed that each half was independently useful, leading to three major architectural families, each suited to different tasks.
| Architecture | Representative Models | Training Objective | Primary Use Cases |
|---|---|---|---|
| Encoder-only | BERT, RoBERTa, DeBERTa, ELECTRA | Masked Language Modelling (MLM) β predict randomly masked tokens using both left and right context | Text classification, NER, sentiment analysis, question answering (extractive), semantic embeddings |
| Decoder-only | GPT-2, GPT-3, GPT-4, Llama, Mistral, Falcon, Gemma | Causal Language Modelling (CLM) β predict the next token given all preceding tokens | Text generation, chat assistants, code generation, summarisation, any open-ended generation task |
| Encoder-Decoder | T5, BART, mT5, FLAN-T5, MarianMT | Span corruption (T5) or denoising (BART) β encoder reads corrupted input, decoder reconstructs | Machine translation, abstractive summarisation, document Q&A, structured prediction tasks |
Encoder-Only: Bidirectional Context
Every token in the encoder attends to every other token β past and future. This bidirectional attention gives a rich contextual representation of each position, ideal for classification and understanding tasks where the full input is available. BERT's success popularised this approach for NLP fine-tuning benchmarks.
Decoder-Only: Causal Masking
A decoder processes tokens left-to-right. To prevent "cheating" during training, a causal mask (also called an autoregressive mask) blocks each position from attending to future positions. The mask is an upper-triangular matrix of -β values added to attention logits before the softmax, zeroing out future attention weights. At inference, tokens are generated one at a time, each attending to all previously generated tokens.
Encoder-Decoder: Cross-Attention Bridge
The decoder in an encoder-decoder model has a special cross-attention layer in addition to self-attention. Queries come from the decoder's own hidden states, but keys and values come from the encoder's output. This lets every decoder position look at any encoder position, enabling the model to condition generation directly on the full input representation.
Why Decoder-Only Dominates Modern LLMs
Despite encoder-decoder models being powerful for structured tasks, the current generation of large LLMs (GPT-4o, Llama 3, Mistral, Claude, DeepSeek) are all decoder-only. The key reasons: (1) next-token prediction scales extremely well with data and compute; (2) a single decoder can handle both understanding and generation; (3) instruction fine-tuning turns completion into conversation without any architectural change; (4) decoder-only models can be used for virtually any NLP task by framing it as text generation. Reasoning models (o1, o3, DeepSeek-R1) extend this further β they generate long internal "thinking" traces as ordinary token sequences before producing the final answer.
π© Inside the Transformer Block
A Transformer model is built by stacking N identical blocks. Each block refines the token representations. Understanding what happens inside one block is the foundation for understanding everything else about LLMs.
# One Transformer block (decoder-only, post-norm variant) Input: x [shape: batch Γ seq_len Γ d_model] # Step 1: Multi-Head Self-Attention residual = x x = MultiHeadSelfAttention(x) # each token attends to all previous tokens x = Dropout(x) x = LayerNorm(x + residual) # Add & Norm (residual connection) # Step 2: Position-wise Feed-Forward Network residual = x x = Linear(x, d_model β d_ff) # expand: d_ff = 4 Γ d_model typically x = ReLU(x) # or GeLU in modern models x = Linear(x, d_ff β d_model) # project back x = Dropout(x) x = LayerNorm(x + residual) # Add & Norm Output: x [same shape: batch Γ seq_len Γ d_model]
| Component | Role | Typical Dimensions (7B model) |
|---|---|---|
| Multi-Head Self-Attention | Lets each token gather information from other relevant tokens; captures syntactic and semantic relationships across the sequence | d_model = 4096, 32 heads, head_dim = 128 |
| Residual Connection (Add) | Adds input directly to output before normalisation; preserves gradient flow through deep stacks; ensures earlier representations are not lost | Same shape as d_model β no params |
| Layer Normalisation (Norm) | Normalises each token's vector to zero mean and unit variance; stabilises training; prevents internal covariate shift | 2 Γ d_model learnable params (scale + bias) |
| Feed-Forward Network (FFN) | Applies the same two-layer MLP independently to each token position; thought to store factual knowledge; provides non-linearity | d_ff = 11008 (β 2.7Γ for SwiGLU variants) |
Residual Connections: Why They Matter
Without residual (skip) connections, gradients must flow through every layer of the network during backpropagation. In a 96-layer model like GPT-3, this is catastrophic β gradients vanish. Residual connections provide a "highway" for gradients to travel directly from output to input, enabling training of networks with hundreds of layers. They were introduced in ResNet (2015) for vision and are equally critical in Transformers.
Pre-Norm vs Post-Norm
The original paper applies LayerNorm after the residual add (post-norm). Most modern LLMs (GPT-3, Llama, Mistral) use pre-norm: LayerNorm is applied to the input before the attention or FFN operation. Pre-norm is more training-stable, especially at large scale, because it prevents the residual stream from growing unboundedly. RMSNorm (used in Llama) is a simplified variant that skips the mean-centering step for efficiency.
The FFN as Knowledge Store
Research by Geva et al. (2021) found that FFN layers act like key-value memories: the first linear layer identifies which "fact pattern" matches the input, and the second linear layer retrieves the associated information. Experiments show that factual associations (e.g., "Paris is the capital of France") are stored in specific FFN neurons, and editing these neurons can change model outputs β the basis for model editing techniques like ROME and MEMIT.
π Positional Encoding
Attention is inherently order-agnostic: if you shuffle the tokens in a sequence, the attention operation produces exactly the same output (just shuffled). A model with no positional information cannot distinguish "The dog bit the man" from "The man bit the dog." Positional encodings inject order information into token representations.
Sinusoidal Encoding (Original Paper)
Vaswani et al. added a fixed, deterministic positional vector to each token embedding before the first layer. The vector is computed using sine and cosine functions at geometrically increasing frequencies:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Different dimensions encode position at different scales β low-frequency dimensions capture global order; high-frequency dimensions capture local position. No training required; generalises to unseen lengths.
Learned Positional Embeddings
Rather than a fixed formula, learned embeddings treat each position index as a learnable lookup β just like token embeddings. The model learns what positional representation works best from data. Used in BERT, GPT-2, and early GPT-3. The downside: the vocabulary of positions is fixed at training time; positions beyond the training maximum are unseen and produce poor representations. This limits generalisation to longer sequences.
RoPE β Rotary Positional Embeddings
Introduced by Su et al. (2021) and adopted by Llama, Mistral, Falcon, Qwen, and most modern open-source LLMs. Instead of adding a positional vector, RoPE rotates the query and key vectors by an angle proportional to their absolute position before computing dot-product attention. The key property: the dot product between a query at position m and a key at position n depends only on their relative offset (m β n), not their absolute values. This gives the model a natural sense of "how far apart" two tokens are.
RoPE and Context Length Extrapolation
Because RoPE encodes relative position, it generalises more gracefully beyond training context lengths than learned absolute embeddings. Techniques like YaRN, NTK-aware scaling, and LongRoPE further extend this by rescaling the rotation frequencies, allowing models trained on 4k tokens to operate on 128k or even 1M+ tokens with fine-tuning. This is why Llama 3.1 can have an 128k context window starting from a base model trained on shorter sequences.
| Method | Encoding Type | Max Length at Train | Extrapolation | Used By |
|---|---|---|---|---|
| Sinusoidal | Absolute, fixed | 512 | Poor beyond train length | Original Transformer, some early BERT variants |
| Learned Absolute | Absolute, trained | 2048 (GPT-3) | Very poor β unseen positions | BERT, GPT-2, GPT-3 |
| RoPE | Relative, applied to Q/K | 4096β8192 | Good; extendable with YaRN | Llama 2/3, Mistral, Falcon, Qwen |
| ALiBi | Relative, attention bias | 2048 | Good linear extrapolation | MPT, BLOOM variants |
π Scaling & Context Length
The power of Transformers comes not just from the architecture but from scaling β stacking more layers, widening the hidden dimension, and training on vastly more data. Understanding how parameters and compute interact is essential for anyone working with LLMs.
Parameter Count Formula
For a decoder-only Transformer, approximate total parameters:
# Approximate parameter count embedding_params = vocab_size Γ d_model per_block_params = ( 4 Γ d_modelΒ² # Q, K, V, O projections + 2 Γ d_model Γ d_ff # FFN up + down + 4 Γ d_model # LayerNorm (Γ2, scale+bias each) ) total β embedding_params + num_layers Γ per_block_params # Example: Llama 3 8B # d_model=4096, d_ff=14336, num_layers=32, vocab=128k # β 8 billion parameters
Context Window Evolution
The maximum sequence length a model can process has grown dramatically:
- GPT-1 (2018): 512 tokens
- BERT (2018): 512 tokens
- GPT-3 (2020): 2,048 tokens
- GPT-4 Turbo (2023): 128k tokens
- GPT-4o (2024): 128k tokens
- Claude 3.5 / 3.7 (2024β25): 200k tokens
- Llama 3.1 / 3.3 (2024): 128k tokens
- Gemini 1.5 Pro (2024): 1,000,000 tokens
- Gemini 2.0 / 2.5 (2025): 1,000,000+ tokens
- o1 / o3 / o4-mini (2024β25): 200k tokens
KV-Cache: Inference Efficiency
During autoregressive generation, the model generates one token at a time. Without caching, it would recompute keys and values for all previous tokens at every step β O(nΒ²) work per token. The KV-cache stores the key and value tensors for all past positions. Each new token only needs to compute its own Q, K, V and then attend to the cached K, V. This reduces generation from O(nΒ²) to O(n) in key-value lookups, at the cost of memory proportional to sequence length Γ model size.
Quadratic Attention Complexity and Modern Mitigations
Standard dot-product attention scales as O(nΒ²) in both time and memory with sequence length n β doubling the context quadruples the cost. At 128kβ1M tokens, this is computationally brutal. Mitigations include: Flash Attention 2/3 (reduces memory bandwidth via IO-aware kernel fusion β FA3 reaches ~75% of H100 hardware peak), Grouped-Query Attention (GQA) (reduces KV-cache memory, used in Llama 3, Gemma, Mistral), Multi-head Latent Attention (MLA) (DeepSeek-V3 approach: projects K/V through a low-rank bottleneck, reducing KV-cache by over 90%), Sliding Window Attention (each token attends only to a local window, used in Mistral), and Mixture of Experts (MoE) routing which reduces active parameter compute per token (Mixtral, Gemini 1.5/2.x, DeepSeek-V3). For most use cases below 32k tokens, Flash Attention 2 is sufficient; for 1M+ context, GQA + MoE + efficient positional encoding (RoPE with YaRN scaling) are all required together.
| Scale | Model Examples | Approximate Parameters | Training Compute (FLOPs) |
|---|---|---|---|
| Small | GPT-2, Phi-4 (14B), Gemma 3 4B, Llama 3.2 3B | 125M β 14B | 10Β²Β² β 10Β²Β³ |
| Medium | Llama 3.1 8B, Mistral Small 3 (24B), Gemma 3 27B | 7B β 30B | 10Β²Β³ β 10Β²β΄ |
| Large | Llama 3.3 70B, Qwen2.5 72B, Mixtral 8Γ22B | 40B β 100B dense (or MoE equivalents) | 10Β²β΄ β 10Β²β΅ |
| Frontier | GPT-4o, Llama 3.1 405B, DeepSeek-V3 (671B MoE), Gemini 2.5 Pro | 200B dense β 1T+ MoE total (37Bβ440B active) | 10Β²β΅ β 10Β²βΆ+ |