Transformer Architecture Overview

⚡ Why Transformers Replaced RNNs

Before 2017, the dominant architecture for sequence modelling was the Recurrent Neural Network (RNN) and its gated variants LSTM and GRU. These models processed text one token at a time, passing a hidden state from left to right. While effective for short sequences, they suffered from fundamental limitations that made them impractical for modern large-scale language tasks.

The Sequential Bottleneck

RNNs must process token 1 before token 2, token 2 before token 3 — and so on. This strict sequential dependency means no parallelism during training. With a 512-token sequence, you need 512 serial steps. Modern GPUs are massively parallel processors; the RNN bottleneck leaves nearly all that hardware idle, making training orders of magnitude slower than it needs to be.

Vanishing & Exploding Gradients

Backpropagation through time (BPTT) requires multiplying gradients across every timestep. With 512 steps, a gradient value less than 1 collapses to zero; greater than 1 explodes to infinity. LSTMs mitigated this with gating, but long-range dependencies (e.g., a pronoun referring to a noun 200 tokens back) remained extremely difficult to learn reliably.

Fixed-Size Context Bottleneck

In sequence-to-sequence RNN models (e.g., for translation), the entire input must be compressed into a single fixed-size hidden state vector before decoding begins. No matter how long the input, the decoder sees only that one vector. Critical early-sequence information is frequently overwritten by later content.

The Attention Is All You Need Moment (2017)

In June 2017, Vaswani et al. at Google Brain published "Attention Is All You Need" — the paper introducing the Transformer. Its key insight: you don't need recurrence at all. If you let every token attend directly to every other token simultaneously, you get full parallelism, O(1) path length between any two positions, and unlimited (theoretically) long-range dependency capture. The paper achieved state-of-the-art machine translation and launched an architectural revolution that produced BERT, GPT, T5, and every major LLM today.

The core trade-off: Transformers exchange the O(n) memory complexity of RNNs for O(n²) attention complexity (every token attends to every other), but this is manageable for the sequence lengths typical at training time, and the parallelism wins decisively on modern hardware. Efficient attention variants (Flash Attention, sliding window) address the quadratic cost as context lengths grow.

🏗️ Encoder vs Decoder vs Encoder-Decoder

The original Transformer had two halves: an encoder that reads and represents the input, and a decoder that generates the output. Subsequent research showed that each half was independently useful, leading to three major architectural families, each suited to different tasks.

Architecture	Representative Models	Training Objective	Primary Use Cases
Encoder-only	BERT, RoBERTa, DeBERTa, ELECTRA	Masked Language Modelling (MLM) — predict randomly masked tokens using both left and right context	Text classification, NER, sentiment analysis, question answering (extractive), semantic embeddings
Decoder-only	GPT-2, GPT-3, GPT-4, Llama, Mistral, Falcon, Gemma	Causal Language Modelling (CLM) — predict the next token given all preceding tokens	Text generation, chat assistants, code generation, summarisation, any open-ended generation task
Encoder-Decoder	T5, BART, mT5, FLAN-T5, MarianMT	Span corruption (T5) or denoising (BART) — encoder reads corrupted input, decoder reconstructs	Machine translation, abstractive summarisation, document Q&A, structured prediction tasks

Encoder-Only: Bidirectional Context

Every token in the encoder attends to every other token — past and future. This bidirectional attention gives a rich contextual representation of each position, ideal for classification and understanding tasks where the full input is available. BERT's success popularised this approach for NLP fine-tuning benchmarks.

BERTFull context

Decoder-Only: Causal Masking

A decoder processes tokens left-to-right. To prevent "cheating" during training, a causal mask (also called an autoregressive mask) blocks each position from attending to future positions. The mask is an upper-triangular matrix of -∞ values added to attention logits before the softmax, zeroing out future attention weights. At inference, tokens are generated one at a time, each attending to all previously generated tokens.

GPT familyLlama

Encoder-Decoder: Cross-Attention Bridge

The decoder in an encoder-decoder model has a special cross-attention layer in addition to self-attention. Queries come from the decoder's own hidden states, but keys and values come from the encoder's output. This lets every decoder position look at any encoder position, enabling the model to condition generation directly on the full input representation.

T5BART

Why Decoder-Only Dominates Modern LLMs

Despite encoder-decoder models being powerful for structured tasks, the current generation of large LLMs (GPT-4o, Llama 3, Mistral, Claude, DeepSeek) are all decoder-only. The key reasons: (1) next-token prediction scales extremely well with data and compute; (2) a single decoder can handle both understanding and generation; (3) instruction fine-tuning turns completion into conversation without any architectural change; (4) decoder-only models can be used for virtually any NLP task by framing it as text generation. Reasoning models (o1, o3, DeepSeek-R1) extend this further — they generate long internal "thinking" traces as ordinary token sequences before producing the final answer.

🔩 Inside the Transformer Block

A Transformer model is built by stacking N identical blocks. Each block refines the token representations. Understanding what happens inside one block is the foundation for understanding everything else about LLMs.

# One Transformer block (decoder-only, post-norm variant)
Input: x  [shape: batch × seq_len × d_model]

# Step 1: Multi-Head Self-Attention
residual = x
x = MultiHeadSelfAttention(x)   # each token attends to all previous tokens
x = Dropout(x)
x = LayerNorm(x + residual)     # Add & Norm (residual connection)

# Step 2: Position-wise Feed-Forward Network
residual = x
x = Linear(x, d_model → d_ff)  # expand: d_ff = 4 × d_model typically
x = ReLU(x)                     # or GeLU in modern models
x = Linear(x, d_ff → d_model)  # project back
x = Dropout(x)
x = LayerNorm(x + residual)     # Add & Norm

Output: x  [same shape: batch × seq_len × d_model]

Component	Role	Typical Dimensions (7B model)
Multi-Head Self-Attention	Lets each token gather information from other relevant tokens; captures syntactic and semantic relationships across the sequence	d_model = 4096, 32 heads, head_dim = 128
Residual Connection (Add)	Adds input directly to output before normalisation; preserves gradient flow through deep stacks; ensures earlier representations are not lost	Same shape as d_model — no params
Layer Normalisation (Norm)	Normalises each token's vector to zero mean and unit variance; stabilises training; prevents internal covariate shift	2 × d_model learnable params (scale + bias)
Feed-Forward Network (FFN)	Applies the same two-layer MLP independently to each token position; thought to store factual knowledge; provides non-linearity	d_ff = 11008 (≈ 2.7× for SwiGLU variants)

Residual Connections: Why They Matter

Without residual (skip) connections, gradients must flow through every layer of the network during backpropagation. In a 96-layer model like GPT-3, this is catastrophic — gradients vanish. Residual connections provide a "highway" for gradients to travel directly from output to input, enabling training of networks with hundreds of layers. They were introduced in ResNet (2015) for vision and are equally critical in Transformers.

Pre-Norm vs Post-Norm

The original paper applies LayerNorm after the residual add (post-norm). Most modern LLMs (GPT-3, Llama, Mistral) use pre-norm: LayerNorm is applied to the input before the attention or FFN operation. Pre-norm is more training-stable, especially at large scale, because it prevents the residual stream from growing unboundedly. RMSNorm (used in Llama) is a simplified variant that skips the mean-centering step for efficiency.

The FFN as Knowledge Store

Research by Geva et al. (2021) found that FFN layers act like key-value memories: the first linear layer identifies which "fact pattern" matches the input, and the second linear layer retrieves the associated information. Experiments show that factual associations (e.g., "Paris is the capital of France") are stored in specific FFN neurons, and editing these neurons can change model outputs — the basis for model editing techniques like ROME and MEMIT.

📍 Positional Encoding

Attention is inherently order-agnostic: if you shuffle the tokens in a sequence, the attention operation produces exactly the same output (just shuffled). A model with no positional information cannot distinguish "The dog bit the man" from "The man bit the dog." Positional encodings inject order information into token representations.

Sinusoidal Encoding (Original Paper)

Vaswani et al. added a fixed, deterministic positional vector to each token embedding before the first layer. The vector is computed using sine and cosine functions at geometrically increasing frequencies:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Different dimensions encode position at different scales — low-frequency dimensions capture global order; high-frequency dimensions capture local position. No training required; generalises to unseen lengths.

Learned Positional Embeddings

Rather than a fixed formula, learned embeddings treat each position index as a learnable lookup — just like token embeddings. The model learns what positional representation works best from data. Used in BERT, GPT-2, and early GPT-3. The downside: the vocabulary of positions is fixed at training time; positions beyond the training maximum are unseen and produce poor representations. This limits generalisation to longer sequences.

RoPE — Rotary Positional Embeddings

Introduced by Su et al. (2021) and adopted by Llama, Mistral, Falcon, Qwen, and most modern open-source LLMs. Instead of adding a positional vector, RoPE rotates the query and key vectors by an angle proportional to their absolute position before computing dot-product attention. The key property: the dot product between a query at position m and a key at position n depends only on their relative offset (m − n), not their absolute values. This gives the model a natural sense of "how far apart" two tokens are.

RoPE and Context Length Extrapolation

Because RoPE encodes relative position, it generalises more gracefully beyond training context lengths than learned absolute embeddings. Techniques like YaRN, NTK-aware scaling, and LongRoPE further extend this by rescaling the rotation frequencies, allowing models trained on 4k tokens to operate on 128k or even 1M+ tokens with fine-tuning. This is why Llama 3.1 can have an 128k context window starting from a base model trained on shorter sequences.

Method	Encoding Type	Max Length at Train	Extrapolation	Used By
Sinusoidal	Absolute, fixed	512	Poor beyond train length	Original Transformer, some early BERT variants
Learned Absolute	Absolute, trained	2048 (GPT-3)	Very poor — unseen positions	BERT, GPT-2, GPT-3
RoPE	Relative, applied to Q/K	4096–8192	Good; extendable with YaRN	Llama 2/3, Mistral, Falcon, Qwen
ALiBi	Relative, attention bias	2048	Good linear extrapolation	MPT, BLOOM variants

📏 Scaling & Context Length

The power of Transformers comes not just from the architecture but from scaling — stacking more layers, widening the hidden dimension, and training on vastly more data. Understanding how parameters and compute interact is essential for anyone working with LLMs.

Parameter Count Formula

For a decoder-only Transformer, approximate total parameters:

# Approximate parameter count
embedding_params  = vocab_size × d_model
per_block_params  = (
  4 × d_model²           # Q, K, V, O projections
  + 2 × d_model × d_ff   # FFN up + down
  + 4 × d_model          # LayerNorm (×2, scale+bias each)
)
total ≈ embedding_params + num_layers × per_block_params

# Example: Llama 3 8B
# d_model=4096, d_ff=14336, num_layers=32, vocab=128k
# ≈ 8 billion parameters

Context Window Evolution

The maximum sequence length a model can process has grown dramatically:

GPT-1 (2018): 512 tokens
BERT (2018): 512 tokens
GPT-3 (2020): 2,048 tokens
GPT-4 Turbo (2023): 128k tokens
GPT-4o (2024): 128k tokens
Claude 3.5 / 3.7 (2024–25): 200k tokens
Llama 3.1 / 3.3 (2024): 128k tokens
Gemini 1.5 Pro (2024): 1,000,000 tokens
Gemini 2.0 / 2.5 (2025): 1,000,000+ tokens
o1 / o3 / o4-mini (2024–25): 200k tokens

KV-Cache: Inference Efficiency

During autoregressive generation, the model generates one token at a time. Without caching, it would recompute keys and values for all previous tokens at every step — O(n²) work per token. The KV-cache stores the key and value tensors for all past positions. Each new token only needs to compute its own Q, K, V and then attend to the cached K, V. This reduces generation from O(n²) to O(n) in key-value lookups, at the cost of memory proportional to sequence length × model size.

Quadratic Attention Complexity and Modern Mitigations

Standard dot-product attention scales as O(n²) in both time and memory with sequence length n — doubling the context quadruples the cost. At 128k–1M tokens, this is computationally brutal. Mitigations include: Flash Attention 2/3 (reduces memory bandwidth via IO-aware kernel fusion — FA3 reaches ~75% of H100 hardware peak), Grouped-Query Attention (GQA) (reduces KV-cache memory, used in Llama 3, Gemma, Mistral), Multi-head Latent Attention (MLA) (DeepSeek-V3 approach: projects K/V through a low-rank bottleneck, reducing KV-cache by over 90%), Sliding Window Attention (each token attends only to a local window, used in Mistral), and Mixture of Experts (MoE) routing which reduces active parameter compute per token (Mixtral, Gemini 1.5/2.x, DeepSeek-V3). For most use cases below 32k tokens, Flash Attention 2 is sufficient; for 1M+ context, GQA + MoE + efficient positional encoding (RoPE with YaRN scaling) are all required together.

Scale	Model Examples	Approximate Parameters	Training Compute (FLOPs)
Small	GPT-2, Phi-4 (14B), Gemma 3 4B, Llama 3.2 3B	125M – 14B	10²² – 10²³
Medium	Llama 3.1 8B, Mistral Small 3 (24B), Gemma 3 27B	7B – 30B	10²³ – 10²⁴
Large	Llama 3.3 70B, Qwen2.5 72B, Mixtral 8×22B	40B – 100B dense (or MoE equivalents)	10²⁴ – 10²⁵
Frontier	GPT-4o, Llama 3.1 405B, DeepSeek-V3 (671B MoE), Gemini 2.5 Pro	200B dense – 1T+ MoE total (37B–440B active)	10²⁵ – 10²⁶+