Recurrent Neural Networks (RNN)

📜 Why Sequences Need Special Handling

Many of the most important data types in the real world are sequences: words in a sentence, frames in a video, measurements from a sensor over time, notes in a melody. Feedforward networks handle fixed-size inputs — but sequences are fundamentally different in two ways: they have variable length, and the order of elements carries meaning.

The Variable-Length Problem

An MLP requires a fixed input dimension. A sentence of 5 words and a sentence of 50 words cannot both feed into the same input layer without padding or truncation. More critically, padding-based approaches ignore the fundamental truth that sequence length is informative and that relationships between distant elements matter.

Sentences: "The cat sat" vs "The cat, despite being tired and hungry, still sat"
Time series: variable-length sensor readings between events
Audio: speech utterances of different durations
Code: programs range from 5 lines to 5 million lines

The Hidden State Concept

An RNN solves the variable-length problem by processing the sequence one step at a time, maintaining a hidden state vector (hₜ) that summarises everything seen so far. At each step t, the network takes the current input xₜ and the previous hidden state hₜ₋₁, computes a new hidden state hₜ, and optionally produces an output yₜ. The same weights are used at every step — weight sharing across time.

hₜ acts as a compressed "memory" of the sequence so far
Same W matrices reused at every timestep (parameter efficiency)
Can process sequences of any length with fixed parameter count
Final hidden state h_T can represent the entire sequence

Key Use Cases for Sequence Models

Text / NLP

Sentiment analysis, machine translation, text generation, question answering, named entity recognition

Speech

Automatic speech recognition (ASR), text-to-speech synthesis, speaker diarisation

Time Series

Financial forecasting, predictive maintenance, anomaly detection, weather prediction

Video

Action recognition, video captioning, temporal event detection

🔄 Vanilla RNN

The simplest RNN updates its hidden state using a single equation. Understanding this equation — and its failure modes — is the key to appreciating why LSTM and GRU were invented.

# Vanilla RNN — the recurrence relation
hₜ = tanh(Wₕ · hₜ₋₁ + Wₓ · xₜ + b)

# Where:
#   xₜ  = input vector at timestep t          (shape: input_dim)
#   hₜ₋₁ = hidden state from previous step    (shape: hidden_dim)
#   Wₓ  = input-to-hidden weight matrix        (shape: hidden_dim × input_dim)
#   Wₕ  = hidden-to-hidden weight matrix       (shape: hidden_dim × hidden_dim)
#   b   = bias vector                          (shape: hidden_dim)
#   hₜ  = new hidden state                     (shape: hidden_dim)
#   tanh = activation function (squashes to −1..+1)

# Output at each step (for sequence-to-sequence tasks):
yₜ = softmax(Wᵧ · hₜ + bᵧ)

# Unrolled through 4 timesteps (e.g. "cats are cute"):
h₀ ──→ h₁ ──→ h₂ ──→ h₃ ──→ h₄
  x₁↑     x₂↑     x₃↑     x₄↑
("cats") ("are") ("cute") (EOS)

# The same Wₓ, Wₕ, b are used at EVERY timestep.

Backpropagation Through Time (BPTT)

Training RNNs uses BPTT: unroll the network through all T timesteps, then apply the chain rule backwards through every step. Gradients at step t are computed by multiplying together the Jacobians of all steps from t back to 1.

For T timesteps, compute gradient through T multiplications
If the gradient magnitudes are <1, multiplying them T times → 0 (vanishing)
If gradient magnitudes are >1, multiplying T times → ∞ (exploding)
Gradient clipping can partially address exploding gradients

Compute intensiveTruncated BPTT used in practice

Why Vanilla RNNs Fail on Long Sequences

The vanishing gradient problem is catastrophic for long sequences. Gradients representing the influence of early timesteps shrink exponentially as they're propagated back through many steps. By step 50, the network has effectively forgotten what happened at step 1 — it cannot learn long-range dependencies.

Information from 50+ steps ago is essentially lost
Critical for tasks like pronoun resolution ("The trophy wouldn't fit in the box because it was too big")
Machine translation suffers at sentence boundaries
Practical limit: ~10-20 meaningful steps in vanilla RNNs

Vanishing gradientShort-term memory only

The Vanishing Gradient Problem in Detail

During BPTT, the gradient of the loss with respect to the hidden state at step t involves the product of the Jacobian matrix Wₕ applied repeatedly over many steps. If the largest eigenvalue of Wₕ is less than 1 (which is common after tanh squashing), this repeated multiplication drives the gradient exponentially towards zero. Conversely, if eigenvalues exceed 1, gradients explode. This is not a software bug — it is a fundamental mathematical consequence of training deep recurrent structures with gradient-based methods, first analysed by Bengio et al. (1994).

🧠 Long Short-Term Memory (LSTM)

Invented by Hochreiter and Schmidhuber in 1997, the LSTM is the most successful solution to the vanishing gradient problem. Its key insight: introduce a separate cell state (Cₜ) that acts like a conveyor belt running through the entire sequence, with gating mechanisms controlling what information is added, removed, or read.

# LSTM equations — all vectors of shape (hidden_dim,)
# σ = sigmoid function (0 to 1); tanh = hyperbolic tangent (-1 to 1)
# [hₜ₋₁, xₜ] = concatenation of previous hidden state and current input

# 1. FORGET GATE — what to erase from cell state
fₜ = σ(Wf · [hₜ₋₁, xₜ] + bf)
# fₜ ≈ 0: forget this memory;  fₜ ≈ 1: keep this memory

# 2. INPUT GATE — what new information to store
iₜ = σ(Wi · [hₜ₋₁, xₜ] + bi)   # how much to add
C̃ₜ = tanh(Wc · [hₜ₋₁, xₜ] + bc) # candidate values to add

# 3. CELL STATE UPDATE — update the conveyor belt
Cₜ = fₜ ⊙ Cₜ₋₁ + iₜ ⊙ C̃ₜ
# ⊙ = element-wise (Hadamard) multiplication

# 4. OUTPUT GATE — what to expose as hidden state
oₜ = σ(Wo · [hₜ₋₁, xₜ] + bo)
hₜ = oₜ ⊙ tanh(Cₜ)

# The cell state Cₜ can persist information across hundreds of timesteps
# because it only involves addition (+), not multiplication through many layers.

Gate	Purpose	Key Equation
Forget Gate (fₜ)	Decides which information from the previous cell state to discard. For a language model, this might reset knowledge of the grammatical subject when a new sentence begins.	`fₜ = σ(Wf · [hₜ₋₁, xₜ] + bf)`
Input Gate (iₜ)	Controls how much of the new candidate information C̃ₜ should be written to the cell state. Prevents every timestep from overwriting the entire memory.	`iₜ = σ(Wi · [hₜ₋₁, xₜ] + bi)`
Output Gate (oₜ)	Controls which parts of the cell state are exposed as the hidden state hₜ. Allows the LSTM to retain information internally without necessarily making it available to the output at every step.	`oₜ = σ(Wo · [hₜ₋₁, xₜ] + bo)`
Cell State (Cₜ)	The "conveyor belt" of LSTM memory. Gradients flow through the additive cell update almost unimpeded, allowing the LSTM to learn dependencies over hundreds of timesteps.	`Cₜ = fₜ ⊙ Cₜ₋₁ + iₜ ⊙ C̃ₜ`

Why the Cell State Solves Vanishing Gradients

The cell state update Cₜ = fₜ ⊙ Cₜ₋₁ + iₜ ⊙ C̃ₜ is an additive operation. When gradients flow back through this addition, they are not multiplied by a weight matrix repeatedly — they flow backwards with minimal decay. This is the same intuition behind ResNet's skip connections. If the forget gate fₜ is kept close to 1 (remember everything), the gradient highway from Cₜ back to C₀ is essentially clear, allowing the network to train on sequences thousands of steps long.

⚡ Gated Recurrent Unit (GRU)

Proposed by Cho et al. in 2014, the GRU simplifies the LSTM by merging the cell state and hidden state into one, and collapsing three gates into two. This reduces parameters by ~25% while achieving comparable performance on most tasks.

# GRU equations
# Two gates instead of three; no separate cell state

# 1. RESET GATE — how much of the past to forget when computing candidate
rₜ = σ(Wr · [hₜ₋₁, xₜ] + br)

# 2. UPDATE GATE — how much of the old state to keep vs. replace
zₜ = σ(Wz · [hₜ₋₁, xₜ] + bz)

# 3. CANDIDATE HIDDEN STATE
h̃ₜ = tanh(W · [rₜ ⊙ hₜ₋₁, xₜ] + b)
# The reset gate rₜ selectively zeros out past hidden state
# before computing the candidate — acts like a "soft reset"

# 4. FINAL HIDDEN STATE — interpolate between old and new
hₜ = (1 − zₜ) ⊙ hₜ₋₁ + zₜ ⊙ h̃ₜ
# zₜ ≈ 0: keep old state (long-term memory)
# zₜ ≈ 1: take new candidate (short-term update)

Property	LSTM	GRU
Gates	3 gates (forget, input, output)	2 gates (reset, update)
States	2 states: cell state Cₜ + hidden state hₜ	1 state: hidden state hₜ only
Parameters	4 × (hidden² + hidden×input) weight matrices	3 × (hidden² + hidden×input) — ~25% fewer
Compute	Slower per step due to more operations	Faster per step; often better on small datasets
Long-range memory	Generally slightly better on very long sequences (separate cell state is advantageous)	Comparable in most practical tasks; slightly weaker on very long dependencies
When to prefer	Long sequences (>200 steps), large datasets, when every unit of performance matters	Shorter sequences, limited compute/memory, rapid prototyping, smaller datasets

🚀 Modern Alternatives

RNNs dominated sequence modelling from the mid-2010s until around 2018-2020, when Transformers largely replaced them for NLP. However, several RNN variants and non-recurrent alternatives remain relevant.

Bidirectional RNNs

A standard RNN only sees past context. A bidirectional RNN runs two RNNs: one forwards (t=1 to T) and one backwards (t=T to 1). The hidden states from both directions are concatenated at each timestep, giving the model access to both past and future context.

Output at step t: [h_forward_t ; h_backward_t] (doubled dimension)
Useful when the entire sequence is available (not real-time)
BERT's success partly inspired by biLSTM predecessors (ELMo)
Cannot be used for real-time or autoregressive tasks

Better NER, POS taggingNot for streaming

Temporal Convolutional Networks (TCN)

TCNs apply 1D convolutions along the time dimension with causal masking (filters only see the past) and dilation (exponentially increasing gaps between filter elements). Dilated convolutions allow the receptive field to grow exponentially with layers, capturing very long-range dependencies without recurrence.

No hidden state — fully parallelisable during training
Dilation 2ˡ at layer l gives exponential receptive field growth
Comparable to LSTM on many tasks, faster to train
WaveNet (DeepMind) is a famous dilated causal CNN

ParallelisableDilated convolutions

Why Transformers Replaced RNNs for NLP

The 2017 paper "Attention Is All You Need" showed that self-attention could model long-range dependencies more effectively than LSTMs, while being fully parallelisable. RNNs process sequences step-by-step — T steps of serial computation per sequence. Transformers process all positions simultaneously.

Transformers: O(T²·d) time but fully parallel; RNNs: O(T·d²) but serial
GPT, BERT, LLaMA — all Transformer-based, none RNN-based
Self-attention directly models any token-to-token relationship
Scale: Transformers absorb more compute and data efficiently

Parallel trainingScales to billions of params

→ Deep dive: Transformers & Attention

Where RNNs Still Excel

RNNs are not obsolete. For edge / embedded deployment, a compact GRU with a 64-unit hidden state has negligible memory footprint and produces outputs in real-time with O(1) memory (the hidden state is fixed-size regardless of sequence length). For streaming inference — where you process one token at a time as it arrives and cannot wait for the full sequence — RNNs are natural. Transformers require the full context window. Newer hybrid architectures like Mamba (2023) and RWKV attempt to combine RNN-like streaming efficiency with Transformer-like modelling quality.