π Why Sequences Need Special Handling
Many of the most important data types in the real world are sequences: words in a sentence, frames in a video, measurements from a sensor over time, notes in a melody. Feedforward networks handle fixed-size inputs β but sequences are fundamentally different in two ways: they have variable length, and the order of elements carries meaning.
The Variable-Length Problem
An MLP requires a fixed input dimension. A sentence of 5 words and a sentence of 50 words cannot both feed into the same input layer without padding or truncation. More critically, padding-based approaches ignore the fundamental truth that sequence length is informative and that relationships between distant elements matter.
- Sentences: "The cat sat" vs "The cat, despite being tired and hungry, still sat"
- Time series: variable-length sensor readings between events
- Audio: speech utterances of different durations
- Code: programs range from 5 lines to 5 million lines
The Hidden State Concept
An RNN solves the variable-length problem by processing the sequence one step at a time, maintaining a hidden state vector (hβ) that summarises everything seen so far. At each step t, the network takes the current input xβ and the previous hidden state hβββ, computes a new hidden state hβ, and optionally produces an output yβ. The same weights are used at every step β weight sharing across time.
- hβ acts as a compressed "memory" of the sequence so far
- Same W matrices reused at every timestep (parameter efficiency)
- Can process sequences of any length with fixed parameter count
- Final hidden state h_T can represent the entire sequence
Key Use Cases for Sequence Models
Text / NLP
Sentiment analysis, machine translation, text generation, question answering, named entity recognition
Speech
Automatic speech recognition (ASR), text-to-speech synthesis, speaker diarisation
Time Series
Financial forecasting, predictive maintenance, anomaly detection, weather prediction
Video
Action recognition, video captioning, temporal event detection
π Vanilla RNN
The simplest RNN updates its hidden state using a single equation. Understanding this equation β and its failure modes β is the key to appreciating why LSTM and GRU were invented.
# Vanilla RNN β the recurrence relation
hβ = tanh(Wβ Β· hβββ + Wβ Β· xβ + b)
# Where:
# xβ = input vector at timestep t (shape: input_dim)
# hβββ = hidden state from previous step (shape: hidden_dim)
# Wβ = input-to-hidden weight matrix (shape: hidden_dim Γ input_dim)
# Wβ = hidden-to-hidden weight matrix (shape: hidden_dim Γ hidden_dim)
# b = bias vector (shape: hidden_dim)
# hβ = new hidden state (shape: hidden_dim)
# tanh = activation function (squashes to β1..+1)
# Output at each step (for sequence-to-sequence tasks):
yβ = softmax(Wα΅§ Β· hβ + bα΅§)
# Unrolled through 4 timesteps (e.g. "cats are cute"):
hβ βββ hβ βββ hβ βββ hβ βββ hβ
xββ xββ xββ xββ
("cats") ("are") ("cute") (EOS)
# The same Wβ, Wβ, b are used at EVERY timestep.
Backpropagation Through Time (BPTT)
Training RNNs uses BPTT: unroll the network through all T timesteps, then apply the chain rule backwards through every step. Gradients at step t are computed by multiplying together the Jacobians of all steps from t back to 1.
- For T timesteps, compute gradient through T multiplications
- If the gradient magnitudes are <1, multiplying them T times β 0 (vanishing)
- If gradient magnitudes are >1, multiplying T times β β (exploding)
- Gradient clipping can partially address exploding gradients
Why Vanilla RNNs Fail on Long Sequences
The vanishing gradient problem is catastrophic for long sequences. Gradients representing the influence of early timesteps shrink exponentially as they're propagated back through many steps. By step 50, the network has effectively forgotten what happened at step 1 β it cannot learn long-range dependencies.
- Information from 50+ steps ago is essentially lost
- Critical for tasks like pronoun resolution ("The trophy wouldn't fit in the box because it was too big")
- Machine translation suffers at sentence boundaries
- Practical limit: ~10-20 meaningful steps in vanilla RNNs
The Vanishing Gradient Problem in Detail
During BPTT, the gradient of the loss with respect to the hidden state at step t involves the product of the Jacobian matrix Wβ applied repeatedly over many steps. If the largest eigenvalue of Wβ is less than 1 (which is common after tanh squashing), this repeated multiplication drives the gradient exponentially towards zero. Conversely, if eigenvalues exceed 1, gradients explode. This is not a software bug β it is a fundamental mathematical consequence of training deep recurrent structures with gradient-based methods, first analysed by Bengio et al. (1994).
π§ Long Short-Term Memory (LSTM)
Invented by Hochreiter and Schmidhuber in 1997, the LSTM is the most successful solution to the vanishing gradient problem. Its key insight: introduce a separate cell state (Cβ) that acts like a conveyor belt running through the entire sequence, with gating mechanisms controlling what information is added, removed, or read.
# LSTM equations β all vectors of shape (hidden_dim,)
# Ο = sigmoid function (0 to 1); tanh = hyperbolic tangent (-1 to 1)
# [hβββ, xβ] = concatenation of previous hidden state and current input
# 1. FORGET GATE β what to erase from cell state
fβ = Ο(Wf Β· [hβββ, xβ] + bf)
# fβ β 0: forget this memory; fβ β 1: keep this memory
# 2. INPUT GATE β what new information to store
iβ = Ο(Wi Β· [hβββ, xβ] + bi) # how much to add
CΜβ = tanh(Wc Β· [hβββ, xβ] + bc) # candidate values to add
# 3. CELL STATE UPDATE β update the conveyor belt
Cβ = fβ β Cβββ + iβ β CΜβ
# β = element-wise (Hadamard) multiplication
# 4. OUTPUT GATE β what to expose as hidden state
oβ = Ο(Wo Β· [hβββ, xβ] + bo)
hβ = oβ β tanh(Cβ)
# The cell state Cβ can persist information across hundreds of timesteps
# because it only involves addition (+), not multiplication through many layers.
| Gate | Purpose | Key Equation |
|---|---|---|
| Forget Gate (fβ) | Decides which information from the previous cell state to discard. For a language model, this might reset knowledge of the grammatical subject when a new sentence begins. | fβ = Ο(Wf Β· [hβββ, xβ] + bf) |
| Input Gate (iβ) | Controls how much of the new candidate information CΜβ should be written to the cell state. Prevents every timestep from overwriting the entire memory. | iβ = Ο(Wi Β· [hβββ, xβ] + bi) |
| Output Gate (oβ) | Controls which parts of the cell state are exposed as the hidden state hβ. Allows the LSTM to retain information internally without necessarily making it available to the output at every step. | oβ = Ο(Wo Β· [hβββ, xβ] + bo) |
| Cell State (Cβ) | The "conveyor belt" of LSTM memory. Gradients flow through the additive cell update almost unimpeded, allowing the LSTM to learn dependencies over hundreds of timesteps. | Cβ = fβ β Cβββ + iβ β CΜβ |
Why the Cell State Solves Vanishing Gradients
The cell state update Cβ = fβ β Cβββ + iβ β CΜβ is an additive operation. When gradients flow back through this addition, they are not multiplied by a weight matrix repeatedly β they flow backwards with minimal decay. This is the same intuition behind ResNet's skip connections. If the forget gate fβ is kept close to 1 (remember everything), the gradient highway from Cβ back to Cβ is essentially clear, allowing the network to train on sequences thousands of steps long.
β‘ Gated Recurrent Unit (GRU)
Proposed by Cho et al. in 2014, the GRU simplifies the LSTM by merging the cell state and hidden state into one, and collapsing three gates into two. This reduces parameters by ~25% while achieving comparable performance on most tasks.
# GRU equations
# Two gates instead of three; no separate cell state
# 1. RESET GATE β how much of the past to forget when computing candidate
rβ = Ο(Wr Β· [hβββ, xβ] + br)
# 2. UPDATE GATE β how much of the old state to keep vs. replace
zβ = Ο(Wz Β· [hβββ, xβ] + bz)
# 3. CANDIDATE HIDDEN STATE
hΜβ = tanh(W Β· [rβ β hβββ, xβ] + b)
# The reset gate rβ selectively zeros out past hidden state
# before computing the candidate β acts like a "soft reset"
# 4. FINAL HIDDEN STATE β interpolate between old and new
hβ = (1 β zβ) β hβββ + zβ β hΜβ
# zβ β 0: keep old state (long-term memory)
# zβ β 1: take new candidate (short-term update)
| Property | LSTM | GRU |
|---|---|---|
| Gates | 3 gates (forget, input, output) | 2 gates (reset, update) |
| States | 2 states: cell state Cβ + hidden state hβ | 1 state: hidden state hβ only |
| Parameters | 4 Γ (hiddenΒ² + hiddenΓinput) weight matrices | 3 Γ (hiddenΒ² + hiddenΓinput) β ~25% fewer |
| Compute | Slower per step due to more operations | Faster per step; often better on small datasets |
| Long-range memory | Generally slightly better on very long sequences (separate cell state is advantageous) | Comparable in most practical tasks; slightly weaker on very long dependencies |
| When to prefer | Long sequences (>200 steps), large datasets, when every unit of performance matters | Shorter sequences, limited compute/memory, rapid prototyping, smaller datasets |
π Modern Alternatives
RNNs dominated sequence modelling from the mid-2010s until around 2018-2020, when Transformers largely replaced them for NLP. However, several RNN variants and non-recurrent alternatives remain relevant.
Bidirectional RNNs
A standard RNN only sees past context. A bidirectional RNN runs two RNNs: one forwards (t=1 to T) and one backwards (t=T to 1). The hidden states from both directions are concatenated at each timestep, giving the model access to both past and future context.
- Output at step t: [h_forward_t ; h_backward_t] (doubled dimension)
- Useful when the entire sequence is available (not real-time)
- BERT's success partly inspired by biLSTM predecessors (ELMo)
- Cannot be used for real-time or autoregressive tasks
Temporal Convolutional Networks (TCN)
TCNs apply 1D convolutions along the time dimension with causal masking (filters only see the past) and dilation (exponentially increasing gaps between filter elements). Dilated convolutions allow the receptive field to grow exponentially with layers, capturing very long-range dependencies without recurrence.
- No hidden state β fully parallelisable during training
- Dilation 2Λ‘ at layer l gives exponential receptive field growth
- Comparable to LSTM on many tasks, faster to train
- WaveNet (DeepMind) is a famous dilated causal CNN
Why Transformers Replaced RNNs for NLP
The 2017 paper "Attention Is All You Need" showed that self-attention could model long-range dependencies more effectively than LSTMs, while being fully parallelisable. RNNs process sequences step-by-step β T steps of serial computation per sequence. Transformers process all positions simultaneously.
- Transformers: O(TΒ²Β·d) time but fully parallel; RNNs: O(TΒ·dΒ²) but serial
- GPT, BERT, LLaMA β all Transformer-based, none RNN-based
- Self-attention directly models any token-to-token relationship
- Scale: Transformers absorb more compute and data efficiently
Where RNNs Still Excel
RNNs are not obsolete. For edge / embedded deployment, a compact GRU with a 64-unit hidden state has negligible memory footprint and produces outputs in real-time with O(1) memory (the hidden state is fixed-size regardless of sequence length). For streaming inference β where you process one token at a time as it arrives and cannot wait for the full sequence β RNNs are natural. Transformers require the full context window. Newer hybrid architectures like Mamba (2023) and RWKV attempt to combine RNN-like streaming efficiency with Transformer-like modelling quality.