⏱ 9 min read πŸ“Š Intermediate πŸ—“ Updated Jan 2025

πŸ“œ Why Sequences Need Special Handling

Many of the most important data types in the real world are sequences: words in a sentence, frames in a video, measurements from a sensor over time, notes in a melody. Feedforward networks handle fixed-size inputs β€” but sequences are fundamentally different in two ways: they have variable length, and the order of elements carries meaning.

The Variable-Length Problem

An MLP requires a fixed input dimension. A sentence of 5 words and a sentence of 50 words cannot both feed into the same input layer without padding or truncation. More critically, padding-based approaches ignore the fundamental truth that sequence length is informative and that relationships between distant elements matter.

  • Sentences: "The cat sat" vs "The cat, despite being tired and hungry, still sat"
  • Time series: variable-length sensor readings between events
  • Audio: speech utterances of different durations
  • Code: programs range from 5 lines to 5 million lines

The Hidden State Concept

An RNN solves the variable-length problem by processing the sequence one step at a time, maintaining a hidden state vector (hβ‚œ) that summarises everything seen so far. At each step t, the network takes the current input xβ‚œ and the previous hidden state hβ‚œβ‚‹β‚, computes a new hidden state hβ‚œ, and optionally produces an output yβ‚œ. The same weights are used at every step β€” weight sharing across time.

  • hβ‚œ acts as a compressed "memory" of the sequence so far
  • Same W matrices reused at every timestep (parameter efficiency)
  • Can process sequences of any length with fixed parameter count
  • Final hidden state h_T can represent the entire sequence

Key Use Cases for Sequence Models

Text / NLP

Sentiment analysis, machine translation, text generation, question answering, named entity recognition

Speech

Automatic speech recognition (ASR), text-to-speech synthesis, speaker diarisation

Time Series

Financial forecasting, predictive maintenance, anomaly detection, weather prediction

Video

Action recognition, video captioning, temporal event detection

πŸ”„ Vanilla RNN

The simplest RNN updates its hidden state using a single equation. Understanding this equation β€” and its failure modes β€” is the key to appreciating why LSTM and GRU were invented.

# Vanilla RNN β€” the recurrence relation
hβ‚œ = tanh(Wβ‚• Β· hβ‚œβ‚‹β‚ + Wβ‚“ Β· xβ‚œ + b)

# Where:
#   xβ‚œ  = input vector at timestep t          (shape: input_dim)
#   hβ‚œβ‚‹β‚ = hidden state from previous step    (shape: hidden_dim)
#   Wβ‚“  = input-to-hidden weight matrix        (shape: hidden_dim Γ— input_dim)
#   Wβ‚•  = hidden-to-hidden weight matrix       (shape: hidden_dim Γ— hidden_dim)
#   b   = bias vector                          (shape: hidden_dim)
#   hβ‚œ  = new hidden state                     (shape: hidden_dim)
#   tanh = activation function (squashes to βˆ’1..+1)

# Output at each step (for sequence-to-sequence tasks):
yβ‚œ = softmax(Wα΅§ Β· hβ‚œ + bα΅§)

# Unrolled through 4 timesteps (e.g. "cats are cute"):
hβ‚€ ──→ h₁ ──→ hβ‚‚ ──→ h₃ ──→ hβ‚„
  x₁↑     x₂↑     x₃↑     x₄↑
("cats") ("are") ("cute") (EOS)

# The same Wβ‚“, Wβ‚•, b are used at EVERY timestep.
      

Backpropagation Through Time (BPTT)

Training RNNs uses BPTT: unroll the network through all T timesteps, then apply the chain rule backwards through every step. Gradients at step t are computed by multiplying together the Jacobians of all steps from t back to 1.

  • For T timesteps, compute gradient through T multiplications
  • If the gradient magnitudes are <1, multiplying them T times β†’ 0 (vanishing)
  • If gradient magnitudes are >1, multiplying T times β†’ ∞ (exploding)
  • Gradient clipping can partially address exploding gradients
Compute intensiveTruncated BPTT used in practice

Why Vanilla RNNs Fail on Long Sequences

The vanishing gradient problem is catastrophic for long sequences. Gradients representing the influence of early timesteps shrink exponentially as they're propagated back through many steps. By step 50, the network has effectively forgotten what happened at step 1 β€” it cannot learn long-range dependencies.

  • Information from 50+ steps ago is essentially lost
  • Critical for tasks like pronoun resolution ("The trophy wouldn't fit in the box because it was too big")
  • Machine translation suffers at sentence boundaries
  • Practical limit: ~10-20 meaningful steps in vanilla RNNs
Vanishing gradientShort-term memory only

The Vanishing Gradient Problem in Detail

During BPTT, the gradient of the loss with respect to the hidden state at step t involves the product of the Jacobian matrix Wβ‚• applied repeatedly over many steps. If the largest eigenvalue of Wβ‚• is less than 1 (which is common after tanh squashing), this repeated multiplication drives the gradient exponentially towards zero. Conversely, if eigenvalues exceed 1, gradients explode. This is not a software bug β€” it is a fundamental mathematical consequence of training deep recurrent structures with gradient-based methods, first analysed by Bengio et al. (1994).

🧠 Long Short-Term Memory (LSTM)

Invented by Hochreiter and Schmidhuber in 1997, the LSTM is the most successful solution to the vanishing gradient problem. Its key insight: introduce a separate cell state (Cβ‚œ) that acts like a conveyor belt running through the entire sequence, with gating mechanisms controlling what information is added, removed, or read.

# LSTM equations β€” all vectors of shape (hidden_dim,)
# Οƒ = sigmoid function (0 to 1); tanh = hyperbolic tangent (-1 to 1)
# [hβ‚œβ‚‹β‚, xβ‚œ] = concatenation of previous hidden state and current input

# 1. FORGET GATE β€” what to erase from cell state
fβ‚œ = Οƒ(Wf Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bf)
# fβ‚œ β‰ˆ 0: forget this memory;  fβ‚œ β‰ˆ 1: keep this memory

# 2. INPUT GATE β€” what new information to store
iβ‚œ = Οƒ(Wi Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bi)   # how much to add
CΜƒβ‚œ = tanh(Wc Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bc) # candidate values to add

# 3. CELL STATE UPDATE β€” update the conveyor belt
Cβ‚œ = fβ‚œ βŠ™ Cβ‚œβ‚‹β‚ + iβ‚œ βŠ™ CΜƒβ‚œ
# βŠ™ = element-wise (Hadamard) multiplication

# 4. OUTPUT GATE β€” what to expose as hidden state
oβ‚œ = Οƒ(Wo Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bo)
hβ‚œ = oβ‚œ βŠ™ tanh(Cβ‚œ)

# The cell state Cβ‚œ can persist information across hundreds of timesteps
# because it only involves addition (+), not multiplication through many layers.
      
Gate Purpose Key Equation
Forget Gate (fβ‚œ) Decides which information from the previous cell state to discard. For a language model, this might reset knowledge of the grammatical subject when a new sentence begins. fβ‚œ = Οƒ(Wf Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bf)
Input Gate (iβ‚œ) Controls how much of the new candidate information CΜƒβ‚œ should be written to the cell state. Prevents every timestep from overwriting the entire memory. iβ‚œ = Οƒ(Wi Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bi)
Output Gate (oβ‚œ) Controls which parts of the cell state are exposed as the hidden state hβ‚œ. Allows the LSTM to retain information internally without necessarily making it available to the output at every step. oβ‚œ = Οƒ(Wo Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bo)
Cell State (Cβ‚œ) The "conveyor belt" of LSTM memory. Gradients flow through the additive cell update almost unimpeded, allowing the LSTM to learn dependencies over hundreds of timesteps. Cβ‚œ = fβ‚œ βŠ™ Cβ‚œβ‚‹β‚ + iβ‚œ βŠ™ CΜƒβ‚œ

Why the Cell State Solves Vanishing Gradients

The cell state update Cβ‚œ = fβ‚œ βŠ™ Cβ‚œβ‚‹β‚ + iβ‚œ βŠ™ CΜƒβ‚œ is an additive operation. When gradients flow back through this addition, they are not multiplied by a weight matrix repeatedly β€” they flow backwards with minimal decay. This is the same intuition behind ResNet's skip connections. If the forget gate fβ‚œ is kept close to 1 (remember everything), the gradient highway from Cβ‚œ back to Cβ‚€ is essentially clear, allowing the network to train on sequences thousands of steps long.

⚑ Gated Recurrent Unit (GRU)

Proposed by Cho et al. in 2014, the GRU simplifies the LSTM by merging the cell state and hidden state into one, and collapsing three gates into two. This reduces parameters by ~25% while achieving comparable performance on most tasks.

# GRU equations
# Two gates instead of three; no separate cell state

# 1. RESET GATE β€” how much of the past to forget when computing candidate
rβ‚œ = Οƒ(Wr Β· [hβ‚œβ‚‹β‚, xβ‚œ] + br)

# 2. UPDATE GATE β€” how much of the old state to keep vs. replace
zβ‚œ = Οƒ(Wz Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bz)

# 3. CANDIDATE HIDDEN STATE
hΜƒβ‚œ = tanh(W Β· [rβ‚œ βŠ™ hβ‚œβ‚‹β‚, xβ‚œ] + b)
# The reset gate rβ‚œ selectively zeros out past hidden state
# before computing the candidate β€” acts like a "soft reset"

# 4. FINAL HIDDEN STATE β€” interpolate between old and new
hβ‚œ = (1 βˆ’ zβ‚œ) βŠ™ hβ‚œβ‚‹β‚ + zβ‚œ βŠ™ hΜƒβ‚œ
# zβ‚œ β‰ˆ 0: keep old state (long-term memory)
# zβ‚œ β‰ˆ 1: take new candidate (short-term update)
      
Property LSTM GRU
Gates 3 gates (forget, input, output) 2 gates (reset, update)
States 2 states: cell state Cβ‚œ + hidden state hβ‚œ 1 state: hidden state hβ‚œ only
Parameters 4 Γ— (hiddenΒ² + hiddenΓ—input) weight matrices 3 Γ— (hiddenΒ² + hiddenΓ—input) β€” ~25% fewer
Compute Slower per step due to more operations Faster per step; often better on small datasets
Long-range memory Generally slightly better on very long sequences (separate cell state is advantageous) Comparable in most practical tasks; slightly weaker on very long dependencies
When to prefer Long sequences (>200 steps), large datasets, when every unit of performance matters Shorter sequences, limited compute/memory, rapid prototyping, smaller datasets

πŸš€ Modern Alternatives

RNNs dominated sequence modelling from the mid-2010s until around 2018-2020, when Transformers largely replaced them for NLP. However, several RNN variants and non-recurrent alternatives remain relevant.

Bidirectional RNNs

A standard RNN only sees past context. A bidirectional RNN runs two RNNs: one forwards (t=1 to T) and one backwards (t=T to 1). The hidden states from both directions are concatenated at each timestep, giving the model access to both past and future context.

  • Output at step t: [h_forward_t ; h_backward_t] (doubled dimension)
  • Useful when the entire sequence is available (not real-time)
  • BERT's success partly inspired by biLSTM predecessors (ELMo)
  • Cannot be used for real-time or autoregressive tasks
Better NER, POS taggingNot for streaming

Temporal Convolutional Networks (TCN)

TCNs apply 1D convolutions along the time dimension with causal masking (filters only see the past) and dilation (exponentially increasing gaps between filter elements). Dilated convolutions allow the receptive field to grow exponentially with layers, capturing very long-range dependencies without recurrence.

  • No hidden state β€” fully parallelisable during training
  • Dilation 2Λ‘ at layer l gives exponential receptive field growth
  • Comparable to LSTM on many tasks, faster to train
  • WaveNet (DeepMind) is a famous dilated causal CNN
ParallelisableDilated convolutions

Why Transformers Replaced RNNs for NLP

The 2017 paper "Attention Is All You Need" showed that self-attention could model long-range dependencies more effectively than LSTMs, while being fully parallelisable. RNNs process sequences step-by-step β€” T steps of serial computation per sequence. Transformers process all positions simultaneously.

  • Transformers: O(TΒ²Β·d) time but fully parallel; RNNs: O(TΒ·dΒ²) but serial
  • GPT, BERT, LLaMA β€” all Transformer-based, none RNN-based
  • Self-attention directly models any token-to-token relationship
  • Scale: Transformers absorb more compute and data efficiently
Parallel trainingScales to billions of params

β†’ Deep dive: Transformers & Attention

Where RNNs Still Excel

RNNs are not obsolete. For edge / embedded deployment, a compact GRU with a 64-unit hidden state has negligible memory footprint and produces outputs in real-time with O(1) memory (the hidden state is fixed-size regardless of sequence length). For streaming inference β€” where you process one token at a time as it arrives and cannot wait for the full sequence β€” RNNs are natural. Transformers require the full context window. Newer hybrid architectures like Mamba (2023) and RWKV attempt to combine RNN-like streaming efficiency with Transformer-like modelling quality.