πΌοΈ The Problem CNNs Solve
Before CNNs, computer vision researchers tried applying standard Multilayer Perceptrons (MLPs) directly to images. This approach hit a wall almost immediately β not because the math was wrong, but because the scale was intractable.
The Parameter Explosion Problem
A modest 224Γ224 RGB image has 224 Γ 224 Γ 3 = 150,528 input values. A single fully connected hidden layer with just 1,000 neurons would require 150,528 Γ 1,000 = 150.5 million weights β for one layer, before learning anything useful. VGG16 processes such images with only 138M parameters total across 16 layers, thanks to convolutions. An equivalent fully-connected network would require trillions of parameters.
No Spatial Awareness in MLPs
An MLP treats each pixel as an independent input. It has no knowledge that pixel (100,100) is spatially adjacent to pixel (101,100). A cat's ear in the top-left corner and a cat's ear in the bottom-right corner activate completely different neurons β the network must relearn the same feature at every position.
- Every pixel-weight connection is independent
- No concept of neighbourhood or proximity
- Features must be relearned at each spatial position
- Sensitive to translation β a shifted image looks "new"
The Shift-Invariance Need
In practice, the same visual feature (an eye, an edge, a wheel) can appear anywhere in an image. We want the network to recognise that feature regardless of where it appears. This property is called translation invariance (or shift invariance). CNNs achieve it via weight sharing β the same filter detects the same feature everywhere it scans.
- A "horizontal edge" detector should work at (10,10) and (200,200)
- Weight sharing: one filter = one learned pattern, applied everywhere
- Dramatically fewer parameters than equivalent MLP
- Natural inductive bias for spatially structured data
π The Convolution Operation
The core of a CNN is the convolution: a small learnable filter (kernel) slides across the input, computing a dot product at each position. The result is a feature map that shows where and how strongly that pattern was detected.
# Convolution: a 3Γ3 filter sliding over a 5Γ5 grayscale input
Input (5Γ5): Filter (3Γ3 β edge detector):
1 2 3 4 5 -1 0 1
6 7 8 9 10 -1 0 1
11 12 13 14 15 -1 0 1
16 17 18 19 20
21 22 23 24 25
Step 1: Place filter at top-left (rows 0-2, cols 0-2)
Element-wise multiply and sum:
(-1Γ1) + (0Γ2) + (1Γ3)
+ (-1Γ6) + (0Γ7) + (1Γ8)
+ (-1Γ11)+ (0Γ12)+ (1Γ13)
= (-1+0+3) + (-6+0+8) + (-11+0+13) = 2+2+2 = 6
Step 2: Slide right by stride=1, repeat
Output feature map will be (5-3+1) Γ (5-3+1) = 3Γ3
Output feature map (3Γ3) β shows vertical edge strength at each position.
# Key formula: output_size = floor((input_size β kernel_size + 2Γpadding) / stride) + 1
# With input=5, kernel=3, padding=0, stride=1: (5β3+0)/1 +1 = 3
# With padding=1 ("same" padding): (5β3+2)/1 +1 = 5 (preserves spatial size)
Filters / Kernels
A filter is a small matrix of learnable weights, typically 3Γ3, 5Γ5, or 7Γ7. Each filter learns to detect a specific local pattern: horizontal edges, vertical edges, blobs of color, texture gradients. A Conv layer typically uses 32β512 filters in parallel, each producing its own feature map. All these feature maps are stacked depth-wise.
Stride
Stride controls how many pixels the filter moves per step. Stride=1 moves one pixel at a time (maximum overlap). Stride=2 moves two pixels, halving the output resolution. Larger strides act as a form of downsampling and reduce compute, but may miss fine-grained features.
Padding
Valid padding (no padding): the filter only runs where it fits completely, shrinking the output. Same padding (zero-padding around the border): the output has the same spatial dimensions as the input. Same padding is preferred in deep networks to prevent the feature map from shrinking to zero over many layers.
Receptive Field
The receptive field of a neuron is the region of the original input that influenced its value. A single Conv layer with a 3Γ3 filter has a 3Γ3 receptive field. Two stacked 3Γ3 layers have an effective 5Γ5 receptive field. This growth with depth is key: deep CNNs develop neurons that "see" very large portions of the image, enabling object-level reasoning.
π§± CNN Layer Types
| Layer Type | What It Does | Key Hyperparameters |
|---|---|---|
| Conv2D | Applies multiple learnable filters to detect spatial patterns. Produces a stack of feature maps, one per filter. Introduces non-linearity via activation (typically ReLU) applied immediately after. | Number of filters, kernel size (e.g. 3Γ3), stride, padding (same/valid), activation function |
| Max Pooling | Divides the feature map into non-overlapping rectangular windows and takes the maximum value from each. Reduces spatial dimensions, introduces limited translation invariance, and reduces computation. | Pool size (e.g. 2Γ2), stride (default = pool size), padding |
| Average Pooling | Like max pooling but takes the mean of each window. Preserves more background information; less aggressive than max pooling. Global Average Pooling (GAP) averages across the entire spatial dimension to produce a single value per feature map. | Pool size, stride. Global Average Pooling has no size parameter. |
| Batch Normalisation | Normalises the activations within each mini-batch to have zero mean and unit variance, then applies learnable scale (Ξ³) and shift (Ξ²). Stabilises and accelerates training, allows higher learning rates, and provides mild regularisation. | momentum (for running statistics), epsilon (numerical stability), learnable Ξ³ and Ξ² |
| Fully Connected (Dense) | Standard MLP layer, typically used at the end of a CNN after flattening or Global Average Pooling. Combines the learned spatial features into class scores or a regression output. | Number of units, activation function |
| Dropout | During training, randomly sets a fraction p of activations to zero, preventing any single neuron from becoming overly dominant. Forces the network to learn redundant representations. Disabled at inference time (activations scaled by 1-p instead). | Dropout rate p (typically 0.2β0.5) |
Pooling Trade-offs
Pooling reduces spatial dimensions, which reduces memory and compute for subsequent layers and prevents overfitting. However, it discards precise spatial information β you know a feature was detected somewhere in that region, not exactly where. This is fine for classification (is there a cat?) but problematic for localisation (where is the cat?). Modern object detectors (YOLO, Faster R-CNN) and segmentation models (U-Net) use tricks like feature pyramid networks and skip connections to recover spatial precision after pooling.
ποΈ Classic Architectures
The history of CNNs is a story of progressively deeper, more ingenious architectures, each building on lessons from the last. These milestones shaped modern deep learning.
LeNet-5 (1998)
Yann LeCun's groundbreaking network, designed for handwritten digit recognition (MNIST). It established the Conv β Pool β Conv β Pool β FC pattern that all later CNNs would follow.
- Year: 1998 (LeCun, Bottou, Bengio, Haffner)
- Key innovation: first practical deep CNN end-to-end trained with backprop
- Parameters: ~60,000
- Input: 32Γ32 grayscale; Output: 10 digit classes
- Activation: tanh and sigmoid (ReLU not yet standard)
AlexNet (2012)
Won ImageNet ILSVRC 2012 by a stunning 10.8% margin, triggering the deep learning revolution. First CNN to exploit GPU parallelism and use ReLU activations and dropout at scale.
- Year: 2012 (Krizhevsky, Sutskever, Hinton)
- Key innovations: ReLU activation, GPU training, dropout regularisation, data augmentation
- Parameters: ~61 million
- Top-5 ImageNet error: 15.3% (vs 26.2% runner-up)
- Used 2 GTX 580 GPUs with model split across them
VGG16 / VGG19 (2014)
Oxford VGG group showed that depth matters: exclusively using 3Γ3 convolutions stacked deeply outperformed larger kernels. VGG16/19 are still widely used as backbone encoders for feature extraction.
- Year: 2014 (Simonyan & Zisserman)
- Key innovation: uniform 3Γ3 kernels; demonstrated depth vs width
- Parameters: ~138 million (VGG16)
- Top-5 error: 7.3% (ILSVRC 2014 runner-up)
- Two 3Γ3 Conv = one 5Γ5 receptive field, but fewer params
ResNet (2015)
Microsoft Research introduced residual (skip) connections: the output of a layer is added to the output of a layer two steps later. This allows gradients to flow directly through shortcuts, enabling networks 100+ layers deep to train without vanishing gradients.
- Year: 2015 (He, Zhang, Ren, Sun)
- Key innovation: skip connections β H(x) = F(x) + x (residual learning)
- Parameters: ResNet-50 ~25M, ResNet-152 ~60M
- Top-5 error: 3.57% (ILSVRC 2015 winner β superhuman)
- Variants: ResNet-18, 34, 50, 101, 152; Wide ResNet; ResNeXt
EfficientNet (2019)
Google Brain proposed compound scaling: systematically scale depth, width, and resolution together using a fixed ratio found by neural architecture search. EfficientNet-B7 achieved state-of-the-art with far fewer parameters than comparable models.
- Year: 2019 (Tan & Le, Google Brain)
- Key innovation: compound coefficient scales depth+width+resolution jointly
- Parameters: EfficientNet-B0 ~5.3M, B7 ~66M
- Top-1 ImageNet accuracy: B7 reaches 84.4%
- Uses MBConv blocks (mobile inverted bottleneck convolutions)
π CNN Applications Beyond Images
The convolution operation is not limited to 2D images. Any data with local structure β sequences, audio, graphs β can benefit from convolutional processing.
| Domain | Input Type | CNN Variant | Example Task |
|---|---|---|---|
| Natural Language Processing | Sequence of word embeddings (shape: seq_len Γ embed_dim) | 1D Conv (Conv1D) with filters of width 2, 3, 4, 5 tokens | Sentence classification, sentiment analysis, spam detection. TextCNN (Kim 2014) used this approach with multiple filter widths in parallel. |
| Time Series | Univariate or multivariate time series (shape: time_steps Γ features) | 1D Conv; dilated (atrous) convolutions for long-range dependencies | Anomaly detection, sensor fault prediction, ECG/EEG classification, stock pattern recognition. |
| Audio | Mel-frequency spectrogram (shape: freq_bins Γ time_frames Γ 1) | 2D Conv on spectrogram; or 1D Conv on raw waveform (WaveNet) | Keyword spotting, speaker identification, music genre classification, environmental sound recognition. |
| Video | Sequence of frames (shape: T Γ H Γ W Γ C) | 3D Conv (C3D, I3D); separable 3D Conv for efficiency | Action recognition, gesture detection, sports analytics, surveillance event detection. |
| Point Clouds / Graphs | Unstructured 3D points or graph node features | Graph Convolutional Networks (GCN); PointNet uses shared MLP with symmetric aggregation | 3D object detection, LiDAR segmentation, molecular property prediction, social network analysis. |
The Inductive Bias Principle
CNNs encode the assumption that important patterns are local and position-invariant. This is a strong inductive bias that works beautifully for natural images and many physical signals. However, when this assumption breaks down β for example, in tasks where global context matters as much as local features, or where data has no natural spatial structure β alternative architectures (Transformers, GNNs) may outperform CNNs. Choosing an architecture means choosing which inductive biases to embed in your model.