⏱ 9 min read πŸ“Š Intermediate πŸ—“ Updated Jan 2025

πŸ–ΌοΈ The Problem CNNs Solve

Before CNNs, computer vision researchers tried applying standard Multilayer Perceptrons (MLPs) directly to images. This approach hit a wall almost immediately β€” not because the math was wrong, but because the scale was intractable.

The Parameter Explosion Problem

A modest 224Γ—224 RGB image has 224 Γ— 224 Γ— 3 = 150,528 input values. A single fully connected hidden layer with just 1,000 neurons would require 150,528 Γ— 1,000 = 150.5 million weights β€” for one layer, before learning anything useful. VGG16 processes such images with only 138M parameters total across 16 layers, thanks to convolutions. An equivalent fully-connected network would require trillions of parameters.

No Spatial Awareness in MLPs

An MLP treats each pixel as an independent input. It has no knowledge that pixel (100,100) is spatially adjacent to pixel (101,100). A cat's ear in the top-left corner and a cat's ear in the bottom-right corner activate completely different neurons β€” the network must relearn the same feature at every position.

  • Every pixel-weight connection is independent
  • No concept of neighbourhood or proximity
  • Features must be relearned at each spatial position
  • Sensitive to translation β€” a shifted image looks "new"

The Shift-Invariance Need

In practice, the same visual feature (an eye, an edge, a wheel) can appear anywhere in an image. We want the network to recognise that feature regardless of where it appears. This property is called translation invariance (or shift invariance). CNNs achieve it via weight sharing β€” the same filter detects the same feature everywhere it scans.

  • A "horizontal edge" detector should work at (10,10) and (200,200)
  • Weight sharing: one filter = one learned pattern, applied everywhere
  • Dramatically fewer parameters than equivalent MLP
  • Natural inductive bias for spatially structured data

πŸ” The Convolution Operation

The core of a CNN is the convolution: a small learnable filter (kernel) slides across the input, computing a dot product at each position. The result is a feature map that shows where and how strongly that pattern was detected.

# Convolution: a 3Γ—3 filter sliding over a 5Γ—5 grayscale input

Input (5Γ—5):               Filter (3Γ—3 β€” edge detector):
  1  2  3  4  5              -1  0  1
  6  7  8  9 10              -1  0  1
 11 12 13 14 15              -1  0  1
 16 17 18 19 20
 21 22 23 24 25

Step 1: Place filter at top-left (rows 0-2, cols 0-2)
  Element-wise multiply and sum:
  (-1Γ—1) + (0Γ—2) + (1Γ—3)
+ (-1Γ—6) + (0Γ—7) + (1Γ—8)
+ (-1Γ—11)+ (0Γ—12)+ (1Γ—13)
= (-1+0+3) + (-6+0+8) + (-11+0+13) = 2+2+2 = 6

Step 2: Slide right by stride=1, repeat
  Output feature map will be (5-3+1) Γ— (5-3+1) = 3Γ—3

Output feature map (3Γ—3) β€” shows vertical edge strength at each position.

# Key formula: output_size = floor((input_size βˆ’ kernel_size + 2Γ—padding) / stride) + 1
# With input=5, kernel=3, padding=0, stride=1: (5βˆ’3+0)/1 +1 = 3
# With padding=1 ("same" padding): (5βˆ’3+2)/1 +1 = 5  (preserves spatial size)
      

Filters / Kernels

A filter is a small matrix of learnable weights, typically 3Γ—3, 5Γ—5, or 7Γ—7. Each filter learns to detect a specific local pattern: horizontal edges, vertical edges, blobs of color, texture gradients. A Conv layer typically uses 32–512 filters in parallel, each producing its own feature map. All these feature maps are stacked depth-wise.

3Γ—3 most commonLearned via backprop

Stride

Stride controls how many pixels the filter moves per step. Stride=1 moves one pixel at a time (maximum overlap). Stride=2 moves two pixels, halving the output resolution. Larger strides act as a form of downsampling and reduce compute, but may miss fine-grained features.

Stride 1: full resolutionStride 2: halves size

Padding

Valid padding (no padding): the filter only runs where it fits completely, shrinking the output. Same padding (zero-padding around the border): the output has the same spatial dimensions as the input. Same padding is preferred in deep networks to prevent the feature map from shrinking to zero over many layers.

Same padding: preserves size

Receptive Field

The receptive field of a neuron is the region of the original input that influenced its value. A single Conv layer with a 3Γ—3 filter has a 3Γ—3 receptive field. Two stacked 3Γ—3 layers have an effective 5Γ—5 receptive field. This growth with depth is key: deep CNNs develop neurons that "see" very large portions of the image, enabling object-level reasoning.

Grows with depthEnables global context

🧱 CNN Layer Types

Layer Type What It Does Key Hyperparameters
Conv2D Applies multiple learnable filters to detect spatial patterns. Produces a stack of feature maps, one per filter. Introduces non-linearity via activation (typically ReLU) applied immediately after. Number of filters, kernel size (e.g. 3Γ—3), stride, padding (same/valid), activation function
Max Pooling Divides the feature map into non-overlapping rectangular windows and takes the maximum value from each. Reduces spatial dimensions, introduces limited translation invariance, and reduces computation. Pool size (e.g. 2Γ—2), stride (default = pool size), padding
Average Pooling Like max pooling but takes the mean of each window. Preserves more background information; less aggressive than max pooling. Global Average Pooling (GAP) averages across the entire spatial dimension to produce a single value per feature map. Pool size, stride. Global Average Pooling has no size parameter.
Batch Normalisation Normalises the activations within each mini-batch to have zero mean and unit variance, then applies learnable scale (Ξ³) and shift (Ξ²). Stabilises and accelerates training, allows higher learning rates, and provides mild regularisation. momentum (for running statistics), epsilon (numerical stability), learnable Ξ³ and Ξ²
Fully Connected (Dense) Standard MLP layer, typically used at the end of a CNN after flattening or Global Average Pooling. Combines the learned spatial features into class scores or a regression output. Number of units, activation function
Dropout During training, randomly sets a fraction p of activations to zero, preventing any single neuron from becoming overly dominant. Forces the network to learn redundant representations. Disabled at inference time (activations scaled by 1-p instead). Dropout rate p (typically 0.2–0.5)

Pooling Trade-offs

Pooling reduces spatial dimensions, which reduces memory and compute for subsequent layers and prevents overfitting. However, it discards precise spatial information β€” you know a feature was detected somewhere in that region, not exactly where. This is fine for classification (is there a cat?) but problematic for localisation (where is the cat?). Modern object detectors (YOLO, Faster R-CNN) and segmentation models (U-Net) use tricks like feature pyramid networks and skip connections to recover spatial precision after pooling.

πŸ›οΈ Classic Architectures

The history of CNNs is a story of progressively deeper, more ingenious architectures, each building on lessons from the last. These milestones shaped modern deep learning.

LeNet-5 (1998)

Yann LeCun's groundbreaking network, designed for handwritten digit recognition (MNIST). It established the Conv β†’ Pool β†’ Conv β†’ Pool β†’ FC pattern that all later CNNs would follow.

  • Year: 1998 (LeCun, Bottou, Bengio, Haffner)
  • Key innovation: first practical deep CNN end-to-end trained with backprop
  • Parameters: ~60,000
  • Input: 32Γ—32 grayscale; Output: 10 digit classes
  • Activation: tanh and sigmoid (ReLU not yet standard)
Historical milestoneMNIST

AlexNet (2012)

Won ImageNet ILSVRC 2012 by a stunning 10.8% margin, triggering the deep learning revolution. First CNN to exploit GPU parallelism and use ReLU activations and dropout at scale.

  • Year: 2012 (Krizhevsky, Sutskever, Hinton)
  • Key innovations: ReLU activation, GPU training, dropout regularisation, data augmentation
  • Parameters: ~61 million
  • Top-5 ImageNet error: 15.3% (vs 26.2% runner-up)
  • Used 2 GTX 580 GPUs with model split across them
Deep learning era begins

VGG16 / VGG19 (2014)

Oxford VGG group showed that depth matters: exclusively using 3Γ—3 convolutions stacked deeply outperformed larger kernels. VGG16/19 are still widely used as backbone encoders for feature extraction.

  • Year: 2014 (Simonyan & Zisserman)
  • Key innovation: uniform 3Γ—3 kernels; demonstrated depth vs width
  • Parameters: ~138 million (VGG16)
  • Top-5 error: 7.3% (ILSVRC 2014 runner-up)
  • Two 3Γ—3 Conv = one 5Γ—5 receptive field, but fewer params
3Γ—3 everywhereHigh parameter count

ResNet (2015)

Microsoft Research introduced residual (skip) connections: the output of a layer is added to the output of a layer two steps later. This allows gradients to flow directly through shortcuts, enabling networks 100+ layers deep to train without vanishing gradients.

  • Year: 2015 (He, Zhang, Ren, Sun)
  • Key innovation: skip connections β€” H(x) = F(x) + x (residual learning)
  • Parameters: ResNet-50 ~25M, ResNet-152 ~60M
  • Top-5 error: 3.57% (ILSVRC 2015 winner β€” superhuman)
  • Variants: ResNet-18, 34, 50, 101, 152; Wide ResNet; ResNeXt
Skip connectionsSuperhuman ImageNet

EfficientNet (2019)

Google Brain proposed compound scaling: systematically scale depth, width, and resolution together using a fixed ratio found by neural architecture search. EfficientNet-B7 achieved state-of-the-art with far fewer parameters than comparable models.

  • Year: 2019 (Tan & Le, Google Brain)
  • Key innovation: compound coefficient scales depth+width+resolution jointly
  • Parameters: EfficientNet-B0 ~5.3M, B7 ~66M
  • Top-1 ImageNet accuracy: B7 reaches 84.4%
  • Uses MBConv blocks (mobile inverted bottleneck convolutions)
Parameter efficientNAS-designed

🌐 CNN Applications Beyond Images

The convolution operation is not limited to 2D images. Any data with local structure β€” sequences, audio, graphs β€” can benefit from convolutional processing.

Domain Input Type CNN Variant Example Task
Natural Language Processing Sequence of word embeddings (shape: seq_len Γ— embed_dim) 1D Conv (Conv1D) with filters of width 2, 3, 4, 5 tokens Sentence classification, sentiment analysis, spam detection. TextCNN (Kim 2014) used this approach with multiple filter widths in parallel.
Time Series Univariate or multivariate time series (shape: time_steps Γ— features) 1D Conv; dilated (atrous) convolutions for long-range dependencies Anomaly detection, sensor fault prediction, ECG/EEG classification, stock pattern recognition.
Audio Mel-frequency spectrogram (shape: freq_bins Γ— time_frames Γ— 1) 2D Conv on spectrogram; or 1D Conv on raw waveform (WaveNet) Keyword spotting, speaker identification, music genre classification, environmental sound recognition.
Video Sequence of frames (shape: T Γ— H Γ— W Γ— C) 3D Conv (C3D, I3D); separable 3D Conv for efficiency Action recognition, gesture detection, sports analytics, surveillance event detection.
Point Clouds / Graphs Unstructured 3D points or graph node features Graph Convolutional Networks (GCN); PointNet uses shared MLP with symmetric aggregation 3D object detection, LiDAR segmentation, molecular property prediction, social network analysis.

The Inductive Bias Principle

CNNs encode the assumption that important patterns are local and position-invariant. This is a strong inductive bias that works beautifully for natural images and many physical signals. However, when this assumption breaks down β€” for example, in tasks where global context matters as much as local features, or where data has no natural spatial structure β€” alternative architectures (Transformers, GNNs) may outperform CNNs. Choosing an architecture means choosing which inductive biases to embed in your model.