Convolutional Neural Networks (CNN)

🖼️ The Problem CNNs Solve

Before CNNs, computer vision researchers tried applying standard Multilayer Perceptrons (MLPs) directly to images. This approach hit a wall almost immediately — not because the math was wrong, but because the scale was intractable.

The Parameter Explosion Problem

A modest 224×224 RGB image has 224 × 224 × 3 = 150,528 input values. A single fully connected hidden layer with just 1,000 neurons would require 150,528 × 1,000 = 150.5 million weights — for one layer, before learning anything useful. VGG16 processes such images with only 138M parameters total across 16 layers, thanks to convolutions. An equivalent fully-connected network would require trillions of parameters.

No Spatial Awareness in MLPs

An MLP treats each pixel as an independent input. It has no knowledge that pixel (100,100) is spatially adjacent to pixel (101,100). A cat's ear in the top-left corner and a cat's ear in the bottom-right corner activate completely different neurons — the network must relearn the same feature at every position.

Every pixel-weight connection is independent
No concept of neighbourhood or proximity
Features must be relearned at each spatial position
Sensitive to translation — a shifted image looks "new"

The Shift-Invariance Need

In practice, the same visual feature (an eye, an edge, a wheel) can appear anywhere in an image. We want the network to recognise that feature regardless of where it appears. This property is called translation invariance (or shift invariance). CNNs achieve it via weight sharing — the same filter detects the same feature everywhere it scans.

A "horizontal edge" detector should work at (10,10) and (200,200)
Weight sharing: one filter = one learned pattern, applied everywhere
Dramatically fewer parameters than equivalent MLP
Natural inductive bias for spatially structured data

🔍 The Convolution Operation

The core of a CNN is the convolution: a small learnable filter (kernel) slides across the input, computing a dot product at each position. The result is a feature map that shows where and how strongly that pattern was detected.

# Convolution: a 3×3 filter sliding over a 5×5 grayscale input

Input (5×5):               Filter (3×3 — edge detector):
  1  2  3  4  5              -1  0  1
  6  7  8  9 10              -1  0  1
 11 12 13 14 15              -1  0  1
 16 17 18 19 20
 21 22 23 24 25

Step 1: Place filter at top-left (rows 0-2, cols 0-2)
  Element-wise multiply and sum:
  (-1×1) + (0×2) + (1×3)
+ (-1×6) + (0×7) + (1×8)
+ (-1×11)+ (0×12)+ (1×13)
= (-1+0+3) + (-6+0+8) + (-11+0+13) = 2+2+2 = 6

Step 2: Slide right by stride=1, repeat
  Output feature map will be (5-3+1) × (5-3+1) = 3×3

Output feature map (3×3) — shows vertical edge strength at each position.

# Key formula: output_size = floor((input_size − kernel_size + 2×padding) / stride) + 1
# With input=5, kernel=3, padding=0, stride=1: (5−3+0)/1 +1 = 3
# With padding=1 ("same" padding): (5−3+2)/1 +1 = 5  (preserves spatial size)

Filters / Kernels

A filter is a small matrix of learnable weights, typically 3×3, 5×5, or 7×7. Each filter learns to detect a specific local pattern: horizontal edges, vertical edges, blobs of color, texture gradients. A Conv layer typically uses 32–512 filters in parallel, each producing its own feature map. All these feature maps are stacked depth-wise.

3×3 most commonLearned via backprop

Stride

Stride controls how many pixels the filter moves per step. Stride=1 moves one pixel at a time (maximum overlap). Stride=2 moves two pixels, halving the output resolution. Larger strides act as a form of downsampling and reduce compute, but may miss fine-grained features.

Stride 1: full resolutionStride 2: halves size

Padding

Valid padding (no padding): the filter only runs where it fits completely, shrinking the output. Same padding (zero-padding around the border): the output has the same spatial dimensions as the input. Same padding is preferred in deep networks to prevent the feature map from shrinking to zero over many layers.

Same padding: preserves size

Receptive Field

The receptive field of a neuron is the region of the original input that influenced its value. A single Conv layer with a 3×3 filter has a 3×3 receptive field. Two stacked 3×3 layers have an effective 5×5 receptive field. This growth with depth is key: deep CNNs develop neurons that "see" very large portions of the image, enabling object-level reasoning.

Grows with depthEnables global context

🧱 CNN Layer Types

Layer Type	What It Does	Key Hyperparameters
Conv2D	Applies multiple learnable filters to detect spatial patterns. Produces a stack of feature maps, one per filter. Introduces non-linearity via activation (typically ReLU) applied immediately after.	Number of filters, kernel size (e.g. 3×3), stride, padding (same/valid), activation function
Max Pooling	Divides the feature map into non-overlapping rectangular windows and takes the maximum value from each. Reduces spatial dimensions, introduces limited translation invariance, and reduces computation.	Pool size (e.g. 2×2), stride (default = pool size), padding
Average Pooling	Like max pooling but takes the mean of each window. Preserves more background information; less aggressive than max pooling. Global Average Pooling (GAP) averages across the entire spatial dimension to produce a single value per feature map.	Pool size, stride. Global Average Pooling has no size parameter.
Batch Normalisation	Normalises the activations within each mini-batch to have zero mean and unit variance, then applies learnable scale (γ) and shift (β). Stabilises and accelerates training, allows higher learning rates, and provides mild regularisation.	momentum (for running statistics), epsilon (numerical stability), learnable γ and β
Fully Connected (Dense)	Standard MLP layer, typically used at the end of a CNN after flattening or Global Average Pooling. Combines the learned spatial features into class scores or a regression output.	Number of units, activation function
Dropout	During training, randomly sets a fraction p of activations to zero, preventing any single neuron from becoming overly dominant. Forces the network to learn redundant representations. Disabled at inference time (activations scaled by 1-p instead).	Dropout rate p (typically 0.2–0.5)

Pooling Trade-offs

Pooling reduces spatial dimensions, which reduces memory and compute for subsequent layers and prevents overfitting. However, it discards precise spatial information — you know a feature was detected somewhere in that region, not exactly where. This is fine for classification (is there a cat?) but problematic for localisation (where is the cat?). Modern object detectors (YOLO, Faster R-CNN) and segmentation models (U-Net) use tricks like feature pyramid networks and skip connections to recover spatial precision after pooling.

🏛️ Classic Architectures

The history of CNNs is a story of progressively deeper, more ingenious architectures, each building on lessons from the last. These milestones shaped modern deep learning.

LeNet-5 (1998)

Yann LeCun's groundbreaking network, designed for handwritten digit recognition (MNIST). It established the Conv → Pool → Conv → Pool → FC pattern that all later CNNs would follow.

Year: 1998 (LeCun, Bottou, Bengio, Haffner)
Key innovation: first practical deep CNN end-to-end trained with backprop
Parameters: ~60,000
Input: 32×32 grayscale; Output: 10 digit classes
Activation: tanh and sigmoid (ReLU not yet standard)

Historical milestoneMNIST

AlexNet (2012)

Won ImageNet ILSVRC 2012 by a stunning 10.8% margin, triggering the deep learning revolution. First CNN to exploit GPU parallelism and use ReLU activations and dropout at scale.

Year: 2012 (Krizhevsky, Sutskever, Hinton)
Key innovations: ReLU activation, GPU training, dropout regularisation, data augmentation
Parameters: ~61 million
Top-5 ImageNet error: 15.3% (vs 26.2% runner-up)
Used 2 GTX 580 GPUs with model split across them

Deep learning era begins

VGG16 / VGG19 (2014)

Oxford VGG group showed that depth matters: exclusively using 3×3 convolutions stacked deeply outperformed larger kernels. VGG16/19 are still widely used as backbone encoders for feature extraction.

Year: 2014 (Simonyan & Zisserman)
Key innovation: uniform 3×3 kernels; demonstrated depth vs width
Parameters: ~138 million (VGG16)
Top-5 error: 7.3% (ILSVRC 2014 runner-up)
Two 3×3 Conv = one 5×5 receptive field, but fewer params

3×3 everywhereHigh parameter count

ResNet (2015)

Microsoft Research introduced residual (skip) connections: the output of a layer is added to the output of a layer two steps later. This allows gradients to flow directly through shortcuts, enabling networks 100+ layers deep to train without vanishing gradients.

Year: 2015 (He, Zhang, Ren, Sun)
Key innovation: skip connections — H(x) = F(x) + x (residual learning)
Parameters: ResNet-50 ~25M, ResNet-152 ~60M
Top-5 error: 3.57% (ILSVRC 2015 winner — superhuman)
Variants: ResNet-18, 34, 50, 101, 152; Wide ResNet; ResNeXt

Skip connectionsSuperhuman ImageNet

EfficientNet (2019)

Google Brain proposed compound scaling: systematically scale depth, width, and resolution together using a fixed ratio found by neural architecture search. EfficientNet-B7 achieved state-of-the-art with far fewer parameters than comparable models.

Year: 2019 (Tan & Le, Google Brain)
Key innovation: compound coefficient scales depth+width+resolution jointly
Parameters: EfficientNet-B0 ~5.3M, B7 ~66M
Top-1 ImageNet accuracy: B7 reaches 84.4%
Uses MBConv blocks (mobile inverted bottleneck convolutions)

Parameter efficientNAS-designed

🌐 CNN Applications Beyond Images

The convolution operation is not limited to 2D images. Any data with local structure — sequences, audio, graphs — can benefit from convolutional processing.

Domain	Input Type	CNN Variant	Example Task
Natural Language Processing	Sequence of word embeddings (shape: seq_len × embed_dim)	1D Conv (Conv1D) with filters of width 2, 3, 4, 5 tokens	Sentence classification, sentiment analysis, spam detection. TextCNN (Kim 2014) used this approach with multiple filter widths in parallel.
Time Series	Univariate or multivariate time series (shape: time_steps × features)	1D Conv; dilated (atrous) convolutions for long-range dependencies	Anomaly detection, sensor fault prediction, ECG/EEG classification, stock pattern recognition.
Audio	Mel-frequency spectrogram (shape: freq_bins × time_frames × 1)	2D Conv on spectrogram; or 1D Conv on raw waveform (WaveNet)	Keyword spotting, speaker identification, music genre classification, environmental sound recognition.
Video	Sequence of frames (shape: T × H × W × C)	3D Conv (C3D, I3D); separable 3D Conv for efficiency	Action recognition, gesture detection, sports analytics, surveillance event detection.
Point Clouds / Graphs	Unstructured 3D points or graph node features	Graph Convolutional Networks (GCN); PointNet uses shared MLP with symmetric aggregation	3D object detection, LiDAR segmentation, molecular property prediction, social network analysis.

The Inductive Bias Principle

CNNs encode the assumption that important patterns are local and position-invariant. This is a strong inductive bias that works beautifully for natural images and many physical signals. However, when this assumption breaks down — for example, in tasks where global context matters as much as local features, or where data has no natural spatial structure — alternative architectures (Transformers, GNNs) may outperform CNNs. Choosing an architecture means choosing which inductive biases to embed in your model.