⏱ 7 min read 📊 Beginner 🗓 Updated Jan 2025

🧬 The Biological Inspiration

The human brain contains roughly 86 billion neurons, each connected to thousands of others through synapses. When neuroscientists Warren McCulloch and Walter Pitts proposed the first mathematical model of a neuron in 1943, they launched one of the most consequential ideas in the history of science: that the logic of biological computation could be replicated in silicon.

Biological Neuron

A biological neuron receives electrochemical signals through its dendrites from many upstream neurons. These signals are integrated in the cell body (soma). If the total signal exceeds a threshold, the neuron fires an electrical pulse down its axon to downstream neurons via synapses.

  • Dendrites: receive input signals from other neurons
  • Soma: integrates and processes incoming signals
  • Axon: transmits output signal to next neurons
  • Synapse: junction where signals transfer between neurons
  • Firing threshold: all-or-nothing action potential

Artificial Neuron (Perceptron)

The artificial neuron mirrors this structure mathematically. Each connection from a previous neuron carries a weight representing synaptic strength. A weighted sum of all inputs is computed in the node body. An activation function then decides the output — analogous to the firing threshold.

  • Inputs (x₁…xₙ): equivalent to dendrite signals
  • Weights (w₁…wₙ): equivalent to synaptic strength
  • Weighted sum + bias: equivalent to soma integration
  • Activation function: equivalent to firing threshold
  • Output: signal passed to next layer's neurons

Why the Analogy Is Imperfect

Biological neurons are far more complex than their artificial counterparts. Real neurons have diverse morphologies, use dozens of neurotransmitters, operate asynchronously, and exhibit plasticity at the subcellular level. Artificial neurons fire on every forward pass (not "all-or-nothing"), compute exact floating-point arithmetic (not stochastic electrochemistry), and are organised into neat layers (real cortex is far messier). The analogy is a useful mental model, not a faithful simulation. Modern AI researchers rarely think of ANNs as brain models — they are mathematical function approximators that happen to share a structural metaphor with biology.

⚙️ Anatomy of an Artificial Neuron

Every artificial neuron performs the same two-step computation: a linear transformation followed by a non-linear activation. Understanding these steps in detail is the key to understanding everything that follows — from backpropagation to transformer attention.

# The two fundamental equations of every neuron

# Step 1: Weighted sum (linear transformation)
z = w₁·x₁ + w₂·x₂ + w₃·x₃ + … + wₙ·xₙ + b

# Compact vector notation
z = W · x + b        # dot product of weight vector and input vector, plus bias

# Step 2: Activation (non-linear transformation)
output = activation(z)

# Full neuron in one line:
output = activation(W · x + b)

# Example with 3 inputs and sigmoid activation:
x = [0.5, 0.8, 0.2]
w = [0.4, -0.6, 0.9]
b = 0.1
z = (0.4×0.5) + (-0.6×0.8) + (0.9×0.2) + 0.1
z = 0.20  - 0.48  + 0.18  + 0.10  = 0.00
output = sigmoid(0.00) = 1/(1+e⁰) = 0.5
      

Inputs (x₁ … xₙ)

Inputs are the raw data or the outputs from the previous layer. They can represent pixel intensities, word embeddings, sensor readings — any numeric feature. A single neuron can receive hundreds or thousands of inputs simultaneously. Inputs are never modified during training; only the weights that connect to them change.

Fixed during forward passCan be any real number

Weights (w₁ … wₙ)

Each weight expresses how much a particular input should influence this neuron's output. A large positive weight amplifies that input; a large negative weight suppresses it. Weights are the learnable parameters — they are initialised randomly and refined by gradient descent during training. In a typical deep network, weights number in the millions.

Learned during trainingInitialised randomly

Bias (b)

The bias is a single learnable scalar added to the weighted sum before the activation. It shifts the activation function left or right, allowing the neuron to fire even when all inputs are zero. Without a bias, every neuron's pre-activation would always pass through the origin, severely limiting the network's expressive power.

Learned during trainingAdds offset flexibility

Activation Function

Without a non-linear activation, stacking multiple layers would be equivalent to a single linear transformation — no matter how deep the network. Activation functions introduce non-linearity, enabling the network to model complex, curved decision boundaries. The choice of activation function significantly affects training dynamics.

Critical design choiceIntroduces non-linearity

🗂️ Layer Types

Neurons are organised into layers — groups of neurons that all receive the same upstream outputs and send their own outputs to the same downstream layer. Every feedforward neural network has at least three types of layers.

Layer Role Typical Size Notes
Input Layer Receives raw feature data and passes it unchanged to the first hidden layer. No computation occurs here. Matches feature dimensionality (e.g. 784 for 28×28 MNIST images) No weights or activations; purely a data ingestion point. Often includes normalization preprocessing.
Hidden Layers Perform successive learned transformations, building progressively abstract representations of the data. Varies widely: 64–4096 neurons per layer in typical MLPs Each neuron in a dense (fully connected) hidden layer connects to every neuron in the adjacent layers. The "depth" of a network is the number of hidden layers.
Output Layer Produces the final prediction. Structure determined by the task. Binary: 1 neuron; Multi-class: C neurons (one per class); Regression: 1+ neurons Activation choice depends on task: sigmoid for binary, softmax for multi-class, linear (none) for regression.

Depth vs Width: The Architecture Trade-off

A deep network (many layers, fewer neurons each) learns hierarchical, compositional representations — early layers detect edges, later layers detect shapes, final layers detect objects. A wide network (few layers, many neurons each) can theoretically approximate any function but requires far more parameters and may not generalise as well. The Universal Approximation Theorem guarantees that a single hidden layer with enough neurons can approximate any continuous function — but "enough" can be astronomically large. In practice, depth is almost always more parameter-efficient than width for complex tasks, which is why "deep" learning dominates.

Shallow Networks (1-2 hidden layers)

Suitable for tabular data, simple regression, or classification where features are already well-engineered. Fast to train, easy to interpret, less prone to overfitting on small datasets.

Fast trainingTabular dataLow overfitting risk

Deep Networks (5+ hidden layers)

Excel at raw data (pixels, waveforms, tokens) where the network must learn its own features. Require more data, more compute, and careful regularisation, but achieve far better performance on complex perceptual tasks.

Images / Audio / TextNeeds large datasetFeature learning

⚡ Activation Functions

The activation function determines whether and how strongly a neuron "fires." It must be non-linear (otherwise deep networks collapse to single layers) and should be differentiable almost everywhere (so gradients can be computed during backpropagation).

Function Formula Output Range Primary Use Case
Sigmoid (σ) σ(z) = 1 / (1 + e⁻ᶻ) (0, 1) Output layer for binary classification; historically used in hidden layers before ReLU's rise.
Tanh tanh(z) = (eᶻ − e⁻ᶻ) / (eᶻ + e⁻ᶻ) (−1, 1) Hidden layers in RNNs; zero-centered output is advantageous for gradient flow compared to sigmoid.
ReLU ReLU(z) = max(0, z) [0, ∞) Default for hidden layers in CNNs and MLPs. Computationally trivial; strong empirical performance.
Leaky ReLU f(z) = z if z>0, else αz (α≈0.01) (−∞, ∞) Drop-in replacement for ReLU when dying neuron problem is observed. Small negative slope keeps gradient alive.
Softmax σ(zᵢ) = eᶻⁱ / Σⱼ eᶻʲ (0, 1), sums to 1 Output layer for multi-class classification. Converts raw logits into a proper probability distribution.
GELU GELU(z) ≈ z·Φ(z) (−∞, ∞) Used in modern Transformers (BERT, GPT). Smooth, stochastic variant of ReLU with better gradient properties.

Why ReLU Dominates Today

Before ReLU (introduced to deep learning by Nair & Hinton in 2010), sigmoid and tanh were the go-to activations. The problem: both saturate at large magnitudes — their gradients approach zero, making learning in deep networks painfully slow (the vanishing gradient problem). ReLU has a gradient of exactly 1 for all positive inputs, meaning gradients flow through it unchanged. This simple property allowed researchers to successfully train networks dozens of layers deep for the first time.

The Dying ReLU Problem

ReLU outputs zero for all negative inputs. A neuron that consistently receives negative inputs will always output zero and its gradient will be zero — it learns nothing and is permanently "dead." This can happen when learning rates are too high or weights are initialised poorly. Solutions include: Leaky ReLU (small negative slope), ELU (smooth negative region), or careful weight initialisation (He initialisation is standard for ReLU networks).

🏗️ Network Architectures Overview

The neurons-layers-activations framework is a foundation. Different tasks call for different ways of connecting and organising these components. The following four families represent the major architectural lineages in deep learning.

Feedforward / MLP

The simplest architecture: information flows in one direction only, from input to output, through one or more fully connected hidden layers. No cycles, no memory. Also called a Multilayer Perceptron (MLP).

Best for: tabular data, simple classification/regression, as the final "head" on top of other architectures.

Dense layers onlyUniversal approximator

Convolutional (CNN)

Uses convolutional filters that slide across spatial inputs to detect local patterns (edges, textures, shapes). Weight sharing dramatically reduces parameter count. Hierarchical feature extraction makes CNNs exceptional at image and audio tasks.

Best for: images, video, audio spectrograms, 1D time series.

Spatial dataWeight sharing

→ Deep dive: Convolutional Networks

Recurrent (RNN / LSTM)

Processes sequential data one step at a time, maintaining a hidden state that acts as memory of previous steps. Enables variable-length input/output and temporal reasoning. LSTMs and GRUs address the vanishing gradient problem of vanilla RNNs.

Best for: time series, natural language, speech recognition, video sequences.

Sequential dataMemory state

→ Deep dive: Recurrent Networks

Transformer

Replaces recurrence with a self-attention mechanism that directly models relationships between all positions in a sequence simultaneously. Massively parallelisable during training. Powers all modern large language models (GPT, BERT, LLaMA) and vision models (ViT).

Best for: NLP, code generation, multimodal tasks, increasingly vision.

Attention mechanismScalable

→ Deep dive: Transformers & Attention

How Architectures Build on Each Other

These families are not mutually exclusive. Modern models often combine them: a Vision Transformer (ViT) applies Transformer attention to image patches. A speech model might use CNNs for feature extraction and an RNN decoder for transcription. Understanding the foundational neuron and layer types covered in this page is prerequisite knowledge for all of these — the later pages in this series each zoom in on one architectural family in depth.