NumPy for Numerical Computing

Python ML Series: NumPy Pandas Scikit-learn TensorFlow & Keras PyTorch Jupyter Notebooks

⏱ 6 min read 📊 Beginner 🗓 Updated Jan 2025

⚙ Why NumPy?

The Problem with Python Lists

Python lists are general-purpose containers. Each element is a Python object with its own type tag, reference count, and memory allocation. Looping over a million integers in a Python list means a million object dereferences — slow by design.

Dynamic typing: Python checks every element's type at runtime
No contiguous memory: list elements scatter across the heap
No BLAS/LAPACK integration: can't dispatch to optimised Fortran routines
GIL limits true multi-threaded parallelism for CPU-bound loops

How NumPy Solves This

NumPy's ndarray stores elements in a single contiguous block of memory with a fixed dtype. This allows vectorised operations that call into highly optimised C and Fortran libraries (OpenBLAS, MKL) without any Python overhead in the inner loop.

Fixed dtype: one type check at array creation, not per element
Contiguous memory: CPU cache-friendly, SIMD-friendly
BLAS/LAPACK: matrix multiply achieves near-peak FLOP/s
Vectorisation: element-wise ops run in C with zero Python loop

Benchmark Intuition

A simple sum of 10 million floats illustrates the difference dramatically. The NumPy version offloads work to a tight C loop and may use AVX-512 SIMD instructions to process 8 doubles per clock cycle.

Python loop over list: ~1000 ms for 10M elements
sum() built-in over list: ~80 ms (C loop, but boxing overhead)
np.sum() over ndarray: ~5 ms (SIMD, no boxing)
np.dot() for matrix multiply: orders of magnitude over nested loops

Installation & Import Convention

NumPy ships with every major Python distribution and is a dependency of pandas, scikit-learn, TensorFlow, and PyTorch. The community alias np is universal — you will see it in every textbook, tutorial, and Stack Overflow answer.

# Install (only needed once)
pip install numpy

# The universal import alias
import numpy as np

# Verify version
print(np.__version__)  # e.g., 1.26.4

# ndarray vs list — same data, different performance
import time

py_list = list(range(10_000_000))
np_array = np.arange(10_000_000, dtype=np.float64)

t0 = time.perf_counter()
s1 = sum(py_list)                  # Python built-in
print(f"Python sum: {time.perf_counter()-t0:.3f}s")

t0 = time.perf_counter()
s2 = np.sum(np_array)              # NumPy vectorised
print(f"NumPy  sum: {time.perf_counter()-t0:.3f}s")
# Typical output: Python ~0.08s, NumPy ~0.005s

📈 Arrays & Dtypes

The ndarray Object

Every NumPy array is an instance of numpy.ndarray. Three attributes describe its structure completely: shape (tuple of dimension sizes), dtype (element type), and strides (bytes to step in each dimension).

a.shape — tuple e.g. (3, 4) for a 3×4 matrix
a.dtype — e.g. float64, int32, bool, complex128
a.ndim — number of dimensions (len of shape)
a.size — total number of elements
a.nbytes — total memory in bytes
a.strides — bytes between elements per axis

Common Dtypes

Choosing the right dtype saves memory and can unlock hardware-specific optimisations. Deep learning frameworks default to float32; scientific computing often uses float64 for precision.

float64 (f8) — default for floating-point, 8 bytes
float32 (f4) — standard for ML/GPU work, 4 bytes
float16 (f2) — half precision for GPU memory saving
int64 (i8) — large integers, indices
int32 (i4) — compact integers
bool — masks, conditions, 1 byte per element
complex128 — signal processing, FFT

Shape & Reshape

Reshape returns a view when possible (same data, new strides). Use -1 as a wildcard dimension and NumPy infers it. Flatten always returns a copy; ravel returns a view when memory is contiguous.

a.reshape(3, 4) — explicit new shape
a.reshape(-1, 4) — infer first dimension
a.reshape(1, -1) — row vector (1, N)
a.reshape(-1, 1) — column vector (N, 1)
a.flatten() — 1D copy
a.ravel() — 1D view (if possible)
a.T — transpose (view, no copy)

import numpy as np

# ── 10 Essential Array Creation Methods ──────────────────────────────────────

# 1. From Python list / nested list
a = np.array([1, 2, 3, 4, 5])                       # 1D, dtype inferred (int64)
M = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32)  # 2D, explicit dtype

# 2. Zeros and ones
z = np.zeros((3, 4))             # 3×4 matrix of 0.0 (float64)
o = np.ones((2, 3, 4))           # 3D tensor of 1.0
e = np.empty((100, 100))         # Uninitialized — fastest, for overwriting

# 3. arange — like Python range() but returns ndarray
r = np.arange(0, 10, 2)         # [0 2 4 6 8]  (start, stop, step)
r_f = np.arange(0.0, 1.0, 0.1)  # Floats work too

# 4. linspace — N evenly spaced points (inclusive of stop)
ls = np.linspace(0, 1, 11)      # [0.0, 0.1, 0.2, ..., 1.0]  — 11 points

# 5. Identity matrix
I = np.eye(4)                    # 4×4 identity (float64)
I3 = np.eye(4, dtype=int)        # Integer identity

# 6. Random arrays (seed for reproducibility)
rng = np.random.default_rng(seed=42)   # New-style generator (preferred)
u = rng.random((3, 3))                 # Uniform [0, 1)
n = rng.standard_normal((100,))        # Standard normal N(0,1)
i = rng.integers(0, 10, size=(5, 5))   # Random integers [0, 10)

# 7. Full — fill with constant value
f = np.full((2, 4), fill_value=3.14)   # 2×4 matrix all 3.14
f_like = np.full_like(M, 0.5)          # Same shape+dtype as M, filled 0.5

# 8. Diagonal matrix
d = np.diag([1, 2, 3, 4])        # 4×4 matrix with given diagonal
diag_vals = np.diag(d)           # Extract diagonal from existing matrix

# 9. Reshape examples
flat = np.arange(24)             # [0..23]
cube = flat.reshape(2, 3, 4)     # 2×3×4 tensor — view, no copy
col_vec = flat.reshape(-1, 1)    # (24,1) column vector

# 10. Stack arrays
row1 = np.array([1, 2, 3])
row2 = np.array([4, 5, 6])
stacked = np.vstack([row1, row2])  # (2, 3)
hstacked = np.hstack([row1, row2]) # (6,)
concatenated = np.concatenate([row1, row2], axis=0)

print(cube.shape, cube.dtype, cube.nbytes)

🔍 Indexing, Slicing & Boolean Masks

Basic Indexing

NumPy uses zero-based indexing like Python. Multi-dimensional arrays are indexed with comma-separated integers. Negative indices count from the end. Basic indexing always returns a view — no data is copied.

a[2] — third element of 1D array
M[1, 2] — row 1, column 2 of 2D matrix
M[-1, -1] — bottom-right corner
T[0, 1, 2] — element in 3D tensor

Slicing

Slices follow start:stop:step syntax per axis. Omitted values default to start=0, stop=end, step=1. Slices return views — modifying the slice modifies the original. Use .copy() to avoid this.

a[2:5] — elements 2, 3, 4
a[::2] — every other element
a[::-1] — reversed array (view)
M[1:3, 0:2] — submatrix rows 1-2, cols 0-1
M[:, 0] — entire first column
M[0, :] — entire first row

Fancy & Boolean Indexing

Fancy indexing uses an integer array or list as an index — it always returns a copy. Boolean masks select elements where the mask is True. These are the workhorses of data filtering in ML pipelines.

a[[0, 3, 5]] — elements at positions 0, 3, 5
M[[0, 2], :] — rows 0 and 2
a[a > threshold] — elements above threshold
a[(a > 2) & (a < 8)] — combined condition
np.where(cond, x, y) — element-wise ternary

import numpy as np
rng = np.random.default_rng(42)

# ── 1D Examples ───────────────────────────────────────────────────────────────
a = np.array([10, 20, 30, 40, 50, 60, 70, 80])

print(a[2])        # 30
print(a[-1])       # 80
print(a[1:5])      # [20 30 40 50]
print(a[::3])      # [10 40 70]   every 3rd
print(a[::-1])     # [80 70 60 50 40 30 20 10]

# ── 2D Examples ───────────────────────────────────────────────────────────────
M = np.arange(16).reshape(4, 4)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]
#  [12 13 14 15]]

print(M[2, 1])        # 9
print(M[1:3, 1:3])    # [[5 6],[9 10]]  — submatrix view
print(M[:, -1])       # [ 3  7 11 15]  — last column
print(M[0])           # [0 1 2 3]      — first row (shorthand for M[0,:])

# ── Fancy Indexing ────────────────────────────────────────────────────────────
rows = np.array([0, 2, 3])
print(M[rows])        # rows 0, 2, 3   — COPY, not view
print(M[rows, 1])     # elements M[0,1], M[2,1], M[3,1] = [1, 9, 13]

# ── Boolean Masking ───────────────────────────────────────────────────────────
data = rng.standard_normal(1000)     # 1000 samples from N(0,1)

# Extract all values above 1.5 sigma
outliers = data[data > 1.5]
print(f"Values > 1.5: {len(outliers)} ({len(outliers)/1000*100:.1f}%)")

# Multi-condition filter
in_range = data[(data > -1.0) & (data < 1.0)]
print(f"Values in (-1, 1): {len(in_range)}")

# np.where — ternary element-wise operation
clipped = np.where(data > 2.0, 2.0, np.where(data < -2.0, -2.0, data))
print(f"Clipped range: [{clipped.min():.2f}, {clipped.max():.2f}]")

# Practical ML example: filter a dataset by label
X = rng.standard_normal((200, 10))   # 200 samples, 10 features
y = rng.integers(0, 3, 200)          # Labels: 0, 1, 2

# Select only samples with label == 1
X_class1 = X[y == 1]
print(f"Class 1 samples: {X_class1.shape}")

# 3D indexing example (batch of images)
images = rng.random((32, 28, 28))    # 32 grayscale 28x28 images
first_img = images[0]                # (28, 28)
top_left_patch = images[:, :14, :14] # (32, 14, 14) all images, top-left quadrant
print(top_left_patch.shape)

🔁 Broadcasting

Broadcasting is NumPy's mechanism for performing arithmetic on arrays with different — but compatible — shapes without copying data. It is one of the most powerful features for writing concise, memory-efficient numerical code.

The Three Broadcasting Rules

1. If arrays have different numbers of dimensions, the shape of the smaller array is padded with 1s on the left.
2. Arrays with size 1 along a dimension are stretched (conceptually) to match the other array's size in that dimension.
3. If neither array has size 1 in a mismatched dimension, a ValueError is raised.

Shape A	Shape B	Result Shape	Valid?	Notes
(5,)	scalar	(5,)	Yes	Scalar broadcasts to any shape
(3, 4)	(4,)	(3, 4)	Yes	Row vector adds to each row
(3, 1)	(1, 4)	(3, 4)	Yes	Outer product pattern
(32, 28, 28)	(28, 28)	(32, 28, 28)	Yes	Per-image bias on batch
(3, 4)	(3,)	—	No	Right-align first: (1,3) vs (3,4) — mismatch at dim -2
(3, 4)	(3, 1)	(3, 4)	Yes	Column vector adds to each column
(2, 3, 4)	(3, 4)	(2, 3, 4)	Yes	Trailing dims match
(2, 3)	(2, 4)	—	No	3 vs 4 with neither being 1

Broadcasting Avoids Loops

Without broadcasting, adding a bias vector to every row of a matrix requires an explicit Python loop or np.tile (which copies memory). Broadcasting does it in one C-level operation with zero extra allocation.

import numpy as np

# ── Scalar × Array ────────────────────────────────────────────────────────────
a = np.array([1.0, 2.0, 3.0, 4.0])
print(a * 2)          # [2. 4. 6. 8.]   — scalar broadcast to (4,)
print(a + 10)         # [11. 12. 13. 14.]
print(a ** 2)         # [ 1.  4.  9. 16.]

# ── Row Vector + Column Vector (outer-product-style grid) ─────────────────────
rows = np.array([[0], [10], [20], [30]])   # shape (4, 1)
cols = np.array([0, 1, 2, 3])             # shape (4,)  → treated as (1, 4)
grid = rows + cols
# [[ 0  1  2  3]
#  [10 11 12 13]
#  [20 21 22 23]
#  [30 31 32 33]]
print(grid)

# ── Image Normalisation (common in ML preprocessing) ──────────────────────────
# Suppose each image has per-channel mean and std from the training set
images = np.random.rand(32, 3, 224, 224).astype(np.float32)  # (batch, C, H, W)
channel_mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)  # ImageNet means
channel_std  = np.array([0.229, 0.224, 0.225], dtype=np.float32)

# Reshape to (1, 3, 1, 1) for broadcasting over batch and spatial dims
mean_bc = channel_mean.reshape(1, 3, 1, 1)
std_bc  = channel_std.reshape(1, 3, 1, 1)

images_normalised = (images - mean_bc) / std_bc
print(images_normalised.shape)   # (32, 3, 224, 224) — no extra memory for mean/std

# ── L2 Normalisation of Embedding Matrix ─────────────────────────────────────
embeddings = np.random.randn(1000, 128)        # 1000 embeddings, dim 128
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)  # (1000, 1)
unit_embeddings = embeddings / norms           # broadcast: (1000,128) / (1000,1)

# Verify all unit length
print(np.allclose(np.linalg.norm(unit_embeddings, axis=1), 1.0))  # True

# ── Pairwise Distance Matrix (broadcast trick) ────────────────────────────────
A = np.random.randn(5, 3)   # 5 points in R^3
B = np.random.randn(7, 3)   # 7 points

# A[:, np.newaxis] → (5,1,3);  B[np.newaxis,:] → (1,7,3)
diff = A[:, np.newaxis, :] - B[np.newaxis, :, :]   # (5, 7, 3)
dist = np.sqrt((diff**2).sum(axis=2))               # (5, 7)
print(dist.shape)   # (5, 7) — all pairwise distances, no loops

🧮 Linear Algebra & ML Ops

Matrix Multiplication

Matrix multiplication is the core computation in every neural network forward pass. NumPy provides multiple syntaxes that all call into the same BLAS routine.

np.dot(A, B) — works for 1D, 2D, and higher
A @ B — Python 3.5+ matmul operator (preferred)
np.matmul(A, B) — explicit, same as @
A * B — element-wise (Hadamard), NOT matrix multiply
A.T — transpose; A.conj().T — conjugate transpose

np.linalg Toolkit

The numpy.linalg submodule wraps LAPACK for reliable, numerically stable implementations of core linear algebra operations.

np.linalg.inv(A) — matrix inverse (avoid when possible)
np.linalg.solve(A, b) — solve Ax=b (preferred over inv)
np.linalg.svd(A) — singular value decomposition
np.linalg.eig(A) — eigenvalues and eigenvectors
np.linalg.norm(v) — vector/matrix norm
np.linalg.det(A) — determinant
np.linalg.matrix_rank(A) — numerical rank

np.einsum

Einstein summation notation provides a concise, general syntax for any tensor contraction. Performance is comparable to explicit BLAS calls for common operations and often faster for exotic ones.

'ij,jk->ik' — matrix multiply
'ij,ij->i' — row-wise dot products
'ijk,jl->ikl' — batched matrix-vector
'ii->' — trace of a matrix
'ij->ji' — transpose
optimize='optimal' for multi-operand chains

import numpy as np

# ── Implement a Single Linear (Dense) Layer from Scratch ──────────────────────
# y = x @ W + b
# x: (batch, in_features)
# W: (in_features, out_features)
# b: (out_features,)
# y: (batch, out_features)

rng = np.random.default_rng(42)

batch_size   = 32
in_features  = 128
out_features = 64

# Xavier / Glorot uniform initialisation (standard for dense layers)
limit = np.sqrt(6.0 / (in_features + out_features))
W = rng.uniform(-limit, limit, (in_features, out_features)).astype(np.float32)
b = np.zeros(out_features, dtype=np.float32)

x = rng.standard_normal((batch_size, in_features)).astype(np.float32)

# Forward pass
y = x @ W + b          # broadcast: b has shape (64,) → (1,64) → (32,64)
print(f"Output shape: {y.shape}")   # (32, 64)

# ── ReLU activation ───────────────────────────────────────────────────────────
def relu(z):
    return np.maximum(0.0, z)          # element-wise, no copy needed

y_activated = relu(y)
print(f"ReLU zeros: {(y_activated == 0).sum()} / {y_activated.size}")

# ── Softmax ───────────────────────────────────────────────────────────────────
def softmax(z):
    # Subtract max for numerical stability
    e = np.exp(z - z.max(axis=1, keepdims=True))
    return e / e.sum(axis=1, keepdims=True)

logits = rng.standard_normal((batch_size, 10)).astype(np.float32)
probs  = softmax(logits)
print(f"Probs sum to 1: {np.allclose(probs.sum(axis=1), 1.0)}")   # True

# ── Cross-Entropy Loss ────────────────────────────────────────────────────────
def cross_entropy(probs, labels):
    n = probs.shape[0]
    correct_probs = probs[np.arange(n), labels]   # fancy indexing
    return -np.mean(np.log(correct_probs + 1e-9))  # log + epsilon for stability

labels = rng.integers(0, 10, batch_size)
loss = cross_entropy(probs, labels)
print(f"Cross-entropy loss: {loss:.4f}")

# ── SVD for Dimensionality Reduction (manual PCA) ─────────────────────────────
data = rng.standard_normal((500, 50))         # 500 samples, 50 features
data_centred = data - data.mean(axis=0)       # centre the data

U, S, Vt = np.linalg.svd(data_centred, full_matrices=False)
# Project onto top 10 principal components
k = 10
X_pca = data_centred @ Vt[:k].T              # (500, 10)
explained = (S[:k]**2).sum() / (S**2).sum()
print(f"Top {k} PCs explain {explained*100:.1f}% of variance")

# ── Batched einsum: attention-like score matrix ───────────────────────────────
# Q, K: (batch, heads, seq_len, d_head)
Q = rng.standard_normal((4, 8, 64, 32)).astype(np.float32)
K = rng.standard_normal((4, 8, 64, 32)).astype(np.float32)

# Scaled dot-product attention scores: (batch, heads, seq_len, seq_len)
scores = np.einsum('bhid,bhjd->bhij', Q, K) / np.sqrt(32.0)
print(f"Attention scores shape: {scores.shape}")   # (4, 8, 64, 64)

NumPy as ML Foundation

Every operation above — linear layers, activations, losses, PCA, attention — runs as fast NumPy code with no frameworks. Understanding these primitives makes framework abstractions transparent and debugging trivial. When a PyTorch or TensorFlow op behaves unexpectedly, you can always re-implement it in NumPy to isolate the issue.