⚙ Why NumPy?
The Problem with Python Lists
Python lists are general-purpose containers. Each element is a Python object with its own type tag, reference count, and memory allocation. Looping over a million integers in a Python list means a million object dereferences — slow by design.
- Dynamic typing: Python checks every element's type at runtime
- No contiguous memory: list elements scatter across the heap
- No BLAS/LAPACK integration: can't dispatch to optimised Fortran routines
- GIL limits true multi-threaded parallelism for CPU-bound loops
How NumPy Solves This
NumPy's ndarray stores elements in a single contiguous block of memory with a fixed dtype. This allows vectorised operations that call into highly optimised C and Fortran libraries (OpenBLAS, MKL) without any Python overhead in the inner loop.
- Fixed dtype: one type check at array creation, not per element
- Contiguous memory: CPU cache-friendly, SIMD-friendly
- BLAS/LAPACK: matrix multiply achieves near-peak FLOP/s
- Vectorisation: element-wise ops run in C with zero Python loop
Benchmark Intuition
A simple sum of 10 million floats illustrates the difference dramatically. The NumPy version offloads work to a tight C loop and may use AVX-512 SIMD instructions to process 8 doubles per clock cycle.
- Python loop over list: ~1000 ms for 10M elements
- sum() built-in over list: ~80 ms (C loop, but boxing overhead)
- np.sum() over ndarray: ~5 ms (SIMD, no boxing)
- np.dot() for matrix multiply: orders of magnitude over nested loops
Installation & Import Convention
NumPy ships with every major Python distribution and is a dependency of pandas, scikit-learn, TensorFlow, and PyTorch. The community alias np is universal — you will see it in every textbook, tutorial, and Stack Overflow answer.
# Install (only needed once)
pip install numpy
# The universal import alias
import numpy as np
# Verify version
print(np.__version__) # e.g., 1.26.4
# ndarray vs list — same data, different performance
import time
py_list = list(range(10_000_000))
np_array = np.arange(10_000_000, dtype=np.float64)
t0 = time.perf_counter()
s1 = sum(py_list) # Python built-in
print(f"Python sum: {time.perf_counter()-t0:.3f}s")
t0 = time.perf_counter()
s2 = np.sum(np_array) # NumPy vectorised
print(f"NumPy sum: {time.perf_counter()-t0:.3f}s")
# Typical output: Python ~0.08s, NumPy ~0.005s
📈 Arrays & Dtypes
The ndarray Object
Every NumPy array is an instance of numpy.ndarray. Three attributes describe its structure completely: shape (tuple of dimension sizes), dtype (element type), and strides (bytes to step in each dimension).
a.shape— tuple e.g. (3, 4) for a 3×4 matrixa.dtype— e.g. float64, int32, bool, complex128a.ndim— number of dimensions (len of shape)a.size— total number of elementsa.nbytes— total memory in bytesa.strides— bytes between elements per axis
Common Dtypes
Choosing the right dtype saves memory and can unlock hardware-specific optimisations. Deep learning frameworks default to float32; scientific computing often uses float64 for precision.
float64(f8) — default for floating-point, 8 bytesfloat32(f4) — standard for ML/GPU work, 4 bytesfloat16(f2) — half precision for GPU memory savingint64(i8) — large integers, indicesint32(i4) — compact integersbool— masks, conditions, 1 byte per elementcomplex128— signal processing, FFT
Shape & Reshape
Reshape returns a view when possible (same data, new strides). Use -1 as a wildcard dimension and NumPy infers it. Flatten always returns a copy; ravel returns a view when memory is contiguous.
a.reshape(3, 4)— explicit new shapea.reshape(-1, 4)— infer first dimensiona.reshape(1, -1)— row vector (1, N)a.reshape(-1, 1)— column vector (N, 1)a.flatten()— 1D copya.ravel()— 1D view (if possible)a.T— transpose (view, no copy)
import numpy as np # ── 10 Essential Array Creation Methods ────────────────────────────────────── # 1. From Python list / nested list a = np.array([1, 2, 3, 4, 5]) # 1D, dtype inferred (int64) M = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32) # 2D, explicit dtype # 2. Zeros and ones z = np.zeros((3, 4)) # 3×4 matrix of 0.0 (float64) o = np.ones((2, 3, 4)) # 3D tensor of 1.0 e = np.empty((100, 100)) # Uninitialized — fastest, for overwriting # 3. arange — like Python range() but returns ndarray r = np.arange(0, 10, 2) # [0 2 4 6 8] (start, stop, step) r_f = np.arange(0.0, 1.0, 0.1) # Floats work too # 4. linspace — N evenly spaced points (inclusive of stop) ls = np.linspace(0, 1, 11) # [0.0, 0.1, 0.2, ..., 1.0] — 11 points # 5. Identity matrix I = np.eye(4) # 4×4 identity (float64) I3 = np.eye(4, dtype=int) # Integer identity # 6. Random arrays (seed for reproducibility) rng = np.random.default_rng(seed=42) # New-style generator (preferred) u = rng.random((3, 3)) # Uniform [0, 1) n = rng.standard_normal((100,)) # Standard normal N(0,1) i = rng.integers(0, 10, size=(5, 5)) # Random integers [0, 10) # 7. Full — fill with constant value f = np.full((2, 4), fill_value=3.14) # 2×4 matrix all 3.14 f_like = np.full_like(M, 0.5) # Same shape+dtype as M, filled 0.5 # 8. Diagonal matrix d = np.diag([1, 2, 3, 4]) # 4×4 matrix with given diagonal diag_vals = np.diag(d) # Extract diagonal from existing matrix # 9. Reshape examples flat = np.arange(24) # [0..23] cube = flat.reshape(2, 3, 4) # 2×3×4 tensor — view, no copy col_vec = flat.reshape(-1, 1) # (24,1) column vector # 10. Stack arrays row1 = np.array([1, 2, 3]) row2 = np.array([4, 5, 6]) stacked = np.vstack([row1, row2]) # (2, 3) hstacked = np.hstack([row1, row2]) # (6,) concatenated = np.concatenate([row1, row2], axis=0) print(cube.shape, cube.dtype, cube.nbytes)
🔍 Indexing, Slicing & Boolean Masks
Basic Indexing
NumPy uses zero-based indexing like Python. Multi-dimensional arrays are indexed with comma-separated integers. Negative indices count from the end. Basic indexing always returns a view — no data is copied.
a[2]— third element of 1D arrayM[1, 2]— row 1, column 2 of 2D matrixM[-1, -1]— bottom-right cornerT[0, 1, 2]— element in 3D tensor
Slicing
Slices follow start:stop:step syntax per axis. Omitted values default to start=0, stop=end, step=1. Slices return views — modifying the slice modifies the original. Use .copy() to avoid this.
a[2:5]— elements 2, 3, 4a[::2]— every other elementa[::-1]— reversed array (view)M[1:3, 0:2]— submatrix rows 1-2, cols 0-1M[:, 0]— entire first columnM[0, :]— entire first row
Fancy & Boolean Indexing
Fancy indexing uses an integer array or list as an index — it always returns a copy. Boolean masks select elements where the mask is True. These are the workhorses of data filtering in ML pipelines.
a[[0, 3, 5]]— elements at positions 0, 3, 5M[[0, 2], :]— rows 0 and 2a[a > threshold]— elements above thresholda[(a > 2) & (a < 8)]— combined conditionnp.where(cond, x, y)— element-wise ternary
import numpy as np
rng = np.random.default_rng(42)
# ── 1D Examples ───────────────────────────────────────────────────────────────
a = np.array([10, 20, 30, 40, 50, 60, 70, 80])
print(a[2]) # 30
print(a[-1]) # 80
print(a[1:5]) # [20 30 40 50]
print(a[::3]) # [10 40 70] every 3rd
print(a[::-1]) # [80 70 60 50 40 30 20 10]
# ── 2D Examples ───────────────────────────────────────────────────────────────
M = np.arange(16).reshape(4, 4)
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]
# [12 13 14 15]]
print(M[2, 1]) # 9
print(M[1:3, 1:3]) # [[5 6],[9 10]] — submatrix view
print(M[:, -1]) # [ 3 7 11 15] — last column
print(M[0]) # [0 1 2 3] — first row (shorthand for M[0,:])
# ── Fancy Indexing ────────────────────────────────────────────────────────────
rows = np.array([0, 2, 3])
print(M[rows]) # rows 0, 2, 3 — COPY, not view
print(M[rows, 1]) # elements M[0,1], M[2,1], M[3,1] = [1, 9, 13]
# ── Boolean Masking ───────────────────────────────────────────────────────────
data = rng.standard_normal(1000) # 1000 samples from N(0,1)
# Extract all values above 1.5 sigma
outliers = data[data > 1.5]
print(f"Values > 1.5: {len(outliers)} ({len(outliers)/1000*100:.1f}%)")
# Multi-condition filter
in_range = data[(data > -1.0) & (data < 1.0)]
print(f"Values in (-1, 1): {len(in_range)}")
# np.where — ternary element-wise operation
clipped = np.where(data > 2.0, 2.0, np.where(data < -2.0, -2.0, data))
print(f"Clipped range: [{clipped.min():.2f}, {clipped.max():.2f}]")
# Practical ML example: filter a dataset by label
X = rng.standard_normal((200, 10)) # 200 samples, 10 features
y = rng.integers(0, 3, 200) # Labels: 0, 1, 2
# Select only samples with label == 1
X_class1 = X[y == 1]
print(f"Class 1 samples: {X_class1.shape}")
# 3D indexing example (batch of images)
images = rng.random((32, 28, 28)) # 32 grayscale 28x28 images
first_img = images[0] # (28, 28)
top_left_patch = images[:, :14, :14] # (32, 14, 14) all images, top-left quadrant
print(top_left_patch.shape)
🔁 Broadcasting
Broadcasting is NumPy's mechanism for performing arithmetic on arrays with different — but compatible — shapes without copying data. It is one of the most powerful features for writing concise, memory-efficient numerical code.
The Three Broadcasting Rules
1. If arrays have different numbers of dimensions, the shape of the smaller array is padded with 1s on the left.
2. Arrays with size 1 along a dimension are stretched (conceptually) to match the other array's size in that dimension.
3. If neither array has size 1 in a mismatched dimension, a ValueError is raised.
| Shape A | Shape B | Result Shape | Valid? | Notes |
|---|---|---|---|---|
| (5,) | scalar | (5,) | Yes | Scalar broadcasts to any shape |
| (3, 4) | (4,) | (3, 4) | Yes | Row vector adds to each row |
| (3, 1) | (1, 4) | (3, 4) | Yes | Outer product pattern |
| (32, 28, 28) | (28, 28) | (32, 28, 28) | Yes | Per-image bias on batch |
| (3, 4) | (3,) | — | No | Right-align first: (1,3) vs (3,4) — mismatch at dim -2 |
| (3, 4) | (3, 1) | (3, 4) | Yes | Column vector adds to each column |
| (2, 3, 4) | (3, 4) | (2, 3, 4) | Yes | Trailing dims match |
| (2, 3) | (2, 4) | — | No | 3 vs 4 with neither being 1 |
Broadcasting Avoids Loops
Without broadcasting, adding a bias vector to every row of a matrix requires an explicit Python loop or np.tile (which copies memory). Broadcasting does it in one C-level operation with zero extra allocation.
import numpy as np # ── Scalar × Array ──────────────────────────────────────────────────────────── a = np.array([1.0, 2.0, 3.0, 4.0]) print(a * 2) # [2. 4. 6. 8.] — scalar broadcast to (4,) print(a + 10) # [11. 12. 13. 14.] print(a ** 2) # [ 1. 4. 9. 16.] # ── Row Vector + Column Vector (outer-product-style grid) ───────────────────── rows = np.array([[0], [10], [20], [30]]) # shape (4, 1) cols = np.array([0, 1, 2, 3]) # shape (4,) → treated as (1, 4) grid = rows + cols # [[ 0 1 2 3] # [10 11 12 13] # [20 21 22 23] # [30 31 32 33]] print(grid) # ── Image Normalisation (common in ML preprocessing) ────────────────────────── # Suppose each image has per-channel mean and std from the training set images = np.random.rand(32, 3, 224, 224).astype(np.float32) # (batch, C, H, W) channel_mean = np.array([0.485, 0.456, 0.406], dtype=np.float32) # ImageNet means channel_std = np.array([0.229, 0.224, 0.225], dtype=np.float32) # Reshape to (1, 3, 1, 1) for broadcasting over batch and spatial dims mean_bc = channel_mean.reshape(1, 3, 1, 1) std_bc = channel_std.reshape(1, 3, 1, 1) images_normalised = (images - mean_bc) / std_bc print(images_normalised.shape) # (32, 3, 224, 224) — no extra memory for mean/std # ── L2 Normalisation of Embedding Matrix ───────────────────────────────────── embeddings = np.random.randn(1000, 128) # 1000 embeddings, dim 128 norms = np.linalg.norm(embeddings, axis=1, keepdims=True) # (1000, 1) unit_embeddings = embeddings / norms # broadcast: (1000,128) / (1000,1) # Verify all unit length print(np.allclose(np.linalg.norm(unit_embeddings, axis=1), 1.0)) # True # ── Pairwise Distance Matrix (broadcast trick) ──────────────────────────────── A = np.random.randn(5, 3) # 5 points in R^3 B = np.random.randn(7, 3) # 7 points # A[:, np.newaxis] → (5,1,3); B[np.newaxis,:] → (1,7,3) diff = A[:, np.newaxis, :] - B[np.newaxis, :, :] # (5, 7, 3) dist = np.sqrt((diff**2).sum(axis=2)) # (5, 7) print(dist.shape) # (5, 7) — all pairwise distances, no loops
🧮 Linear Algebra & ML Ops
Matrix Multiplication
Matrix multiplication is the core computation in every neural network forward pass. NumPy provides multiple syntaxes that all call into the same BLAS routine.
np.dot(A, B)— works for 1D, 2D, and higherA @ B— Python 3.5+ matmul operator (preferred)np.matmul(A, B)— explicit, same as @A * B— element-wise (Hadamard), NOT matrix multiplyA.T— transpose;A.conj().T— conjugate transpose
np.linalg Toolkit
The numpy.linalg submodule wraps LAPACK for reliable, numerically stable implementations of core linear algebra operations.
np.linalg.inv(A)— matrix inverse (avoid when possible)np.linalg.solve(A, b)— solve Ax=b (preferred over inv)np.linalg.svd(A)— singular value decompositionnp.linalg.eig(A)— eigenvalues and eigenvectorsnp.linalg.norm(v)— vector/matrix normnp.linalg.det(A)— determinantnp.linalg.matrix_rank(A)— numerical rank
np.einsum
Einstein summation notation provides a concise, general syntax for any tensor contraction. Performance is comparable to explicit BLAS calls for common operations and often faster for exotic ones.
'ij,jk->ik'— matrix multiply'ij,ij->i'— row-wise dot products'ijk,jl->ikl'— batched matrix-vector'ii->'— trace of a matrix'ij->ji'— transpose- optimize='optimal' for multi-operand chains
import numpy as np
# ── Implement a Single Linear (Dense) Layer from Scratch ──────────────────────
# y = x @ W + b
# x: (batch, in_features)
# W: (in_features, out_features)
# b: (out_features,)
# y: (batch, out_features)
rng = np.random.default_rng(42)
batch_size = 32
in_features = 128
out_features = 64
# Xavier / Glorot uniform initialisation (standard for dense layers)
limit = np.sqrt(6.0 / (in_features + out_features))
W = rng.uniform(-limit, limit, (in_features, out_features)).astype(np.float32)
b = np.zeros(out_features, dtype=np.float32)
x = rng.standard_normal((batch_size, in_features)).astype(np.float32)
# Forward pass
y = x @ W + b # broadcast: b has shape (64,) → (1,64) → (32,64)
print(f"Output shape: {y.shape}") # (32, 64)
# ── ReLU activation ───────────────────────────────────────────────────────────
def relu(z):
return np.maximum(0.0, z) # element-wise, no copy needed
y_activated = relu(y)
print(f"ReLU zeros: {(y_activated == 0).sum()} / {y_activated.size}")
# ── Softmax ───────────────────────────────────────────────────────────────────
def softmax(z):
# Subtract max for numerical stability
e = np.exp(z - z.max(axis=1, keepdims=True))
return e / e.sum(axis=1, keepdims=True)
logits = rng.standard_normal((batch_size, 10)).astype(np.float32)
probs = softmax(logits)
print(f"Probs sum to 1: {np.allclose(probs.sum(axis=1), 1.0)}") # True
# ── Cross-Entropy Loss ────────────────────────────────────────────────────────
def cross_entropy(probs, labels):
n = probs.shape[0]
correct_probs = probs[np.arange(n), labels] # fancy indexing
return -np.mean(np.log(correct_probs + 1e-9)) # log + epsilon for stability
labels = rng.integers(0, 10, batch_size)
loss = cross_entropy(probs, labels)
print(f"Cross-entropy loss: {loss:.4f}")
# ── SVD for Dimensionality Reduction (manual PCA) ─────────────────────────────
data = rng.standard_normal((500, 50)) # 500 samples, 50 features
data_centred = data - data.mean(axis=0) # centre the data
U, S, Vt = np.linalg.svd(data_centred, full_matrices=False)
# Project onto top 10 principal components
k = 10
X_pca = data_centred @ Vt[:k].T # (500, 10)
explained = (S[:k]**2).sum() / (S**2).sum()
print(f"Top {k} PCs explain {explained*100:.1f}% of variance")
# ── Batched einsum: attention-like score matrix ───────────────────────────────
# Q, K: (batch, heads, seq_len, d_head)
Q = rng.standard_normal((4, 8, 64, 32)).astype(np.float32)
K = rng.standard_normal((4, 8, 64, 32)).astype(np.float32)
# Scaled dot-product attention scores: (batch, heads, seq_len, seq_len)
scores = np.einsum('bhid,bhjd->bhij', Q, K) / np.sqrt(32.0)
print(f"Attention scores shape: {scores.shape}") # (4, 8, 64, 64)
NumPy as ML Foundation
Every operation above — linear layers, activations, losses, PCA, attention — runs as fast NumPy code with no frameworks. Understanding these primitives makes framework abstractions transparent and debugging trivial. When a PyTorch or TensorFlow op behaves unexpectedly, you can always re-implement it in NumPy to isolate the issue.