⚖ PyTorch vs TensorFlow
| Feature | PyTorch | TensorFlow / Keras |
|---|---|---|
| Computation graph | Dynamic (define-by-run) — graph built on each forward pass | TF2: eager by default; @tf.function builds static graph |
| Debugging | Standard Python debugger works; inspect tensors with print() | Eager mode debuggable; tf.function can be opaque |
| Training loop | Explicit loop (zero_grad → forward → loss → backward → step) | model.fit() abstracts loop; GradientTape for custom |
| Research papers | Dominant — ~80% of ML papers use PyTorch | Strong in production/industry; catching up in research |
| Production serving | TorchServe, TorchScript, ONNX export | TF Serving, SavedModel, TFX — more mature ecosystem |
| Mobile / Edge | PyTorch Mobile (experimental) | TFLite — mature, widely deployed |
| Distributed training | torch.distributed, DDP (DistributedDataParallel) | tf.distribute — Strategy API |
| HuggingFace ecosystem | Primary backend for Transformers library | Supported but secondary in most HF models |
| PyTorch 2.0 | torch.compile() — 2x speedup via kernel fusion | XLA compilation via jit_compile=True |
Why Researchers Prefer PyTorch
The dynamic graph model means you can use native Python control flow (if/else, loops) in your model and it just works. You can set a breakpoint inside a forward pass and inspect tensors. For novel architectures where the computation graph changes per sample (e.g., tree-structured networks, variable-length sequences without padding), PyTorch's define-by-run approach is far more natural.
💫 Tensors & Autograd
torch.Tensor Basics
PyTorch tensors are similar to NumPy arrays but can live on GPU and participate in automatic differentiation. Most NumPy creation functions have direct PyTorch equivalents.
torch.zeros(3, 4),torch.ones(),torch.rand()torch.tensor([1,2,3])— from Python list (copies data)torch.from_numpy(arr)— zero-copy from NumPyt.to('cuda')/t.cuda()— move to GPUt.to('mps')— Apple Silicon GPUt.numpy()— back to NumPy (CPU only)t.item()— extract scalar from single-element tensor
Autograd: Automatic Differentiation
PyTorch's autograd engine tracks operations on tensors with requires_grad=True, building a computation graph in the background. Calling .backward() on a scalar computes gradients via reverse-mode automatic differentiation.
t = torch.tensor([3.0], requires_grad=True)- All operations on
tare recorded loss.backward()— compute all gradientst.grad— gradient of loss w.r.t. toptimizer.zero_grad()— clear accumulated gradswith torch.no_grad():— disable grad tracking- Gradient accumulates! Must zero before each backward pass
Device Management
Tensors must be on the same device to interact. The standard practice is to define a device variable at the top and move all tensors and models to it.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')model.to(device)— move model parametersx = x.to(device)— move input tensorst.detach().cpu().numpy()— safe tensor → NumPy- Mixed precision:
torch.cuda.amp.autocast()
import torch
import numpy as np
# ── Device setup ──────────────────────────────────────────────────────────────
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
# ── Tensor creation ───────────────────────────────────────────────────────────
a = torch.zeros(3, 4) # (3,4) float32 zeros
b = torch.ones(2, 3, dtype=torch.float64)
c = torch.rand(5, 5) # uniform [0,1)
d = torch.randn(100, 20) # N(0,1)
e = torch.arange(0, 10, 2) # [0,2,4,6,8]
f = torch.linspace(0, 1, 11) # [0.0,...,1.0]
# From NumPy (zero-copy, shares memory)
np_arr = np.random.randn(4, 4).astype(np.float32)
t_from_np = torch.from_numpy(np_arr) # shares memory
print(t_from_np.shape, t_from_np.dtype)
# ── Manual gradient descent on y = w*x + b ────────────────────────────────────
torch.manual_seed(42)
# Synthetic data: y = 2x + 0.5 + noise
x_data = torch.linspace(-2, 2, 100)
y_data = 2.0 * x_data + 0.5 + torch.randn(100) * 0.2
# Learnable parameters
w = torch.tensor([0.0], requires_grad=True)
b_param = torch.tensor([0.0], requires_grad=True)
lr = 0.01
for epoch in range(200):
# Forward pass
y_pred = w * x_data + b_param
# Loss (MSE)
loss = ((y_pred - y_data) ** 2).mean()
# Backward pass — compute gradients
loss.backward()
# Update parameters (no_grad so update isn't tracked)
with torch.no_grad():
w -= lr * w.grad
b_param -= lr * b_param.grad
# Zero gradients for next iteration
w.grad.zero_()
b_param.grad.zero_()
if (epoch + 1) % 50 == 0:
print(f"Epoch {epoch+1:3d}: loss={loss.item():.6f} "
f"w={w.item():.4f} b={b_param.item():.4f}")
# Should converge toward w≈2.0, b≈0.5
# ── no_grad for inference ─────────────────────────────────────────────────────
model_output = torch.randn(5, requires_grad=True)
with torch.no_grad():
softmax_out = torch.softmax(model_output, dim=0)
print(f"no_grad preserves grads on input: {model_output.requires_grad}")
print(f"output requires_grad: {softmax_out.requires_grad}") # False
🏗 Building Models with nn.Module
nn.Module Pattern
All PyTorch models subclass nn.Module. The __init__ method defines layers as attributes; forward defines the computation. This separation keeps architecture and forward logic explicit.
- Register layers in
__init__as self.xxx - Define computation in
forward(self, x) - Never call forward() directly — call
model(x) model.parameters()— all learnable parametersmodel.named_parameters()— with namesmodel.train()/model.eval()— toggle modes
Key nn Layers
nn.Linear(in, out)— dense layer (no activation)nn.Conv2d(in_ch, out_ch, kernel_size)nn.BatchNorm1d / BatchNorm2d— batch normalisationnn.Dropout(p)/nn.Dropout2d(p)nn.LSTM(input_size, hidden_size, num_layers)nn.Transformer(d_model, nhead, ...)nn.Embedding(vocab_size, embed_dim)nn.Sequential(layer1, layer2, ...)
Activation Functions
nn.ReLU()/F.relu(x)nn.GELU()— Gaussian error linear unitnn.Sigmoid()/nn.Tanh()nn.Softmax(dim=1)— for probabilitiesnn.LeakyReLU(0.01)— negative slopenn.SiLU()— sigmoid linear unit (swish)- Functional:
import torch.nn.functional as F
import torch
import torch.nn as nn
import torch.nn.functional as F
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# ── Feedforward Network ────────────────────────────────────────────────────────
class FeedForwardNet(nn.Module):
def __init__(self, input_dim, hidden_dims, output_dim, dropout=0.3):
super().__init__()
layers = []
prev_dim = input_dim
for h in hidden_dims:
layers += [
nn.Linear(prev_dim, h),
nn.BatchNorm1d(h),
nn.GELU(),
nn.Dropout(dropout),
]
prev_dim = h
layers.append(nn.Linear(prev_dim, output_dim))
self.net = nn.Sequential(*layers)
def forward(self, x):
return self.net(x)
model = FeedForwardNet(
input_dim=64,
hidden_dims=[256, 128, 64],
output_dim=10,
dropout=0.3,
).to(device)
# ── Inspect parameters ────────────────────────────────────────────────────────
total_params = sum(p.numel() for p in model.parameters())
train_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total params: {total_params:,}")
print(f"Trainable params: {train_params:,}")
print(f"\nLayer breakdown:")
for name, param in model.named_parameters():
print(f" {name:40s}: {list(param.shape)}")
# ── CNN Example ───────────────────────────────────────────────────────────────
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(1, 32, kernel_size=3, padding=1), # (B,32,28,28)
nn.BatchNorm2d(32),
nn.ReLU(inplace=True),
nn.Conv2d(32, 64, kernel_size=3, padding=1), # (B,64,28,28)
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2), # (B,64,14,14)
nn.Dropout2d(0.25),
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(64 * 14 * 14, 512),
nn.ReLU(inplace=True),
nn.Dropout(0.5),
nn.Linear(512, num_classes),
)
def forward(self, x):
x = self.features(x)
return self.classifier(x) # logits (no softmax — use CrossEntropyLoss)
cnn = SimpleCNN(num_classes=10).to(device)
dummy = torch.randn(8, 1, 28, 28).to(device) # batch of 8 MNIST-size images
logits = cnn(dummy)
print(f"\nCNN output shape: {logits.shape}") # (8, 10)
🔥 Training Loop
Dataset & DataLoader
Dataset defines how to access one sample; DataLoader handles batching, shuffling, and parallel loading. Custom datasets subclass torch.utils.data.Dataset.
- Implement
__len__and__getitem__ DataLoader(dataset, batch_size, shuffle, num_workers)num_workers=4— parallel data loading processespin_memory=True— faster CPU→GPU transfercollate_fn— custom batch assembly- Built-in:
TensorDatasetwraps tensors directly
Optimizers
Optimizers update model parameters based on gradients. Adam is the default starting point; SGD with momentum is preferred for vision tasks with careful tuning.
torch.optim.Adam(model.parameters(), lr=1e-3)torch.optim.AdamW— Adam + decoupled weight decaytorch.optim.SGD(lr, momentum=0.9, nesterov=True)torch.optim.RMSprop— good for RNNs- LR schedulers:
CosineAnnealingLR,OneCycleLR,ReduceLROnPlateau
Loss Functions
nn.CrossEntropyLoss()— multi-class (expects raw logits)nn.BCEWithLogitsLoss()— binary (expects logits, numerically stable)nn.MSELoss()— regression (mean squared error)nn.L1Loss()— MAE; robust to outliersnn.SmoothL1Loss()— Huber lossnn.NLLLoss()— use after log_softmax- Pass
weight=class_weightsfor imbalanced classes
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, TensorDataset, random_split
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# ── Custom Dataset ────────────────────────────────────────────────────────────
class TabularDataset(Dataset):
def __init__(self, X, y):
self.X = torch.tensor(X, dtype=torch.float32)
self.y = torch.tensor(y, dtype=torch.long)
def __len__(self):
return len(self.y)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]
# Synthetic data
import numpy as np
rng = np.random.default_rng(42)
X_np = rng.standard_normal((2000, 32)).astype(np.float32)
y_np = rng.integers(0, 4, 2000)
dataset = TabularDataset(X_np, y_np)
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_ds, val_ds = random_split(dataset, [train_size, val_size],
generator=torch.Generator().manual_seed(42))
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=0)
val_loader = DataLoader(val_ds, batch_size=128, shuffle=False, num_workers=0)
# ── Model ─────────────────────────────────────────────────────────────────────
model = nn.Sequential(
nn.Linear(32, 128), nn.BatchNorm1d(128), nn.GELU(), nn.Dropout(0.3),
nn.Linear(128, 64), nn.BatchNorm1d(64), nn.GELU(), nn.Dropout(0.2),
nn.Linear(64, 4), # 4 classes — raw logits
).to(device)
# ── Optimizer & Loss ──────────────────────────────────────────────────────────
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss()
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=30, eta_min=1e-5)
# ── Complete Training Loop ────────────────────────────────────────────────────
def train_epoch(model, loader, optimizer, criterion, device):
model.train()
total_loss, correct, total = 0, 0, 0
for X_batch, y_batch in loader:
X_batch, y_batch = X_batch.to(device), y_batch.to(device)
optimizer.zero_grad() # 1. clear gradients
logits = model(X_batch) # 2. forward pass
loss = criterion(logits, y_batch) # 3. compute loss
loss.backward() # 4. backprop
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # gradient clipping
optimizer.step() # 5. update weights
total_loss += loss.item() * len(y_batch)
correct += (logits.argmax(1) == y_batch).sum().item()
total += len(y_batch)
return total_loss / total, correct / total
@torch.no_grad()
def eval_epoch(model, loader, criterion, device):
model.eval()
total_loss, correct, total = 0, 0, 0
for X_batch, y_batch in loader:
X_batch, y_batch = X_batch.to(device), y_batch.to(device)
logits = model(X_batch)
loss = criterion(logits, y_batch)
total_loss += loss.item() * len(y_batch)
correct += (logits.argmax(1) == y_batch).sum().item()
total += len(y_batch)
return total_loss / total, correct / total
best_val_acc = 0.0
for epoch in range(1, 31):
train_loss, train_acc = train_epoch(model, train_loader, optimizer, criterion, device)
val_loss, val_acc = eval_epoch(model, val_loader, criterion, device)
scheduler.step()
if val_acc > best_val_acc:
best_val_acc = val_acc
torch.save(model.state_dict(), 'best_model.pt')
if epoch % 5 == 0:
print(f"Epoch {epoch:2d}: "
f"train_loss={train_loss:.4f} train_acc={train_acc:.3f} | "
f"val_loss={val_loss:.4f} val_acc={val_acc:.3f} "
f"lr={scheduler.get_last_lr()[0]:.6f}")
print(f"Best val accuracy: {best_val_acc:.4f}")
# Load best checkpoint
model.load_state_dict(torch.load('best_model.pt', map_location=device))
🌎 Ecosystem & Deployment
| Tool | Purpose | Install |
|---|---|---|
| torchvision | Image datasets (ImageNet, CIFAR, MNIST), transforms, pretrained models (ResNet, ViT, EfficientNet) | pip install torchvision |
| torchaudio | Audio I/O, spectrograms, transforms, audio datasets (LibriSpeech, Common Voice) | pip install torchaudio |
| torchtext | Text datasets, tokenisation, vocabulary utilities | pip install torchtext |
| HuggingFace Transformers | 1000s of pretrained language models (BERT, GPT, LLaMA, Whisper) with PyTorch backend | pip install transformers |
| torch.onnx | Export model to ONNX format for cross-framework deployment (ONNXRuntime, TensorRT) | Bundled with PyTorch |
| TorchScript | Compile model to serialisable IR; run without Python in C++ or mobile | Bundled with PyTorch |
| TorchServe | Production model serving server; REST and gRPC endpoints; batching and scaling | pip install torchserve |
| torch.compile | PyTorch 2.0 compiler; 2-3x speedup via kernel fusion and triton backends | Bundled with PyTorch 2.0+ |
| Lightning | PyTorch Lightning — structured training loop, logging, multi-GPU, mixed precision | pip install lightning |
import torch
import torch.nn as nn
# ── Transfer learning with torchvision pretrained model ────────────────────────
# from torchvision import models, transforms
# model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
# # Freeze backbone
# for param in model.parameters():
# param.requires_grad = False
# # Replace classifier head for 10 classes
# model.fc = nn.Linear(model.fc.in_features, 10)
# # Only classifier parameters are trainable
# optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)
# ── HuggingFace quick example ─────────────────────────────────────────────────
# from transformers import AutoTokenizer, AutoModelForSequenceClassification
# import torch
#
# tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
#
# texts = ["This is great!", "Terrible experience."]
# encoded = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
# with torch.no_grad():
# outputs = model(**encoded)
# probs = torch.softmax(outputs.logits, dim=1)
# print(probs)
# ── ONNX export ───────────────────────────────────────────────────────────────
model = nn.Sequential(
nn.Linear(32, 64), nn.ReLU(),
nn.Linear(64, 10),
)
model.eval()
dummy_input = torch.randn(1, 32)
torch.onnx.export(
model,
dummy_input,
'model.onnx',
export_params=True,
opset_version=17,
input_names=['features'],
output_names=['logits'],
dynamic_axes={
'features': {0: 'batch_size'},
'logits': {0: 'batch_size'},
},
)
print("ONNX model exported to model.onnx")
# ── TorchScript export (no Python runtime needed) ────────────────────────────
scripted = torch.jit.script(model)
scripted.save('model_scripted.pt')
# Load and run without Python model definition:
loaded_script = torch.jit.load('model_scripted.pt')
out = loaded_script(torch.randn(4, 32))
print(f"TorchScript inference shape: {out.shape}")
# ── torch.compile (PyTorch 2.0+) ──────────────────────────────────────────────
# compiled_model = torch.compile(model, mode='reduce-overhead')
# # First call triggers compilation (slow); subsequent calls are fast
# with torch.no_grad():
# out = compiled_model(torch.randn(64, 32))
# ── Mixed precision training ─────────────────────────────────────────────────
# scaler = torch.cuda.amp.GradScaler()
# for X_batch, y_batch in train_loader:
# X_batch, y_batch = X_batch.to(device), y_batch.to(device)
# optimizer.zero_grad()
# with torch.cuda.amp.autocast(): # fp16 forward pass
# logits = model(X_batch)
# loss = criterion(logits, y_batch)
# scaler.scale(loss).backward() # scaled backward
# scaler.step(optimizer)
# scaler.update()