⏱ 6 min read 📊 Beginner 🗓 Updated Jan 2025

⚖ PyTorch vs TensorFlow

Feature PyTorch TensorFlow / Keras
Computation graph Dynamic (define-by-run) — graph built on each forward pass TF2: eager by default; @tf.function builds static graph
Debugging Standard Python debugger works; inspect tensors with print() Eager mode debuggable; tf.function can be opaque
Training loop Explicit loop (zero_grad → forward → loss → backward → step) model.fit() abstracts loop; GradientTape for custom
Research papers Dominant — ~80% of ML papers use PyTorch Strong in production/industry; catching up in research
Production serving TorchServe, TorchScript, ONNX export TF Serving, SavedModel, TFX — more mature ecosystem
Mobile / Edge PyTorch Mobile (experimental) TFLite — mature, widely deployed
Distributed training torch.distributed, DDP (DistributedDataParallel) tf.distribute — Strategy API
HuggingFace ecosystem Primary backend for Transformers library Supported but secondary in most HF models
PyTorch 2.0 torch.compile() — 2x speedup via kernel fusion XLA compilation via jit_compile=True

Why Researchers Prefer PyTorch

The dynamic graph model means you can use native Python control flow (if/else, loops) in your model and it just works. You can set a breakpoint inside a forward pass and inspect tensors. For novel architectures where the computation graph changes per sample (e.g., tree-structured networks, variable-length sequences without padding), PyTorch's define-by-run approach is far more natural.

💫 Tensors & Autograd

torch.Tensor Basics

PyTorch tensors are similar to NumPy arrays but can live on GPU and participate in automatic differentiation. Most NumPy creation functions have direct PyTorch equivalents.

  • torch.zeros(3, 4), torch.ones(), torch.rand()
  • torch.tensor([1,2,3]) — from Python list (copies data)
  • torch.from_numpy(arr) — zero-copy from NumPy
  • t.to('cuda') / t.cuda() — move to GPU
  • t.to('mps') — Apple Silicon GPU
  • t.numpy() — back to NumPy (CPU only)
  • t.item() — extract scalar from single-element tensor

Autograd: Automatic Differentiation

PyTorch's autograd engine tracks operations on tensors with requires_grad=True, building a computation graph in the background. Calling .backward() on a scalar computes gradients via reverse-mode automatic differentiation.

  • t = torch.tensor([3.0], requires_grad=True)
  • All operations on t are recorded
  • loss.backward() — compute all gradients
  • t.grad — gradient of loss w.r.t. t
  • optimizer.zero_grad() — clear accumulated grads
  • with torch.no_grad(): — disable grad tracking
  • Gradient accumulates! Must zero before each backward pass

Device Management

Tensors must be on the same device to interact. The standard practice is to define a device variable at the top and move all tensors and models to it.

  • device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
  • model.to(device) — move model parameters
  • x = x.to(device) — move input tensors
  • t.detach().cpu().numpy() — safe tensor → NumPy
  • Mixed precision: torch.cuda.amp.autocast()
import torch
import numpy as np

# ── Device setup ──────────────────────────────────────────────────────────────
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# ── Tensor creation ───────────────────────────────────────────────────────────
a = torch.zeros(3, 4)                     # (3,4) float32 zeros
b = torch.ones(2, 3, dtype=torch.float64)
c = torch.rand(5, 5)                      # uniform [0,1)
d = torch.randn(100, 20)                  # N(0,1)
e = torch.arange(0, 10, 2)               # [0,2,4,6,8]
f = torch.linspace(0, 1, 11)             # [0.0,...,1.0]

# From NumPy (zero-copy, shares memory)
np_arr = np.random.randn(4, 4).astype(np.float32)
t_from_np = torch.from_numpy(np_arr)       # shares memory
print(t_from_np.shape, t_from_np.dtype)

# ── Manual gradient descent on y = w*x + b ────────────────────────────────────
torch.manual_seed(42)
# Synthetic data: y = 2x + 0.5 + noise
x_data = torch.linspace(-2, 2, 100)
y_data = 2.0 * x_data + 0.5 + torch.randn(100) * 0.2

# Learnable parameters
w = torch.tensor([0.0], requires_grad=True)
b_param = torch.tensor([0.0], requires_grad=True)

lr = 0.01
for epoch in range(200):
    # Forward pass
    y_pred = w * x_data + b_param

    # Loss (MSE)
    loss = ((y_pred - y_data) ** 2).mean()

    # Backward pass — compute gradients
    loss.backward()

    # Update parameters (no_grad so update isn't tracked)
    with torch.no_grad():
        w       -= lr * w.grad
        b_param -= lr * b_param.grad

    # Zero gradients for next iteration
    w.grad.zero_()
    b_param.grad.zero_()

    if (epoch + 1) % 50 == 0:
        print(f"Epoch {epoch+1:3d}: loss={loss.item():.6f}  "
              f"w={w.item():.4f}  b={b_param.item():.4f}")
# Should converge toward w≈2.0, b≈0.5

# ── no_grad for inference ─────────────────────────────────────────────────────
model_output = torch.randn(5, requires_grad=True)
with torch.no_grad():
    softmax_out = torch.softmax(model_output, dim=0)
print(f"no_grad preserves grads on input: {model_output.requires_grad}")
print(f"output requires_grad: {softmax_out.requires_grad}")   # False

🏗 Building Models with nn.Module

nn.Module Pattern

All PyTorch models subclass nn.Module. The __init__ method defines layers as attributes; forward defines the computation. This separation keeps architecture and forward logic explicit.

  • Register layers in __init__ as self.xxx
  • Define computation in forward(self, x)
  • Never call forward() directly — call model(x)
  • model.parameters() — all learnable parameters
  • model.named_parameters() — with names
  • model.train() / model.eval() — toggle modes

Key nn Layers

  • nn.Linear(in, out) — dense layer (no activation)
  • nn.Conv2d(in_ch, out_ch, kernel_size)
  • nn.BatchNorm1d / BatchNorm2d — batch normalisation
  • nn.Dropout(p) / nn.Dropout2d(p)
  • nn.LSTM(input_size, hidden_size, num_layers)
  • nn.Transformer(d_model, nhead, ...)
  • nn.Embedding(vocab_size, embed_dim)
  • nn.Sequential(layer1, layer2, ...)

Activation Functions

  • nn.ReLU() / F.relu(x)
  • nn.GELU() — Gaussian error linear unit
  • nn.Sigmoid() / nn.Tanh()
  • nn.Softmax(dim=1) — for probabilities
  • nn.LeakyReLU(0.01) — negative slope
  • nn.SiLU() — sigmoid linear unit (swish)
  • Functional: import torch.nn.functional as F
import torch
import torch.nn as nn
import torch.nn.functional as F

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# ── Feedforward Network ────────────────────────────────────────────────────────
class FeedForwardNet(nn.Module):
    def __init__(self, input_dim, hidden_dims, output_dim, dropout=0.3):
        super().__init__()
        layers = []
        prev_dim = input_dim
        for h in hidden_dims:
            layers += [
                nn.Linear(prev_dim, h),
                nn.BatchNorm1d(h),
                nn.GELU(),
                nn.Dropout(dropout),
            ]
            prev_dim = h
        layers.append(nn.Linear(prev_dim, output_dim))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

model = FeedForwardNet(
    input_dim=64,
    hidden_dims=[256, 128, 64],
    output_dim=10,
    dropout=0.3,
).to(device)

# ── Inspect parameters ────────────────────────────────────────────────────────
total_params = sum(p.numel() for p in model.parameters())
train_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total params:     {total_params:,}")
print(f"Trainable params: {train_params:,}")
print(f"\nLayer breakdown:")
for name, param in model.named_parameters():
    print(f"  {name:40s}: {list(param.shape)}")

# ── CNN Example ───────────────────────────────────────────────────────────────
class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, padding=1),   # (B,32,28,28)
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.Conv2d(32, 64, kernel_size=3, padding=1),  # (B,64,28,28)
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                           # (B,64,14,14)
            nn.Dropout2d(0.25),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 14 * 14, 512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(512, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        return self.classifier(x)    # logits (no softmax — use CrossEntropyLoss)

cnn = SimpleCNN(num_classes=10).to(device)
dummy = torch.randn(8, 1, 28, 28).to(device)    # batch of 8 MNIST-size images
logits = cnn(dummy)
print(f"\nCNN output shape: {logits.shape}")      # (8, 10)

🔥 Training Loop

Dataset & DataLoader

Dataset defines how to access one sample; DataLoader handles batching, shuffling, and parallel loading. Custom datasets subclass torch.utils.data.Dataset.

  • Implement __len__ and __getitem__
  • DataLoader(dataset, batch_size, shuffle, num_workers)
  • num_workers=4 — parallel data loading processes
  • pin_memory=True — faster CPU→GPU transfer
  • collate_fn — custom batch assembly
  • Built-in: TensorDataset wraps tensors directly

Optimizers

Optimizers update model parameters based on gradients. Adam is the default starting point; SGD with momentum is preferred for vision tasks with careful tuning.

  • torch.optim.Adam(model.parameters(), lr=1e-3)
  • torch.optim.AdamW — Adam + decoupled weight decay
  • torch.optim.SGD(lr, momentum=0.9, nesterov=True)
  • torch.optim.RMSprop — good for RNNs
  • LR schedulers: CosineAnnealingLR, OneCycleLR, ReduceLROnPlateau

Loss Functions

  • nn.CrossEntropyLoss() — multi-class (expects raw logits)
  • nn.BCEWithLogitsLoss() — binary (expects logits, numerically stable)
  • nn.MSELoss() — regression (mean squared error)
  • nn.L1Loss() — MAE; robust to outliers
  • nn.SmoothL1Loss() — Huber loss
  • nn.NLLLoss() — use after log_softmax
  • Pass weight=class_weights for imbalanced classes
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, TensorDataset, random_split

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# ── Custom Dataset ────────────────────────────────────────────────────────────
class TabularDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.long)

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# Synthetic data
import numpy as np
rng = np.random.default_rng(42)
X_np = rng.standard_normal((2000, 32)).astype(np.float32)
y_np = rng.integers(0, 4, 2000)

dataset = TabularDataset(X_np, y_np)
train_size = int(0.8 * len(dataset))
val_size   = len(dataset) - train_size
train_ds, val_ds = random_split(dataset, [train_size, val_size],
                                 generator=torch.Generator().manual_seed(42))

train_loader = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=0)
val_loader   = DataLoader(val_ds,   batch_size=128, shuffle=False, num_workers=0)

# ── Model ─────────────────────────────────────────────────────────────────────
model = nn.Sequential(
    nn.Linear(32, 128), nn.BatchNorm1d(128), nn.GELU(), nn.Dropout(0.3),
    nn.Linear(128, 64), nn.BatchNorm1d(64),  nn.GELU(), nn.Dropout(0.2),
    nn.Linear(64, 4),   # 4 classes — raw logits
).to(device)

# ── Optimizer & Loss ──────────────────────────────────────────────────────────
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss()
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=30, eta_min=1e-5)

# ── Complete Training Loop ────────────────────────────────────────────────────
def train_epoch(model, loader, optimizer, criterion, device):
    model.train()
    total_loss, correct, total = 0, 0, 0
    for X_batch, y_batch in loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        optimizer.zero_grad()              # 1. clear gradients
        logits = model(X_batch)            # 2. forward pass
        loss = criterion(logits, y_batch)  # 3. compute loss
        loss.backward()                    # 4. backprop
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # gradient clipping
        optimizer.step()                   # 5. update weights
        total_loss += loss.item() * len(y_batch)
        correct    += (logits.argmax(1) == y_batch).sum().item()
        total      += len(y_batch)
    return total_loss / total, correct / total

@torch.no_grad()
def eval_epoch(model, loader, criterion, device):
    model.eval()
    total_loss, correct, total = 0, 0, 0
    for X_batch, y_batch in loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        logits = model(X_batch)
        loss   = criterion(logits, y_batch)
        total_loss += loss.item() * len(y_batch)
        correct    += (logits.argmax(1) == y_batch).sum().item()
        total      += len(y_batch)
    return total_loss / total, correct / total

best_val_acc = 0.0
for epoch in range(1, 31):
    train_loss, train_acc = train_epoch(model, train_loader, optimizer, criterion, device)
    val_loss,   val_acc   = eval_epoch(model, val_loader, criterion, device)
    scheduler.step()
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        torch.save(model.state_dict(), 'best_model.pt')
    if epoch % 5 == 0:
        print(f"Epoch {epoch:2d}: "
              f"train_loss={train_loss:.4f} train_acc={train_acc:.3f} | "
              f"val_loss={val_loss:.4f} val_acc={val_acc:.3f} "
              f"lr={scheduler.get_last_lr()[0]:.6f}")

print(f"Best val accuracy: {best_val_acc:.4f}")

# Load best checkpoint
model.load_state_dict(torch.load('best_model.pt', map_location=device))

🌎 Ecosystem & Deployment

Tool Purpose Install
torchvision Image datasets (ImageNet, CIFAR, MNIST), transforms, pretrained models (ResNet, ViT, EfficientNet) pip install torchvision
torchaudio Audio I/O, spectrograms, transforms, audio datasets (LibriSpeech, Common Voice) pip install torchaudio
torchtext Text datasets, tokenisation, vocabulary utilities pip install torchtext
HuggingFace Transformers 1000s of pretrained language models (BERT, GPT, LLaMA, Whisper) with PyTorch backend pip install transformers
torch.onnx Export model to ONNX format for cross-framework deployment (ONNXRuntime, TensorRT) Bundled with PyTorch
TorchScript Compile model to serialisable IR; run without Python in C++ or mobile Bundled with PyTorch
TorchServe Production model serving server; REST and gRPC endpoints; batching and scaling pip install torchserve
torch.compile PyTorch 2.0 compiler; 2-3x speedup via kernel fusion and triton backends Bundled with PyTorch 2.0+
Lightning PyTorch Lightning — structured training loop, logging, multi-GPU, mixed precision pip install lightning
import torch
import torch.nn as nn

# ── Transfer learning with torchvision pretrained model ────────────────────────
# from torchvision import models, transforms
# model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
# # Freeze backbone
# for param in model.parameters():
#     param.requires_grad = False
# # Replace classifier head for 10 classes
# model.fc = nn.Linear(model.fc.in_features, 10)
# # Only classifier parameters are trainable
# optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)

# ── HuggingFace quick example ─────────────────────────────────────────────────
# from transformers import AutoTokenizer, AutoModelForSequenceClassification
# import torch
#
# tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
#
# texts = ["This is great!", "Terrible experience."]
# encoded = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
# with torch.no_grad():
#     outputs = model(**encoded)
# probs = torch.softmax(outputs.logits, dim=1)
# print(probs)

# ── ONNX export ───────────────────────────────────────────────────────────────
model = nn.Sequential(
    nn.Linear(32, 64), nn.ReLU(),
    nn.Linear(64, 10),
)
model.eval()

dummy_input = torch.randn(1, 32)
torch.onnx.export(
    model,
    dummy_input,
    'model.onnx',
    export_params=True,
    opset_version=17,
    input_names=['features'],
    output_names=['logits'],
    dynamic_axes={
        'features': {0: 'batch_size'},
        'logits':   {0: 'batch_size'},
    },
)
print("ONNX model exported to model.onnx")

# ── TorchScript export (no Python runtime needed) ────────────────────────────
scripted = torch.jit.script(model)
scripted.save('model_scripted.pt')
# Load and run without Python model definition:
loaded_script = torch.jit.load('model_scripted.pt')
out = loaded_script(torch.randn(4, 32))
print(f"TorchScript inference shape: {out.shape}")

# ── torch.compile (PyTorch 2.0+) ──────────────────────────────────────────────
# compiled_model = torch.compile(model, mode='reduce-overhead')
# # First call triggers compilation (slow); subsequent calls are fast
# with torch.no_grad():
#     out = compiled_model(torch.randn(64, 32))

# ── Mixed precision training ─────────────────────────────────────────────────
# scaler = torch.cuda.amp.GradScaler()
# for X_batch, y_batch in train_loader:
#     X_batch, y_batch = X_batch.to(device), y_batch.to(device)
#     optimizer.zero_grad()
#     with torch.cuda.amp.autocast():       # fp16 forward pass
#         logits = model(X_batch)
#         loss   = criterion(logits, y_batch)
#     scaler.scale(loss).backward()         # scaled backward
#     scaler.step(optimizer)
#     scaler.update()