PyTorch | CyberSecurityHub

⚖ PyTorch vs TensorFlow

Feature	PyTorch	TensorFlow / Keras
Computation graph	Dynamic (define-by-run) — graph built on each forward pass	TF2: eager by default; `@tf.function` builds static graph
Debugging	Standard Python debugger works; inspect tensors with print()	Eager mode debuggable; tf.function can be opaque
Training loop	Explicit loop (zero_grad → forward → loss → backward → step)	model.fit() abstracts loop; GradientTape for custom
Research papers	Dominant — ~80% of ML papers use PyTorch	Strong in production/industry; catching up in research
Production serving	TorchServe, TorchScript, ONNX export	TF Serving, SavedModel, TFX — more mature ecosystem
Mobile / Edge	PyTorch Mobile (experimental)	TFLite — mature, widely deployed
Distributed training	torch.distributed, DDP (DistributedDataParallel)	tf.distribute — Strategy API
HuggingFace ecosystem	Primary backend for Transformers library	Supported but secondary in most HF models
PyTorch 2.0	torch.compile() — 2x speedup via kernel fusion	XLA compilation via jit_compile=True

Why Researchers Prefer PyTorch

The dynamic graph model means you can use native Python control flow (if/else, loops) in your model and it just works. You can set a breakpoint inside a forward pass and inspect tensors. For novel architectures where the computation graph changes per sample (e.g., tree-structured networks, variable-length sequences without padding), PyTorch's define-by-run approach is far more natural.

💫 Tensors & Autograd

torch.Tensor Basics

PyTorch tensors are similar to NumPy arrays but can live on GPU and participate in automatic differentiation. Most NumPy creation functions have direct PyTorch equivalents.

torch.zeros(3, 4), torch.ones(), torch.rand()
torch.tensor([1,2,3]) — from Python list (copies data)
torch.from_numpy(arr) — zero-copy from NumPy
t.to('cuda') / t.cuda() — move to GPU
t.to('mps') — Apple Silicon GPU
t.numpy() — back to NumPy (CPU only)
t.item() — extract scalar from single-element tensor

Autograd: Automatic Differentiation

PyTorch's autograd engine tracks operations on tensors with requires_grad=True, building a computation graph in the background. Calling .backward() on a scalar computes gradients via reverse-mode automatic differentiation.

t = torch.tensor([3.0], requires_grad=True)
All operations on t are recorded
loss.backward() — compute all gradients
t.grad — gradient of loss w.r.t. t
optimizer.zero_grad() — clear accumulated grads
with torch.no_grad(): — disable grad tracking
Gradient accumulates! Must zero before each backward pass

Device Management

Tensors must be on the same device to interact. The standard practice is to define a device variable at the top and move all tensors and models to it.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device) — move model parameters
x = x.to(device) — move input tensors
t.detach().cpu().numpy() — safe tensor → NumPy
Mixed precision: torch.cuda.amp.autocast()

import torch
import numpy as np

# ── Device setup ──────────────────────────────────────────────────────────────
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# ── Tensor creation ───────────────────────────────────────────────────────────
a = torch.zeros(3, 4)                     # (3,4) float32 zeros
b = torch.ones(2, 3, dtype=torch.float64)
c = torch.rand(5, 5)                      # uniform [0,1)
d = torch.randn(100, 20)                  # N(0,1)
e = torch.arange(0, 10, 2)               # [0,2,4,6,8]
f = torch.linspace(0, 1, 11)             # [0.0,...,1.0]

# From NumPy (zero-copy, shares memory)
np_arr = np.random.randn(4, 4).astype(np.float32)
t_from_np = torch.from_numpy(np_arr)       # shares memory
print(t_from_np.shape, t_from_np.dtype)

# ── Manual gradient descent on y = w*x + b ────────────────────────────────────
torch.manual_seed(42)
# Synthetic data: y = 2x + 0.5 + noise
x_data = torch.linspace(-2, 2, 100)
y_data = 2.0 * x_data + 0.5 + torch.randn(100) * 0.2

# Learnable parameters
w = torch.tensor([0.0], requires_grad=True)
b_param = torch.tensor([0.0], requires_grad=True)

lr = 0.01
for epoch in range(200):
    # Forward pass
    y_pred = w * x_data + b_param

    # Loss (MSE)
    loss = ((y_pred - y_data) ** 2).mean()

    # Backward pass — compute gradients
    loss.backward()

    # Update parameters (no_grad so update isn't tracked)
    with torch.no_grad():
        w       -= lr * w.grad
        b_param -= lr * b_param.grad

    # Zero gradients for next iteration
    w.grad.zero_()
    b_param.grad.zero_()

    if (epoch + 1) % 50 == 0:
        print(f"Epoch {epoch+1:3d}: loss={loss.item():.6f}  "
              f"w={w.item():.4f}  b={b_param.item():.4f}")
# Should converge toward w≈2.0, b≈0.5

# ── no_grad for inference ─────────────────────────────────────────────────────
model_output = torch.randn(5, requires_grad=True)
with torch.no_grad():
    softmax_out = torch.softmax(model_output, dim=0)
print(f"no_grad preserves grads on input: {model_output.requires_grad}")
print(f"output requires_grad: {softmax_out.requires_grad}")   # False

🏗 Building Models with nn.Module

nn.Module Pattern

All PyTorch models subclass nn.Module. The __init__ method defines layers as attributes; forward defines the computation. This separation keeps architecture and forward logic explicit.

Register layers in __init__ as self.xxx
Define computation in forward(self, x)
Never call forward() directly — call model(x)
model.parameters() — all learnable parameters
model.named_parameters() — with names
model.train() / model.eval() — toggle modes

Key nn Layers

nn.Linear(in, out) — dense layer (no activation)
nn.Conv2d(in_ch, out_ch, kernel_size)
nn.BatchNorm1d / BatchNorm2d — batch normalisation
nn.Dropout(p) / nn.Dropout2d(p)
nn.LSTM(input_size, hidden_size, num_layers)
nn.Transformer(d_model, nhead, ...)
nn.Embedding(vocab_size, embed_dim)
nn.Sequential(layer1, layer2, ...)

Activation Functions

nn.ReLU() / F.relu(x)
nn.GELU() — Gaussian error linear unit
nn.Sigmoid() / nn.Tanh()
nn.Softmax(dim=1) — for probabilities
nn.LeakyReLU(0.01) — negative slope
nn.SiLU() — sigmoid linear unit (swish)
Functional: import torch.nn.functional as F

import torch
import torch.nn as nn
import torch.nn.functional as F

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# ── Feedforward Network ────────────────────────────────────────────────────────
class FeedForwardNet(nn.Module):
    def __init__(self, input_dim, hidden_dims, output_dim, dropout=0.3):
        super().__init__()
        layers = []
        prev_dim = input_dim
        for h in hidden_dims:
            layers += [
                nn.Linear(prev_dim, h),
                nn.BatchNorm1d(h),
                nn.GELU(),
                nn.Dropout(dropout),
            ]
            prev_dim = h
        layers.append(nn.Linear(prev_dim, output_dim))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

model = FeedForwardNet(
    input_dim=64,
    hidden_dims=[256, 128, 64],
    output_dim=10,
    dropout=0.3,
).to(device)

# ── Inspect parameters ────────────────────────────────────────────────────────
total_params = sum(p.numel() for p in model.parameters())
train_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total params:     {total_params:,}")
print(f"Trainable params: {train_params:,}")
print(f"\nLayer breakdown:")
for name, param in model.named_parameters():
    print(f"  {name:40s}: {list(param.shape)}")

# ── CNN Example ───────────────────────────────────────────────────────────────
class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, padding=1),   # (B,32,28,28)
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.Conv2d(32, 64, kernel_size=3, padding=1),  # (B,64,28,28)
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                           # (B,64,14,14)
            nn.Dropout2d(0.25),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 14 * 14, 512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(512, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        return self.classifier(x)    # logits (no softmax — use CrossEntropyLoss)

cnn = SimpleCNN(num_classes=10).to(device)
dummy = torch.randn(8, 1, 28, 28).to(device)    # batch of 8 MNIST-size images
logits = cnn(dummy)
print(f"\nCNN output shape: {logits.shape}")      # (8, 10)

🔥 Training Loop

Dataset & DataLoader

Dataset defines how to access one sample; DataLoader handles batching, shuffling, and parallel loading. Custom datasets subclass torch.utils.data.Dataset.

Implement __len__ and __getitem__
DataLoader(dataset, batch_size, shuffle, num_workers)
num_workers=4 — parallel data loading processes
pin_memory=True — faster CPU→GPU transfer
collate_fn — custom batch assembly
Built-in: TensorDataset wraps tensors directly

Optimizers

Optimizers update model parameters based on gradients. Adam is the default starting point; SGD with momentum is preferred for vision tasks with careful tuning.

torch.optim.Adam(model.parameters(), lr=1e-3)
torch.optim.AdamW — Adam + decoupled weight decay
torch.optim.SGD(lr, momentum=0.9, nesterov=True)
torch.optim.RMSprop — good for RNNs
LR schedulers: CosineAnnealingLR, OneCycleLR, ReduceLROnPlateau

Loss Functions

nn.CrossEntropyLoss() — multi-class (expects raw logits)
nn.BCEWithLogitsLoss() — binary (expects logits, numerically stable)
nn.MSELoss() — regression (mean squared error)
nn.L1Loss() — MAE; robust to outliers
nn.SmoothL1Loss() — Huber loss
nn.NLLLoss() — use after log_softmax
Pass weight=class_weights for imbalanced classes

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, TensorDataset, random_split

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# ── Custom Dataset ────────────────────────────────────────────────────────────
class TabularDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.long)

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# Synthetic data
import numpy as np
rng = np.random.default_rng(42)
X_np = rng.standard_normal((2000, 32)).astype(np.float32)
y_np = rng.integers(0, 4, 2000)

dataset = TabularDataset(X_np, y_np)
train_size = int(0.8 * len(dataset))
val_size   = len(dataset) - train_size
train_ds, val_ds = random_split(dataset, [train_size, val_size],
                                 generator=torch.Generator().manual_seed(42))

train_loader = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=0)
val_loader   = DataLoader(val_ds,   batch_size=128, shuffle=False, num_workers=0)

# ── Model ─────────────────────────────────────────────────────────────────────
model = nn.Sequential(
    nn.Linear(32, 128), nn.BatchNorm1d(128), nn.GELU(), nn.Dropout(0.3),
    nn.Linear(128, 64), nn.BatchNorm1d(64),  nn.GELU(), nn.Dropout(0.2),
    nn.Linear(64, 4),   # 4 classes — raw logits
).to(device)

# ── Optimizer & Loss ──────────────────────────────────────────────────────────
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss()
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=30, eta_min=1e-5)

# ── Complete Training Loop ────────────────────────────────────────────────────
def train_epoch(model, loader, optimizer, criterion, device):
    model.train()
    total_loss, correct, total = 0, 0, 0
    for X_batch, y_batch in loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        optimizer.zero_grad()              # 1. clear gradients
        logits = model(X_batch)            # 2. forward pass
        loss = criterion(logits, y_batch)  # 3. compute loss
        loss.backward()                    # 4. backprop
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # gradient clipping
        optimizer.step()                   # 5. update weights
        total_loss += loss.item() * len(y_batch)
        correct    += (logits.argmax(1) == y_batch).sum().item()
        total      += len(y_batch)
    return total_loss / total, correct / total

@torch.no_grad()
def eval_epoch(model, loader, criterion, device):
    model.eval()
    total_loss, correct, total = 0, 0, 0
    for X_batch, y_batch in loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        logits = model(X_batch)
        loss   = criterion(logits, y_batch)
        total_loss += loss.item() * len(y_batch)
        correct    += (logits.argmax(1) == y_batch).sum().item()
        total      += len(y_batch)
    return total_loss / total, correct / total

best_val_acc = 0.0
for epoch in range(1, 31):
    train_loss, train_acc = train_epoch(model, train_loader, optimizer, criterion, device)
    val_loss,   val_acc   = eval_epoch(model, val_loader, criterion, device)
    scheduler.step()
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        torch.save(model.state_dict(), 'best_model.pt')
    if epoch % 5 == 0:
        print(f"Epoch {epoch:2d}: "
              f"train_loss={train_loss:.4f} train_acc={train_acc:.3f} | "
              f"val_loss={val_loss:.4f} val_acc={val_acc:.3f} "
              f"lr={scheduler.get_last_lr()[0]:.6f}")

print(f"Best val accuracy: {best_val_acc:.4f}")

# Load best checkpoint
model.load_state_dict(torch.load('best_model.pt', map_location=device))

🌎 Ecosystem & Deployment

Tool	Purpose	Install
torchvision	Image datasets (ImageNet, CIFAR, MNIST), transforms, pretrained models (ResNet, ViT, EfficientNet)	`pip install torchvision`
torchaudio	Audio I/O, spectrograms, transforms, audio datasets (LibriSpeech, Common Voice)	`pip install torchaudio`
torchtext	Text datasets, tokenisation, vocabulary utilities	`pip install torchtext`
HuggingFace Transformers	1000s of pretrained language models (BERT, GPT, LLaMA, Whisper) with PyTorch backend	`pip install transformers`
torch.onnx	Export model to ONNX format for cross-framework deployment (ONNXRuntime, TensorRT)	Bundled with PyTorch
TorchScript	Compile model to serialisable IR; run without Python in C++ or mobile	Bundled with PyTorch
TorchServe	Production model serving server; REST and gRPC endpoints; batching and scaling	`pip install torchserve`
torch.compile	PyTorch 2.0 compiler; 2-3x speedup via kernel fusion and triton backends	Bundled with PyTorch 2.0+
Lightning	PyTorch Lightning — structured training loop, logging, multi-GPU, mixed precision	`pip install lightning`

import torch
import torch.nn as nn

# ── Transfer learning with torchvision pretrained model ────────────────────────
# from torchvision import models, transforms
# model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
# # Freeze backbone
# for param in model.parameters():
#     param.requires_grad = False
# # Replace classifier head for 10 classes
# model.fc = nn.Linear(model.fc.in_features, 10)
# # Only classifier parameters are trainable
# optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)

# ── HuggingFace quick example ─────────────────────────────────────────────────
# from transformers import AutoTokenizer, AutoModelForSequenceClassification
# import torch
#
# tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
#
# texts = ["This is great!", "Terrible experience."]
# encoded = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
# with torch.no_grad():
#     outputs = model(**encoded)
# probs = torch.softmax(outputs.logits, dim=1)
# print(probs)

# ── ONNX export ───────────────────────────────────────────────────────────────
model = nn.Sequential(
    nn.Linear(32, 64), nn.ReLU(),
    nn.Linear(64, 10),
)
model.eval()

dummy_input = torch.randn(1, 32)
torch.onnx.export(
    model,
    dummy_input,
    'model.onnx',
    export_params=True,
    opset_version=17,
    input_names=['features'],
    output_names=['logits'],
    dynamic_axes={
        'features': {0: 'batch_size'},
        'logits':   {0: 'batch_size'},
    },
)
print("ONNX model exported to model.onnx")

# ── TorchScript export (no Python runtime needed) ────────────────────────────
scripted = torch.jit.script(model)
scripted.save('model_scripted.pt')
# Load and run without Python model definition:
loaded_script = torch.jit.load('model_scripted.pt')
out = loaded_script(torch.randn(4, 32))
print(f"TorchScript inference shape: {out.shape}")

# ── torch.compile (PyTorch 2.0+) ──────────────────────────────────────────────
# compiled_model = torch.compile(model, mode='reduce-overhead')
# # First call triggers compilation (slow); subsequent calls are fast
# with torch.no_grad():
#     out = compiled_model(torch.randn(64, 32))

# ── Mixed precision training ─────────────────────────────────────────────────
# scaler = torch.cuda.amp.GradScaler()
# for X_batch, y_batch in train_loader:
#     X_batch, y_batch = X_batch.to(device), y_batch.to(device)
#     optimizer.zero_grad()
#     with torch.cuda.amp.autocast():       # fp16 forward pass
#         logits = model(X_batch)
#         loss   = criterion(logits, y_batch)
#     scaler.scale(loss).backward()         # scaled backward
#     scaler.step(optimizer)
#     scaler.update()