Jupyter Notebooks for ML | CyberSecurityHub

📚 What is Jupyter?

The Notebook Concept

A Jupyter notebook is a document that mixes executable code cells with rich-text Markdown cells and their outputs (text, tables, charts, images, LaTeX). This literate-programming style makes notebooks ideal for data exploration, analysis, and communicating findings.

Code cells — Python (or R, Julia, etc.) executed by the kernel
Markdown cells — headings, paragraphs, LaTeX equations, images
Output cells — text, DataFrames, matplotlib plots, HTML widgets
Raw cells — unformatted content, passed to nbconvert as-is
Stored as JSON (.ipynb) — both code and outputs are saved

The Kernel

The kernel is a separate process that executes code. The notebook frontend communicates with the kernel via ZeroMQ messages. This separation means you can restart the kernel without closing the browser tab, and one notebook server can host kernels in multiple languages.

IPython kernel — the standard Python kernel
Each notebook has its own kernel process
State is persistent within a session (variables survive across cells)
Restart & Run All — reproducibility test: can anyone run this fresh?
Multiple kernels: Python, R (IRkernel), Julia (IJulia), bash

JupyterLab vs Classic vs VS Code

The Jupyter ecosystem has evolved significantly. JupyterLab is now the recommended interface for serious work, though VS Code's notebook support has become excellent for integrated development.

JupyterLab — modern IDE-like interface; tabs, file browser, terminal
Classic Notebook — the original interface; simpler; still works
VS Code Notebooks — integrated with editor, debugger, git; great DX
Google Colab — cloud, free GPU, no install; based on Jupyter
Jupyter Desktop — Electron app for local use

JupyterLab is the Modern Choice

JupyterLab replaces the classic Notebook interface. It provides a full-featured IDE experience with a file browser, multiple notebooks/terminals in tabs, a text editor, CSV viewer, image viewer, and a rich extension ecosystem. Install with pip install jupyterlab and launch with jupyter lab.

# ── Installation ──────────────────────────────────────────────────────────────
# Install JupyterLab (recommended)
# pip install jupyterlab

# Install classic Notebook
# pip install notebook

# Install with conda (includes many data science packages)
# conda install -c conda-forge jupyterlab

# Launch JupyterLab
# jupyter lab
# jupyter lab --port=8889 --no-browser    # specific port, no auto-open

# Launch classic Notebook
# jupyter notebook

# ── IPython features available in notebooks ────────────────────────────────────
# Tab completion: type pandas. then Tab → see all methods
# ? after a function: np.random.randn?   → show docstring
# ?? after a function: np.random.randn?? → show source code
# Shift+Tab inside function parens → show signature

import numpy as np
import pandas as pd

# These work in any Jupyter cell:
# display() renders rich HTML for DataFrames
df = pd.DataFrame({'a': [1,2,3], 'b': [4.1, 5.2, 6.3]})
# display(df)      # renders as styled HTML table
# df               # last expression in cell auto-displays

# Multiple displays per cell:
# from IPython.display import display
# display(df.head(3))
# display(df.describe())

⚡ Essential Workflow

Keyboard Shortcuts

Learning keyboard shortcuts dramatically speeds up notebook work. There are two modes: Command mode (blue border, press Esc) for cell operations, and Edit mode (green border, press Enter) for editing content.

Shift+Enter — run cell and move to next
Ctrl+Enter — run cell in place
Alt+Enter — run cell and insert new below
Esc → A — insert cell above (command mode)
Esc → B — insert cell below
Esc → D D — delete cell
Esc → M — convert to Markdown
Esc → Y — convert to Code
Ctrl+Shift+P — command palette (JupyterLab)

Kernel Management

A common notebook trap: running cells out of order creates state that doesn't match a top-to-bottom execution. Always validate reproducibility by restarting the kernel and running all cells in order.

Restart Kernel — clears all variables, keeps outputs
Restart & Clear Output — clean slate
Restart & Run All — full reproducibility test
Variable inspector — Jupyter extension or %who magic
del variable_name — free memory explicitly
Watch kernel status indicator (circle in top-right)

Output & Display

Notebooks can render rich media directly inline — HTML tables, interactive widgets, images, audio, and video. Pandas DataFrames render as styled HTML tables automatically.

display(obj) — explicitly render any object
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.3f}'.format)
from IPython.display import Image, HTML, Latex
ipywidgets — interactive sliders, dropdowns, buttons
Rich output is embedded in .ipynb JSON — sharable

Magic Command	What it does	Example
`%timeit`	Time a single line, run many iterations, report mean ± std	`%timeit np.dot(A, B)`
`%%time`	Time the entire cell once (wall time and CPU time)	First line of cell: `%%time`
`%%timeit`	Time the entire cell, run multiple iterations	First line of cell: `%%timeit`
`%matplotlib inline`	Render matplotlib plots inline in the notebook	Put in first cell of notebook
`%matplotlib widget`	Interactive matplotlib plots (zoom/pan) via ipympl	`pip install ipympl` first
`%load_ext autoreload`	Auto-reload modules when their source changes on disk	Then: `%autoreload 2`
`%who` / `%whos`	List all variables in namespace (whos includes type/size)	`%whos DataFrame` — only DataFrames
`%run script.py`	Execute an external Python script and import its namespace	`%run train.py`
`!command`	Run a shell command and display output	`!pip install xgboost`, `!ls -la`
`%env VAR=value`	Set or display environment variables	`%env CUDA_VISIBLE_DEVICES=0`
`%%bash`	Run entire cell as bash script	Multi-line shell commands in cell
`%pdb on`	Enable post-mortem debugger on exceptions	Opens interactive pdb on error

📉 Effective ML Workflows

Recommended Notebook Structure

Structuring notebooks consistently makes them easier to review, debug, and hand off. Each major section should be in its own cells with Markdown headers.

1. Imports & Config — all imports, constants, paths, seeds
2. Data Loading — read raw files, no transformations yet
3. EDA — distributions, correlations, missing values, outliers
4. Preprocessing — cleaning, encoding, feature engineering
5. Model Training — fit, cross-validate, tune
6. Evaluation — metrics, plots, error analysis
7. Conclusions — findings, next steps, limitations

Exporting & Sharing

nbconvert transforms notebooks into static formats for sharing with non-technical stakeholders or including in documentation.

jupyter nbconvert --to html notebook.ipynb
jupyter nbconvert --to pdf notebook.ipynb (needs LaTeX)
jupyter nbconvert --to script notebook.ipynb — .py file
jupyter nbconvert --execute notebook.ipynb — run then convert
Quarto — modern alternative for scientific publishing
GitHub renders .ipynb files natively (static view)

Version Control

Notebooks are JSON with embedded outputs — diffs are noisy, merge conflicts are painful, and outputs include execution counts that change every run. Several tools address this.

nbstripout — git pre-commit hook strips outputs before commit
jupytext — sync .ipynb with .py or .md (text format)
nbdime — notebook-aware diff and merge tools
Best practice: commit only stripped notebooks or .py equivalents
Large outputs (model weights embedded): never commit

Commit .py Files, Not .ipynb Files, for Serious Projects

Notebooks accumulate cell outputs that balloon file size and create meaningless diffs. Use jupytext --sync to maintain a paired .py file (percent format) that imports cleanly, diffs readably, and works with standard Python tooling. Treat the .ipynb as a rendered artifact, not the source of truth.

# ── nbstripout: strip outputs before every git commit ─────────────────────────
# pip install nbstripout
# nbstripout --install     # installs git filter for current repo
# nbstripout --install --global   # for all repos

# ── jupytext: sync notebook to .py percent format ────────────────────────────
# pip install jupytext
# jupytext --to py:percent my_analysis.ipynb   # one-time conversion
# jupytext --sync my_analysis.ipynb            # sync paired files
# In notebook metadata, add: "jupytext": {"formats": "ipynb,py:percent"}
# Now both files stay in sync on save

# ── papermill: parameterised notebook execution ──────────────────────────────
# pip install papermill
# Run a notebook with parameters from command line:
# papermill template_notebook.ipynb output_run_42.ipynb \
#   -p LEARNING_RATE 0.001 \
#   -p N_EPOCHS 50 \
#   -p RANDOM_SEED 42

# In the notebook, mark the parameters cell with the tag "parameters":
# (Add tag via View > Cell Toolbar > Tags in classic Notebook)
# LEARNING_RATE = 0.01  # default value — overridden by papermill
# N_EPOCHS = 100
# RANDOM_SEED = 0

# ── Example: well-structured notebook imports cell ────────────────────────────
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

# Reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Display settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.float_format', '{:.4f}'.format)
plt.rcParams.update({
    'figure.figsize': (12, 5),
    'axes.grid': True,
    'grid.alpha': 0.3,
})

# Paths
DATA_DIR   = '../data'
MODELS_DIR = '../models'
FIGURES_DIR = '../figures'

print(f"numpy  {np.__version__}")
print(f"pandas {pd.__version__}")
print(f"matplotlib {matplotlib.__version__}")

🖼 Visualisation in Notebooks

Matplotlib (Static)

The foundation of Python plotting. Use the object-oriented fig, ax = plt.subplots() API for anything beyond a single plot — it gives you full control over every plot element.

fig, ax = plt.subplots() — preferred API
fig, axes = plt.subplots(2, 3, figsize=(15,8)) — grid
ax.plot(), ax.scatter(), ax.bar(), ax.hist()
ax.set_xlabel/ylabel/title(); ax.legend()
plt.tight_layout() — prevent label clipping
fig.savefig('plot.png', dpi=150, bbox_inches='tight')

Seaborn (Statistical)

Seaborn builds on matplotlib for statistical visualisation. It understands pandas DataFrames natively and produces publication-quality plots with minimal code.

sns.histplot(data=df, x='col', hue='label')
sns.boxplot, sns.violinplot — distribution by group
sns.heatmap(corr_matrix, annot=True)
sns.pairplot(df, hue='label') — feature scatter matrix
sns.scatterplot, sns.lineplot
sns.set_theme(style='darkgrid') — global theme

Plotly (Interactive)

Plotly generates interactive HTML charts that render natively in notebooks. Hover tooltips, zoom, pan, and dropdown menus make exploratory analysis much more productive.

import plotly.express as px — high-level API
px.scatter(df, x='feat1', y='feat2', color='label')
px.histogram, px.box, px.heatmap
px.scatter_3d — 3D scatter for PCA visualisation
fig.show() renders inline in JupyterLab
Export to HTML: fig.write_html('plot.html')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# %matplotlib inline   # put this in a cell at notebook start

# Generate synthetic ML results data
rng = np.random.default_rng(42)
n = 300
X = rng.standard_normal((n, 2))
y = (X[:, 0] + X[:, 1] > 0).astype(int)
probs = 1 / (1 + np.exp(-(X[:, 0] + X[:, 1]))) + rng.normal(0, 0.1, n)
probs = np.clip(probs, 0, 1)

# ── 1. Training/Validation Loss Curve ────────────────────────────────────────
epochs = np.arange(1, 51)
train_loss = 1.5 * np.exp(-0.08 * epochs) + rng.normal(0, 0.02, 50)
val_loss   = 1.5 * np.exp(-0.06 * epochs) + rng.normal(0, 0.04, 50) + 0.1

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(epochs, train_loss, label='Train loss', color='#00d9ff', linewidth=2)
ax.plot(epochs, val_loss,   label='Val loss',   color='#8b5cf6', linewidth=2, linestyle='--')
ax.axvline(np.argmin(val_loss)+1, color='#f97316', linestyle=':', label=f'Best epoch={np.argmin(val_loss)+1}')
ax.set_xlabel('Epoch'); ax.set_ylabel('Loss')
ax.set_title('Training and Validation Loss', fontweight='bold')
ax.legend(); ax.grid(True, alpha=0.3)
plt.tight_layout()
# plt.savefig('loss_curve.png', dpi=150, bbox_inches='tight')
plt.show()

# ── 2. Confusion Matrix ───────────────────────────────────────────────────────
from sklearn.metrics import confusion_matrix
y_pred = (probs > 0.5).astype(int)
cm = confusion_matrix(y, y_pred)
fig, ax = plt.subplots(figsize=(5, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Pred 0','Pred 1'], yticklabels=['True 0','True 1'],
            ax=ax, linewidths=0.5)
ax.set_title('Confusion Matrix', fontweight='bold')
plt.tight_layout(); plt.show()

# ── 3. Feature Importance Bar Chart ──────────────────────────────────────────
feature_names = [f'feat_{i}' for i in range(10)]
importances   = np.abs(rng.standard_normal(10))
importances  /= importances.sum()
order = np.argsort(importances)[::-1]

fig, ax = plt.subplots(figsize=(9, 4))
ax.barh(np.array(feature_names)[order], importances[order], color='#0066ff', alpha=0.8)
ax.set_xlabel('Importance'); ax.set_title('Feature Importances', fontweight='bold')
ax.invert_yaxis()
plt.tight_layout(); plt.show()

# ── 4. Distribution by Class ─────────────────────────────────────────────────
df = pd.DataFrame({'value': X[:, 0], 'label': y.astype(str)})
fig, ax = plt.subplots(figsize=(8, 4))
for label, color in [('0', '#00d9ff'), ('1', '#f97316')]:
    sns.kdeplot(df[df.label==label]['value'], label=f'Class {label}',
                ax=ax, fill=True, alpha=0.3, color=color)
ax.set_xlabel('Feature value'); ax.set_ylabel('Density')
ax.set_title('Feature Distribution by Class', fontweight='bold')
ax.legend(); plt.tight_layout(); plt.show()

# ── 5. Correlation Heatmap ────────────────────────────────────────────────────
feature_data = pd.DataFrame(
    rng.standard_normal((200, 8)),
    columns=[f'F{i}' for i in range(8)]
)
feature_data['F3'] = feature_data['F0'] * 0.8 + rng.normal(0, 0.2, 200)

corr = feature_data.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))  # upper triangle mask
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='coolwarm',
            center=0, vmin=-1, vmax=1, ax=ax, linewidths=0.5)
ax.set_title('Feature Correlation Heatmap', fontweight='bold')
plt.tight_layout(); plt.show()

🚀 Performance & Remote Notebooks

Profiling in Notebooks

Before optimising, measure. IPython magic commands make profiling trivial. Identify the actual bottleneck before reaching for parallelism or GPU.

%timeit expr — micro-benchmark a line (auto-repeats)
%%timeit — benchmark entire cell
%prun func() — cProfile function call
%lprun -f func func(args) — line-by-line timing (line_profiler)
%mprun -f func func(args) — line-by-line memory (memory_profiler)
%load_ext memory_profiler — enable memory profiling

GPU in Notebooks

Notebooks can run GPU-accelerated code natively — just ensure your PyTorch or TensorFlow installation includes CUDA support and check device availability.

import torch; torch.cuda.is_available()
torch.cuda.get_device_name(0) — GPU model
torch.cuda.memory_allocated() / 1e9 — GB used
!nvidia-smi — GPU utilisation, memory, temperature
%%time + GPU vs CPU: see speedup directly in notebook
torch.cuda.empty_cache() — release cached memory

Remote & Team Notebooks

Running notebooks on a remote GPU server avoids transferring large datasets. SSH tunnels make the remote Jupyter server accessible through a local browser window.

SSH tunnel: ssh -L 8888:localhost:8888 user@gpu-server
Start remote: jupyter lab --no-browser --port=8888
Use token from remote server output in local browser
JupyterHub — multi-user server; one URL, login for each user
BinderHub — share reproducible notebooks via URL (GitHub + Docker)

Platform	GPU Available	Cost	Best For
Google Colab (Free)	T4 GPU, sometimes A100 (limited)	Free (with time limits)	Learning, quick experiments, prototyping without local GPU
Google Colab Pro	A100 / V100; priority access	~$10-50/month	Regular GPU work, longer sessions, larger RAM
Kaggle Notebooks	P100 / T4 GPU, 30h/week free	Free	Competitions, datasets already available on platform
AWS SageMaker Studio	Any EC2 GPU instance (p3, p4, g4dn)	Pay-per-second for GPU instances	Production ML pipelines, MLOps, enterprise teams
Paperspace Gradient	A100, A6000, RTX4000	Free tier + pay-per-use	GPU notebooks with persistent storage, no time limits
Local JupyterLab	Your own GPU (NVIDIA/AMD/Apple M-series)	Hardware cost only	Privacy-sensitive data, frequent use, full control
JupyterHub (self-hosted)	Depends on server	Server cost	Teams sharing a GPU server; each user gets isolated environment

# ── Profiling a slow function ──────────────────────────────────────────────────
# In a notebook cell:
# %load_ext line_profiler
# %load_ext memory_profiler

import numpy as np

def slow_pairwise_distance(X):
    """Naive O(n^2) pairwise distances — intentionally slow."""
    n = len(X)
    dist = np.zeros((n, n))
    for i in range(n):
        for j in range(i+1, n):
            diff = X[i] - X[j]
            dist[i, j] = dist[j, i] = np.sqrt((diff**2).sum())
    return dist

def fast_pairwise_distance(X):
    """Vectorised using broadcasting — fast."""
    diff = X[:, np.newaxis, :] - X[np.newaxis, :, :]   # (n,n,d)
    return np.sqrt((diff**2).sum(axis=2))

X_small = np.random.randn(200, 10)

# In notebook: %timeit slow_pairwise_distance(X_small)
# In notebook: %timeit fast_pairwise_distance(X_small)
# Expected speedup: ~100x for n=200

# Line profiler (requires %load_ext line_profiler):
# %lprun -f slow_pairwise_distance slow_pairwise_distance(X_small)

# ── GPU check cell (put at top of GPU notebooks) ──────────────────────────────
import torch
if torch.cuda.is_available():
    gpu = torch.cuda.get_device_properties(0)
    print(f"GPU: {gpu.name}")
    print(f"VRAM: {gpu.total_memory / 1e9:.1f} GB")
    print(f"CUDA version: {torch.version.cuda}")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    print("Apple Silicon MPS available")
else:
    print("No GPU — running on CPU")

# ── Google Colab: mount Google Drive ─────────────────────────────────────────
# from google.colab import drive
# drive.mount('/content/drive')
# df = pd.read_csv('/content/drive/MyDrive/my_dataset.csv')

# ── Install packages in Colab ─────────────────────────────────────────────────
# !pip install -q transformers accelerate bitsandbytes
# !pip install -q 'xformers<0.0.27'  # specific version

# ── Colab: check GPU quota remaining ─────────────────────────────────────────
# !nvidia-smi
# from google.colab import runtime
# runtime.unassign()   # release GPU when done