📚 What is Jupyter?
The Notebook Concept
A Jupyter notebook is a document that mixes executable code cells with rich-text Markdown cells and their outputs (text, tables, charts, images, LaTeX). This literate-programming style makes notebooks ideal for data exploration, analysis, and communicating findings.
- Code cells — Python (or R, Julia, etc.) executed by the kernel
- Markdown cells — headings, paragraphs, LaTeX equations, images
- Output cells — text, DataFrames, matplotlib plots, HTML widgets
- Raw cells — unformatted content, passed to nbconvert as-is
- Stored as JSON (
.ipynb) — both code and outputs are saved
The Kernel
The kernel is a separate process that executes code. The notebook frontend communicates with the kernel via ZeroMQ messages. This separation means you can restart the kernel without closing the browser tab, and one notebook server can host kernels in multiple languages.
- IPython kernel — the standard Python kernel
- Each notebook has its own kernel process
- State is persistent within a session (variables survive across cells)
- Restart & Run All — reproducibility test: can anyone run this fresh?
- Multiple kernels: Python, R (IRkernel), Julia (IJulia), bash
JupyterLab vs Classic vs VS Code
The Jupyter ecosystem has evolved significantly. JupyterLab is now the recommended interface for serious work, though VS Code's notebook support has become excellent for integrated development.
- JupyterLab — modern IDE-like interface; tabs, file browser, terminal
- Classic Notebook — the original interface; simpler; still works
- VS Code Notebooks — integrated with editor, debugger, git; great DX
- Google Colab — cloud, free GPU, no install; based on Jupyter
- Jupyter Desktop — Electron app for local use
JupyterLab is the Modern Choice
JupyterLab replaces the classic Notebook interface. It provides a full-featured IDE experience with a file browser, multiple notebooks/terminals in tabs, a text editor, CSV viewer, image viewer, and a rich extension ecosystem. Install with pip install jupyterlab and launch with jupyter lab.
# ── Installation ──────────────────────────────────────────────────────────────
# Install JupyterLab (recommended)
# pip install jupyterlab
# Install classic Notebook
# pip install notebook
# Install with conda (includes many data science packages)
# conda install -c conda-forge jupyterlab
# Launch JupyterLab
# jupyter lab
# jupyter lab --port=8889 --no-browser # specific port, no auto-open
# Launch classic Notebook
# jupyter notebook
# ── IPython features available in notebooks ────────────────────────────────────
# Tab completion: type pandas. then Tab → see all methods
# ? after a function: np.random.randn? → show docstring
# ?? after a function: np.random.randn?? → show source code
# Shift+Tab inside function parens → show signature
import numpy as np
import pandas as pd
# These work in any Jupyter cell:
# display() renders rich HTML for DataFrames
df = pd.DataFrame({'a': [1,2,3], 'b': [4.1, 5.2, 6.3]})
# display(df) # renders as styled HTML table
# df # last expression in cell auto-displays
# Multiple displays per cell:
# from IPython.display import display
# display(df.head(3))
# display(df.describe())
⚡ Essential Workflow
Keyboard Shortcuts
Learning keyboard shortcuts dramatically speeds up notebook work. There are two modes: Command mode (blue border, press Esc) for cell operations, and Edit mode (green border, press Enter) for editing content.
- Shift+Enter — run cell and move to next
- Ctrl+Enter — run cell in place
- Alt+Enter — run cell and insert new below
- Esc → A — insert cell above (command mode)
- Esc → B — insert cell below
- Esc → D D — delete cell
- Esc → M — convert to Markdown
- Esc → Y — convert to Code
- Ctrl+Shift+P — command palette (JupyterLab)
Kernel Management
A common notebook trap: running cells out of order creates state that doesn't match a top-to-bottom execution. Always validate reproducibility by restarting the kernel and running all cells in order.
- Restart Kernel — clears all variables, keeps outputs
- Restart & Clear Output — clean slate
- Restart & Run All — full reproducibility test
- Variable inspector — Jupyter extension or
%whomagic del variable_name— free memory explicitly- Watch kernel status indicator (circle in top-right)
Output & Display
Notebooks can render rich media directly inline — HTML tables, interactive widgets, images, audio, and video. Pandas DataFrames render as styled HTML tables automatically.
display(obj)— explicitly render any objectpd.set_option('display.max_rows', 100)pd.set_option('display.float_format', '{:.3f}'.format)from IPython.display import Image, HTML, Latex- ipywidgets — interactive sliders, dropdowns, buttons
- Rich output is embedded in .ipynb JSON — sharable
| Magic Command | What it does | Example |
|---|---|---|
%timeit |
Time a single line, run many iterations, report mean ± std | %timeit np.dot(A, B) |
%%time |
Time the entire cell once (wall time and CPU time) | First line of cell: %%time |
%%timeit |
Time the entire cell, run multiple iterations | First line of cell: %%timeit |
%matplotlib inline |
Render matplotlib plots inline in the notebook | Put in first cell of notebook |
%matplotlib widget |
Interactive matplotlib plots (zoom/pan) via ipympl | pip install ipympl first |
%load_ext autoreload |
Auto-reload modules when their source changes on disk | Then: %autoreload 2 |
%who / %whos |
List all variables in namespace (whos includes type/size) | %whos DataFrame — only DataFrames |
%run script.py |
Execute an external Python script and import its namespace | %run train.py |
!command |
Run a shell command and display output | !pip install xgboost, !ls -la |
%env VAR=value |
Set or display environment variables | %env CUDA_VISIBLE_DEVICES=0 |
%%bash |
Run entire cell as bash script | Multi-line shell commands in cell |
%pdb on |
Enable post-mortem debugger on exceptions | Opens interactive pdb on error |
📉 Effective ML Workflows
Recommended Notebook Structure
Structuring notebooks consistently makes them easier to review, debug, and hand off. Each major section should be in its own cells with Markdown headers.
- 1. Imports & Config — all imports, constants, paths, seeds
- 2. Data Loading — read raw files, no transformations yet
- 3. EDA — distributions, correlations, missing values, outliers
- 4. Preprocessing — cleaning, encoding, feature engineering
- 5. Model Training — fit, cross-validate, tune
- 6. Evaluation — metrics, plots, error analysis
- 7. Conclusions — findings, next steps, limitations
Exporting & Sharing
nbconvert transforms notebooks into static formats for sharing with non-technical stakeholders or including in documentation.
jupyter nbconvert --to html notebook.ipynbjupyter nbconvert --to pdf notebook.ipynb(needs LaTeX)jupyter nbconvert --to script notebook.ipynb— .py filejupyter nbconvert --execute notebook.ipynb— run then convert- Quarto — modern alternative for scientific publishing
- GitHub renders .ipynb files natively (static view)
Version Control
Notebooks are JSON with embedded outputs — diffs are noisy, merge conflicts are painful, and outputs include execution counts that change every run. Several tools address this.
- nbstripout — git pre-commit hook strips outputs before commit
- jupytext — sync .ipynb with .py or .md (text format)
- nbdime — notebook-aware diff and merge tools
- Best practice: commit only stripped notebooks or .py equivalents
- Large outputs (model weights embedded): never commit
Commit .py Files, Not .ipynb Files, for Serious Projects
Notebooks accumulate cell outputs that balloon file size and create meaningless diffs. Use jupytext --sync to maintain a paired .py file (percent format) that imports cleanly, diffs readably, and works with standard Python tooling. Treat the .ipynb as a rendered artifact, not the source of truth.
# ── nbstripout: strip outputs before every git commit ─────────────────────────
# pip install nbstripout
# nbstripout --install # installs git filter for current repo
# nbstripout --install --global # for all repos
# ── jupytext: sync notebook to .py percent format ────────────────────────────
# pip install jupytext
# jupytext --to py:percent my_analysis.ipynb # one-time conversion
# jupytext --sync my_analysis.ipynb # sync paired files
# In notebook metadata, add: "jupytext": {"formats": "ipynb,py:percent"}
# Now both files stay in sync on save
# ── papermill: parameterised notebook execution ──────────────────────────────
# pip install papermill
# Run a notebook with parameters from command line:
# papermill template_notebook.ipynb output_run_42.ipynb \
# -p LEARNING_RATE 0.001 \
# -p N_EPOCHS 50 \
# -p RANDOM_SEED 42
# In the notebook, mark the parameters cell with the tag "parameters":
# (Add tag via View > Cell Toolbar > Tags in classic Notebook)
# LEARNING_RATE = 0.01 # default value — overridden by papermill
# N_EPOCHS = 100
# RANDOM_SEED = 0
# ── Example: well-structured notebook imports cell ────────────────────────────
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
# Reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
# Display settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.float_format', '{:.4f}'.format)
plt.rcParams.update({
'figure.figsize': (12, 5),
'axes.grid': True,
'grid.alpha': 0.3,
})
# Paths
DATA_DIR = '../data'
MODELS_DIR = '../models'
FIGURES_DIR = '../figures'
print(f"numpy {np.__version__}")
print(f"pandas {pd.__version__}")
print(f"matplotlib {matplotlib.__version__}")
🖼 Visualisation in Notebooks
Matplotlib (Static)
The foundation of Python plotting. Use the object-oriented fig, ax = plt.subplots() API for anything beyond a single plot — it gives you full control over every plot element.
fig, ax = plt.subplots()— preferred APIfig, axes = plt.subplots(2, 3, figsize=(15,8))— grid- ax.plot(), ax.scatter(), ax.bar(), ax.hist()
- ax.set_xlabel/ylabel/title(); ax.legend()
plt.tight_layout()— prevent label clippingfig.savefig('plot.png', dpi=150, bbox_inches='tight')
Seaborn (Statistical)
Seaborn builds on matplotlib for statistical visualisation. It understands pandas DataFrames natively and produces publication-quality plots with minimal code.
sns.histplot(data=df, x='col', hue='label')sns.boxplot,sns.violinplot— distribution by groupsns.heatmap(corr_matrix, annot=True)sns.pairplot(df, hue='label')— feature scatter matrixsns.scatterplot,sns.lineplotsns.set_theme(style='darkgrid')— global theme
Plotly (Interactive)
Plotly generates interactive HTML charts that render natively in notebooks. Hover tooltips, zoom, pan, and dropdown menus make exploratory analysis much more productive.
import plotly.express as px— high-level APIpx.scatter(df, x='feat1', y='feat2', color='label')px.histogram,px.box,px.heatmappx.scatter_3d— 3D scatter for PCA visualisationfig.show()renders inline in JupyterLab- Export to HTML:
fig.write_html('plot.html')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# %matplotlib inline # put this in a cell at notebook start
# Generate synthetic ML results data
rng = np.random.default_rng(42)
n = 300
X = rng.standard_normal((n, 2))
y = (X[:, 0] + X[:, 1] > 0).astype(int)
probs = 1 / (1 + np.exp(-(X[:, 0] + X[:, 1]))) + rng.normal(0, 0.1, n)
probs = np.clip(probs, 0, 1)
# ── 1. Training/Validation Loss Curve ────────────────────────────────────────
epochs = np.arange(1, 51)
train_loss = 1.5 * np.exp(-0.08 * epochs) + rng.normal(0, 0.02, 50)
val_loss = 1.5 * np.exp(-0.06 * epochs) + rng.normal(0, 0.04, 50) + 0.1
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(epochs, train_loss, label='Train loss', color='#00d9ff', linewidth=2)
ax.plot(epochs, val_loss, label='Val loss', color='#8b5cf6', linewidth=2, linestyle='--')
ax.axvline(np.argmin(val_loss)+1, color='#f97316', linestyle=':', label=f'Best epoch={np.argmin(val_loss)+1}')
ax.set_xlabel('Epoch'); ax.set_ylabel('Loss')
ax.set_title('Training and Validation Loss', fontweight='bold')
ax.legend(); ax.grid(True, alpha=0.3)
plt.tight_layout()
# plt.savefig('loss_curve.png', dpi=150, bbox_inches='tight')
plt.show()
# ── 2. Confusion Matrix ───────────────────────────────────────────────────────
from sklearn.metrics import confusion_matrix
y_pred = (probs > 0.5).astype(int)
cm = confusion_matrix(y, y_pred)
fig, ax = plt.subplots(figsize=(5, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Pred 0','Pred 1'], yticklabels=['True 0','True 1'],
ax=ax, linewidths=0.5)
ax.set_title('Confusion Matrix', fontweight='bold')
plt.tight_layout(); plt.show()
# ── 3. Feature Importance Bar Chart ──────────────────────────────────────────
feature_names = [f'feat_{i}' for i in range(10)]
importances = np.abs(rng.standard_normal(10))
importances /= importances.sum()
order = np.argsort(importances)[::-1]
fig, ax = plt.subplots(figsize=(9, 4))
ax.barh(np.array(feature_names)[order], importances[order], color='#0066ff', alpha=0.8)
ax.set_xlabel('Importance'); ax.set_title('Feature Importances', fontweight='bold')
ax.invert_yaxis()
plt.tight_layout(); plt.show()
# ── 4. Distribution by Class ─────────────────────────────────────────────────
df = pd.DataFrame({'value': X[:, 0], 'label': y.astype(str)})
fig, ax = plt.subplots(figsize=(8, 4))
for label, color in [('0', '#00d9ff'), ('1', '#f97316')]:
sns.kdeplot(df[df.label==label]['value'], label=f'Class {label}',
ax=ax, fill=True, alpha=0.3, color=color)
ax.set_xlabel('Feature value'); ax.set_ylabel('Density')
ax.set_title('Feature Distribution by Class', fontweight='bold')
ax.legend(); plt.tight_layout(); plt.show()
# ── 5. Correlation Heatmap ────────────────────────────────────────────────────
feature_data = pd.DataFrame(
rng.standard_normal((200, 8)),
columns=[f'F{i}' for i in range(8)]
)
feature_data['F3'] = feature_data['F0'] * 0.8 + rng.normal(0, 0.2, 200)
corr = feature_data.corr()
mask = np.triu(np.ones_like(corr, dtype=bool)) # upper triangle mask
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='coolwarm',
center=0, vmin=-1, vmax=1, ax=ax, linewidths=0.5)
ax.set_title('Feature Correlation Heatmap', fontweight='bold')
plt.tight_layout(); plt.show()
🚀 Performance & Remote Notebooks
Profiling in Notebooks
Before optimising, measure. IPython magic commands make profiling trivial. Identify the actual bottleneck before reaching for parallelism or GPU.
%timeit expr— micro-benchmark a line (auto-repeats)%%timeit— benchmark entire cell%prun func()— cProfile function call%lprun -f func func(args)— line-by-line timing (line_profiler)%mprun -f func func(args)— line-by-line memory (memory_profiler)%load_ext memory_profiler— enable memory profiling
GPU in Notebooks
Notebooks can run GPU-accelerated code natively — just ensure your PyTorch or TensorFlow installation includes CUDA support and check device availability.
import torch; torch.cuda.is_available()torch.cuda.get_device_name(0)— GPU modeltorch.cuda.memory_allocated() / 1e9— GB used!nvidia-smi— GPU utilisation, memory, temperature%%time+ GPU vs CPU: see speedup directly in notebooktorch.cuda.empty_cache()— release cached memory
Remote & Team Notebooks
Running notebooks on a remote GPU server avoids transferring large datasets. SSH tunnels make the remote Jupyter server accessible through a local browser window.
- SSH tunnel:
ssh -L 8888:localhost:8888 user@gpu-server - Start remote:
jupyter lab --no-browser --port=8888 - Use token from remote server output in local browser
- JupyterHub — multi-user server; one URL, login for each user
- BinderHub — share reproducible notebooks via URL (GitHub + Docker)
| Platform | GPU Available | Cost | Best For |
|---|---|---|---|
| Google Colab (Free) | T4 GPU, sometimes A100 (limited) | Free (with time limits) | Learning, quick experiments, prototyping without local GPU |
| Google Colab Pro | A100 / V100; priority access | ~$10-50/month | Regular GPU work, longer sessions, larger RAM |
| Kaggle Notebooks | P100 / T4 GPU, 30h/week free | Free | Competitions, datasets already available on platform |
| AWS SageMaker Studio | Any EC2 GPU instance (p3, p4, g4dn) | Pay-per-second for GPU instances | Production ML pipelines, MLOps, enterprise teams |
| Paperspace Gradient | A100, A6000, RTX4000 | Free tier + pay-per-use | GPU notebooks with persistent storage, no time limits |
| Local JupyterLab | Your own GPU (NVIDIA/AMD/Apple M-series) | Hardware cost only | Privacy-sensitive data, frequent use, full control |
| JupyterHub (self-hosted) | Depends on server | Server cost | Teams sharing a GPU server; each user gets isolated environment |
# ── Profiling a slow function ──────────────────────────────────────────────────
# In a notebook cell:
# %load_ext line_profiler
# %load_ext memory_profiler
import numpy as np
def slow_pairwise_distance(X):
"""Naive O(n^2) pairwise distances — intentionally slow."""
n = len(X)
dist = np.zeros((n, n))
for i in range(n):
for j in range(i+1, n):
diff = X[i] - X[j]
dist[i, j] = dist[j, i] = np.sqrt((diff**2).sum())
return dist
def fast_pairwise_distance(X):
"""Vectorised using broadcasting — fast."""
diff = X[:, np.newaxis, :] - X[np.newaxis, :, :] # (n,n,d)
return np.sqrt((diff**2).sum(axis=2))
X_small = np.random.randn(200, 10)
# In notebook: %timeit slow_pairwise_distance(X_small)
# In notebook: %timeit fast_pairwise_distance(X_small)
# Expected speedup: ~100x for n=200
# Line profiler (requires %load_ext line_profiler):
# %lprun -f slow_pairwise_distance slow_pairwise_distance(X_small)
# ── GPU check cell (put at top of GPU notebooks) ──────────────────────────────
import torch
if torch.cuda.is_available():
gpu = torch.cuda.get_device_properties(0)
print(f"GPU: {gpu.name}")
print(f"VRAM: {gpu.total_memory / 1e9:.1f} GB")
print(f"CUDA version: {torch.version.cuda}")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
print("Apple Silicon MPS available")
else:
print("No GPU — running on CPU")
# ── Google Colab: mount Google Drive ─────────────────────────────────────────
# from google.colab import drive
# drive.mount('/content/drive')
# df = pd.read_csv('/content/drive/MyDrive/my_dataset.csv')
# ── Install packages in Colab ─────────────────────────────────────────────────
# !pip install -q transformers accelerate bitsandbytes
# !pip install -q 'xformers<0.0.27' # specific version
# ── Colab: check GPU quota remaining ─────────────────────────────────────────
# !nvidia-smi
# from google.colab import runtime
# runtime.unassign() # release GPU when done