🗂 Data Sources
Every machine learning project begins with a fundamental question: where does the data come from? The source shapes everything downstream — quality, bias, licensing constraints, and ultimately model performance. Understanding the trade-offs between different data sourcing strategies is a core practitioner skill.
| Source Type | Examples | Pros | Cons |
|---|---|---|---|
| Public Datasets | Kaggle, HuggingFace Hub, UCI ML Repository, data.gov, World Bank Open Data, ImageNet, Common Crawl | Free, pre-labelled, community-validated, benchmarks available | May be outdated, over-used (benchmark saturation), not domain-specific, licensing varies |
| Web Scraping | News articles, product reviews, social media posts, forum threads, job listings | Massive scale, real-world distribution, up-to-date | Legal risks, noisy, requires heavy cleaning, robots.txt/ToS constraints, rate limits |
| APIs | Twitter/X API, Reddit API, OpenWeatherMap, financial data APIs (Alpha Vantage), Google Maps | Structured, clean, usually real-time, terms of service clear | Rate limits, cost at scale, API changes can break pipelines, data access may be restricted |
| Internal Databases | CRM data, server logs, ERP systems, application telemetry, transaction records | Domain-specific, proprietary advantage, often labelled by business process | Data silos, schema inconsistencies, PII concerns, requires data engineering |
| Synthetic Generation | GANs, diffusion models, LLM-generated text, SMOTE, simulation engines | Privacy-preserving, controllable distribution, unlimited volume | Distribution shift from real data, mode collapse risk, requires validation |
| Data Purchase / Licensing | Bloomberg data, Refinitiv, medical record datasets, commercial annotation services | High quality, expert-labelled, legal clarity | Expensive, locked-in, may not cover edge cases, resale restrictions |
Public Dataset Repositories
- Kaggle — competitions + community datasets; wide variety of domains; CSV-heavy
- HuggingFace Hub — the go-to for NLP; models, datasets, and spaces in one place
- UCI ML Repository — classic benchmark datasets; tabular data focus; well-cited
- Google Dataset Search — meta-search across thousands of published datasets
- data.gov / EU Open Data — government datasets; geospatial, demographic, regulatory
- OpenML — benchmark suites with reproducible experiments and metadata
Domain-Specific Sources
- Computer Vision — COCO, Open Images, LAION-5B, CelebA, CIFAR-10/100
- NLP / Text — Common Crawl, The Pile, Wikipedia dumps, BookCorpus
- Audio / Speech — LibriSpeech, Mozilla Common Voice, VoxCeleb
- Cybersecurity — CICIDS, KDD Cup 99, NSL-KDD, UNSW-NB15 for intrusion detection
- Medical — MIMIC-III, NIH Chest X-rays, PhysioNet (strict access controls)
- Finance — Yahoo Finance, Quandl, SEC EDGAR filings
📊 Data Volume Planning
One of the most common questions in applied ML is "how much data do I need?" The honest answer depends on model complexity, task difficulty, and the quality of your labels. Here are evidence-based heuristics to guide collection efforts before you commit significant resources.
Quality >> Quantity
A carefully curated dataset of 10,000 clean, correctly-labelled examples almost always outperforms a noisy dataset of 1,000,000 examples. Before scaling collection, maximise label quality, resolve ambiguous examples, and remove near-duplicates.
Classical ML Rules of Thumb
- Start with at least 10× the number of features in your training set to avoid overfitting
- For classification: aim for at least 100–1000 examples per class, more for complex boundaries
- Learning curves are your best guide — plot validation performance vs training set size to see if more data still helps
- Diminishing returns typically set in — doubling data often yields only modest gains after the initial slope
- Use cross-validation on small datasets to maximise use of available data
- Feature engineering quality can substitute for raw data volume in shallow models
Deep Learning & LLM Scaling
- Chinchilla scaling laws (Hoffmann et al., 2022): for compute-optimal training, tokens ≈ 20× model parameters
- GPT-3 (175B params) would be Chinchilla-optimal at ~3.5T tokens; it was trained on ~300B
- For fine-tuning: even hundreds to low thousands of high-quality examples can work with LoRA/PEFT
- Transfer learning changes the equation — leverage pre-trained models to reduce data requirements by orders of magnitude
- Data diversity often matters more than raw count for generalisation
- Consider active learning to label only the most informative examples
| Task Type | Minimum Viable | Production Quality | Notes |
|---|---|---|---|
| Binary classification (tabular) | 500–2,000 per class | 10,000+ per class | Depends heavily on feature count and separability |
| Multi-class classification (10 classes) | 200–500 per class | 5,000+ per class | Use stratified sampling |
| Image classification (CNN fine-tune) | 100–500 per class | 2,000–10,000 per class | With ImageNet pre-training |
| LLM instruction fine-tuning | 500–2,000 QA pairs | 50,000–500,000 pairs | Quality of instructions critical; RLHF adds more |
| Named entity recognition | 1,000 sentences | 20,000+ sentences | Token-level annotation; consider distant supervision |
🕸 Web Scraping & APIs
Web scraping enables collection of data at a scale and recency that no public dataset can match, but it comes with technical, legal, and ethical complexity. Understanding how to scrape responsibly is as important as knowing how to scrape effectively.
Responsible Scraping Principles
- robots.txt — always check
example.com/robots.txt; respect Disallow directives even when technically bypassable - Rate limiting — add delays (1–5 seconds) between requests; use exponential backoff on 429 responses
- User-Agent — identify your bot honestly; contact site owners for large-scale crawls
- Terms of Service — many ToS explicitly prohibit scraping; hiQ v. LinkedIn established some legal precedent but landscape is unsettled
- Cache aggressively — avoid re-fetching; store raw HTML before parsing to avoid re-crawls
Deduplication & Pagination
- URL normalisation — strip tracking parameters, normalise trailing slashes, resolve redirects before storing
- Content hashing — MD5/SHA256 hash of page body to detect exact duplicates across different URLs
- Near-deduplication — MinHash or SimHash for near-duplicate detection at scale
- Pagination patterns —
?page=N, offset/limit params, cursor-based (next_token), infinite scroll (XHR interception) - Sitemap.xml — use sitemaps as a structured URL frontier for comprehensive crawls
Minimal Python Scraper Outline
import requests
from bs4 import BeautifulSoup
import time
import hashlib
from urllib.robotparser import RobotFileParser
# ---- Check robots.txt ----
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
BASE_URL = "https://example.com/articles"
DELAY = 2 # seconds between requests
seen = set() # URL deduplication
def scrape_page(url):
if not rp.can_fetch("*", url):
print(f"Blocked by robots.txt: {url}")
return None
resp = requests.get(url, headers={"User-Agent": "MyResearchBot/1.0"}, timeout=10)
resp.raise_for_status()
content_hash = hashlib.md5(resp.content).hexdigest()
if content_hash in seen:
return None # exact duplicate
seen.add(content_hash)
soup = BeautifulSoup(resp.text, "html.parser")
return {
"url": url,
"title": soup.find("h1").get_text(strip=True) if soup.find("h1") else "",
"body": soup.find("article").get_text(separator=" ", strip=True) if soup.find("article") else "",
}
# ---- Paginated crawl ----
results = []
for page_num in range(1, 20):
url = f"{BASE_URL}?page={page_num}"
data = scrape_page(url)
if data:
results.append(data)
time.sleep(DELAY)
# For large-scale scraping, use Scrapy + Redis for distributed crawling
# pip install scrapy scrapy-redis
Scrapy for Production-Scale Crawls
For anything beyond a few hundred pages, use Scrapy — it provides built-in middleware for rate limiting, retry logic, proxy rotation, item pipelines for cleaning, and Scrapy-Redis for distributed crawling across multiple nodes. The scrapy crawl command handles concurrency, politeness, and logging automatically.
🧪 Synthetic Data Generation
When real data is scarce, sensitive (medical records, financial fraud), expensive to label, or contains privacy constraints, synthetic data bridges the gap. The key question is always whether the synthetic distribution is close enough to the real one to be useful for training.
| Method | Use Case | Quality | Risk / Caveats |
|---|---|---|---|
| SMOTE | Tabular minority class oversampling for imbalanced datasets | Good for linear boundaries; weak for high-dimensional data | Can create unrealistic interpolations; noisy boundary samples |
| GANs (DCGAN, CTGAN) | Image synthesis; tabular data (CTGAN for mixed types) | High visual quality; CTGAN handles categorical columns | Training instability; mode collapse; long training times |
| Diffusion Models | High-quality image & audio synthesis; increasingly text | State-of-the-art image fidelity (DALL-E, Stable Diffusion) | Computationally expensive; copyright concerns on training data |
| LLM-Generated Text | Instruction-tuning pairs, question generation, data augmentation | High semantic quality; can target specific formats and styles | Model collapse if used to train the generating model; hallucinations |
| Simulation / Rule-Based | Robotics, games, autonomous driving (CARLA), network traffic | Perfect ground truth; unlimited volume; controllable conditions | Sim-to-real gap; unrealistic sensor noise; brittle assumptions |
| Differential Privacy Synthesis | Privacy-preserving release of sensitive tabular data | Moderate — privacy-utility trade-off | Added noise reduces statistical fidelity; complex to tune epsilon |
When Synthetic Data Shines
- Rare events that never appear in real data (edge cases in autonomous driving)
- Privacy-sensitive domains where real data can't be shared (medical, financial)
- Counterfactual generation — "what if" scenarios for fairness testing
- Bootstrapping a model before real data collection starts
- Data augmentation to teach invariances (see the augmentation page in this series)
Validating Synthetic Data
- Train-on-synthetic, test-on-real (TSTR) — the gold standard evaluation
- Statistical similarity — compare marginal distributions, correlations, and joint distributions
- Privacy audit — membership inference attacks to check if real records are memorised
- Domain expert review — human evaluation for plausibility
- FID score (Frechet Inception Distance) for image quality
⚖ Data Licensing & Ethics
The legal and ethical landscape around training data has shifted dramatically. High-profile lawsuits (Getty Images v. Stability AI, The New York Times v. OpenAI) and incoming EU AI Act requirements mean practitioners must understand licensing before they collect, not after.
Common Open Data Licences
- CC0 (Public Domain) — no restrictions; ideal for training data
- CC BY — attribution required; commercial use allowed; fine for most ML
- CC BY-SA — ShareAlike: derived works must use same licence; problematic for proprietary models
- CC BY-NC — non-commercial only; cannot use for commercial product training
- ODbL (Open Database Licence) — used by OpenStreetMap; share-alike for the database
- MIT / Apache 2.0 / BSD — common for code datasets; usually permissive
GDPR & Data Collection Rules
- Lawful basis — need consent, legitimate interest, or another Article 6 basis to collect personal data
- Purpose limitation — data collected for one purpose can't be repurposed for unrelated ML training without new consent
- Data minimisation — collect only what's needed; anonymise where possible
- Right to erasure — "machine unlearning" is an open research problem; design systems to handle deletion requests
- DPA registration — notify your Data Protection Authority if processing at scale
Training on Scraped Web Data — Legal Frontier
Multiple ongoing lawsuits challenge whether scraping publicly accessible websites for ML training constitutes copyright infringement. The legal landscape differs by jurisdiction: the US has a fair use doctrine that may protect some training uses; the EU is more restrictive. Best practices include: using datasets with explicit training permissions (like LAION with watermarked provenance), maintaining data cards documenting sources, and consulting legal counsel for commercial projects. The EU AI Act will require disclosure of training data for general-purpose AI models.
Bias in Data Sourcing
Bias enters ML systems long before modelling begins. Common sourcing biases include:
- Selection bias — datasets reflect who had access to the data-generating system (e.g., medical datasets skew toward patients who sought care)
- Historical bias — datasets encoding past discrimination (hiring data, criminal justice records) perpetuate it
- Measurement bias — different quality sensors or annotation standards across demographic groups
- Label bias — annotators bring their own cultural assumptions; use multiple annotators and measure inter-annotator agreement (Cohen's kappa)
- Representation bias — CommonCrawl-derived datasets over-represent English and Western perspectives; under-represent Global South populations
- Provenance documentation — use Datasheets for Datasets (Gebru et al.) or Data Cards to record collection methodology, known limitations, and intended uses