⏱ 7 min read 📊 Beginner 🗓 Updated Jan 2025

🗂 Data Sources

Every machine learning project begins with a fundamental question: where does the data come from? The source shapes everything downstream — quality, bias, licensing constraints, and ultimately model performance. Understanding the trade-offs between different data sourcing strategies is a core practitioner skill.

Source Type Examples Pros Cons
Public Datasets Kaggle, HuggingFace Hub, UCI ML Repository, data.gov, World Bank Open Data, ImageNet, Common Crawl Free, pre-labelled, community-validated, benchmarks available May be outdated, over-used (benchmark saturation), not domain-specific, licensing varies
Web Scraping News articles, product reviews, social media posts, forum threads, job listings Massive scale, real-world distribution, up-to-date Legal risks, noisy, requires heavy cleaning, robots.txt/ToS constraints, rate limits
APIs Twitter/X API, Reddit API, OpenWeatherMap, financial data APIs (Alpha Vantage), Google Maps Structured, clean, usually real-time, terms of service clear Rate limits, cost at scale, API changes can break pipelines, data access may be restricted
Internal Databases CRM data, server logs, ERP systems, application telemetry, transaction records Domain-specific, proprietary advantage, often labelled by business process Data silos, schema inconsistencies, PII concerns, requires data engineering
Synthetic Generation GANs, diffusion models, LLM-generated text, SMOTE, simulation engines Privacy-preserving, controllable distribution, unlimited volume Distribution shift from real data, mode collapse risk, requires validation
Data Purchase / Licensing Bloomberg data, Refinitiv, medical record datasets, commercial annotation services High quality, expert-labelled, legal clarity Expensive, locked-in, may not cover edge cases, resale restrictions

Public Dataset Repositories

  • Kaggle — competitions + community datasets; wide variety of domains; CSV-heavy
  • HuggingFace Hub — the go-to for NLP; models, datasets, and spaces in one place
  • UCI ML Repository — classic benchmark datasets; tabular data focus; well-cited
  • Google Dataset Search — meta-search across thousands of published datasets
  • data.gov / EU Open Data — government datasets; geospatial, demographic, regulatory
  • OpenML — benchmark suites with reproducible experiments and metadata

Domain-Specific Sources

  • Computer Vision — COCO, Open Images, LAION-5B, CelebA, CIFAR-10/100
  • NLP / Text — Common Crawl, The Pile, Wikipedia dumps, BookCorpus
  • Audio / Speech — LibriSpeech, Mozilla Common Voice, VoxCeleb
  • Cybersecurity — CICIDS, KDD Cup 99, NSL-KDD, UNSW-NB15 for intrusion detection
  • Medical — MIMIC-III, NIH Chest X-rays, PhysioNet (strict access controls)
  • Finance — Yahoo Finance, Quandl, SEC EDGAR filings

📊 Data Volume Planning

One of the most common questions in applied ML is "how much data do I need?" The honest answer depends on model complexity, task difficulty, and the quality of your labels. Here are evidence-based heuristics to guide collection efforts before you commit significant resources.

Quality >> Quantity

A carefully curated dataset of 10,000 clean, correctly-labelled examples almost always outperforms a noisy dataset of 1,000,000 examples. Before scaling collection, maximise label quality, resolve ambiguous examples, and remove near-duplicates.

Classical ML Rules of Thumb

  • Start with at least 10× the number of features in your training set to avoid overfitting
  • For classification: aim for at least 100–1000 examples per class, more for complex boundaries
  • Learning curves are your best guide — plot validation performance vs training set size to see if more data still helps
  • Diminishing returns typically set in — doubling data often yields only modest gains after the initial slope
  • Use cross-validation on small datasets to maximise use of available data
  • Feature engineering quality can substitute for raw data volume in shallow models

Deep Learning & LLM Scaling

  • Chinchilla scaling laws (Hoffmann et al., 2022): for compute-optimal training, tokens ≈ 20× model parameters
  • GPT-3 (175B params) would be Chinchilla-optimal at ~3.5T tokens; it was trained on ~300B
  • For fine-tuning: even hundreds to low thousands of high-quality examples can work with LoRA/PEFT
  • Transfer learning changes the equation — leverage pre-trained models to reduce data requirements by orders of magnitude
  • Data diversity often matters more than raw count for generalisation
  • Consider active learning to label only the most informative examples
Task Type Minimum Viable Production Quality Notes
Binary classification (tabular) 500–2,000 per class 10,000+ per class Depends heavily on feature count and separability
Multi-class classification (10 classes) 200–500 per class 5,000+ per class Use stratified sampling
Image classification (CNN fine-tune) 100–500 per class 2,000–10,000 per class With ImageNet pre-training
LLM instruction fine-tuning 500–2,000 QA pairs 50,000–500,000 pairs Quality of instructions critical; RLHF adds more
Named entity recognition 1,000 sentences 20,000+ sentences Token-level annotation; consider distant supervision

🕸 Web Scraping & APIs

Web scraping enables collection of data at a scale and recency that no public dataset can match, but it comes with technical, legal, and ethical complexity. Understanding how to scrape responsibly is as important as knowing how to scrape effectively.

Responsible Scraping Principles

  • robots.txt — always check example.com/robots.txt; respect Disallow directives even when technically bypassable
  • Rate limiting — add delays (1–5 seconds) between requests; use exponential backoff on 429 responses
  • User-Agent — identify your bot honestly; contact site owners for large-scale crawls
  • Terms of Service — many ToS explicitly prohibit scraping; hiQ v. LinkedIn established some legal precedent but landscape is unsettled
  • Cache aggressively — avoid re-fetching; store raw HTML before parsing to avoid re-crawls

Deduplication & Pagination

  • URL normalisation — strip tracking parameters, normalise trailing slashes, resolve redirects before storing
  • Content hashing — MD5/SHA256 hash of page body to detect exact duplicates across different URLs
  • Near-deduplication — MinHash or SimHash for near-duplicate detection at scale
  • Pagination patterns?page=N, offset/limit params, cursor-based (next_token), infinite scroll (XHR interception)
  • Sitemap.xml — use sitemaps as a structured URL frontier for comprehensive crawls

Minimal Python Scraper Outline

import requests
from bs4 import BeautifulSoup
import time
import hashlib
from urllib.robotparser import RobotFileParser

# ---- Check robots.txt ----
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

BASE_URL = "https://example.com/articles"
DELAY    = 2        # seconds between requests
seen     = set()    # URL deduplication

def scrape_page(url):
    if not rp.can_fetch("*", url):
        print(f"Blocked by robots.txt: {url}")
        return None

    resp = requests.get(url, headers={"User-Agent": "MyResearchBot/1.0"}, timeout=10)
    resp.raise_for_status()

    content_hash = hashlib.md5(resp.content).hexdigest()
    if content_hash in seen:
        return None          # exact duplicate
    seen.add(content_hash)

    soup = BeautifulSoup(resp.text, "html.parser")
    return {
        "url":   url,
        "title": soup.find("h1").get_text(strip=True) if soup.find("h1") else "",
        "body":  soup.find("article").get_text(separator=" ", strip=True) if soup.find("article") else "",
    }

# ---- Paginated crawl ----
results = []
for page_num in range(1, 20):
    url  = f"{BASE_URL}?page={page_num}"
    data = scrape_page(url)
    if data:
        results.append(data)
    time.sleep(DELAY)

# For large-scale scraping, use Scrapy + Redis for distributed crawling
# pip install scrapy scrapy-redis

Scrapy for Production-Scale Crawls

For anything beyond a few hundred pages, use Scrapy — it provides built-in middleware for rate limiting, retry logic, proxy rotation, item pipelines for cleaning, and Scrapy-Redis for distributed crawling across multiple nodes. The scrapy crawl command handles concurrency, politeness, and logging automatically.

🧪 Synthetic Data Generation

When real data is scarce, sensitive (medical records, financial fraud), expensive to label, or contains privacy constraints, synthetic data bridges the gap. The key question is always whether the synthetic distribution is close enough to the real one to be useful for training.

Method Use Case Quality Risk / Caveats
SMOTE Tabular minority class oversampling for imbalanced datasets Good for linear boundaries; weak for high-dimensional data Can create unrealistic interpolations; noisy boundary samples
GANs (DCGAN, CTGAN) Image synthesis; tabular data (CTGAN for mixed types) High visual quality; CTGAN handles categorical columns Training instability; mode collapse; long training times
Diffusion Models High-quality image & audio synthesis; increasingly text State-of-the-art image fidelity (DALL-E, Stable Diffusion) Computationally expensive; copyright concerns on training data
LLM-Generated Text Instruction-tuning pairs, question generation, data augmentation High semantic quality; can target specific formats and styles Model collapse if used to train the generating model; hallucinations
Simulation / Rule-Based Robotics, games, autonomous driving (CARLA), network traffic Perfect ground truth; unlimited volume; controllable conditions Sim-to-real gap; unrealistic sensor noise; brittle assumptions
Differential Privacy Synthesis Privacy-preserving release of sensitive tabular data Moderate — privacy-utility trade-off Added noise reduces statistical fidelity; complex to tune epsilon

When Synthetic Data Shines

  • Rare events that never appear in real data (edge cases in autonomous driving)
  • Privacy-sensitive domains where real data can't be shared (medical, financial)
  • Counterfactual generation — "what if" scenarios for fairness testing
  • Bootstrapping a model before real data collection starts
  • Data augmentation to teach invariances (see the augmentation page in this series)

Validating Synthetic Data

  • Train-on-synthetic, test-on-real (TSTR) — the gold standard evaluation
  • Statistical similarity — compare marginal distributions, correlations, and joint distributions
  • Privacy audit — membership inference attacks to check if real records are memorised
  • Domain expert review — human evaluation for plausibility
  • FID score (Frechet Inception Distance) for image quality

⚖ Data Licensing & Ethics

The legal and ethical landscape around training data has shifted dramatically. High-profile lawsuits (Getty Images v. Stability AI, The New York Times v. OpenAI) and incoming EU AI Act requirements mean practitioners must understand licensing before they collect, not after.

Common Open Data Licences

  • CC0 (Public Domain) — no restrictions; ideal for training data
  • CC BY — attribution required; commercial use allowed; fine for most ML
  • CC BY-SA — ShareAlike: derived works must use same licence; problematic for proprietary models
  • CC BY-NC — non-commercial only; cannot use for commercial product training
  • ODbL (Open Database Licence) — used by OpenStreetMap; share-alike for the database
  • MIT / Apache 2.0 / BSD — common for code datasets; usually permissive

GDPR & Data Collection Rules

  • Lawful basis — need consent, legitimate interest, or another Article 6 basis to collect personal data
  • Purpose limitation — data collected for one purpose can't be repurposed for unrelated ML training without new consent
  • Data minimisation — collect only what's needed; anonymise where possible
  • Right to erasure — "machine unlearning" is an open research problem; design systems to handle deletion requests
  • DPA registration — notify your Data Protection Authority if processing at scale

Training on Scraped Web Data — Legal Frontier

Multiple ongoing lawsuits challenge whether scraping publicly accessible websites for ML training constitutes copyright infringement. The legal landscape differs by jurisdiction: the US has a fair use doctrine that may protect some training uses; the EU is more restrictive. Best practices include: using datasets with explicit training permissions (like LAION with watermarked provenance), maintaining data cards documenting sources, and consulting legal counsel for commercial projects. The EU AI Act will require disclosure of training data for general-purpose AI models.

Bias in Data Sourcing

Bias enters ML systems long before modelling begins. Common sourcing biases include:

  • Selection bias — datasets reflect who had access to the data-generating system (e.g., medical datasets skew toward patients who sought care)
  • Historical bias — datasets encoding past discrimination (hiring data, criminal justice records) perpetuate it
  • Measurement bias — different quality sensors or annotation standards across demographic groups
  • Label bias — annotators bring their own cultural assumptions; use multiple annotators and measure inter-annotator agreement (Cohen's kappa)
  • Representation bias — CommonCrawl-derived datasets over-represent English and Western perspectives; under-represent Global South populations
  • Provenance documentation — use Datasheets for Datasets (Gebru et al.) or Data Cards to record collection methodology, known limitations, and intended uses