Data Collection & Sourcing | CyberSecurityHub

🗂 Data Sources

Every machine learning project begins with a fundamental question: where does the data come from? The source shapes everything downstream — quality, bias, licensing constraints, and ultimately model performance. Understanding the trade-offs between different data sourcing strategies is a core practitioner skill.

Source Type	Examples	Pros	Cons
Public Datasets	Kaggle, HuggingFace Hub, UCI ML Repository, data.gov, World Bank Open Data, ImageNet, Common Crawl	Free, pre-labelled, community-validated, benchmarks available	May be outdated, over-used (benchmark saturation), not domain-specific, licensing varies
Web Scraping	News articles, product reviews, social media posts, forum threads, job listings	Massive scale, real-world distribution, up-to-date	Legal risks, noisy, requires heavy cleaning, robots.txt/ToS constraints, rate limits
APIs	Twitter/X API, Reddit API, OpenWeatherMap, financial data APIs (Alpha Vantage), Google Maps	Structured, clean, usually real-time, terms of service clear	Rate limits, cost at scale, API changes can break pipelines, data access may be restricted
Internal Databases	CRM data, server logs, ERP systems, application telemetry, transaction records	Domain-specific, proprietary advantage, often labelled by business process	Data silos, schema inconsistencies, PII concerns, requires data engineering
Synthetic Generation	GANs, diffusion models, LLM-generated text, SMOTE, simulation engines	Privacy-preserving, controllable distribution, unlimited volume	Distribution shift from real data, mode collapse risk, requires validation
Data Purchase / Licensing	Bloomberg data, Refinitiv, medical record datasets, commercial annotation services	High quality, expert-labelled, legal clarity	Expensive, locked-in, may not cover edge cases, resale restrictions

Public Dataset Repositories

Kaggle — competitions + community datasets; wide variety of domains; CSV-heavy
HuggingFace Hub — the go-to for NLP; models, datasets, and spaces in one place
UCI ML Repository — classic benchmark datasets; tabular data focus; well-cited
Google Dataset Search — meta-search across thousands of published datasets
data.gov / EU Open Data — government datasets; geospatial, demographic, regulatory
OpenML — benchmark suites with reproducible experiments and metadata

Domain-Specific Sources

Computer Vision — COCO, Open Images, LAION-5B, CelebA, CIFAR-10/100
NLP / Text — Common Crawl, The Pile, Wikipedia dumps, BookCorpus
Audio / Speech — LibriSpeech, Mozilla Common Voice, VoxCeleb
Cybersecurity — CICIDS, KDD Cup 99, NSL-KDD, UNSW-NB15 for intrusion detection
Medical — MIMIC-III, NIH Chest X-rays, PhysioNet (strict access controls)
Finance — Yahoo Finance, Quandl, SEC EDGAR filings

📊 Data Volume Planning

One of the most common questions in applied ML is "how much data do I need?" The honest answer depends on model complexity, task difficulty, and the quality of your labels. Here are evidence-based heuristics to guide collection efforts before you commit significant resources.

Quality >> Quantity

A carefully curated dataset of 10,000 clean, correctly-labelled examples almost always outperforms a noisy dataset of 1,000,000 examples. Before scaling collection, maximise label quality, resolve ambiguous examples, and remove near-duplicates.

Classical ML Rules of Thumb

Start with at least 10× the number of features in your training set to avoid overfitting
For classification: aim for at least 100–1000 examples per class, more for complex boundaries
Learning curves are your best guide — plot validation performance vs training set size to see if more data still helps
Diminishing returns typically set in — doubling data often yields only modest gains after the initial slope
Use cross-validation on small datasets to maximise use of available data
Feature engineering quality can substitute for raw data volume in shallow models

Deep Learning & LLM Scaling

Chinchilla scaling laws (Hoffmann et al., 2022): for compute-optimal training, tokens ≈ 20× model parameters
GPT-3 (175B params) would be Chinchilla-optimal at ~3.5T tokens; it was trained on ~300B
For fine-tuning: even hundreds to low thousands of high-quality examples can work with LoRA/PEFT
Transfer learning changes the equation — leverage pre-trained models to reduce data requirements by orders of magnitude
Data diversity often matters more than raw count for generalisation
Consider active learning to label only the most informative examples

Task Type	Minimum Viable	Production Quality	Notes
Binary classification (tabular)	500–2,000 per class	10,000+ per class	Depends heavily on feature count and separability
Multi-class classification (10 classes)	200–500 per class	5,000+ per class	Use stratified sampling
Image classification (CNN fine-tune)	100–500 per class	2,000–10,000 per class	With ImageNet pre-training
LLM instruction fine-tuning	500–2,000 QA pairs	50,000–500,000 pairs	Quality of instructions critical; RLHF adds more
Named entity recognition	1,000 sentences	20,000+ sentences	Token-level annotation; consider distant supervision

🕸 Web Scraping & APIs

Web scraping enables collection of data at a scale and recency that no public dataset can match, but it comes with technical, legal, and ethical complexity. Understanding how to scrape responsibly is as important as knowing how to scrape effectively.

Responsible Scraping Principles

robots.txt — always check example.com/robots.txt; respect Disallow directives even when technically bypassable
Rate limiting — add delays (1–5 seconds) between requests; use exponential backoff on 429 responses
User-Agent — identify your bot honestly; contact site owners for large-scale crawls
Terms of Service — many ToS explicitly prohibit scraping; hiQ v. LinkedIn established some legal precedent but landscape is unsettled
Cache aggressively — avoid re-fetching; store raw HTML before parsing to avoid re-crawls

Deduplication & Pagination

URL normalisation — strip tracking parameters, normalise trailing slashes, resolve redirects before storing
Content hashing — MD5/SHA256 hash of page body to detect exact duplicates across different URLs
Near-deduplication — MinHash or SimHash for near-duplicate detection at scale
Pagination patterns — ?page=N, offset/limit params, cursor-based (next_token), infinite scroll (XHR interception)
Sitemap.xml — use sitemaps as a structured URL frontier for comprehensive crawls

Minimal Python Scraper Outline

import requests
from bs4 import BeautifulSoup
import time
import hashlib
from urllib.robotparser import RobotFileParser

# ---- Check robots.txt ----
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

BASE_URL = "https://example.com/articles"
DELAY    = 2        # seconds between requests
seen     = set()    # URL deduplication

def scrape_page(url):
    if not rp.can_fetch("*", url):
        print(f"Blocked by robots.txt: {url}")
        return None

    resp = requests.get(url, headers={"User-Agent": "MyResearchBot/1.0"}, timeout=10)
    resp.raise_for_status()

    content_hash = hashlib.md5(resp.content).hexdigest()
    if content_hash in seen:
        return None          # exact duplicate
    seen.add(content_hash)

    soup = BeautifulSoup(resp.text, "html.parser")
    return {
        "url":   url,
        "title": soup.find("h1").get_text(strip=True) if soup.find("h1") else "",
        "body":  soup.find("article").get_text(separator=" ", strip=True) if soup.find("article") else "",
    }

# ---- Paginated crawl ----
results = []
for page_num in range(1, 20):
    url  = f"{BASE_URL}?page={page_num}"
    data = scrape_page(url)
    if data:
        results.append(data)
    time.sleep(DELAY)

# For large-scale scraping, use Scrapy + Redis for distributed crawling
# pip install scrapy scrapy-redis

Scrapy for Production-Scale Crawls

For anything beyond a few hundred pages, use Scrapy — it provides built-in middleware for rate limiting, retry logic, proxy rotation, item pipelines for cleaning, and Scrapy-Redis for distributed crawling across multiple nodes. The scrapy crawl command handles concurrency, politeness, and logging automatically.

🧪 Synthetic Data Generation

When real data is scarce, sensitive (medical records, financial fraud), expensive to label, or contains privacy constraints, synthetic data bridges the gap. The key question is always whether the synthetic distribution is close enough to the real one to be useful for training.

Method	Use Case	Quality	Risk / Caveats
SMOTE	Tabular minority class oversampling for imbalanced datasets	Good for linear boundaries; weak for high-dimensional data	Can create unrealistic interpolations; noisy boundary samples
GANs (DCGAN, CTGAN)	Image synthesis; tabular data (CTGAN for mixed types)	High visual quality; CTGAN handles categorical columns	Training instability; mode collapse; long training times
Diffusion Models	High-quality image & audio synthesis; increasingly text	State-of-the-art image fidelity (DALL-E, Stable Diffusion)	Computationally expensive; copyright concerns on training data
LLM-Generated Text	Instruction-tuning pairs, question generation, data augmentation	High semantic quality; can target specific formats and styles	Model collapse if used to train the generating model; hallucinations
Simulation / Rule-Based	Robotics, games, autonomous driving (CARLA), network traffic	Perfect ground truth; unlimited volume; controllable conditions	Sim-to-real gap; unrealistic sensor noise; brittle assumptions
Differential Privacy Synthesis	Privacy-preserving release of sensitive tabular data	Moderate — privacy-utility trade-off	Added noise reduces statistical fidelity; complex to tune epsilon

When Synthetic Data Shines

Rare events that never appear in real data (edge cases in autonomous driving)
Privacy-sensitive domains where real data can't be shared (medical, financial)
Counterfactual generation — "what if" scenarios for fairness testing
Bootstrapping a model before real data collection starts
Data augmentation to teach invariances (see the augmentation page in this series)

Validating Synthetic Data

Train-on-synthetic, test-on-real (TSTR) — the gold standard evaluation
Statistical similarity — compare marginal distributions, correlations, and joint distributions
Privacy audit — membership inference attacks to check if real records are memorised
Domain expert review — human evaluation for plausibility
FID score (Frechet Inception Distance) for image quality

⚖ Data Licensing & Ethics

The legal and ethical landscape around training data has shifted dramatically. High-profile lawsuits (Getty Images v. Stability AI, The New York Times v. OpenAI) and incoming EU AI Act requirements mean practitioners must understand licensing before they collect, not after.

Common Open Data Licences

CC0 (Public Domain) — no restrictions; ideal for training data
CC BY — attribution required; commercial use allowed; fine for most ML
CC BY-SA — ShareAlike: derived works must use same licence; problematic for proprietary models
CC BY-NC — non-commercial only; cannot use for commercial product training
ODbL (Open Database Licence) — used by OpenStreetMap; share-alike for the database
MIT / Apache 2.0 / BSD — common for code datasets; usually permissive

GDPR & Data Collection Rules

Lawful basis — need consent, legitimate interest, or another Article 6 basis to collect personal data
Purpose limitation — data collected for one purpose can't be repurposed for unrelated ML training without new consent
Data minimisation — collect only what's needed; anonymise where possible
Right to erasure — "machine unlearning" is an open research problem; design systems to handle deletion requests
DPA registration — notify your Data Protection Authority if processing at scale

Training on Scraped Web Data — Legal Frontier

Multiple ongoing lawsuits challenge whether scraping publicly accessible websites for ML training constitutes copyright infringement. The legal landscape differs by jurisdiction: the US has a fair use doctrine that may protect some training uses; the EU is more restrictive. Best practices include: using datasets with explicit training permissions (like LAION with watermarked provenance), maintaining data cards documenting sources, and consulting legal counsel for commercial projects. The EU AI Act will require disclosure of training data for general-purpose AI models.

Bias in Data Sourcing

Bias enters ML systems long before modelling begins. Common sourcing biases include:

Selection bias — datasets reflect who had access to the data-generating system (e.g., medical datasets skew toward patients who sought care)
Historical bias — datasets encoding past discrimination (hiring data, criminal justice records) perpetuate it
Measurement bias — different quality sensors or annotation standards across demographic groups
Label bias — annotators bring their own cultural assumptions; use multiple annotators and measure inter-annotator agreement (Cohen's kappa)
Representation bias — CommonCrawl-derived datasets over-represent English and Western perspectives; under-represent Global South populations
Provenance documentation — use Datasheets for Datasets (Gebru et al.) or Data Cards to record collection methodology, known limitations, and intended uses