How data is split and why it matters for reliable machine learning
Every supervised machine learning project requires a disciplined separation of data into three distinct roles. Conflating them is one of the most common and costly mistakes in applied ML — it produces models that appear to perform well but fail catastrophically in production.
The data the model actually learns from. The optimization algorithm (e.g., gradient descent) sees these examples repeatedly, adjusting model weights to minimize the training loss. It's the "study material."
Typically 60–80% of your total dataset. More is generally better, but returns diminish. The model should never be evaluated on training data as a measure of real performance.
Used during development to tune hyperparameters and compare model architectures. Each time you make a decision based on validation performance (learning rate, number of layers, regularization strength), you are implicitly using that signal.
Typically 10–20%. Because validation is used repeatedly to guide decisions, it leaks some information — hence the need for a separate, untouched test set.
The honest, final evaluation. It must be used only once — after all development decisions are finalized. It simulates the real-world data the model will encounter after deployment.
Typically 10–20%. If you evaluate on the test set multiple times and make changes, it becomes a de facto validation set and you lose your unbiased estimate of generalization performance.
Without a validation set, you have no way to tune hyperparameters without contaminating the test set. Without a test set, your final accuracy estimate is optimistic — it reflects how well you've tuned to the validation set, not true generalization. The three-way split is the minimum responsible practice for any model that will be deployed.
The method you use to split data is as important as the ratio. A naive random split is appropriate for some tasks but actively harmful for others. Choosing the wrong split can cause you to dramatically overestimate model performance.
Randomly shuffle all examples and allocate them by ratio. Works well when data is independent and identically distributed (i.i.d.) — i.e., there's no meaningful ordering or group structure.
# Pseudocode: random split shuffle(dataset, seed=42) train = dataset[0:70%] val = dataset[70%:85%] test = dataset[85%:100%]
Preserves the class distribution across splits. Essential for classification tasks with imbalanced classes — without stratification, a small class might appear only in train or only in test.
# Pseudocode: stratified split for each class_label in unique_labels: subset = filter(dataset, class=class_label) shuffle(subset, seed=42) train += subset[0:70%] val += subset[70%:85%] test += subset[85%:100%]
For temporal data, you must respect time order. Training on future data to predict the past is a form of data leakage. The test set must always be the most recent time window.
# Pseudocode: time-series split sort(dataset, by="timestamp") train = dataset[0:70%] # oldest val = dataset[70%:85%] test = dataset[85%:100%] # newest # Walk-forward validation for robust # time-series model selection: for window in sliding_windows(dataset): train_fold, val_fold = window
When samples are not independent — e.g., multiple scans from the same patient, or multiple transactions from the same user — entire groups must stay together in one split. Splitting groups across train/test leaks correlations.
# Pseudocode: group split groups = unique(dataset["patient_id"]) shuffle(groups, seed=42) train_groups = groups[0:70%] val_groups = groups[70%:85%] test_groups = groups[85%:100%] train = filter(dataset, id in train_groups) val = filter(dataset, id in val_groups)
| Scenario | Train | Validation | Test | Notes |
|---|---|---|---|---|
| General purpose | 70% | 15% | 15% | Safe default for medium datasets |
| Large dataset (>1M samples) | 98% | 1% | 1% | Even 1% of 1M = 10,000 evaluation examples |
| Small dataset (<1,000 samples) | 60% | 20% | 20% | Consider k-fold CV instead |
| Deep learning | 80% | 10% | 10% | More data for training is critical |
| Benchmark competitions | 80% | 10% | 10% (hidden) | Public LB vs. private LB split |
Cross-validation maximizes the use of available data by cycling through multiple train/validation splits. Instead of holding out a fixed validation set, the model is trained and evaluated k times, each time using a different portion of the data as the validation fold. The results are averaged for a more reliable performance estimate.
Cross-validation is especially valuable with small datasets where any fixed split would leave too few examples in either train or validation to be reliable.
The dataset is divided into k equal-sized folds. The model is trained k times: each iteration uses k-1 folds for training and 1 fold for validation. Final performance = mean across all k runs.
# Pseudocode: 5-fold CV folds = split(dataset, k=5) scores = [] for i in range(5): val_fold = folds[i] train_fold = all_folds except folds[i] model.fit(train_fold) scores.append(model.evaluate(val_fold)) mean_score = mean(scores) std_score = std(scores)
k=5 and k=10 are standard choices. Higher k = less bias but more computation. Always fix the random seed for reproducibility.
Each fold preserves the same class ratio as the full dataset. Essential for imbalanced classification. If your dataset is 5% positive class, each fold should also be 5% positive.
Extreme case of k-fold where k = n (number of samples). Every single example is used as a validation set once. Very expensive computationally but maximally uses the data. Useful only for very small datasets (n < 100).
Walk-forward validation: each fold's training data is a growing prefix of the time series, and validation is always the next time window. Never use future data to train.
| Method | Best For | Computational Cost | Bias | Variance |
|---|---|---|---|---|
| k-Fold (k=5) | General purpose | 5x training cost | Low | Moderate |
| k-Fold (k=10) | More reliable estimate | 10x training cost | Very low | Low |
| Stratified k-Fold | Imbalanced classification | Same as k-Fold | Low | Low |
| Leave-One-Out | Tiny datasets (<100) | n x training cost | Minimal | High |
| Time-Series CV | Sequential / temporal data | Moderate | Low | Moderate |
Data leakage occurs when information from outside the training set is inadvertently used to build the model. It causes artificially inflated performance metrics during development, followed by dramatic failure in production. Many published ML benchmarks have been invalidated due to discovered leakage.
The golden rule: treat the test set as if it doesn't exist until you are completely done with all modeling decisions.
# WRONG: leaky pipeline scaler.fit(all_data) # learns from test statistics! X_scaled = scaler.transform(all_data) X_train, X_test = split(X_scaled) # CORRECT: leak-free pipeline X_train, X_test = split(raw_data) scaler.fit(X_train) # fit ONLY on training data X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) # apply same transform
In sklearn, use Pipeline objects to chain preprocessing and modeling — they automatically prevent leakage by fitting preprocessing steps only on training folds during cross-validation.
A pervasive myth in ML is that "more data always wins." While scale matters enormously for deep learning, the relationship between dataset size and model performance is nuanced — and data quality often matters more than raw quantity.
A learning curve plots model performance as a function of training set size. It reveals critical information about your model and data regime:
Adding more low-quality data can actively hurt model performance. Noisy labels, selection bias, and distribution mismatch are harder to overcome with scale alone.
For classical ML (trees, SVMs, linear models): hundreds to tens of thousands of labeled examples per class are typically sufficient. For deep learning on images or text: millions of examples help, though transfer learning can achieve strong results with thousands. For fine-tuning pre-trained language models: even hundreds of high-quality labeled examples can be effective. Always plot learning curves to understand whether collecting more data will actually improve your model.