Training, Validation & Test Sets

How data is split and why it matters for reliable machine learning

Data Splitting Cross-Validation Data Leakage Best Practices

ML Fundamentals Series

Supervised vs. Unsupervised Training, Validation & Test Sets Model Evaluation & Metrics Overfitting & Underfitting Feature Engineering Model Deployment & Inference

← Supervised vs. Unsupervised Model Evaluation & Metrics →

⏱ 7 min read 📊 Beginner 🗓 Updated Jan 2025

The Three Dataset Roles

Every supervised machine learning project requires a disciplined separation of data into three distinct roles. Conflating them is one of the most common and costly mistakes in applied ML — it produces models that appear to perform well but fail catastrophically in production.

Training Set (70%)

Val (15%)

Test (15%)

Training Set

The data the model actually learns from. The optimization algorithm (e.g., gradient descent) sees these examples repeatedly, adjusting model weights to minimize the training loss. It's the "study material."

Typically 60–80% of your total dataset. More is generally better, but returns diminish. The model should never be evaluated on training data as a measure of real performance.

Validation Set

Used during development to tune hyperparameters and compare model architectures. Each time you make a decision based on validation performance (learning rate, number of layers, regularization strength), you are implicitly using that signal.

Typically 10–20%. Because validation is used repeatedly to guide decisions, it leaks some information — hence the need for a separate, untouched test set.

Test Set

The honest, final evaluation. It must be used only once — after all development decisions are finalized. It simulates the real-world data the model will encounter after deployment.

Typically 10–20%. If you evaluate on the test set multiple times and make changes, it becomes a de facto validation set and you lose your unbiased estimate of generalization performance.

Why All Three Are Necessary

Without a validation set, you have no way to tune hyperparameters without contaminating the test set. Without a test set, your final accuracy estimate is optimistic — it reflects how well you've tuned to the validation set, not true generalization. The three-way split is the minimum responsible practice for any model that will be deployed.

Splitting Strategies

The method you use to split data is as important as the ratio. A naive random split is appropriate for some tasks but actively harmful for others. Choosing the wrong split can cause you to dramatically overestimate model performance.

Random Split

Randomly shuffle all examples and allocate them by ratio. Works well when data is independent and identically distributed (i.i.d.) — i.e., there's no meaningful ordering or group structure.

# Pseudocode: random split
shuffle(dataset, seed=42)
train = dataset[0:70%]
val   = dataset[70%:85%]
test  = dataset[85%:100%]

Stratified Split

Preserves the class distribution across splits. Essential for classification tasks with imbalanced classes — without stratification, a small class might appear only in train or only in test.

# Pseudocode: stratified split
for each class_label in unique_labels:
  subset = filter(dataset, class=class_label)
  shuffle(subset, seed=42)
  train += subset[0:70%]
  val   += subset[70%:85%]
  test  += subset[85%:100%]

Time-Series Split

For temporal data, you must respect time order. Training on future data to predict the past is a form of data leakage. The test set must always be the most recent time window.

# Pseudocode: time-series split
sort(dataset, by="timestamp")
train = dataset[0:70%]   # oldest
val   = dataset[70%:85%]
test  = dataset[85%:100%] # newest

# Walk-forward validation for robust
# time-series model selection:
for window in sliding_windows(dataset):
  train_fold, val_fold = window

Group Split

When samples are not independent — e.g., multiple scans from the same patient, or multiple transactions from the same user — entire groups must stay together in one split. Splitting groups across train/test leaks correlations.

# Pseudocode: group split
groups = unique(dataset["patient_id"])
shuffle(groups, seed=42)
train_groups = groups[0:70%]
val_groups   = groups[70%:85%]
test_groups  = groups[85%:100%]

train = filter(dataset, id in train_groups)
val   = filter(dataset, id in val_groups)

Common Split Ratios

Scenario	Train	Validation	Test	Notes
General purpose	70%	15%	15%	Safe default for medium datasets
Large dataset (>1M samples)	98%	1%	1%	Even 1% of 1M = 10,000 evaluation examples
Small dataset (<1,000 samples)	60%	20%	20%	Consider k-fold CV instead
Deep learning	80%	10%	10%	More data for training is critical
Benchmark competitions	80%	10%	10% (hidden)	Public LB vs. private LB split

Cross-Validation

Cross-validation maximizes the use of available data by cycling through multiple train/validation splits. Instead of holding out a fixed validation set, the model is trained and evaluated k times, each time using a different portion of the data as the validation fold. The results are averaged for a more reliable performance estimate.

Cross-validation is especially valuable with small datasets where any fixed split would leave too few examples in either train or validation to be reliable.

k-Fold Cross-Validation

The dataset is divided into k equal-sized folds. The model is trained k times: each iteration uses k-1 folds for training and 1 fold for validation. Final performance = mean across all k runs.

# Pseudocode: 5-fold CV
folds = split(dataset, k=5)
scores = []

for i in range(5):
  val_fold   = folds[i]
  train_fold = all_folds except folds[i]
  model.fit(train_fold)
  scores.append(model.evaluate(val_fold))

mean_score = mean(scores)
std_score  = std(scores)

k=5 and k=10 are standard choices. Higher k = less bias but more computation. Always fix the random seed for reproducibility.

Stratified k-Fold

Each fold preserves the same class ratio as the full dataset. Essential for imbalanced classification. If your dataset is 5% positive class, each fold should also be 5% positive.

Leave-One-Out (LOO)

Extreme case of k-fold where k = n (number of samples). Every single example is used as a validation set once. Very expensive computationally but maximally uses the data. Useful only for very small datasets (n < 100).

Time-Series CV

Walk-forward validation: each fold's training data is a growing prefix of the time series, and validation is always the next time window. Never use future data to train.

Cross-Validation Strategy Comparison

Method	Best For	Computational Cost	Bias	Variance
k-Fold (k=5)	General purpose	5x training cost	Low	Moderate
k-Fold (k=10)	More reliable estimate	10x training cost	Very low	Low
Stratified k-Fold	Imbalanced classification	Same as k-Fold	Low	Low
Leave-One-Out	Tiny datasets (<100)	n x training cost	Minimal	High
Time-Series CV	Sequential / temporal data	Moderate	Low	Moderate

Data Leakage

Data Leakage is the Silent Model Killer

Data leakage occurs when information from outside the training set is inadvertently used to build the model. It causes artificially inflated performance metrics during development, followed by dramatic failure in production. Many published ML benchmarks have been invalidated due to discovered leakage.

Common Forms of Data Leakage

Preprocessing on the full dataset — Computing normalization statistics (mean, std) on all data including test, then applying them during training. The correct approach: fit scalers only on training data, then transform validation and test sets using those train-derived statistics.
Target leakage — Including features that are computed using or strongly correlated with the target variable in ways that wouldn't be available at prediction time. Example: including "received_treatment" as a feature when predicting "will patient recover" — the treatment itself happens after the diagnosis.
Temporal leakage — In time-series tasks, using future information to predict the past. Example: using the closing price of tomorrow to predict tomorrow's price direction.
Group leakage — When multiple samples from the same entity (patient, user, document) appear in both train and test, the model memorizes entity-specific patterns rather than generalizing.
Feature selection on full data — Running feature selection, PCA, or any analysis that "looks at" the test labels before splitting. The test set must be truly held out from every modeling decision.
Repeated test evaluation — Using test set performance to make modeling decisions converts it into a validation set. Reserve the test set for a single final evaluation.

How to Prevent Data Leakage

The golden rule: treat the test set as if it doesn't exist until you are completely done with all modeling decisions.

# WRONG: leaky pipeline
scaler.fit(all_data)          # learns from test statistics!
X_scaled = scaler.transform(all_data)
X_train, X_test = split(X_scaled)

# CORRECT: leak-free pipeline
X_train, X_test = split(raw_data)
scaler.fit(X_train)           # fit ONLY on training data
X_train = scaler.transform(X_train)
X_test  = scaler.transform(X_test)  # apply same transform

In sklearn, use Pipeline objects to chain preprocessing and modeling — they automatically prevent leakage by fitting preprocessing steps only on training folds during cross-validation.

Dataset Size & Quality

A pervasive myth in ML is that "more data always wins." While scale matters enormously for deep learning, the relationship between dataset size and model performance is nuanced — and data quality often matters more than raw quantity.

Learning Curves

A learning curve plots model performance as a function of training set size. It reveals critical information about your model and data regime:

Training error rises as more data is added (harder to perfectly fit more examples)
Validation error falls as more data is added (better generalization)
When both curves converge at a low error — your model is well-fitted
When both curves converge at a high error — you have underfitting; more data won't help, but a more complex model might
When there remains a large gap between curves — you have overfitting; more data or regularization is needed

Quality Over Quantity

Adding more low-quality data can actively hurt model performance. Noisy labels, selection bias, and distribution mismatch are harder to overcome with scale alone.

Label noise — Even 5% random label errors can significantly degrade a classifier. Clean 10,000 examples often beat noisy 100,000.
Selection bias — If your training data only captures part of the real-world distribution, the model will fail on underrepresented inputs.
Class imbalance — 99% negative class, 1% positive class makes raw accuracy misleading. Oversample the minority class or use weighted loss functions.
Distribution shift — Training data collected in one time period or context may not represent production data, making careful test set construction essential.

Practical Guidelines for Dataset Size

For classical ML (trees, SVMs, linear models): hundreds to tens of thousands of labeled examples per class are typically sufficient. For deep learning on images or text: millions of examples help, though transfer learning can achieve strong results with thousands. For fine-tuning pre-trained language models: even hundreds of high-quality labeled examples can be effective. Always plot learning curves to understand whether collecting more data will actually improve your model.