Training, Validation & Test Sets

How data is split and why it matters for reliable machine learning

Data Splitting Cross-Validation Data Leakage Best Practices
ML Fundamentals Series
⏱ 7 min read 📊 Beginner 🗓 Updated Jan 2025

The Three Dataset Roles

Every supervised machine learning project requires a disciplined separation of data into three distinct roles. Conflating them is one of the most common and costly mistakes in applied ML — it produces models that appear to perform well but fail catastrophically in production.

Training Set (70%)
Val (15%)
Test (15%)

Training Set

The data the model actually learns from. The optimization algorithm (e.g., gradient descent) sees these examples repeatedly, adjusting model weights to minimize the training loss. It's the "study material."

Typically 60–80% of your total dataset. More is generally better, but returns diminish. The model should never be evaluated on training data as a measure of real performance.

Validation Set

Used during development to tune hyperparameters and compare model architectures. Each time you make a decision based on validation performance (learning rate, number of layers, regularization strength), you are implicitly using that signal.

Typically 10–20%. Because validation is used repeatedly to guide decisions, it leaks some information — hence the need for a separate, untouched test set.

Test Set

The honest, final evaluation. It must be used only once — after all development decisions are finalized. It simulates the real-world data the model will encounter after deployment.

Typically 10–20%. If you evaluate on the test set multiple times and make changes, it becomes a de facto validation set and you lose your unbiased estimate of generalization performance.

Why All Three Are Necessary

Without a validation set, you have no way to tune hyperparameters without contaminating the test set. Without a test set, your final accuracy estimate is optimistic — it reflects how well you've tuned to the validation set, not true generalization. The three-way split is the minimum responsible practice for any model that will be deployed.

Splitting Strategies

The method you use to split data is as important as the ratio. A naive random split is appropriate for some tasks but actively harmful for others. Choosing the wrong split can cause you to dramatically overestimate model performance.

Random Split

Randomly shuffle all examples and allocate them by ratio. Works well when data is independent and identically distributed (i.i.d.) — i.e., there's no meaningful ordering or group structure.

# Pseudocode: random split
shuffle(dataset, seed=42)
train = dataset[0:70%]
val   = dataset[70%:85%]
test  = dataset[85%:100%]

Stratified Split

Preserves the class distribution across splits. Essential for classification tasks with imbalanced classes — without stratification, a small class might appear only in train or only in test.

# Pseudocode: stratified split
for each class_label in unique_labels:
  subset = filter(dataset, class=class_label)
  shuffle(subset, seed=42)
  train += subset[0:70%]
  val   += subset[70%:85%]
  test  += subset[85%:100%]

Time-Series Split

For temporal data, you must respect time order. Training on future data to predict the past is a form of data leakage. The test set must always be the most recent time window.

# Pseudocode: time-series split
sort(dataset, by="timestamp")
train = dataset[0:70%]   # oldest
val   = dataset[70%:85%]
test  = dataset[85%:100%] # newest

# Walk-forward validation for robust
# time-series model selection:
for window in sliding_windows(dataset):
  train_fold, val_fold = window

Group Split

When samples are not independent — e.g., multiple scans from the same patient, or multiple transactions from the same user — entire groups must stay together in one split. Splitting groups across train/test leaks correlations.

# Pseudocode: group split
groups = unique(dataset["patient_id"])
shuffle(groups, seed=42)
train_groups = groups[0:70%]
val_groups   = groups[70%:85%]
test_groups  = groups[85%:100%]

train = filter(dataset, id in train_groups)
val   = filter(dataset, id in val_groups)

Common Split Ratios

ScenarioTrainValidationTestNotes
General purpose70%15%15%Safe default for medium datasets
Large dataset (>1M samples)98%1%1%Even 1% of 1M = 10,000 evaluation examples
Small dataset (<1,000 samples)60%20%20%Consider k-fold CV instead
Deep learning80%10%10%More data for training is critical
Benchmark competitions80%10%10% (hidden)Public LB vs. private LB split

Cross-Validation

Cross-validation maximizes the use of available data by cycling through multiple train/validation splits. Instead of holding out a fixed validation set, the model is trained and evaluated k times, each time using a different portion of the data as the validation fold. The results are averaged for a more reliable performance estimate.

Cross-validation is especially valuable with small datasets where any fixed split would leave too few examples in either train or validation to be reliable.

k-Fold Cross-Validation

The dataset is divided into k equal-sized folds. The model is trained k times: each iteration uses k-1 folds for training and 1 fold for validation. Final performance = mean across all k runs.

# Pseudocode: 5-fold CV
folds = split(dataset, k=5)
scores = []

for i in range(5):
  val_fold   = folds[i]
  train_fold = all_folds except folds[i]
  model.fit(train_fold)
  scores.append(model.evaluate(val_fold))

mean_score = mean(scores)
std_score  = std(scores)

k=5 and k=10 are standard choices. Higher k = less bias but more computation. Always fix the random seed for reproducibility.

Stratified k-Fold

Each fold preserves the same class ratio as the full dataset. Essential for imbalanced classification. If your dataset is 5% positive class, each fold should also be 5% positive.

Leave-One-Out (LOO)

Extreme case of k-fold where k = n (number of samples). Every single example is used as a validation set once. Very expensive computationally but maximally uses the data. Useful only for very small datasets (n < 100).

Time-Series CV

Walk-forward validation: each fold's training data is a growing prefix of the time series, and validation is always the next time window. Never use future data to train.

Cross-Validation Strategy Comparison

MethodBest ForComputational CostBiasVariance
k-Fold (k=5)General purpose5x training costLowModerate
k-Fold (k=10)More reliable estimate10x training costVery lowLow
Stratified k-FoldImbalanced classificationSame as k-FoldLowLow
Leave-One-OutTiny datasets (<100)n x training costMinimalHigh
Time-Series CVSequential / temporal dataModerateLowModerate

Data Leakage

Data Leakage is the Silent Model Killer

Data leakage occurs when information from outside the training set is inadvertently used to build the model. It causes artificially inflated performance metrics during development, followed by dramatic failure in production. Many published ML benchmarks have been invalidated due to discovered leakage.

Common Forms of Data Leakage

How to Prevent Data Leakage

The golden rule: treat the test set as if it doesn't exist until you are completely done with all modeling decisions.

# WRONG: leaky pipeline
scaler.fit(all_data)          # learns from test statistics!
X_scaled = scaler.transform(all_data)
X_train, X_test = split(X_scaled)

# CORRECT: leak-free pipeline
X_train, X_test = split(raw_data)
scaler.fit(X_train)           # fit ONLY on training data
X_train = scaler.transform(X_train)
X_test  = scaler.transform(X_test)  # apply same transform

In sklearn, use Pipeline objects to chain preprocessing and modeling — they automatically prevent leakage by fitting preprocessing steps only on training folds during cross-validation.

Dataset Size & Quality

A pervasive myth in ML is that "more data always wins." While scale matters enormously for deep learning, the relationship between dataset size and model performance is nuanced — and data quality often matters more than raw quantity.

Learning Curves

A learning curve plots model performance as a function of training set size. It reveals critical information about your model and data regime:

  • Training error rises as more data is added (harder to perfectly fit more examples)
  • Validation error falls as more data is added (better generalization)
  • When both curves converge at a low error — your model is well-fitted
  • When both curves converge at a high error — you have underfitting; more data won't help, but a more complex model might
  • When there remains a large gap between curves — you have overfitting; more data or regularization is needed

Quality Over Quantity

Adding more low-quality data can actively hurt model performance. Noisy labels, selection bias, and distribution mismatch are harder to overcome with scale alone.

  • Label noise — Even 5% random label errors can significantly degrade a classifier. Clean 10,000 examples often beat noisy 100,000.
  • Selection bias — If your training data only captures part of the real-world distribution, the model will fail on underrepresented inputs.
  • Class imbalance — 99% negative class, 1% positive class makes raw accuracy misleading. Oversample the minority class or use weighted loss functions.
  • Distribution shift — Training data collected in one time period or context may not represent production data, making careful test set construction essential.

Practical Guidelines for Dataset Size

For classical ML (trees, SVMs, linear models): hundreds to tens of thousands of labeled examples per class are typically sufficient. For deep learning on images or text: millions of examples help, though transfer learning can achieve strong results with thousands. For fine-tuning pre-trained language models: even hundreds of high-quality labeled examples can be effective. Always plot learning curves to understand whether collecting more data will actually improve your model.