Overfitting & Underfitting

Understanding the bias-variance tradeoff and how to find the sweet spot

Bias-Variance Regularization Learning Curves Model Complexity

ML Fundamentals Series

Supervised vs. Unsupervised Training, Validation & Test Sets Model Evaluation & Metrics Overfitting & Underfitting Feature Engineering Model Deployment & Inference

← Model Evaluation & Metrics Feature Engineering →

⏱ 6 min read 📊 Beginner 🗓 Updated Jan 2025

Overfitting

Overfitting occurs when a model learns the training data too well — including the noise, outliers, and random fluctuations that are specific to the training set and do not reflect the true underlying pattern. The model memorizes rather than generalizes.

An overfit model performs impressively on training data but fails on new, unseen examples. It has essentially learned a lookup table of training examples rather than the rules that govern them.

Signs of Overfitting

Training accuracy is very high (95–100%) but validation/test accuracy is much lower
Training loss continues to decrease while validation loss plateaus or rises
The model is extremely sensitive to small input perturbations
The decision boundary or regression curve is jagged/complex rather than smooth
Model performance degrades noticeably on slightly different real-world data

Causes of Overfitting

Model too complex — Too many parameters relative to training samples; a 100-layer network on 50 examples will overfit
Too little training data — Small datasets leave the model with little evidence to learn robust patterns
Training too long — Gradient descent will eventually fit training noise if run indefinitely
Noisy features — Irrelevant features give the model spurious patterns to latch on to
No regularization — Unconstrained model can take any shape the optimizer finds

Classic Overfitting Example

A polynomial regression model fitting a straight-line dataset with degree-15 polynomial will weave through every training point perfectly — achieving near-zero training error. But between training points, the curve oscillates wildly. Any new data point not from the training set will be predicted poorly. This is textbook overfitting: perfect memory, zero generalization.

Underfitting

Underfitting is the opposite problem: the model is too simple to capture the true patterns in the data. It fails to learn even the structure that's clearly present in the training set. A linear model trying to fit a curved relationship is a classic example.

Underfitting is sometimes overlooked because practitioners focus on preventing overfitting, but a severely underfit model is just as useless in production — it simply fails in a less subtle way.

Signs of Underfitting

High error on both training and validation/test sets
Training loss stagnates at a high value early in training
Model predictions are similar regardless of input variation
The model underpredicts extreme values ("regression toward the mean" excessively)
Adding more training data doesn't improve performance meaningfully

Causes of Underfitting

Model too simple — Linear model for inherently non-linear data; not enough capacity
Excessive regularization — L1/L2 penalties so strong they force all weights toward zero
Too few features — Missing information the model needs to make accurate predictions
Too little training — Training stopped before convergence; learning rate too small
Poor feature engineering — Raw features don't expose the signal the model needs

The Bias-Variance Tradeoff

The expected prediction error of a model can be decomposed into three components: Bias, Variance, and irreducible noise.

      Expected Error = Bias² + Variance + Irreducible Noise
    

Bias — Error due to wrong assumptions in the learning algorithm. High bias = underfitting. The model consistently misses the true relationship, predicting off-center regardless of how much data you provide.
Variance — Error due to excessive sensitivity to training data. High variance = overfitting. The model's predictions vary wildly across different training sets drawn from the same distribution.
Irreducible Noise — Inherent randomness in the data that no model can remove. The floor on achievable error.

The fundamental tension: reducing bias typically increases variance, and vice versa. Increasing model complexity reduces bias (better approximation) but increases variance (more sensitive to training data). The goal is to find the sweet spot that minimizes total error.

Bias, Variance, and the Right Balance

Condition	High Bias (Underfitting)	High Variance (Overfitting)	Well-Balanced
Training Error	High	Very low	Moderate-low
Validation Error	High (similar to train)	Much higher than train	Close to train
Gap (val - train)	Small	Large	Small to moderate
Model Complexity	Too low	Too high	Appropriate
Primary Fix	Increase model capacity, better features, less regularization	More data, reduce complexity, add regularization	Monitor and maintain with ongoing evaluation
Example Algorithms	Linear regression on complex data, shallow trees	Deep unpruned trees, large neural nets with no dropout	Gradient boosting, regularized neural nets, ensembles

Regularization Techniques

Regularization techniques constrain the model during training to prevent it from memorizing noise. They add a penalty term or constraint that discourages model complexity, effectively trading a small amount of bias for a large reduction in variance.

L1 Regularization (Lasso)

Adds the sum of absolute values of weights to the loss function: Loss + λ * Σ|wᵢ|

L1 has a geometric property that pushes many weights exactly to zero — performing automatic feature selection. The resulting model is sparse: most features are dropped, keeping only the most informative ones. Particularly useful when you have many features and suspect most are irrelevant.

Produces sparse models with many zero weights
Effective built-in feature selection
Can be unstable when features are correlated
Used in Lasso Regression, sparse neural networks

L2 Regularization (Ridge)

Adds the sum of squared weights to the loss function: Loss + λ * Σwᵢ²

L2 penalizes large weights but does not zero them out — it distributes the penalty across all weights, shrinking them uniformly. This leads to smooth, stable models. Ridge regression is analytically solvable and works well when many features have small but non-zero effects.

Weights shrink toward zero but rarely become exactly zero
Stable when features are correlated
Well-suited for problems with many small effects
Standard choice in logistic regression and linear models

Dropout (Neural Networks)

During each training step, randomly set a fraction of neurons to zero. The network must learn redundant representations because it cannot rely on any particular neuron always being present.

At inference time, all neurons are active but their outputs are scaled by the keep probability. Dropout effectively trains an exponential ensemble of thinned networks and averages their predictions.

Dropout rate typically 0.2–0.5 for hidden layers
Applied between layers, not on input or output
Dramatically reduces overfitting in large networks
Enables training very deep networks with limited data

Early Stopping

Monitor validation loss during training and stop training when it stops improving. This prevents the model from continuing to learn training noise after it has already learned the useful signal.

Keep the model weights from the epoch with the lowest validation loss (best checkpoint). It is one of the simplest, most effective regularization techniques and has almost no hyperparameters to tune.

Set "patience" — wait N epochs before stopping to allow temporary plateau
Always save model weights at best validation checkpoint
Works for any iteratively trained model (neural nets, gradient boosting)
Free regularization — no extra computation cost

Elastic Net: Best of Both Worlds

Elastic Net combines L1 and L2 penalties: Loss + λ₁*Σ|wᵢ| + λ₂*Σwᵢ². It performs feature selection like L1 while maintaining the stability of L2 under correlated features. Recommended as a default over pure L1 or L2 when you're uncertain which to use.

Learning Curves — Diagnosing Over & Underfitting

A learning curve plots training and validation error (or loss) against either training set size or number of training epochs. Reading learning curves is an essential skill for diagnosing model problems quickly, before investing more computation.

Overfitting Pattern

Large gap between training and validation loss. Validation loss may initially decrease then rise (the moment overfitting begins). Adding more data would help close this gap.

Underfitting Pattern

Both training and validation losses are high and converge together at an unsatisfactorily high level. More data doesn't help. Model needs more capacity or better features.

Well-Fit Model Pattern

Training and validation losses both decrease and converge at a low value, with a small stable gap between them. This is what you want to see.

What You See in Curves	Diagnosis	Remedy
Large gap, val loss rising	Overfitting	Add regularization, reduce model size, get more data, early stopping
Both high, small gap	Underfitting	More complex model, better features, more training, less regularization
Both decreasing but not converging	Still training — needs more epochs	Continue training, check learning rate
Both low, small stable gap	Well-fit — good generalization	Finalize model, evaluate on test set
Val loss noisy / oscillating	Validation set too small or learning rate too high	Reduce learning rate, use learning rate schedule, increase val set size