Overfitting & Underfitting

Understanding the bias-variance tradeoff and how to find the sweet spot

Bias-Variance Regularization Learning Curves Model Complexity
ML Fundamentals Series
⏱ 6 min read πŸ“Š Beginner πŸ—“ Updated Jan 2025

Overfitting

Overfitting occurs when a model learns the training data too well β€” including the noise, outliers, and random fluctuations that are specific to the training set and do not reflect the true underlying pattern. The model memorizes rather than generalizes.

An overfit model performs impressively on training data but fails on new, unseen examples. It has essentially learned a lookup table of training examples rather than the rules that govern them.

Signs of Overfitting

  • Training accuracy is very high (95–100%) but validation/test accuracy is much lower
  • Training loss continues to decrease while validation loss plateaus or rises
  • The model is extremely sensitive to small input perturbations
  • The decision boundary or regression curve is jagged/complex rather than smooth
  • Model performance degrades noticeably on slightly different real-world data

Causes of Overfitting

  • Model too complex β€” Too many parameters relative to training samples; a 100-layer network on 50 examples will overfit
  • Too little training data β€” Small datasets leave the model with little evidence to learn robust patterns
  • Training too long β€” Gradient descent will eventually fit training noise if run indefinitely
  • Noisy features β€” Irrelevant features give the model spurious patterns to latch on to
  • No regularization β€” Unconstrained model can take any shape the optimizer finds

Classic Overfitting Example

A polynomial regression model fitting a straight-line dataset with degree-15 polynomial will weave through every training point perfectly β€” achieving near-zero training error. But between training points, the curve oscillates wildly. Any new data point not from the training set will be predicted poorly. This is textbook overfitting: perfect memory, zero generalization.

Underfitting

Underfitting is the opposite problem: the model is too simple to capture the true patterns in the data. It fails to learn even the structure that's clearly present in the training set. A linear model trying to fit a curved relationship is a classic example.

Underfitting is sometimes overlooked because practitioners focus on preventing overfitting, but a severely underfit model is just as useless in production β€” it simply fails in a less subtle way.

Signs of Underfitting

  • High error on both training and validation/test sets
  • Training loss stagnates at a high value early in training
  • Model predictions are similar regardless of input variation
  • The model underpredicts extreme values ("regression toward the mean" excessively)
  • Adding more training data doesn't improve performance meaningfully

Causes of Underfitting

  • Model too simple β€” Linear model for inherently non-linear data; not enough capacity
  • Excessive regularization β€” L1/L2 penalties so strong they force all weights toward zero
  • Too few features β€” Missing information the model needs to make accurate predictions
  • Too little training β€” Training stopped before convergence; learning rate too small
  • Poor feature engineering β€” Raw features don't expose the signal the model needs

The Bias-Variance Tradeoff

The expected prediction error of a model can be decomposed into three components: Bias, Variance, and irreducible noise.

Expected Error = BiasΒ² + Variance + Irreducible Noise

The fundamental tension: reducing bias typically increases variance, and vice versa. Increasing model complexity reduces bias (better approximation) but increases variance (more sensitive to training data). The goal is to find the sweet spot that minimizes total error.

Bias, Variance, and the Right Balance

ConditionHigh Bias (Underfitting)High Variance (Overfitting)Well-Balanced
Training Error High Very low Moderate-low
Validation Error High (similar to train) Much higher than train Close to train
Gap (val - train) Small Large Small to moderate
Model Complexity Too low Too high Appropriate
Primary Fix Increase model capacity, better features, less regularization More data, reduce complexity, add regularization Monitor and maintain with ongoing evaluation
Example Algorithms Linear regression on complex data, shallow trees Deep unpruned trees, large neural nets with no dropout Gradient boosting, regularized neural nets, ensembles

Regularization Techniques

Regularization techniques constrain the model during training to prevent it from memorizing noise. They add a penalty term or constraint that discourages model complexity, effectively trading a small amount of bias for a large reduction in variance.

L1 Regularization (Lasso)

Adds the sum of absolute values of weights to the loss function: Loss + Ξ» * Ξ£|wα΅’|

L1 has a geometric property that pushes many weights exactly to zero β€” performing automatic feature selection. The resulting model is sparse: most features are dropped, keeping only the most informative ones. Particularly useful when you have many features and suspect most are irrelevant.

  • Produces sparse models with many zero weights
  • Effective built-in feature selection
  • Can be unstable when features are correlated
  • Used in Lasso Regression, sparse neural networks

L2 Regularization (Ridge)

Adds the sum of squared weights to the loss function: Loss + Ξ» * Ξ£wα΅’Β²

L2 penalizes large weights but does not zero them out β€” it distributes the penalty across all weights, shrinking them uniformly. This leads to smooth, stable models. Ridge regression is analytically solvable and works well when many features have small but non-zero effects.

  • Weights shrink toward zero but rarely become exactly zero
  • Stable when features are correlated
  • Well-suited for problems with many small effects
  • Standard choice in logistic regression and linear models

Dropout (Neural Networks)

During each training step, randomly set a fraction of neurons to zero. The network must learn redundant representations because it cannot rely on any particular neuron always being present.

At inference time, all neurons are active but their outputs are scaled by the keep probability. Dropout effectively trains an exponential ensemble of thinned networks and averages their predictions.

  • Dropout rate typically 0.2–0.5 for hidden layers
  • Applied between layers, not on input or output
  • Dramatically reduces overfitting in large networks
  • Enables training very deep networks with limited data

Early Stopping

Monitor validation loss during training and stop training when it stops improving. This prevents the model from continuing to learn training noise after it has already learned the useful signal.

Keep the model weights from the epoch with the lowest validation loss (best checkpoint). It is one of the simplest, most effective regularization techniques and has almost no hyperparameters to tune.

  • Set "patience" β€” wait N epochs before stopping to allow temporary plateau
  • Always save model weights at best validation checkpoint
  • Works for any iteratively trained model (neural nets, gradient boosting)
  • Free regularization β€” no extra computation cost

Elastic Net: Best of Both Worlds

Elastic Net combines L1 and L2 penalties: Loss + λ₁*Ξ£|wα΅’| + Ξ»β‚‚*Ξ£wα΅’Β². It performs feature selection like L1 while maintaining the stability of L2 under correlated features. Recommended as a default over pure L1 or L2 when you're uncertain which to use.

Learning Curves β€” Diagnosing Over & Underfitting

A learning curve plots training and validation error (or loss) against either training set size or number of training epochs. Reading learning curves is an essential skill for diagnosing model problems quickly, before investing more computation.

Overfitting Pattern

Large gap between training and validation loss. Validation loss may initially decrease then rise (the moment overfitting begins). Adding more data would help close this gap.

Loss | High| val β•²___________ | β•²_ _ _ _ _ _ _ (val levels off high) | Low | train β•²______________ (train keeps falling) +--------------------------------- Epochs / Data Size

Underfitting Pattern

Both training and validation losses are high and converge together at an unsatisfactorily high level. More data doesn't help. Model needs more capacity or better features.

Loss | High| val β•²_ _ _ _ _ _ _ _ _ (stays high) | train β•²_ _ _ _ _ _ _ _ (also stays high) | (curves close, both high) Low | +--------------------------------- Epochs / Data Size

Well-Fit Model Pattern

Training and validation losses both decrease and converge at a low value, with a small stable gap between them. This is what you want to see.

Loss | High| val β•²______________ | β•²_ _ _ _ _ _ β•²_ (small, stable gap) | train β•²__________ | (both low, close together) Low | . . . . . . . . . . . (converged) +--------------------------------- Epochs / Data Size
What You See in CurvesDiagnosisRemedy
Large gap, val loss rising Overfitting Add regularization, reduce model size, get more data, early stopping
Both high, small gap Underfitting More complex model, better features, more training, less regularization
Both decreasing but not converging Still training β€” needs more epochs Continue training, check learning rate
Both low, small stable gap Well-fit β€” good generalization Finalize model, evaluate on test set
Val loss noisy / oscillating Validation set too small or learning rate too high Reduce learning rate, use learning rate schedule, increase val set size