Understanding the bias-variance tradeoff and how to find the sweet spot
Overfitting occurs when a model learns the training data too well β including the noise, outliers, and random fluctuations that are specific to the training set and do not reflect the true underlying pattern. The model memorizes rather than generalizes.
An overfit model performs impressively on training data but fails on new, unseen examples. It has essentially learned a lookup table of training examples rather than the rules that govern them.
A polynomial regression model fitting a straight-line dataset with degree-15 polynomial will weave through every training point perfectly β achieving near-zero training error. But between training points, the curve oscillates wildly. Any new data point not from the training set will be predicted poorly. This is textbook overfitting: perfect memory, zero generalization.
Underfitting is the opposite problem: the model is too simple to capture the true patterns in the data. It fails to learn even the structure that's clearly present in the training set. A linear model trying to fit a curved relationship is a classic example.
Underfitting is sometimes overlooked because practitioners focus on preventing overfitting, but a severely underfit model is just as useless in production β it simply fails in a less subtle way.
The expected prediction error of a model can be decomposed into three components: Bias, Variance, and irreducible noise.
The fundamental tension: reducing bias typically increases variance, and vice versa. Increasing model complexity reduces bias (better approximation) but increases variance (more sensitive to training data). The goal is to find the sweet spot that minimizes total error.
| Condition | High Bias (Underfitting) | High Variance (Overfitting) | Well-Balanced |
|---|---|---|---|
| Training Error | High | Very low | Moderate-low |
| Validation Error | High (similar to train) | Much higher than train | Close to train |
| Gap (val - train) | Small | Large | Small to moderate |
| Model Complexity | Too low | Too high | Appropriate |
| Primary Fix | Increase model capacity, better features, less regularization | More data, reduce complexity, add regularization | Monitor and maintain with ongoing evaluation |
| Example Algorithms | Linear regression on complex data, shallow trees | Deep unpruned trees, large neural nets with no dropout | Gradient boosting, regularized neural nets, ensembles |
Regularization techniques constrain the model during training to prevent it from memorizing noise. They add a penalty term or constraint that discourages model complexity, effectively trading a small amount of bias for a large reduction in variance.
Adds the sum of absolute values of weights to the loss function: Loss + Ξ» * Ξ£|wα΅’|
L1 has a geometric property that pushes many weights exactly to zero β performing automatic feature selection. The resulting model is sparse: most features are dropped, keeping only the most informative ones. Particularly useful when you have many features and suspect most are irrelevant.
Adds the sum of squared weights to the loss function: Loss + Ξ» * Ξ£wα΅’Β²
L2 penalizes large weights but does not zero them out β it distributes the penalty across all weights, shrinking them uniformly. This leads to smooth, stable models. Ridge regression is analytically solvable and works well when many features have small but non-zero effects.
During each training step, randomly set a fraction of neurons to zero. The network must learn redundant representations because it cannot rely on any particular neuron always being present.
At inference time, all neurons are active but their outputs are scaled by the keep probability. Dropout effectively trains an exponential ensemble of thinned networks and averages their predictions.
Monitor validation loss during training and stop training when it stops improving. This prevents the model from continuing to learn training noise after it has already learned the useful signal.
Keep the model weights from the epoch with the lowest validation loss (best checkpoint). It is one of the simplest, most effective regularization techniques and has almost no hyperparameters to tune.
Elastic Net combines L1 and L2 penalties: Loss + Ξ»β*Ξ£|wα΅’| + Ξ»β*Ξ£wα΅’Β². It performs feature selection like L1 while maintaining the stability of L2 under correlated features. Recommended as a default over pure L1 or L2 when you're uncertain which to use.
A learning curve plots training and validation error (or loss) against either training set size or number of training epochs. Reading learning curves is an essential skill for diagnosing model problems quickly, before investing more computation.
Large gap between training and validation loss. Validation loss may initially decrease then rise (the moment overfitting begins). Adding more data would help close this gap.
Both training and validation losses are high and converge together at an unsatisfactorily high level. More data doesn't help. Model needs more capacity or better features.
Training and validation losses both decrease and converge at a low value, with a small stable gap between them. This is what you want to see.
| What You See in Curves | Diagnosis | Remedy |
|---|---|---|
| Large gap, val loss rising | Overfitting | Add regularization, reduce model size, get more data, early stopping |
| Both high, small gap | Underfitting | More complex model, better features, more training, less regularization |
| Both decreasing but not converging | Still training β needs more epochs | Continue training, check learning rate |
| Both low, small stable gap | Well-fit β good generalization | Finalize model, evaluate on test set |
| Val loss noisy / oscillating | Validation set too small or learning rate too high | Reduce learning rate, use learning rate schedule, increase val set size |