Feature Engineering & Selection

Transforming raw data into signals your model can actually learn from

Feature Engineering Normalization Encoding Dimensionality Reduction

ML Fundamentals Series

Supervised vs. Unsupervised Training, Validation & Test Sets Model Evaluation & Metrics Overfitting & Underfitting Feature Engineering Model Deployment & Inference

← Overfitting & Underfitting Model Deployment & Inference →

⏱ 7 min read 📊 Intermediate 🗓 Updated Jan 2025

What is Feature Engineering?

Feature engineering is the process of using domain knowledge and data intuition to extract, transform, and create input variables (features) that better represent the underlying problem for a machine learning model. Raw data rarely arrives in a form that algorithms can use effectively.

Before deep learning dominated perception tasks (vision, audio, text), feature engineering was the primary determinant of model quality. An expert who understood the problem could often outperform a naive deep model simply by crafting the right input representation. Even today, for tabular data, thoughtful feature engineering frequently delivers more improvement than model selection.

Garbage In, Garbage Out

No algorithm, regardless of sophistication, can extract signal from features that don't encode it. A model predicting house prices given only house color cannot learn much — not because the algorithm is weak, but because the feature doesn't contain the information. Feature engineering is fundamentally about channeling information into the representation the model sees.

Role of Domain Knowledge

The best features come from understanding the problem deeply. A domain expert knows which combinations of raw measurements are physically meaningful:

In finance: price/earnings ratio is more informative than price and earnings separately
In NLP: TF-IDF down-weights common words and up-weights rare discriminative terms
In healthcare: change in creatinine from baseline is more predictive of kidney failure than absolute creatinine
In e-commerce: days since last purchase captures recency better than last purchase date
In manufacturing: ratio of actual to expected production rate captures efficiency

The Feature Engineering Process

Exploratory Data Analysis — Understand distributions, correlations, outliers, and missing value patterns
Hypothesis Generation — Based on domain knowledge, hypothesize which transformations would reveal signal
Feature Creation — Apply transformations, compute interactions, extract time/date components
Validation — Measure whether new features improve model performance on validation data
Iteration — Repeat based on model error analysis and residual patterns

Feature Transformation Techniques

Many ML algorithms are sensitive to the scale and distribution of input features. Gradient-based methods (neural networks, logistic regression, SVM) converge faster with normalized features. Distance-based methods (k-NN, k-means) are completely distorted by unscaled features. Tree-based models are scale-invariant — but even they benefit from well-shaped feature distributions.

Min-Max Normalization

x' = (x - x_min) / (x_max - x_min)

Scales all values to [0, 1]. Preserves relationships and zero values. Sensitive to outliers — one extreme outlier compresses all other values.

Z-Score Standardization

x' = (x - μ) / σ

Centers at zero with unit variance. Not bounded. Handles outliers better than min-max. Default choice for most gradient-based models.

Log Transform

x' = log(x + 1)

Compresses right-skewed distributions (income, population, prices). +1 avoids log(0). Makes multiplicative relationships additive. Essential for heavy-tailed data.

Polynomial Features

x₁, x₂ → x₁, x₂, x₁², x₂², x₁x₂

Creates interaction and higher-order terms, enabling linear models to learn non-linear patterns. Risk: feature space explodes — use with caution and regularization.

RobustScaler

x' = (x - median) / IQR

Uses median and interquartile range instead of mean and std. Resistant to outliers. Best when outliers are real signal, not errors, and you still want scaling.

Quantile Transform

Maps to uniform or normal distribution

Forces any distribution into a desired shape (uniform or Gaussian). Useful for extremely skewed data. Non-parametric — works on any distribution.

Transformation Selection Guide

Situation	Recommended Transform	Why
Neural networks, SVMs, k-NN	Z-Score Standardization	Algorithms sensitive to feature scale; zero-mean helps gradient flow
Need values in [0,1] range	Min-Max Normalization	Image pixel values, probabilities, bounded outputs
Heavy-tailed / right-skewed data	Log Transform	Makes distribution more Gaussian; stabilizes variance
Outliers present but real	RobustScaler	Median/IQR not affected by extreme values
Linear model on non-linear problem	Polynomial Features	Adds expressiveness without changing algorithm
Tree-based models (RF, XGBoost)	Usually none needed	Trees are invariant to monotonic transformations of features

Encoding Categorical Variables

Most ML algorithms require numerical inputs. Categorical variables — city names, product types, user segments — must be converted to numbers. The encoding strategy has a major impact on model performance, especially for high-cardinality categories.

Encoding Method	How It Works	Pros	Cons	Best For
One-Hot Encoding	Creates a binary column for each category. Red/Blue/Green → [1,0,0] / [0,1,0] / [0,0,1]	No ordinal assumption; works with linear models	Dimensionality explosion with high cardinality; sparse matrix	Low-cardinality nominal features (<20 categories)
Label Encoding	Assigns an integer to each category. Red=0, Blue=1, Green=2	Memory-efficient; single column	Implies false ordinal relationship (Green is "bigger" than Red)	Tree-based models only; ordinal features
Ordinal Encoding	Maps to integers respecting natural order. Low=0, Medium=1, High=2	Preserves meaningful rank; compact	Assumes equal spacing between ranks	Genuinely ordered categories (education level, rating)
Target Encoding	Replaces category with mean target value for that category. "City" → avg house price per city	Compact; captures target relationship; handles high cardinality	Data leakage risk; needs cross-fold application; noisy for rare categories	High-cardinality features with strong relationship to target
Frequency Encoding	Replaces category with its frequency in the dataset	Simple; handles high cardinality; no leakage risk	Two categories with same frequency get same encoding	High-cardinality features where frequency is informative
Binary Encoding	Label encode, then convert integer to binary bits. 5 → 101 → [1,0,1]	Much fewer columns than one-hot for high cardinality	Less interpretable; binary patterns are arbitrary	Medium-cardinality features (20–1000 categories)

Target Encoding Leakage Risk

When computing target encoding, you must use only training fold data — never the full dataset. In cross-validation, each fold's target encoding must be computed using only the other folds' data. Libraries like category_encoders handle this with leave-one-out or k-fold target encoding to prevent leakage.

Feature Selection Methods

Not all features are useful. Irrelevant features add noise, increase training time, and can degrade model performance. Feature selection identifies which features to keep, which to discard, and which combinations matter most.

Three Families of Feature Selection

Method Family	How It Works	Examples	Pros	Cons
Filter Methods	Evaluate features independently of the model using statistical tests. Select top-k by score.	Pearson correlation, Mutual information, Chi-squared, ANOVA F-test, Variance threshold	Fast; scalable; no model needed; avoid overfitting to model	Ignores feature interactions; doesn't optimize for specific model
Wrapper Methods	Search through feature subsets by training and evaluating the model on each subset.	Recursive Feature Elimination (RFE), Forward selection, Backward elimination, Exhaustive search	Considers feature interactions; optimized for specific model	Computationally expensive; risk of overfitting to validation set
Embedded Methods	Feature selection happens during model training as part of the learning algorithm itself.	LASSO (L1) shrinks weights to zero; Tree importance; Gradient boosting feature importance	Efficient; considers interactions; less overfit risk than wrappers	Tied to specific model type; importance scores can be misleading with correlated features

Recursive Feature Elimination (RFE)

Train the model on all features, rank features by importance, remove the least important, retrain, and repeat until the desired number of features is reached. Works with any model that exposes feature importance or coefficients.

# RFE pseudocode
features = all_features
while len(features) > target_k:
  model.fit(X[features], y)
  importances = model.feature_importances_
  worst = features[argmin(importances)]
  features.remove(worst)

return features  # final selected set

Correlation-Based Filtering

Two types of correlation to check:

Feature–Target correlation — Keep features that are highly correlated with the target. Remove features with near-zero correlation.
Feature–Feature correlation — When two features are highly correlated with each other (>0.9), one is largely redundant. Keep the one with higher target correlation.

Pearson correlation captures linear relationships. Use mutual information for non-linear feature-target relationships. Use Spearman for monotonic but non-linear relationships.

Dimensionality Reduction

Dimensionality reduction transforms a high-dimensional feature space into a lower-dimensional one while retaining as much useful information as possible. Unlike feature selection (which keeps a subset of original features), dimensionality reduction creates new synthetic features that are combinations of the originals.

The Curse of Dimensionality

As the number of features grows, the amount of data needed to cover the feature space grows exponentially. In 2D, you need roughly N points to fill a square; in 100D, you need N^50 points to achieve equivalent coverage. Distance-based algorithms (k-NN, k-means, SVM) become unreliable in very high dimensions because all points become roughly equidistant. Dimensionality reduction counteracts this phenomenon.

PCA (Principal Component Analysis)

Finds the directions (principal components) of maximum variance in the data and projects onto them. The components are ordered by how much variance they explain. The first component captures the most variance, the second the next most, and so on.

Linear method — can only capture linear structure
Explained variance ratio tells you how much information each component retains
Select n components to retain 95% of total variance
Preprocessing: standardize features first (PCA is scale-sensitive)
Use for: preprocessing before clustering, linear models, visualization of 2D/3D projections

t-SNE

t-Distributed Stochastic Neighbor Embedding preserves local neighborhood structure for visualization. Points that are near each other in high-dimensional space will be near each other in the 2D or 3D projection.

Non-linear — can reveal complex manifold structure
Primarily for visualization only (2D/3D) — not for downstream ML
Stochastic — run multiple times to check stability
Perplexity hyperparameter (~5–50) controls neighborhood size
Slow for very large datasets (>50k samples)

UMAP

Uniform Manifold Approximation and Projection is faster than t-SNE and better preserves global structure. Rapidly becoming the preferred method for large-scale high-dimensional visualization and sometimes pre-processing.

Much faster than t-SNE for large datasets
Preserves both local and global structure better
Can be used for general dimensionality reduction (not just 2D)
Supports supervised, semi-supervised, and unsupervised modes
Used in single-cell RNA sequencing, NLP embedding visualization

When to Use Each Method

Method	Type	Primary Use	Linear?	Scale
PCA	Global variance	Preprocessing, linear structure	Yes	Any (fast)
t-SNE	Local neighborhood	2D/3D visualization only	No	<50k samples
UMAP	Local + global	Visualization & preprocessing	No	Millions of samples
Autoencoder	Reconstruction	Non-linear compression, anomaly detection	No	Any (GPU helps)
ICA	Statistical independence	Signal separation (audio, EEG)	Yes	Moderate
LDA (Linear Discriminant)	Class separability	Supervised dimensionality reduction	Yes	Any

Feature Engineering in Practice

Start with good transformations and clean encoding before reaching for complex models. In Kaggle competitions, the most successful solutions typically involve extensive domain-driven feature engineering. A gradient boosting model with excellent features will outperform a deep neural network on poor ones. Invest in understanding your data before tuning algorithms — the return is far higher.