Feature Engineering & Selection

Transforming raw data into signals your model can actually learn from

Feature Engineering Normalization Encoding Dimensionality Reduction
ML Fundamentals Series
⏱ 7 min read πŸ“Š Intermediate πŸ—“ Updated Jan 2025

What is Feature Engineering?

Feature engineering is the process of using domain knowledge and data intuition to extract, transform, and create input variables (features) that better represent the underlying problem for a machine learning model. Raw data rarely arrives in a form that algorithms can use effectively.

Before deep learning dominated perception tasks (vision, audio, text), feature engineering was the primary determinant of model quality. An expert who understood the problem could often outperform a naive deep model simply by crafting the right input representation. Even today, for tabular data, thoughtful feature engineering frequently delivers more improvement than model selection.

Garbage In, Garbage Out

No algorithm, regardless of sophistication, can extract signal from features that don't encode it. A model predicting house prices given only house color cannot learn much β€” not because the algorithm is weak, but because the feature doesn't contain the information. Feature engineering is fundamentally about channeling information into the representation the model sees.

Role of Domain Knowledge

The best features come from understanding the problem deeply. A domain expert knows which combinations of raw measurements are physically meaningful:

  • In finance: price/earnings ratio is more informative than price and earnings separately
  • In NLP: TF-IDF down-weights common words and up-weights rare discriminative terms
  • In healthcare: change in creatinine from baseline is more predictive of kidney failure than absolute creatinine
  • In e-commerce: days since last purchase captures recency better than last purchase date
  • In manufacturing: ratio of actual to expected production rate captures efficiency

The Feature Engineering Process

  • Exploratory Data Analysis β€” Understand distributions, correlations, outliers, and missing value patterns
  • Hypothesis Generation β€” Based on domain knowledge, hypothesize which transformations would reveal signal
  • Feature Creation β€” Apply transformations, compute interactions, extract time/date components
  • Validation β€” Measure whether new features improve model performance on validation data
  • Iteration β€” Repeat based on model error analysis and residual patterns

Feature Transformation Techniques

Many ML algorithms are sensitive to the scale and distribution of input features. Gradient-based methods (neural networks, logistic regression, SVM) converge faster with normalized features. Distance-based methods (k-NN, k-means) are completely distorted by unscaled features. Tree-based models are scale-invariant β€” but even they benefit from well-shaped feature distributions.

Min-Max Normalization
x' = (x - x_min) / (x_max - x_min)
Scales all values to [0, 1]. Preserves relationships and zero values. Sensitive to outliers β€” one extreme outlier compresses all other values.
Z-Score Standardization
x' = (x - ΞΌ) / Οƒ
Centers at zero with unit variance. Not bounded. Handles outliers better than min-max. Default choice for most gradient-based models.
Log Transform
x' = log(x + 1)
Compresses right-skewed distributions (income, population, prices). +1 avoids log(0). Makes multiplicative relationships additive. Essential for heavy-tailed data.
Polynomial Features
x₁, xβ‚‚ β†’ x₁, xβ‚‚, x₁², xβ‚‚Β², x₁xβ‚‚
Creates interaction and higher-order terms, enabling linear models to learn non-linear patterns. Risk: feature space explodes β€” use with caution and regularization.
RobustScaler
x' = (x - median) / IQR
Uses median and interquartile range instead of mean and std. Resistant to outliers. Best when outliers are real signal, not errors, and you still want scaling.
Quantile Transform
Maps to uniform or normal distribution
Forces any distribution into a desired shape (uniform or Gaussian). Useful for extremely skewed data. Non-parametric β€” works on any distribution.

Transformation Selection Guide

SituationRecommended TransformWhy
Neural networks, SVMs, k-NNZ-Score StandardizationAlgorithms sensitive to feature scale; zero-mean helps gradient flow
Need values in [0,1] rangeMin-Max NormalizationImage pixel values, probabilities, bounded outputs
Heavy-tailed / right-skewed dataLog TransformMakes distribution more Gaussian; stabilizes variance
Outliers present but realRobustScalerMedian/IQR not affected by extreme values
Linear model on non-linear problemPolynomial FeaturesAdds expressiveness without changing algorithm
Tree-based models (RF, XGBoost)Usually none neededTrees are invariant to monotonic transformations of features

Encoding Categorical Variables

Most ML algorithms require numerical inputs. Categorical variables β€” city names, product types, user segments β€” must be converted to numbers. The encoding strategy has a major impact on model performance, especially for high-cardinality categories.

Encoding MethodHow It WorksProsConsBest For
One-Hot Encoding Creates a binary column for each category. Red/Blue/Green β†’ [1,0,0] / [0,1,0] / [0,0,1] No ordinal assumption; works with linear models Dimensionality explosion with high cardinality; sparse matrix Low-cardinality nominal features (<20 categories)
Label Encoding Assigns an integer to each category. Red=0, Blue=1, Green=2 Memory-efficient; single column Implies false ordinal relationship (Green is "bigger" than Red) Tree-based models only; ordinal features
Ordinal Encoding Maps to integers respecting natural order. Low=0, Medium=1, High=2 Preserves meaningful rank; compact Assumes equal spacing between ranks Genuinely ordered categories (education level, rating)
Target Encoding Replaces category with mean target value for that category. "City" β†’ avg house price per city Compact; captures target relationship; handles high cardinality Data leakage risk; needs cross-fold application; noisy for rare categories High-cardinality features with strong relationship to target
Frequency Encoding Replaces category with its frequency in the dataset Simple; handles high cardinality; no leakage risk Two categories with same frequency get same encoding High-cardinality features where frequency is informative
Binary Encoding Label encode, then convert integer to binary bits. 5 β†’ 101 β†’ [1,0,1] Much fewer columns than one-hot for high cardinality Less interpretable; binary patterns are arbitrary Medium-cardinality features (20–1000 categories)

Target Encoding Leakage Risk

When computing target encoding, you must use only training fold data β€” never the full dataset. In cross-validation, each fold's target encoding must be computed using only the other folds' data. Libraries like category_encoders handle this with leave-one-out or k-fold target encoding to prevent leakage.

Feature Selection Methods

Not all features are useful. Irrelevant features add noise, increase training time, and can degrade model performance. Feature selection identifies which features to keep, which to discard, and which combinations matter most.

Three Families of Feature Selection

Method FamilyHow It WorksExamplesProsCons
Filter Methods Evaluate features independently of the model using statistical tests. Select top-k by score. Pearson correlation, Mutual information, Chi-squared, ANOVA F-test, Variance threshold Fast; scalable; no model needed; avoid overfitting to model Ignores feature interactions; doesn't optimize for specific model
Wrapper Methods Search through feature subsets by training and evaluating the model on each subset. Recursive Feature Elimination (RFE), Forward selection, Backward elimination, Exhaustive search Considers feature interactions; optimized for specific model Computationally expensive; risk of overfitting to validation set
Embedded Methods Feature selection happens during model training as part of the learning algorithm itself. LASSO (L1) shrinks weights to zero; Tree importance; Gradient boosting feature importance Efficient; considers interactions; less overfit risk than wrappers Tied to specific model type; importance scores can be misleading with correlated features

Recursive Feature Elimination (RFE)

Train the model on all features, rank features by importance, remove the least important, retrain, and repeat until the desired number of features is reached. Works with any model that exposes feature importance or coefficients.

# RFE pseudocode
features = all_features
while len(features) > target_k:
  model.fit(X[features], y)
  importances = model.feature_importances_
  worst = features[argmin(importances)]
  features.remove(worst)

return features  # final selected set

Correlation-Based Filtering

Two types of correlation to check:

  • Feature–Target correlation β€” Keep features that are highly correlated with the target. Remove features with near-zero correlation.
  • Feature–Feature correlation β€” When two features are highly correlated with each other (>0.9), one is largely redundant. Keep the one with higher target correlation.

Pearson correlation captures linear relationships. Use mutual information for non-linear feature-target relationships. Use Spearman for monotonic but non-linear relationships.

Dimensionality Reduction

Dimensionality reduction transforms a high-dimensional feature space into a lower-dimensional one while retaining as much useful information as possible. Unlike feature selection (which keeps a subset of original features), dimensionality reduction creates new synthetic features that are combinations of the originals.

The Curse of Dimensionality

As the number of features grows, the amount of data needed to cover the feature space grows exponentially. In 2D, you need roughly N points to fill a square; in 100D, you need N^50 points to achieve equivalent coverage. Distance-based algorithms (k-NN, k-means, SVM) become unreliable in very high dimensions because all points become roughly equidistant. Dimensionality reduction counteracts this phenomenon.

PCA (Principal Component Analysis)

Finds the directions (principal components) of maximum variance in the data and projects onto them. The components are ordered by how much variance they explain. The first component captures the most variance, the second the next most, and so on.

  • Linear method β€” can only capture linear structure
  • Explained variance ratio tells you how much information each component retains
  • Select n components to retain 95% of total variance
  • Preprocessing: standardize features first (PCA is scale-sensitive)
  • Use for: preprocessing before clustering, linear models, visualization of 2D/3D projections

t-SNE

t-Distributed Stochastic Neighbor Embedding preserves local neighborhood structure for visualization. Points that are near each other in high-dimensional space will be near each other in the 2D or 3D projection.

  • Non-linear β€” can reveal complex manifold structure
  • Primarily for visualization only (2D/3D) β€” not for downstream ML
  • Stochastic β€” run multiple times to check stability
  • Perplexity hyperparameter (~5–50) controls neighborhood size
  • Slow for very large datasets (>50k samples)

UMAP

Uniform Manifold Approximation and Projection is faster than t-SNE and better preserves global structure. Rapidly becoming the preferred method for large-scale high-dimensional visualization and sometimes pre-processing.

  • Much faster than t-SNE for large datasets
  • Preserves both local and global structure better
  • Can be used for general dimensionality reduction (not just 2D)
  • Supports supervised, semi-supervised, and unsupervised modes
  • Used in single-cell RNA sequencing, NLP embedding visualization

When to Use Each Method

MethodTypePrimary UseLinear?Scale
PCAGlobal variancePreprocessing, linear structureYesAny (fast)
t-SNELocal neighborhood2D/3D visualization onlyNo<50k samples
UMAPLocal + globalVisualization & preprocessingNoMillions of samples
AutoencoderReconstructionNon-linear compression, anomaly detectionNoAny (GPU helps)
ICAStatistical independenceSignal separation (audio, EEG)YesModerate
LDA (Linear Discriminant)Class separabilitySupervised dimensionality reductionYesAny

Feature Engineering in Practice

Start with good transformations and clean encoding before reaching for complex models. In Kaggle competitions, the most successful solutions typically involve extensive domain-driven feature engineering. A gradient boosting model with excellent features will outperform a deep neural network on poor ones. Invest in understanding your data before tuning algorithms β€” the return is far higher.