Data Prep Series: Data Collection Cleaning & Imputation Normalization & Scaling Imbalanced Datasets Data Augmentation Exploratory Data Analysis
← Data Augmentation Back to AI & ML Index →
⏱ 7 min read 📊 Beginner 🗓 Updated Jan 2025
🔍 Why EDA Matters

Find Problems Before They Become Expensive

Data quality issues discovered during EDA take minutes to fix. The same issues discovered after a week of training cost far more — in compute, time, and credibility with stakeholders.

  • Label leakage: a feature that directly encodes the target variable
  • Train/test contamination: the same entity appearing in both splits
  • Silent joins gone wrong: many-to-many join that multiplied rows unnoticed
  • Temporal leakage: future information encoded in "current" features
  • Near-constant features: variables with 99% the same value provide no signal

Discover What Your Data Actually Says

Documentation lies. Schema definitions are aspirational. The actual data in production has a different story — and EDA is how you read it.

  • Distributions are rarely Gaussian — most real-world features are skewed or multimodal
  • Outliers are often real events that the model must handle, not noise to discard
  • Correlations between features suggest redundancy and multicollinearity risks
  • Missing value patterns encode information — missingness is rarely completely random
  • Class imbalance ratios determine your evaluation metric choices

Guide Feature Engineering

The best features are usually suggested by the data itself. EDA reveals which transformations, interactions, and aggregations are worth trying.

  • Highly skewed continuous features suggest log-transformation
  • Bimodal distributions suggest a hidden categorical split
  • Strong correlation between two features suggests ratio or interaction features
  • Cyclic datetime patterns (hour, day-of-week) suggest sinusoidal encoding
  • Groups in scatter plots suggest segment-specific models or interaction terms

Validate Your Assumptions

Every ML project starts with assumptions about the data. EDA is where you test them before committing to a model architecture and feature set.

  • "The classes are roughly balanced" — check this; often wrong
  • "Feature X is a good predictor of Y" — check correlation; sometimes spurious
  • "The data is i.i.d." — check for temporal autocorrelation and batch effects
  • "Missing values are random" — visualize missingness patterns; often systematic
  • "The test set is representative" — compare distributions between train and test

The 80/20 Reality of ML Projects

Studies consistently show that data work — collection, cleaning, validation, and EDA — consumes 70–80% of total ML project time. EDA is not optional overhead; it is the core of the work. A thorough EDA makes the remaining 20% (modeling) dramatically more effective by ensuring you're optimizing the right objective with the right features on clean, well-understood data.

Famous EDA Disasters

The ImageNet team later discovered that many "dog breed" images were actually photos of the dog owners — the model was learning to classify humans. A healthcare ML team trained a pneumonia risk model that inadvertently learned that asthma patients had lower pneumonia risk (because they went to the ICU directly, so fewer died in the pneumonia study cohort). Both disasters would have been caught by thorough EDA and data auditing before modeling began.

📊 Univariate Analysis

Numerical Features

Analyze each numerical feature in isolation before considering relationships. Key statistics and plots:

  • Histogram: reveals distribution shape — normal, skewed, bimodal, uniform
  • Box plot: shows median, IQR, and outliers using Tukey fences (1.5× IQR)
  • QQ plot: compare empirical quantiles against theoretical normal distribution
  • Skewness: |skew| > 1 is highly skewed; |skew| > 0.5 is moderately skewed
  • Kurtosis: high kurtosis (leptokurtic) indicates heavy tails and extreme outlier risk
  • Mean vs median: large gap indicates skew; use median for robust central tendency reporting

Categorical Features

Categorical analysis focuses on the count and frequency of each level, plus cardinality assessment.

  • Value counts: frequency of each category; sort descending to see dominant levels
  • Bar chart: visualize value counts; horizontal orientation for long category names
  • Cardinality: number of unique values — high cardinality needs special encoding strategies
  • Rare categories: categories with <1% frequency may need grouping into "Other"
  • Unexpected values: typos, inconsistent casing ("Male"/"male"/"M"), legacy codes

Datetime & Time Series Features

Temporal data requires its own analysis to reveal trends, seasonality, and gaps.

  • Time-series plot: raw values over time; reveals trend and outlier events
  • Seasonality decomposition: STL decomposition splits trend + seasonal + residual components
  • Autocorrelation (ACF/PACF): detect lag-based dependencies
  • Gap detection: missing time periods that could indicate data pipeline failures
  • Distribution by time bucket: hour-of-day, day-of-week — reveals cyclic patterns

Missing Value Analysis

Missingness should never be ignored — its pattern is itself informative.

  • Missing percentage per feature: sort descending; features >80% missing may need dropping
  • Missing completely at random (MCAR): missingness unrelated to any variable
  • Missing at random (MAR): missingness related to other observed variables — can impute
  • Missing not at random (MNAR): missingness related to the missing value itself — hardest case
  • Missingness heatmap: visualize which rows have missing values across which columns simultaneously

Pandas Describe + Seaborn Histplot

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

# Load dataset
df = pd.read_csv("dataset.csv")

# --- Basic statistics ---
print("Shape:", df.shape)
print("\nData types:\n", df.dtypes)
print("\nMissing values:\n", df.isnull().sum())
print("\nDescriptive statistics:\n", df.describe())

# Numerical columns
num_cols = df.select_dtypes(include=np.number).columns.tolist()
cat_cols = df.select_dtypes(include=["object", "category"]).columns.tolist()

# --- Skewness and kurtosis for all numerical features ---
print("\nSkewness:")
for col in num_cols:
    sk = df[col].skew()
    ku = df[col].kurtosis()
    flag = " << HIGH SKEW" if abs(sk) > 1 else ""
    print(f"  {col}: skew={sk:.3f}, kurtosis={ku:.3f}{flag}")

# --- Histogram grid for numerical features ---
n = len(num_cols)
cols_per_row = 3
rows = (n + cols_per_row - 1) // cols_per_row
fig, axes = plt.subplots(rows, cols_per_row, figsize=(15, 4 * rows))
axes = axes.flatten()

for i, col in enumerate(num_cols):
    sns.histplot(data=df, x=col, kde=True, ax=axes[i],
                 color="#00d9ff", alpha=0.7)
    axes[i].set_title(f"{col}\n(skew: {df[col].skew():.2f})")
    axes[i].set_facecolor("#0a0e27")

# Hide unused subplots
for j in range(i + 1, len(axes)):
    axes[j].set_visible(False)

plt.tight_layout()
plt.savefig("univariate_histograms.png", dpi=150, bbox_inches="tight")

# --- Categorical value counts ---
for col in cat_cols:
    print(f"\n{col} ({df[col].nunique()} unique values):")
    print(df[col].value_counts(normalize=True).head(10).to_string())

# --- Missing value heatmap ---
import missingno as msno
msno.heatmap(df, figsize=(12, 6))
plt.savefig("missing_heatmap.png", dpi=150)
🔗 Bivariate & Multivariate Analysis

Correlation Analysis

Measure pairwise linear and monotonic relationships between numerical features.

  • Pearson r: linear correlation; assumes normality; sensitive to outliers
  • Spearman ρ: monotonic correlation; rank-based; robust to outliers; preferred for skewed data
  • Kendall τ: concordant vs discordant pairs; more robust than Spearman; slower for large datasets
  • Strong correlations (>0.8) between features indicate multicollinearity — one may be droppable
  • Strong correlation with target indicates high predictive utility

Target Correlation Analysis

For supervised learning, the most important bivariate analysis is feature-vs-target correlation.

  • Rank features by absolute Spearman correlation with target
  • Low correlation does not mean useless — interactions and nonlinear relations matter
  • For classification: plot class-conditional distributions of each feature
  • Box plots (feature value | class) reveal separability at a glance
  • Mutual information captures nonlinear relationships that correlation misses

Multicollinearity Detection

Correlated features cause instability in linear models and inflate coefficient variances.

  • Correlation matrix heatmap: visualize all pairwise correlations at once
  • VIF (Variance Inflation Factor): VIF > 5 suggests problematic collinearity; VIF > 10 is severe
  • Cluster correlated features together and keep one representative per cluster
  • PCA collapses collinear features into orthogonal components

Categorical vs Continuous Relationships

When one variable is categorical and another is continuous, different plots reveal different aspects of the relationship.

  • Box plot: median and spread per category; good for outlier comparison
  • Violin plot: shows full distribution per category; better than box for multimodal distributions
  • Strip plot + swarm plot: all individual data points overlaid per category
  • ANOVA / Kruskal-Wallis test: statistical test for mean differences across groups

Seaborn Pairplot + Correlation Heatmap

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("dataset.csv")
target_col = "label"
num_cols = df.select_dtypes(include=np.number).columns.tolist()

# --- Correlation matrix (Spearman — robust to outliers and skew) ---
corr_matrix = df[num_cols].corr(method="spearman")

fig, ax = plt.subplots(figsize=(12, 10))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))  # show lower triangle only
sns.heatmap(
    corr_matrix,
    mask=mask,
    annot=True, fmt=".2f",
    cmap="coolwarm", center=0,
    linewidths=0.5,
    ax=ax
)
ax.set_title("Spearman Correlation Matrix", fontsize=14)
plt.tight_layout()
plt.savefig("correlation_heatmap.png", dpi=150, bbox_inches="tight")

# --- Feature correlations with target ---
target_corr = corr_matrix[target_col].drop(target_col).sort_values(ascending=False)
print("Feature correlations with target:")
print(target_corr.to_string())

# --- Pairplot for top correlated features (small feature sets only) ---
top_features = target_corr.abs().nlargest(5).index.tolist() + [target_col]
pair_df = df[top_features].copy()

pair_grid = sns.pairplot(
    pair_df,
    hue=target_col if df[target_col].nunique() <= 10 else None,
    diag_kind="kde",
    plot_kws={"alpha": 0.5, "s": 20}
)
pair_grid.fig.suptitle("Pairplot — Top 5 Features vs Target", y=1.02)
plt.savefig("pairplot.png", dpi=150, bbox_inches="tight")

# --- Mutual information (captures nonlinear relationships) ---
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression

X = df[num_cols].drop(columns=[target_col]).fillna(0)
y = df[target_col]

# Use classif for discrete target, regression for continuous
mi = mutual_info_classif(X, y, random_state=42)
mi_series = pd.Series(mi, index=X.columns).sort_values(ascending=False)
print("\nMutual information with target:")
print(mi_series.to_string())

# --- Violin plots: top features vs categorical target ---
fig, axes = plt.subplots(1, min(3, len(X.columns)), figsize=(15, 5))
for i, col in enumerate(mi_series.index[:3]):
    sns.violinplot(data=df, x=target_col, y=col, ax=axes[i], palette="muted")
    axes[i].set_title(f"{col} by target class")
plt.tight_layout()
plt.savefig("violin_plots.png", dpi=150, bbox_inches="tight")
🤖 Automated EDA Tools

Manual EDA is thorough but time-consuming. Automated tools generate comprehensive reports in one or two lines of code, giving you a fast first pass that you can then drill into manually.

ydata-profiling

Formerly pandas-profiling, this is the most popular automated EDA tool. One line generates a comprehensive HTML report covering statistics, distributions, correlations, and missing values for every column.

from ydata_profiling import ProfileReport
import pandas as pd

df = pd.read_csv("dataset.csv")
profile = ProfileReport(
    df,
    title="Dataset EDA Report",
    explorative=True,
    correlations={
        "pearson": {"calculate": True},
        "spearman": {"calculate": True},
    }
)
profile.to_file("eda_report.html")
Open SourceHTML Output

Sweetviz

Sweetviz specializes in train/test comparison — generating side-by-side visualizations that make distribution shift between splits immediately obvious. Essential for validating train/test splits.

import sweetviz as sv
import pandas as pd

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

# Compare train vs test distributions
compare_report = sv.compare(
    [train, "Train"], [test, "Test"],
    target_feat="label"
)
compare_report.show_html("train_test_comparison.html")

# Single dataset analysis
analyze_report = sv.analyze(train, target_feat="label")
analyze_report.show_html("train_analysis.html")
Open SourceTrain/Test Compare

D-Tale

D-Tale provides an interactive web UI for exploring pandas DataFrames. You can filter, sort, chart, and compute statistics interactively — useful for ad-hoc exploration in development environments.

import dtale
import pandas as pd

df = pd.read_csv("dataset.csv")

# Launch interactive UI (opens browser tab)
d = dtale.show(df)
d.open_browser()

# In Jupyter
import dtale.app as dtale_app
dtale_app.USE_COLAB = False  # set True for Colab
dtale.show(df)
Open SourceInteractive UI

Lux

Lux is a Jupyter-native library that automatically recommends visualizations based on what attributes it detects in your DataFrame. It surfaces non-obvious patterns without you needing to specify what to plot.

import lux
import pandas as pd

df = pd.read_csv("dataset.csv")

# Just displaying the DataFrame in Jupyter
# activates Lux's automatic recommendation engine
df  # shows interactive Lux widget

# Specify an intent to focus recommendations
df.intent = ["label"]  # recommendations relative to target
df  # now shows intent-driven recommendations
Open SourceAuto Viz Suggest

Tool Comparison

ToolOutput FormatBest ForInstall
ydata-profilingStandalone HTML report; also JSON/widgetFull dataset audit; first pass on any new datasetpip install ydata-profiling
SweetvizStandalone HTML report with side-by-side panelsTrain/test comparison; detecting split distribution shiftpip install sweetviz
D-TaleInteractive browser-based UIAd-hoc exploration; non-technical stakeholder demospip install dtale
LuxJupyter widget with interactive chart recommendationsRapid visual exploration; discovering unexpected patternspip install lux-api
AutoVizMatplotlib/Plotly charts saved to fileQuick automated charting; scriptable batch EDApip install autoviz

Automated EDA Is a Starting Point, Not a Destination

Automated tools catch obvious issues and generate a comprehensive overview in minutes. But they cannot replace domain knowledge. A profiling report won't tell you that "transaction_amount = 0.01" is suspicious for your fraud dataset, or that a certain feature combination is physically impossible. Always follow automated EDA with manual domain-specific investigation.

⚙️ EDA-Driven Feature Engineering Insights

Transforming Skewed Features

EDA reveals which features need transformation before they can be useful to linear models or distance-based algorithms.

  • Log transform: apply to right-skewed features (income, request counts, file sizes); use log(x+1) to handle zeros
  • Square root transform: milder than log; useful for count data with occasional large values
  • Box-Cox transform: data-driven power transform; requires all positive values
  • Yeo-Johnson transform: like Box-Cox but handles zeros and negatives
  • Always check skewness before and after; target |skew| < 0.5

Encoding Outliers as Features

Rather than removing outliers, EDA may reveal they are meaningful events worth flagging explicitly.

  • Create binary "is_outlier_X" flag for extreme values in feature X
  • Winsorize: cap extreme values at 1st/99th percentile instead of dropping
  • In fraud detection, outliers are often the signal — never blindly remove them
  • Use IQR method (below Q1 - 1.5×IQR or above Q3 + 1.5×IQR) for outlier boundaries

Binning & Interaction Features

EDA-revealed non-linear relationships and bimodal distributions suggest creative feature engineering.

  • Binning/bucketing: convert continuous age into "young/middle/senior" buckets if EDA shows non-linear relationship with target
  • Interaction features: multiply two correlated features; ratio of two related features (e.g., debt/income)
  • Polynomial features: add x² or x³ if scatter plot suggests quadratic relationship with target
  • Document every feature engineering decision with the EDA evidence that motivated it

Datetime & Missing Value Features

Temporal features and missingness patterns are two of the most commonly overlooked sources of signal.

  • Datetime decomposition: extract hour, day-of-week, month, quarter, is_weekend, is_holiday
  • Time since event: days since last purchase, hours since last login
  • Missing as signal: create "was_X_missing" binary flag before imputing X
  • Missingness count: total number of missing features per row as a feature
  • Recency, frequency, monetary (RFM): aggregate time-series data into summary statistics

Produce a Written Data Quality Report

EDA should culminate in a written document — not just notebooks with charts. The data quality report should include: dataset dimensions and provenance, feature-by-feature statistics and quality assessments, identified issues and how they were addressed, class balance, train/test distribution comparison, and recommended feature engineering steps. This report should be version-controlled alongside the code and referenced by anyone who works with the dataset.

EDA to Feature Engineering Workflow

EDA FindingFeature Engineering ResponseWhy It Helps
Feature skewness > 1Log or Box-Cox transformNormalizes distribution for linear models; reduces outlier influence
Bimodal distribution in feature XCreate binary split feature (X > threshold)Captures the hidden categorical structure explicitly
Strong correlation between X and YAdd X/Y ratio as interaction featureRatio often more informative than either component alone
Datetime column presentExtract hour, day-of-week, is_weekend, days_sinceExposes cyclic and recency patterns invisible in raw timestamp
Systematic missingness in column XAdd binary "was_X_missing" flag before imputationPreserves the informational signal that missingness itself carries
Outliers in critical featureWinsorize + add "is_extreme_X" flagPrevents outlier distortion while retaining the event as a signal
Near-constant feature (>99% same value)Drop the featureZero variance means zero predictive power; adds noise and compute cost
High cardinality categorical (1000+ levels)Target encoding or embedding; rare-level groupingOne-hot encoding would create sparse high-dim space; target encoding captures signal compactly