Find Problems Before They Become Expensive
Data quality issues discovered during EDA take minutes to fix. The same issues discovered after a week of training cost far more — in compute, time, and credibility with stakeholders.
- Label leakage: a feature that directly encodes the target variable
- Train/test contamination: the same entity appearing in both splits
- Silent joins gone wrong: many-to-many join that multiplied rows unnoticed
- Temporal leakage: future information encoded in "current" features
- Near-constant features: variables with 99% the same value provide no signal
Discover What Your Data Actually Says
Documentation lies. Schema definitions are aspirational. The actual data in production has a different story — and EDA is how you read it.
- Distributions are rarely Gaussian — most real-world features are skewed or multimodal
- Outliers are often real events that the model must handle, not noise to discard
- Correlations between features suggest redundancy and multicollinearity risks
- Missing value patterns encode information — missingness is rarely completely random
- Class imbalance ratios determine your evaluation metric choices
Guide Feature Engineering
The best features are usually suggested by the data itself. EDA reveals which transformations, interactions, and aggregations are worth trying.
- Highly skewed continuous features suggest log-transformation
- Bimodal distributions suggest a hidden categorical split
- Strong correlation between two features suggests ratio or interaction features
- Cyclic datetime patterns (hour, day-of-week) suggest sinusoidal encoding
- Groups in scatter plots suggest segment-specific models or interaction terms
Validate Your Assumptions
Every ML project starts with assumptions about the data. EDA is where you test them before committing to a model architecture and feature set.
- "The classes are roughly balanced" — check this; often wrong
- "Feature X is a good predictor of Y" — check correlation; sometimes spurious
- "The data is i.i.d." — check for temporal autocorrelation and batch effects
- "Missing values are random" — visualize missingness patterns; often systematic
- "The test set is representative" — compare distributions between train and test
The 80/20 Reality of ML Projects
Studies consistently show that data work — collection, cleaning, validation, and EDA — consumes 70–80% of total ML project time. EDA is not optional overhead; it is the core of the work. A thorough EDA makes the remaining 20% (modeling) dramatically more effective by ensuring you're optimizing the right objective with the right features on clean, well-understood data.
Famous EDA Disasters
The ImageNet team later discovered that many "dog breed" images were actually photos of the dog owners — the model was learning to classify humans. A healthcare ML team trained a pneumonia risk model that inadvertently learned that asthma patients had lower pneumonia risk (because they went to the ICU directly, so fewer died in the pneumonia study cohort). Both disasters would have been caught by thorough EDA and data auditing before modeling began.
Numerical Features
Analyze each numerical feature in isolation before considering relationships. Key statistics and plots:
- Histogram: reveals distribution shape — normal, skewed, bimodal, uniform
- Box plot: shows median, IQR, and outliers using Tukey fences (1.5× IQR)
- QQ plot: compare empirical quantiles against theoretical normal distribution
- Skewness: |skew| > 1 is highly skewed; |skew| > 0.5 is moderately skewed
- Kurtosis: high kurtosis (leptokurtic) indicates heavy tails and extreme outlier risk
- Mean vs median: large gap indicates skew; use median for robust central tendency reporting
Categorical Features
Categorical analysis focuses on the count and frequency of each level, plus cardinality assessment.
- Value counts: frequency of each category; sort descending to see dominant levels
- Bar chart: visualize value counts; horizontal orientation for long category names
- Cardinality: number of unique values — high cardinality needs special encoding strategies
- Rare categories: categories with <1% frequency may need grouping into "Other"
- Unexpected values: typos, inconsistent casing ("Male"/"male"/"M"), legacy codes
Datetime & Time Series Features
Temporal data requires its own analysis to reveal trends, seasonality, and gaps.
- Time-series plot: raw values over time; reveals trend and outlier events
- Seasonality decomposition: STL decomposition splits trend + seasonal + residual components
- Autocorrelation (ACF/PACF): detect lag-based dependencies
- Gap detection: missing time periods that could indicate data pipeline failures
- Distribution by time bucket: hour-of-day, day-of-week — reveals cyclic patterns
Missing Value Analysis
Missingness should never be ignored — its pattern is itself informative.
- Missing percentage per feature: sort descending; features >80% missing may need dropping
- Missing completely at random (MCAR): missingness unrelated to any variable
- Missing at random (MAR): missingness related to other observed variables — can impute
- Missing not at random (MNAR): missingness related to the missing value itself — hardest case
- Missingness heatmap: visualize which rows have missing values across which columns simultaneously
Pandas Describe + Seaborn Histplot
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
# Load dataset
df = pd.read_csv("dataset.csv")
# --- Basic statistics ---
print("Shape:", df.shape)
print("\nData types:\n", df.dtypes)
print("\nMissing values:\n", df.isnull().sum())
print("\nDescriptive statistics:\n", df.describe())
# Numerical columns
num_cols = df.select_dtypes(include=np.number).columns.tolist()
cat_cols = df.select_dtypes(include=["object", "category"]).columns.tolist()
# --- Skewness and kurtosis for all numerical features ---
print("\nSkewness:")
for col in num_cols:
sk = df[col].skew()
ku = df[col].kurtosis()
flag = " << HIGH SKEW" if abs(sk) > 1 else ""
print(f" {col}: skew={sk:.3f}, kurtosis={ku:.3f}{flag}")
# --- Histogram grid for numerical features ---
n = len(num_cols)
cols_per_row = 3
rows = (n + cols_per_row - 1) // cols_per_row
fig, axes = plt.subplots(rows, cols_per_row, figsize=(15, 4 * rows))
axes = axes.flatten()
for i, col in enumerate(num_cols):
sns.histplot(data=df, x=col, kde=True, ax=axes[i],
color="#00d9ff", alpha=0.7)
axes[i].set_title(f"{col}\n(skew: {df[col].skew():.2f})")
axes[i].set_facecolor("#0a0e27")
# Hide unused subplots
for j in range(i + 1, len(axes)):
axes[j].set_visible(False)
plt.tight_layout()
plt.savefig("univariate_histograms.png", dpi=150, bbox_inches="tight")
# --- Categorical value counts ---
for col in cat_cols:
print(f"\n{col} ({df[col].nunique()} unique values):")
print(df[col].value_counts(normalize=True).head(10).to_string())
# --- Missing value heatmap ---
import missingno as msno
msno.heatmap(df, figsize=(12, 6))
plt.savefig("missing_heatmap.png", dpi=150)
Correlation Analysis
Measure pairwise linear and monotonic relationships between numerical features.
- Pearson r: linear correlation; assumes normality; sensitive to outliers
- Spearman ρ: monotonic correlation; rank-based; robust to outliers; preferred for skewed data
- Kendall τ: concordant vs discordant pairs; more robust than Spearman; slower for large datasets
- Strong correlations (>0.8) between features indicate multicollinearity — one may be droppable
- Strong correlation with target indicates high predictive utility
Target Correlation Analysis
For supervised learning, the most important bivariate analysis is feature-vs-target correlation.
- Rank features by absolute Spearman correlation with target
- Low correlation does not mean useless — interactions and nonlinear relations matter
- For classification: plot class-conditional distributions of each feature
- Box plots (feature value | class) reveal separability at a glance
- Mutual information captures nonlinear relationships that correlation misses
Multicollinearity Detection
Correlated features cause instability in linear models and inflate coefficient variances.
- Correlation matrix heatmap: visualize all pairwise correlations at once
- VIF (Variance Inflation Factor): VIF > 5 suggests problematic collinearity; VIF > 10 is severe
- Cluster correlated features together and keep one representative per cluster
- PCA collapses collinear features into orthogonal components
Categorical vs Continuous Relationships
When one variable is categorical and another is continuous, different plots reveal different aspects of the relationship.
- Box plot: median and spread per category; good for outlier comparison
- Violin plot: shows full distribution per category; better than box for multimodal distributions
- Strip plot + swarm plot: all individual data points overlaid per category
- ANOVA / Kruskal-Wallis test: statistical test for mean differences across groups
Seaborn Pairplot + Correlation Heatmap
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("dataset.csv")
target_col = "label"
num_cols = df.select_dtypes(include=np.number).columns.tolist()
# --- Correlation matrix (Spearman — robust to outliers and skew) ---
corr_matrix = df[num_cols].corr(method="spearman")
fig, ax = plt.subplots(figsize=(12, 10))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool)) # show lower triangle only
sns.heatmap(
corr_matrix,
mask=mask,
annot=True, fmt=".2f",
cmap="coolwarm", center=0,
linewidths=0.5,
ax=ax
)
ax.set_title("Spearman Correlation Matrix", fontsize=14)
plt.tight_layout()
plt.savefig("correlation_heatmap.png", dpi=150, bbox_inches="tight")
# --- Feature correlations with target ---
target_corr = corr_matrix[target_col].drop(target_col).sort_values(ascending=False)
print("Feature correlations with target:")
print(target_corr.to_string())
# --- Pairplot for top correlated features (small feature sets only) ---
top_features = target_corr.abs().nlargest(5).index.tolist() + [target_col]
pair_df = df[top_features].copy()
pair_grid = sns.pairplot(
pair_df,
hue=target_col if df[target_col].nunique() <= 10 else None,
diag_kind="kde",
plot_kws={"alpha": 0.5, "s": 20}
)
pair_grid.fig.suptitle("Pairplot — Top 5 Features vs Target", y=1.02)
plt.savefig("pairplot.png", dpi=150, bbox_inches="tight")
# --- Mutual information (captures nonlinear relationships) ---
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
X = df[num_cols].drop(columns=[target_col]).fillna(0)
y = df[target_col]
# Use classif for discrete target, regression for continuous
mi = mutual_info_classif(X, y, random_state=42)
mi_series = pd.Series(mi, index=X.columns).sort_values(ascending=False)
print("\nMutual information with target:")
print(mi_series.to_string())
# --- Violin plots: top features vs categorical target ---
fig, axes = plt.subplots(1, min(3, len(X.columns)), figsize=(15, 5))
for i, col in enumerate(mi_series.index[:3]):
sns.violinplot(data=df, x=target_col, y=col, ax=axes[i], palette="muted")
axes[i].set_title(f"{col} by target class")
plt.tight_layout()
plt.savefig("violin_plots.png", dpi=150, bbox_inches="tight")
Manual EDA is thorough but time-consuming. Automated tools generate comprehensive reports in one or two lines of code, giving you a fast first pass that you can then drill into manually.
ydata-profiling
Formerly pandas-profiling, this is the most popular automated EDA tool. One line generates a comprehensive HTML report covering statistics, distributions, correlations, and missing values for every column.
from ydata_profiling import ProfileReport
import pandas as pd
df = pd.read_csv("dataset.csv")
profile = ProfileReport(
df,
title="Dataset EDA Report",
explorative=True,
correlations={
"pearson": {"calculate": True},
"spearman": {"calculate": True},
}
)
profile.to_file("eda_report.html")
Sweetviz
Sweetviz specializes in train/test comparison — generating side-by-side visualizations that make distribution shift between splits immediately obvious. Essential for validating train/test splits.
import sweetviz as sv
import pandas as pd
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
# Compare train vs test distributions
compare_report = sv.compare(
[train, "Train"], [test, "Test"],
target_feat="label"
)
compare_report.show_html("train_test_comparison.html")
# Single dataset analysis
analyze_report = sv.analyze(train, target_feat="label")
analyze_report.show_html("train_analysis.html")
D-Tale
D-Tale provides an interactive web UI for exploring pandas DataFrames. You can filter, sort, chart, and compute statistics interactively — useful for ad-hoc exploration in development environments.
import dtale
import pandas as pd
df = pd.read_csv("dataset.csv")
# Launch interactive UI (opens browser tab)
d = dtale.show(df)
d.open_browser()
# In Jupyter
import dtale.app as dtale_app
dtale_app.USE_COLAB = False # set True for Colab
dtale.show(df)
Lux
Lux is a Jupyter-native library that automatically recommends visualizations based on what attributes it detects in your DataFrame. It surfaces non-obvious patterns without you needing to specify what to plot.
import lux
import pandas as pd
df = pd.read_csv("dataset.csv")
# Just displaying the DataFrame in Jupyter
# activates Lux's automatic recommendation engine
df # shows interactive Lux widget
# Specify an intent to focus recommendations
df.intent = ["label"] # recommendations relative to target
df # now shows intent-driven recommendations
Tool Comparison
| Tool | Output Format | Best For | Install |
|---|---|---|---|
| ydata-profiling | Standalone HTML report; also JSON/widget | Full dataset audit; first pass on any new dataset | pip install ydata-profiling |
| Sweetviz | Standalone HTML report with side-by-side panels | Train/test comparison; detecting split distribution shift | pip install sweetviz |
| D-Tale | Interactive browser-based UI | Ad-hoc exploration; non-technical stakeholder demos | pip install dtale |
| Lux | Jupyter widget with interactive chart recommendations | Rapid visual exploration; discovering unexpected patterns | pip install lux-api |
| AutoViz | Matplotlib/Plotly charts saved to file | Quick automated charting; scriptable batch EDA | pip install autoviz |
Automated EDA Is a Starting Point, Not a Destination
Automated tools catch obvious issues and generate a comprehensive overview in minutes. But they cannot replace domain knowledge. A profiling report won't tell you that "transaction_amount = 0.01" is suspicious for your fraud dataset, or that a certain feature combination is physically impossible. Always follow automated EDA with manual domain-specific investigation.
Transforming Skewed Features
EDA reveals which features need transformation before they can be useful to linear models or distance-based algorithms.
- Log transform: apply to right-skewed features (income, request counts, file sizes); use log(x+1) to handle zeros
- Square root transform: milder than log; useful for count data with occasional large values
- Box-Cox transform: data-driven power transform; requires all positive values
- Yeo-Johnson transform: like Box-Cox but handles zeros and negatives
- Always check skewness before and after; target |skew| < 0.5
Encoding Outliers as Features
Rather than removing outliers, EDA may reveal they are meaningful events worth flagging explicitly.
- Create binary "is_outlier_X" flag for extreme values in feature X
- Winsorize: cap extreme values at 1st/99th percentile instead of dropping
- In fraud detection, outliers are often the signal — never blindly remove them
- Use IQR method (below Q1 - 1.5×IQR or above Q3 + 1.5×IQR) for outlier boundaries
Binning & Interaction Features
EDA-revealed non-linear relationships and bimodal distributions suggest creative feature engineering.
- Binning/bucketing: convert continuous age into "young/middle/senior" buckets if EDA shows non-linear relationship with target
- Interaction features: multiply two correlated features; ratio of two related features (e.g., debt/income)
- Polynomial features: add x² or x³ if scatter plot suggests quadratic relationship with target
- Document every feature engineering decision with the EDA evidence that motivated it
Datetime & Missing Value Features
Temporal features and missingness patterns are two of the most commonly overlooked sources of signal.
- Datetime decomposition: extract hour, day-of-week, month, quarter, is_weekend, is_holiday
- Time since event: days since last purchase, hours since last login
- Missing as signal: create "was_X_missing" binary flag before imputing X
- Missingness count: total number of missing features per row as a feature
- Recency, frequency, monetary (RFM): aggregate time-series data into summary statistics
Produce a Written Data Quality Report
EDA should culminate in a written document — not just notebooks with charts. The data quality report should include: dataset dimensions and provenance, feature-by-feature statistics and quality assessments, identified issues and how they were addressed, class balance, train/test distribution comparison, and recommended feature engineering steps. This report should be version-controlled alongside the code and referenced by anyone who works with the dataset.
EDA to Feature Engineering Workflow
| EDA Finding | Feature Engineering Response | Why It Helps |
|---|---|---|
| Feature skewness > 1 | Log or Box-Cox transform | Normalizes distribution for linear models; reduces outlier influence |
| Bimodal distribution in feature X | Create binary split feature (X > threshold) | Captures the hidden categorical structure explicitly |
| Strong correlation between X and Y | Add X/Y ratio as interaction feature | Ratio often more informative than either component alone |
| Datetime column present | Extract hour, day-of-week, is_weekend, days_since | Exposes cyclic and recency patterns invisible in raw timestamp |
| Systematic missingness in column X | Add binary "was_X_missing" flag before imputation | Preserves the informational signal that missingness itself carries |
| Outliers in critical feature | Winsorize + add "is_extreme_X" flag | Prevents outlier distortion while retaining the event as a signal |
| Near-constant feature (>99% same value) | Drop the feature | Zero variance means zero predictive power; adds noise and compute cost |
| High cardinality categorical (1000+ levels) | Target encoding or embedding; rare-level grouping | One-hot encoding would create sparse high-dim space; target encoding captures signal compactly |