Exploratory Data Analysis | CyberHub AI & ML

🔍 Why EDA Matters

Find Problems Before They Become Expensive

Data quality issues discovered during EDA take minutes to fix. The same issues discovered after a week of training cost far more — in compute, time, and credibility with stakeholders.

Label leakage: a feature that directly encodes the target variable
Train/test contamination: the same entity appearing in both splits
Silent joins gone wrong: many-to-many join that multiplied rows unnoticed
Temporal leakage: future information encoded in "current" features
Near-constant features: variables with 99% the same value provide no signal

Discover What Your Data Actually Says

Documentation lies. Schema definitions are aspirational. The actual data in production has a different story — and EDA is how you read it.

Distributions are rarely Gaussian — most real-world features are skewed or multimodal
Outliers are often real events that the model must handle, not noise to discard
Correlations between features suggest redundancy and multicollinearity risks
Missing value patterns encode information — missingness is rarely completely random
Class imbalance ratios determine your evaluation metric choices

Guide Feature Engineering

The best features are usually suggested by the data itself. EDA reveals which transformations, interactions, and aggregations are worth trying.

Highly skewed continuous features suggest log-transformation
Bimodal distributions suggest a hidden categorical split
Strong correlation between two features suggests ratio or interaction features
Cyclic datetime patterns (hour, day-of-week) suggest sinusoidal encoding
Groups in scatter plots suggest segment-specific models or interaction terms

Validate Your Assumptions

Every ML project starts with assumptions about the data. EDA is where you test them before committing to a model architecture and feature set.

"The classes are roughly balanced" — check this; often wrong
"Feature X is a good predictor of Y" — check correlation; sometimes spurious
"The data is i.i.d." — check for temporal autocorrelation and batch effects
"Missing values are random" — visualize missingness patterns; often systematic
"The test set is representative" — compare distributions between train and test

The 80/20 Reality of ML Projects

Studies consistently show that data work — collection, cleaning, validation, and EDA — consumes 70–80% of total ML project time. EDA is not optional overhead; it is the core of the work. A thorough EDA makes the remaining 20% (modeling) dramatically more effective by ensuring you're optimizing the right objective with the right features on clean, well-understood data.

Famous EDA Disasters

The ImageNet team later discovered that many "dog breed" images were actually photos of the dog owners — the model was learning to classify humans. A healthcare ML team trained a pneumonia risk model that inadvertently learned that asthma patients had lower pneumonia risk (because they went to the ICU directly, so fewer died in the pneumonia study cohort). Both disasters would have been caught by thorough EDA and data auditing before modeling began.

📊 Univariate Analysis

Numerical Features

Analyze each numerical feature in isolation before considering relationships. Key statistics and plots:

Histogram: reveals distribution shape — normal, skewed, bimodal, uniform
Box plot: shows median, IQR, and outliers using Tukey fences (1.5× IQR)
QQ plot: compare empirical quantiles against theoretical normal distribution
Skewness: |skew| > 1 is highly skewed; |skew| > 0.5 is moderately skewed
Kurtosis: high kurtosis (leptokurtic) indicates heavy tails and extreme outlier risk
Mean vs median: large gap indicates skew; use median for robust central tendency reporting

Categorical Features

Categorical analysis focuses on the count and frequency of each level, plus cardinality assessment.

Value counts: frequency of each category; sort descending to see dominant levels
Bar chart: visualize value counts; horizontal orientation for long category names
Cardinality: number of unique values — high cardinality needs special encoding strategies
Rare categories: categories with <1% frequency may need grouping into "Other"
Unexpected values: typos, inconsistent casing ("Male"/"male"/"M"), legacy codes

Datetime & Time Series Features

Temporal data requires its own analysis to reveal trends, seasonality, and gaps.

Time-series plot: raw values over time; reveals trend and outlier events
Seasonality decomposition: STL decomposition splits trend + seasonal + residual components
Autocorrelation (ACF/PACF): detect lag-based dependencies
Gap detection: missing time periods that could indicate data pipeline failures
Distribution by time bucket: hour-of-day, day-of-week — reveals cyclic patterns

Missing Value Analysis

Missingness should never be ignored — its pattern is itself informative.

Missing percentage per feature: sort descending; features >80% missing may need dropping
Missing completely at random (MCAR): missingness unrelated to any variable
Missing at random (MAR): missingness related to other observed variables — can impute
Missing not at random (MNAR): missingness related to the missing value itself — hardest case
Missingness heatmap: visualize which rows have missing values across which columns simultaneously

Pandas Describe + Seaborn Histplot

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

# Load dataset
df = pd.read_csv("dataset.csv")

# --- Basic statistics ---
print("Shape:", df.shape)
print("\nData types:\n", df.dtypes)
print("\nMissing values:\n", df.isnull().sum())
print("\nDescriptive statistics:\n", df.describe())

# Numerical columns
num_cols = df.select_dtypes(include=np.number).columns.tolist()
cat_cols = df.select_dtypes(include=["object", "category"]).columns.tolist()

# --- Skewness and kurtosis for all numerical features ---
print("\nSkewness:")
for col in num_cols:
    sk = df[col].skew()
    ku = df[col].kurtosis()
    flag = " << HIGH SKEW" if abs(sk) > 1 else ""
    print(f"  {col}: skew={sk:.3f}, kurtosis={ku:.3f}{flag}")

# --- Histogram grid for numerical features ---
n = len(num_cols)
cols_per_row = 3
rows = (n + cols_per_row - 1) // cols_per_row
fig, axes = plt.subplots(rows, cols_per_row, figsize=(15, 4 * rows))
axes = axes.flatten()

for i, col in enumerate(num_cols):
    sns.histplot(data=df, x=col, kde=True, ax=axes[i],
                 color="#00d9ff", alpha=0.7)
    axes[i].set_title(f"{col}\n(skew: {df[col].skew():.2f})")
    axes[i].set_facecolor("#0a0e27")

# Hide unused subplots
for j in range(i + 1, len(axes)):
    axes[j].set_visible(False)

plt.tight_layout()
plt.savefig("univariate_histograms.png", dpi=150, bbox_inches="tight")

# --- Categorical value counts ---
for col in cat_cols:
    print(f"\n{col} ({df[col].nunique()} unique values):")
    print(df[col].value_counts(normalize=True).head(10).to_string())

# --- Missing value heatmap ---
import missingno as msno
msno.heatmap(df, figsize=(12, 6))
plt.savefig("missing_heatmap.png", dpi=150)

🔗 Bivariate & Multivariate Analysis

Correlation Analysis

Measure pairwise linear and monotonic relationships between numerical features.

Pearson r: linear correlation; assumes normality; sensitive to outliers
Spearman ρ: monotonic correlation; rank-based; robust to outliers; preferred for skewed data
Kendall τ: concordant vs discordant pairs; more robust than Spearman; slower for large datasets
Strong correlations (>0.8) between features indicate multicollinearity — one may be droppable
Strong correlation with target indicates high predictive utility

Target Correlation Analysis

For supervised learning, the most important bivariate analysis is feature-vs-target correlation.

Rank features by absolute Spearman correlation with target
Low correlation does not mean useless — interactions and nonlinear relations matter
For classification: plot class-conditional distributions of each feature
Box plots (feature value | class) reveal separability at a glance
Mutual information captures nonlinear relationships that correlation misses

Multicollinearity Detection

Correlated features cause instability in linear models and inflate coefficient variances.

Correlation matrix heatmap: visualize all pairwise correlations at once
VIF (Variance Inflation Factor): VIF > 5 suggests problematic collinearity; VIF > 10 is severe
Cluster correlated features together and keep one representative per cluster
PCA collapses collinear features into orthogonal components

Categorical vs Continuous Relationships

When one variable is categorical and another is continuous, different plots reveal different aspects of the relationship.

Box plot: median and spread per category; good for outlier comparison
Violin plot: shows full distribution per category; better than box for multimodal distributions
Strip plot + swarm plot: all individual data points overlaid per category
ANOVA / Kruskal-Wallis test: statistical test for mean differences across groups

Seaborn Pairplot + Correlation Heatmap

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("dataset.csv")
target_col = "label"
num_cols = df.select_dtypes(include=np.number).columns.tolist()

# --- Correlation matrix (Spearman — robust to outliers and skew) ---
corr_matrix = df[num_cols].corr(method="spearman")

fig, ax = plt.subplots(figsize=(12, 10))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))  # show lower triangle only
sns.heatmap(
    corr_matrix,
    mask=mask,
    annot=True, fmt=".2f",
    cmap="coolwarm", center=0,
    linewidths=0.5,
    ax=ax
)
ax.set_title("Spearman Correlation Matrix", fontsize=14)
plt.tight_layout()
plt.savefig("correlation_heatmap.png", dpi=150, bbox_inches="tight")

# --- Feature correlations with target ---
target_corr = corr_matrix[target_col].drop(target_col).sort_values(ascending=False)
print("Feature correlations with target:")
print(target_corr.to_string())

# --- Pairplot for top correlated features (small feature sets only) ---
top_features = target_corr.abs().nlargest(5).index.tolist() + [target_col]
pair_df = df[top_features].copy()

pair_grid = sns.pairplot(
    pair_df,
    hue=target_col if df[target_col].nunique() <= 10 else None,
    diag_kind="kde",
    plot_kws={"alpha": 0.5, "s": 20}
)
pair_grid.fig.suptitle("Pairplot — Top 5 Features vs Target", y=1.02)
plt.savefig("pairplot.png", dpi=150, bbox_inches="tight")

# --- Mutual information (captures nonlinear relationships) ---
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression

X = df[num_cols].drop(columns=[target_col]).fillna(0)
y = df[target_col]

# Use classif for discrete target, regression for continuous
mi = mutual_info_classif(X, y, random_state=42)
mi_series = pd.Series(mi, index=X.columns).sort_values(ascending=False)
print("\nMutual information with target:")
print(mi_series.to_string())

# --- Violin plots: top features vs categorical target ---
fig, axes = plt.subplots(1, min(3, len(X.columns)), figsize=(15, 5))
for i, col in enumerate(mi_series.index[:3]):
    sns.violinplot(data=df, x=target_col, y=col, ax=axes[i], palette="muted")
    axes[i].set_title(f"{col} by target class")
plt.tight_layout()
plt.savefig("violin_plots.png", dpi=150, bbox_inches="tight")

🤖 Automated EDA Tools

Manual EDA is thorough but time-consuming. Automated tools generate comprehensive reports in one or two lines of code, giving you a fast first pass that you can then drill into manually.

ydata-profiling

Formerly pandas-profiling, this is the most popular automated EDA tool. One line generates a comprehensive HTML report covering statistics, distributions, correlations, and missing values for every column.

from ydata_profiling import ProfileReport
import pandas as pd

df = pd.read_csv("dataset.csv")
profile = ProfileReport(
    df,
    title="Dataset EDA Report",
    explorative=True,
    correlations={
        "pearson": {"calculate": True},
        "spearman": {"calculate": True},
    }
)
profile.to_file("eda_report.html")

Open SourceHTML Output

Sweetviz

Sweetviz specializes in train/test comparison — generating side-by-side visualizations that make distribution shift between splits immediately obvious. Essential for validating train/test splits.

import sweetviz as sv
import pandas as pd

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

# Compare train vs test distributions
compare_report = sv.compare(
    [train, "Train"], [test, "Test"],
    target_feat="label"
)
compare_report.show_html("train_test_comparison.html")

# Single dataset analysis
analyze_report = sv.analyze(train, target_feat="label")
analyze_report.show_html("train_analysis.html")

Open SourceTrain/Test Compare

D-Tale

D-Tale provides an interactive web UI for exploring pandas DataFrames. You can filter, sort, chart, and compute statistics interactively — useful for ad-hoc exploration in development environments.

import dtale
import pandas as pd

df = pd.read_csv("dataset.csv")

# Launch interactive UI (opens browser tab)
d = dtale.show(df)
d.open_browser()

# In Jupyter
import dtale.app as dtale_app
dtale_app.USE_COLAB = False  # set True for Colab
dtale.show(df)

Open SourceInteractive UI

Lux

Lux is a Jupyter-native library that automatically recommends visualizations based on what attributes it detects in your DataFrame. It surfaces non-obvious patterns without you needing to specify what to plot.

import lux
import pandas as pd

df = pd.read_csv("dataset.csv")

# Just displaying the DataFrame in Jupyter
# activates Lux's automatic recommendation engine
df  # shows interactive Lux widget

# Specify an intent to focus recommendations
df.intent = ["label"]  # recommendations relative to target
df  # now shows intent-driven recommendations

Open SourceAuto Viz Suggest

Tool Comparison

Tool	Output Format	Best For	Install
ydata-profiling	Standalone HTML report; also JSON/widget	Full dataset audit; first pass on any new dataset	`pip install ydata-profiling`
Sweetviz	Standalone HTML report with side-by-side panels	Train/test comparison; detecting split distribution shift	`pip install sweetviz`
D-Tale	Interactive browser-based UI	Ad-hoc exploration; non-technical stakeholder demos	`pip install dtale`
Lux	Jupyter widget with interactive chart recommendations	Rapid visual exploration; discovering unexpected patterns	`pip install lux-api`
AutoViz	Matplotlib/Plotly charts saved to file	Quick automated charting; scriptable batch EDA	`pip install autoviz`

Automated EDA Is a Starting Point, Not a Destination

Automated tools catch obvious issues and generate a comprehensive overview in minutes. But they cannot replace domain knowledge. A profiling report won't tell you that "transaction_amount = 0.01" is suspicious for your fraud dataset, or that a certain feature combination is physically impossible. Always follow automated EDA with manual domain-specific investigation.

⚙️ EDA-Driven Feature Engineering Insights

Transforming Skewed Features

EDA reveals which features need transformation before they can be useful to linear models or distance-based algorithms.

Log transform: apply to right-skewed features (income, request counts, file sizes); use log(x+1) to handle zeros
Square root transform: milder than log; useful for count data with occasional large values
Box-Cox transform: data-driven power transform; requires all positive values
Yeo-Johnson transform: like Box-Cox but handles zeros and negatives
Always check skewness before and after; target |skew| < 0.5

Encoding Outliers as Features

Rather than removing outliers, EDA may reveal they are meaningful events worth flagging explicitly.

Create binary "is_outlier_X" flag for extreme values in feature X
Winsorize: cap extreme values at 1st/99th percentile instead of dropping
In fraud detection, outliers are often the signal — never blindly remove them
Use IQR method (below Q1 - 1.5×IQR or above Q3 + 1.5×IQR) for outlier boundaries

Binning & Interaction Features

EDA-revealed non-linear relationships and bimodal distributions suggest creative feature engineering.

Binning/bucketing: convert continuous age into "young/middle/senior" buckets if EDA shows non-linear relationship with target
Interaction features: multiply two correlated features; ratio of two related features (e.g., debt/income)
Polynomial features: add x² or x³ if scatter plot suggests quadratic relationship with target
Document every feature engineering decision with the EDA evidence that motivated it

Datetime & Missing Value Features

Temporal features and missingness patterns are two of the most commonly overlooked sources of signal.

Datetime decomposition: extract hour, day-of-week, month, quarter, is_weekend, is_holiday
Time since event: days since last purchase, hours since last login
Missing as signal: create "was_X_missing" binary flag before imputing X
Missingness count: total number of missing features per row as a feature
Recency, frequency, monetary (RFM): aggregate time-series data into summary statistics

Produce a Written Data Quality Report

EDA should culminate in a written document — not just notebooks with charts. The data quality report should include: dataset dimensions and provenance, feature-by-feature statistics and quality assessments, identified issues and how they were addressed, class balance, train/test distribution comparison, and recommended feature engineering steps. This report should be version-controlled alongside the code and referenced by anyone who works with the dataset.

EDA to Feature Engineering Workflow

EDA Finding	Feature Engineering Response	Why It Helps
Feature skewness > 1	Log or Box-Cox transform	Normalizes distribution for linear models; reduces outlier influence
Bimodal distribution in feature X	Create binary split feature (X > threshold)	Captures the hidden categorical structure explicitly
Strong correlation between X and Y	Add X/Y ratio as interaction feature	Ratio often more informative than either component alone
Datetime column present	Extract hour, day-of-week, is_weekend, days_since	Exposes cyclic and recency patterns invisible in raw timestamp
Systematic missingness in column X	Add binary "was_X_missing" flag before imputation	Preserves the informational signal that missingness itself carries
Outliers in critical feature	Winsorize + add "is_extreme_X" flag	Prevents outlier distortion while retaining the event as a signal
Near-constant feature (>99% same value)	Drop the feature	Zero variance means zero predictive power; adds noise and compute cost
High cardinality categorical (1000+ levels)	Target encoding or embedding; rare-level grouping	One-hot encoding would create sparse high-dim space; target encoding captures signal compactly