Data Cleaning & Imputation | CyberSecurityHub

⚠ Common Data Quality Problems

Real-world data is never clean. Before any model training, you must understand what is wrong with your data, how much of it is affected, and what impact each problem will have on downstream learning. Rushing past this audit step is the single most common source of silent model failures.

Problem	Example	Detection Method	Impact on ML
Missing Values	Age column has NaN for 30% of rows; sensor dropout leaving gaps in time series	`df.isnull().sum()`, missingness heatmap, MSNO library	Many algorithms crash on NaN; wrong imputation introduces bias
Duplicate Records	Same transaction recorded twice; web-scraped article appears under two URLs	`df.duplicated().sum()`, hash-based dedup, similarity search	Inflates training examples; model memorises duplicates; inflated metrics
Wrong Data Types	Dates stored as strings; categorical IDs read as integers and averaged	`df.dtypes`, value distribution inspection, parsing errors	Numeric operations on strings fail; date arithmetic impossible; ordinal encoding wrong
Outliers	Age = 999; transaction amount = $1 billion; typo: 180cm entered as 1800cm	IQR / z-score / boxplot; isolation forest for multivariate outliers	Skew means and standard deviations; distort linear model coefficients; inflate MSE in regression
Inconsistent Formatting	"New York", "new york", "NY", "N.Y." all meaning the same city	`df['col'].value_counts()`, fuzzy string matching, regex audit	Creates spurious extra categories; increases cardinality; corrupts groupby aggregations
Mislabelled Examples	5-star review tagged as 1-star; spam email in non-spam class	Confident learning (Cleanlab), cross-validation + review of misclassified examples	Corrupted labels are often worse than missing data; degrades model ceiling

❓ Handling Missing Values

Not all missing data is the same. The mechanism by which values go missing determines which handling strategy is appropriate. Using the wrong strategy can introduce systematic bias — even worse than leaving the data alone.

MCAR — Missing Completely At Random

The probability of missingness is independent of both observed and unobserved data. Example: a sensor randomly fails 2% of the time due to hardware glitches.

Listwise deletion is unbiased (though reduces sample size)
Any imputation method works — bias is not introduced
Test: Little's MCAR test (but rarely conclusive)

MAR — Missing At Random

The probability of missingness depends on observed data but not on the missing value itself. Example: older patients less likely to have their weight recorded, but weight itself doesn't predict missingness after controlling for age.

Conditional imputation is valid (condition on the observed predictors)
Multiple imputation (MICE) handles MAR well
Listwise deletion introduces bias — avoid it

MNAR — Missing Not At Random

The probability of missingness depends on the missing value itself. Example: high earners refuse to disclose income; very ill patients drop out of clinical trials.

No imputation method fully corrects for MNAR
Add a binary "was_missing" indicator feature — the fact of missingness carries information
Collect more data; redesign the collection process
Sensitivity analysis to quantify impact of different assumptions

Imputation Strategies

Strategy	When to Use	Pros	Cons
Listwise Deletion	MCAR, <5% missing, large dataset	Simple; unbiased under MCAR	Loses data; biased under MAR/MNAR
Mean / Median / Mode	MCAR, quick baseline	Simple; no model needed	Reduces variance; ignores feature correlations; biased for MAR
KNN Imputation	MAR, moderate missing rate, <100K rows	Uses feature relationships; works for mixed types	Slow on large datasets; sensitive to scale (must normalise first)
MICE / Iterative Imputation	MAR, multiple columns with missingness	Preserves correlations; accounts for uncertainty	Computationally expensive; assumes linear relationships by default
Model-Based (Random Forest)	Complex non-linear relationships, mixed types	Captures non-linear feature interactions	Can overfit; slow; requires careful cross-validation

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer

df = pd.read_csv("dataset.csv")

# ---- Step 1: Audit missingness ----
missing_report = df.isnull().mean().sort_values(ascending=False)
print(missing_report[missing_report > 0])

# Add "was_missing" indicators BEFORE imputation (captures MNAR signal)
for col in df.columns[df.isnull().any()]:
    df[f"{col}_was_missing"] = df[col].isnull().astype(int)

# ---- Step 2: Simple imputation (baseline) ----
simple_imp = SimpleImputer(strategy="median")  # or "mean", "most_frequent", "constant"
df[["age", "income"]] = simple_imp.fit_transform(df[["age", "income"]])

# ---- Step 3: KNN imputation (preserves feature correlations) ----
# Note: KNNImputer requires scaled data — normalise first
knn_imp = KNNImputer(n_neighbors=5, weights="distance")
df[["bmi", "glucose"]] = knn_imp.fit_transform(df[["bmi", "glucose"]])

# ---- Step 4: MICE / Iterative imputation (best for MAR) ----
iter_imp = IterativeImputer(
    max_iter=10,
    random_state=42,
    initial_strategy="median",
    estimator=None  # defaults to BayesianRidge; swap for RandomForestRegressor for non-linear
)
numeric_cols = df.select_dtypes(include=np.number).columns
df[numeric_cols] = iter_imp.fit_transform(df[numeric_cols])

# IMPORTANT: fit imputers on train set only; transform train + test
# store imputer objects and call .transform() on test data

💥 Outlier Detection & Treatment

Outliers are data points that deviate significantly from the rest. They can be genuine (a real billionaire in income data) or errors (age = 999). The distinction is critical — removing genuine outliers destroys real-world signal; keeping erroneous ones corrupts your model.

Outliers in Features vs. Outliers in Labels

These require different treatment. Feature outliers may distort distance calculations and gradient updates — consider capping or transforming. Label outliers (mislabelled examples) are often more dangerous — they directly teach the model wrong answers. Use tools like Cleanlab to detect label errors using confident learning before treating them as outliers to remove.

Univariate Detection Methods

IQR method — outliers are values below Q1 − 1.5×IQR or above Q3 + 1.5×IQR; robust, widely used
Z-score — |z| > 3 flags outliers; assumes normality; distorted by extreme values
Modified Z-score — uses median absolute deviation (MAD) instead of std; robust to extreme values
Percentile capping — clip at 1st/99th percentile; preserves all rows

Multivariate Detection Methods

Isolation Forest — isolates outliers by randomly partitioning feature space; fast; works in high dimensions
Local Outlier Factor (LOF) — density-based; detects local outliers that wouldn't be flagged univariately
DBSCAN — density clustering; noise points are outlier candidates; good for spatial data
Mahalanobis distance — accounts for feature correlations; parametric; requires invertible covariance
Autoencoder reconstruction error — high reconstruction loss = anomaly; works for complex patterns

import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest

df = pd.read_csv("dataset.csv")

# ---- IQR capping (winsorization) ----
def cap_outliers_iqr(series, multiplier=1.5):
    Q1, Q3 = series.quantile(0.25), series.quantile(0.75)
    IQR = Q3 - Q1
    lower, upper = Q1 - multiplier * IQR, Q3 + multiplier * IQR
    return series.clip(lower=lower, upper=upper), (lower, upper)

df["age"], age_bounds = cap_outliers_iqr(df["age"])
print(f"Age capped at {age_bounds}")

# ---- Z-score filtering (removes rows — use with caution) ----
from scipy import stats
z_scores = np.abs(stats.zscore(df[["income", "spending"]]))
df_clean = df[(z_scores < 3).all(axis=1)]
print(f"Removed {len(df) - len(df_clean)} rows via z-score filter")

# ---- Isolation Forest (multivariate) ----
iso_forest = IsolationForest(contamination=0.05, random_state=42)  # 5% assumed outlier rate
feature_cols = ["age", "income", "spending", "credit_score"]
df["outlier_score"] = iso_forest.fit_predict(df[feature_cols])
# -1 = outlier, 1 = inlier
outliers = df[df["outlier_score"] == -1]
print(f"Isolation Forest flagged {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)")

# Treatment options (pick based on domain knowledge):
# Option A: Remove            df = df[df["outlier_score"] == 1]
# Option B: Cap/clip          (use cap_outliers_iqr above)
# Option C: Transform         df["income"] = np.log1p(df["income"])
# Option D: Keep + flag       (already done with outlier_score column)

🧹 Duplicate & Inconsistency Removal

Duplicates and format inconsistencies silently inflate your dataset and create spurious category splits. A model that learns "New York" and "new york" are different cities will fail systematically on any input that doesn't match training-time formatting.

Exact vs. Near-Duplicate Detection

Exact duplicates — df.duplicated(); hash of all columns or a subset of key columns
Near-duplicates (text) — MinHash + LSH (Locality-Sensitive Hashing); Jaccard similarity on shingles; efficient even for millions of documents
Near-duplicates (images) — perceptual hash (pHash/dHash); cosine similarity of CNN embeddings
Semantic duplicates — same meaning, different wording; requires embedding-based similarity search (FAISS)

String Standardisation

Case normalisation — .str.lower().str.strip() as a minimum
Regex cleaning — remove punctuation, extra whitespace, special characters
Fuzzy matching — RapidFuzz / FuzzyWuzzy (token_sort_ratio); Levenshtein edit distance
Canonical mapping — build a dictionary {"NY": "New York", "N.Y.": "New York"}; use for known value sets
Entity resolution — link records across tables referring to the same real-world entity

import pandas as pd
from rapidfuzz import fuzz, process

df = pd.read_csv("customers.csv")

# ---- Exact duplicate removal ----
n_before = len(df)
df = df.drop_duplicates(subset=["email", "phone"], keep="first")
print(f"Removed {n_before - len(df)} exact duplicates")

# ---- String standardisation ----
df["city"] = df["city"].str.lower().str.strip()
df["city"] = df["city"].str.replace(r"[^\w\s]", "", regex=True)  # remove punctuation

# ---- Canonical category mapping ----
city_map = {
    "ny":        "new york",
    "n.y.":      "new york",
    "new york city": "new york",
    "nyc":       "new york",
    "la":        "los angeles",
    "l.a.":      "los angeles",
}
df["city"] = df["city"].map(city_map).fillna(df["city"])

# ---- Fuzzy deduplication for near-duplicates ----
# Given a reference list of canonical city names:
canonical_cities = ["new york", "los angeles", "chicago", "houston", "phoenix"]

def fuzzy_standardise(value, choices, threshold=85):
    if pd.isna(value):
        return value
    match, score, _ = process.extractOne(str(value), choices, scorer=fuzz.token_sort_ratio)
    return match if score >= threshold else value

df["city"] = df["city"].apply(lambda x: fuzzy_standardise(x, canonical_cities))

# ---- Category harmonisation ----
df["gender"] = df["gender"].str.lower().map({
    "m": "male", "man": "male", "male": "male",
    "f": "female", "woman": "female", "female": "female",
}).fillna("unknown")

print(df["city"].value_counts().head(10))

🔧 Cleaning Pipelines

Ad-hoc cleaning scripts are technical debt. The moment you need to apply the same cleaning logic to new data (a new batch, a test set, production inference), scattered notebook code becomes a liability. Building cleaning as a pipeline ensures reproducibility and prevents data leakage.

Always Keep the Raw Data

Never overwrite your raw data. Store the original files in an immutable location (S3 with versioning, DVC-tracked files) and always write cleaned data to a separate path. When your cleaning logic improves, you can re-run from scratch. Treat raw data as a read-only source of truth.

Data Quality Metrics to Track

Completeness — % of non-null values per column; target >95% for key features
Consistency — values conform to expected format/range; referential integrity holds
Validity — values make domain sense (age between 0–120; probabilities sum to 1)
Uniqueness — no unexpected duplicates in primary key columns
Timeliness — data reflects current state; no stale records
Accuracy — values match ground truth (hardest to measure automatically)

Pipeline Best Practices

Fit on train, transform all — all statistics (mean, IQR bounds) computed on train split only
Log everything removed — record how many rows/values were affected by each step
Version your cleaning — tag dataset versions alongside model versions (use DVC or MLflow)
Schema validation — use Great Expectations or Pandera to assert constraints on data before training
Idempotent operations — running the pipeline twice should produce the same result

import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")
log = logging.getLogger(__name__)

# ---- Custom cleaning transformers ----
def remove_duplicates(X):
    n_before = len(X)
    X = X.drop_duplicates()
    log.info(f"Removed {n_before - len(X)} duplicate rows")
    return X

def cap_outliers(X, cols=None, multiplier=1.5):
    X = X.copy()
    cols = cols or X.select_dtypes(include=np.number).columns
    for col in cols:
        Q1, Q3 = X[col].quantile(0.25), X[col].quantile(0.75)
        IQR = Q3 - Q1
        lower, upper = Q1 - multiplier * IQR, Q3 + multiplier * IQR
        n_capped = ((X[col] < lower) | (X[col] > upper)).sum()
        X[col] = X[col].clip(lower, upper)
        log.info(f"Capped {n_capped} outliers in '{col}' to [{lower:.2f}, {upper:.2f}]")
    return X

def standardise_strings(X, cols):
    X = X.copy()
    for col in cols:
        X[col] = X[col].str.lower().str.strip()
    return X

# ---- Build the pipeline ----
# Note: sklearn Pipelines work with estimators; for pandas DataFrames,
# use FunctionTransformer or a custom BaseEstimator + TransformerMixin

from sklearn.base import BaseEstimator, TransformerMixin

class DataCleaner(BaseEstimator, TransformerMixin):
    def __init__(self, numeric_cols, string_cols):
        self.numeric_cols = numeric_cols
        self.string_cols   = string_cols
        self.imputer_      = SimpleImputer(strategy="median")
        self.iqr_bounds_   = {}

    def fit(self, X, y=None):
        # Compute statistics on TRAIN data only
        for col in self.numeric_cols:
            Q1, Q3 = X[col].quantile(0.25), X[col].quantile(0.75)
            IQR = Q3 - Q1
            self.iqr_bounds_[col] = (Q1 - 1.5*IQR, Q3 + 1.5*IQR)
        self.imputer_.fit(X[self.numeric_cols])
        return self

    def transform(self, X):
        X = X.copy()
        X = remove_duplicates(X)
        X = standardise_strings(X, self.string_cols)
        for col, (lo, hi) in self.iqr_bounds_.items():
            X[col] = X[col].clip(lo, hi)
        X[self.numeric_cols] = self.imputer_.transform(X[self.numeric_cols])
        return X

# Usage
cleaner = DataCleaner(
    numeric_cols=["age", "income", "balance"],
    string_cols=["city", "country"]
)
X_train_clean = cleaner.fit_transform(X_train)   # fit + transform train
X_test_clean  = cleaner.transform(X_test)         # transform test with TRAIN statistics