⚠ Common Data Quality Problems
Real-world data is never clean. Before any model training, you must understand what is wrong with your data, how much of it is affected, and what impact each problem will have on downstream learning. Rushing past this audit step is the single most common source of silent model failures.
| Problem | Example | Detection Method | Impact on ML |
|---|---|---|---|
| Missing Values | Age column has NaN for 30% of rows; sensor dropout leaving gaps in time series | df.isnull().sum(), missingness heatmap, MSNO library |
Many algorithms crash on NaN; wrong imputation introduces bias |
| Duplicate Records | Same transaction recorded twice; web-scraped article appears under two URLs | df.duplicated().sum(), hash-based dedup, similarity search |
Inflates training examples; model memorises duplicates; inflated metrics |
| Wrong Data Types | Dates stored as strings; categorical IDs read as integers and averaged | df.dtypes, value distribution inspection, parsing errors |
Numeric operations on strings fail; date arithmetic impossible; ordinal encoding wrong |
| Outliers | Age = 999; transaction amount = $1 billion; typo: 180cm entered as 1800cm | IQR / z-score / boxplot; isolation forest for multivariate outliers | Skew means and standard deviations; distort linear model coefficients; inflate MSE in regression |
| Inconsistent Formatting | "New York", "new york", "NY", "N.Y." all meaning the same city | df['col'].value_counts(), fuzzy string matching, regex audit |
Creates spurious extra categories; increases cardinality; corrupts groupby aggregations |
| Mislabelled Examples | 5-star review tagged as 1-star; spam email in non-spam class | Confident learning (Cleanlab), cross-validation + review of misclassified examples | Corrupted labels are often worse than missing data; degrades model ceiling |
❓ Handling Missing Values
Not all missing data is the same. The mechanism by which values go missing determines which handling strategy is appropriate. Using the wrong strategy can introduce systematic bias — even worse than leaving the data alone.
MCAR — Missing Completely At Random
The probability of missingness is independent of both observed and unobserved data. Example: a sensor randomly fails 2% of the time due to hardware glitches.
- Listwise deletion is unbiased (though reduces sample size)
- Any imputation method works — bias is not introduced
- Test: Little's MCAR test (but rarely conclusive)
MAR — Missing At Random
The probability of missingness depends on observed data but not on the missing value itself. Example: older patients less likely to have their weight recorded, but weight itself doesn't predict missingness after controlling for age.
- Conditional imputation is valid (condition on the observed predictors)
- Multiple imputation (MICE) handles MAR well
- Listwise deletion introduces bias — avoid it
MNAR — Missing Not At Random
The probability of missingness depends on the missing value itself. Example: high earners refuse to disclose income; very ill patients drop out of clinical trials.
- No imputation method fully corrects for MNAR
- Add a binary "was_missing" indicator feature — the fact of missingness carries information
- Collect more data; redesign the collection process
- Sensitivity analysis to quantify impact of different assumptions
Imputation Strategies
| Strategy | When to Use | Pros | Cons |
|---|---|---|---|
| Listwise Deletion | MCAR, <5% missing, large dataset | Simple; unbiased under MCAR | Loses data; biased under MAR/MNAR |
| Mean / Median / Mode | MCAR, quick baseline | Simple; no model needed | Reduces variance; ignores feature correlations; biased for MAR |
| KNN Imputation | MAR, moderate missing rate, <100K rows | Uses feature relationships; works for mixed types | Slow on large datasets; sensitive to scale (must normalise first) |
| MICE / Iterative Imputation | MAR, multiple columns with missingness | Preserves correlations; accounts for uncertainty | Computationally expensive; assumes linear relationships by default |
| Model-Based (Random Forest) | Complex non-linear relationships, mixed types | Captures non-linear feature interactions | Can overfit; slow; requires careful cross-validation |
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.impute import IterativeImputer
df = pd.read_csv("dataset.csv")
# ---- Step 1: Audit missingness ----
missing_report = df.isnull().mean().sort_values(ascending=False)
print(missing_report[missing_report > 0])
# Add "was_missing" indicators BEFORE imputation (captures MNAR signal)
for col in df.columns[df.isnull().any()]:
df[f"{col}_was_missing"] = df[col].isnull().astype(int)
# ---- Step 2: Simple imputation (baseline) ----
simple_imp = SimpleImputer(strategy="median") # or "mean", "most_frequent", "constant"
df[["age", "income"]] = simple_imp.fit_transform(df[["age", "income"]])
# ---- Step 3: KNN imputation (preserves feature correlations) ----
# Note: KNNImputer requires scaled data — normalise first
knn_imp = KNNImputer(n_neighbors=5, weights="distance")
df[["bmi", "glucose"]] = knn_imp.fit_transform(df[["bmi", "glucose"]])
# ---- Step 4: MICE / Iterative imputation (best for MAR) ----
iter_imp = IterativeImputer(
max_iter=10,
random_state=42,
initial_strategy="median",
estimator=None # defaults to BayesianRidge; swap for RandomForestRegressor for non-linear
)
numeric_cols = df.select_dtypes(include=np.number).columns
df[numeric_cols] = iter_imp.fit_transform(df[numeric_cols])
# IMPORTANT: fit imputers on train set only; transform train + test
# store imputer objects and call .transform() on test data
💥 Outlier Detection & Treatment
Outliers are data points that deviate significantly from the rest. They can be genuine (a real billionaire in income data) or errors (age = 999). The distinction is critical — removing genuine outliers destroys real-world signal; keeping erroneous ones corrupts your model.
Outliers in Features vs. Outliers in Labels
These require different treatment. Feature outliers may distort distance calculations and gradient updates — consider capping or transforming. Label outliers (mislabelled examples) are often more dangerous — they directly teach the model wrong answers. Use tools like Cleanlab to detect label errors using confident learning before treating them as outliers to remove.
Univariate Detection Methods
- IQR method — outliers are values below Q1 − 1.5×IQR or above Q3 + 1.5×IQR; robust, widely used
- Z-score — |z| > 3 flags outliers; assumes normality; distorted by extreme values
- Modified Z-score — uses median absolute deviation (MAD) instead of std; robust to extreme values
- Percentile capping — clip at 1st/99th percentile; preserves all rows
Multivariate Detection Methods
- Isolation Forest — isolates outliers by randomly partitioning feature space; fast; works in high dimensions
- Local Outlier Factor (LOF) — density-based; detects local outliers that wouldn't be flagged univariately
- DBSCAN — density clustering; noise points are outlier candidates; good for spatial data
- Mahalanobis distance — accounts for feature correlations; parametric; requires invertible covariance
- Autoencoder reconstruction error — high reconstruction loss = anomaly; works for complex patterns
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
df = pd.read_csv("dataset.csv")
# ---- IQR capping (winsorization) ----
def cap_outliers_iqr(series, multiplier=1.5):
Q1, Q3 = series.quantile(0.25), series.quantile(0.75)
IQR = Q3 - Q1
lower, upper = Q1 - multiplier * IQR, Q3 + multiplier * IQR
return series.clip(lower=lower, upper=upper), (lower, upper)
df["age"], age_bounds = cap_outliers_iqr(df["age"])
print(f"Age capped at {age_bounds}")
# ---- Z-score filtering (removes rows — use with caution) ----
from scipy import stats
z_scores = np.abs(stats.zscore(df[["income", "spending"]]))
df_clean = df[(z_scores < 3).all(axis=1)]
print(f"Removed {len(df) - len(df_clean)} rows via z-score filter")
# ---- Isolation Forest (multivariate) ----
iso_forest = IsolationForest(contamination=0.05, random_state=42) # 5% assumed outlier rate
feature_cols = ["age", "income", "spending", "credit_score"]
df["outlier_score"] = iso_forest.fit_predict(df[feature_cols])
# -1 = outlier, 1 = inlier
outliers = df[df["outlier_score"] == -1]
print(f"Isolation Forest flagged {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)")
# Treatment options (pick based on domain knowledge):
# Option A: Remove df = df[df["outlier_score"] == 1]
# Option B: Cap/clip (use cap_outliers_iqr above)
# Option C: Transform df["income"] = np.log1p(df["income"])
# Option D: Keep + flag (already done with outlier_score column)
🧹 Duplicate & Inconsistency Removal
Duplicates and format inconsistencies silently inflate your dataset and create spurious category splits. A model that learns "New York" and "new york" are different cities will fail systematically on any input that doesn't match training-time formatting.
Exact vs. Near-Duplicate Detection
- Exact duplicates —
df.duplicated(); hash of all columns or a subset of key columns - Near-duplicates (text) — MinHash + LSH (Locality-Sensitive Hashing); Jaccard similarity on shingles; efficient even for millions of documents
- Near-duplicates (images) — perceptual hash (pHash/dHash); cosine similarity of CNN embeddings
- Semantic duplicates — same meaning, different wording; requires embedding-based similarity search (FAISS)
String Standardisation
- Case normalisation —
.str.lower().str.strip()as a minimum - Regex cleaning — remove punctuation, extra whitespace, special characters
- Fuzzy matching — RapidFuzz / FuzzyWuzzy (token_sort_ratio); Levenshtein edit distance
- Canonical mapping — build a dictionary {"NY": "New York", "N.Y.": "New York"}; use for known value sets
- Entity resolution — link records across tables referring to the same real-world entity
import pandas as pd
from rapidfuzz import fuzz, process
df = pd.read_csv("customers.csv")
# ---- Exact duplicate removal ----
n_before = len(df)
df = df.drop_duplicates(subset=["email", "phone"], keep="first")
print(f"Removed {n_before - len(df)} exact duplicates")
# ---- String standardisation ----
df["city"] = df["city"].str.lower().str.strip()
df["city"] = df["city"].str.replace(r"[^\w\s]", "", regex=True) # remove punctuation
# ---- Canonical category mapping ----
city_map = {
"ny": "new york",
"n.y.": "new york",
"new york city": "new york",
"nyc": "new york",
"la": "los angeles",
"l.a.": "los angeles",
}
df["city"] = df["city"].map(city_map).fillna(df["city"])
# ---- Fuzzy deduplication for near-duplicates ----
# Given a reference list of canonical city names:
canonical_cities = ["new york", "los angeles", "chicago", "houston", "phoenix"]
def fuzzy_standardise(value, choices, threshold=85):
if pd.isna(value):
return value
match, score, _ = process.extractOne(str(value), choices, scorer=fuzz.token_sort_ratio)
return match if score >= threshold else value
df["city"] = df["city"].apply(lambda x: fuzzy_standardise(x, canonical_cities))
# ---- Category harmonisation ----
df["gender"] = df["gender"].str.lower().map({
"m": "male", "man": "male", "male": "male",
"f": "female", "woman": "female", "female": "female",
}).fillna("unknown")
print(df["city"].value_counts().head(10))
🔧 Cleaning Pipelines
Ad-hoc cleaning scripts are technical debt. The moment you need to apply the same cleaning logic to new data (a new batch, a test set, production inference), scattered notebook code becomes a liability. Building cleaning as a pipeline ensures reproducibility and prevents data leakage.
Always Keep the Raw Data
Never overwrite your raw data. Store the original files in an immutable location (S3 with versioning, DVC-tracked files) and always write cleaned data to a separate path. When your cleaning logic improves, you can re-run from scratch. Treat raw data as a read-only source of truth.
Data Quality Metrics to Track
- Completeness — % of non-null values per column; target >95% for key features
- Consistency — values conform to expected format/range; referential integrity holds
- Validity — values make domain sense (age between 0–120; probabilities sum to 1)
- Uniqueness — no unexpected duplicates in primary key columns
- Timeliness — data reflects current state; no stale records
- Accuracy — values match ground truth (hardest to measure automatically)
Pipeline Best Practices
- Fit on train, transform all — all statistics (mean, IQR bounds) computed on train split only
- Log everything removed — record how many rows/values were affected by each step
- Version your cleaning — tag dataset versions alongside model versions (use DVC or MLflow)
- Schema validation — use Great Expectations or Pandera to assert constraints on data before training
- Idempotent operations — running the pipeline twice should produce the same result
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")
log = logging.getLogger(__name__)
# ---- Custom cleaning transformers ----
def remove_duplicates(X):
n_before = len(X)
X = X.drop_duplicates()
log.info(f"Removed {n_before - len(X)} duplicate rows")
return X
def cap_outliers(X, cols=None, multiplier=1.5):
X = X.copy()
cols = cols or X.select_dtypes(include=np.number).columns
for col in cols:
Q1, Q3 = X[col].quantile(0.25), X[col].quantile(0.75)
IQR = Q3 - Q1
lower, upper = Q1 - multiplier * IQR, Q3 + multiplier * IQR
n_capped = ((X[col] < lower) | (X[col] > upper)).sum()
X[col] = X[col].clip(lower, upper)
log.info(f"Capped {n_capped} outliers in '{col}' to [{lower:.2f}, {upper:.2f}]")
return X
def standardise_strings(X, cols):
X = X.copy()
for col in cols:
X[col] = X[col].str.lower().str.strip()
return X
# ---- Build the pipeline ----
# Note: sklearn Pipelines work with estimators; for pandas DataFrames,
# use FunctionTransformer or a custom BaseEstimator + TransformerMixin
from sklearn.base import BaseEstimator, TransformerMixin
class DataCleaner(BaseEstimator, TransformerMixin):
def __init__(self, numeric_cols, string_cols):
self.numeric_cols = numeric_cols
self.string_cols = string_cols
self.imputer_ = SimpleImputer(strategy="median")
self.iqr_bounds_ = {}
def fit(self, X, y=None):
# Compute statistics on TRAIN data only
for col in self.numeric_cols:
Q1, Q3 = X[col].quantile(0.25), X[col].quantile(0.75)
IQR = Q3 - Q1
self.iqr_bounds_[col] = (Q1 - 1.5*IQR, Q3 + 1.5*IQR)
self.imputer_.fit(X[self.numeric_cols])
return self
def transform(self, X):
X = X.copy()
X = remove_duplicates(X)
X = standardise_strings(X, self.string_cols)
for col, (lo, hi) in self.iqr_bounds_.items():
X[col] = X[col].clip(lo, hi)
X[self.numeric_cols] = self.imputer_.transform(X[self.numeric_cols])
return X
# Usage
cleaner = DataCleaner(
numeric_cols=["age", "income", "balance"],
string_cols=["city", "country"]
)
X_train_clean = cleaner.fit_transform(X_train) # fit + transform train
X_test_clean = cleaner.transform(X_test) # transform test with TRAIN statistics