Pandas for Data Manipulation | CyberSecurityHub

📊 Series & DataFrame

Series — Labelled 1D Array

A pd.Series is a one-dimensional array with an associated index. The index can be integers, strings, datetimes, or any hashable type. Think of it as an ordered dictionary where values are numpy arrays.

Backed by a NumPy array (or extension array for nullable types)
Index enables label-based alignment during arithmetic
s.values — underlying NumPy array
s.index — the Index object
s.dtype — element dtype
s.name — optional name (becomes column name in DataFrame)

DataFrame — Tabular Data

A pd.DataFrame is a 2D labelled data structure — essentially a dict of Series sharing the same index. Each column is a Series. DataFrames are the primary object in any Pandas workflow.

Columns can have different dtypes (unlike NumPy 2D array)
Row index and column names both labelled
df.shape — (n_rows, n_cols)
df.columns — column names Index
df.dtypes — dtype per column
df.memory_usage(deep=True) — actual RAM used

Index Concept

The Index is central to Pandas. It enables O(1) label lookups, automatic alignment of Series during operations, and time-series resampling. Misaligned indices cause NaN fill rather than errors — a common gotcha for beginners.

Default: RangeIndex(0, n) — memory efficient
Custom: set with df.set_index('col')
Reset: df.reset_index() — turns index back to column
MultiIndex: hierarchical rows for grouped data
DatetimeIndex: powers time-series operations

import pandas as pd
import numpy as np

# ── Series creation ────────────────────────────────────────────────────────────
s1 = pd.Series([10, 20, 30, 40], name='values')
s2 = pd.Series({'a': 1.1, 'b': 2.2, 'c': 3.3}, name='floats')
print(s2['b'])     # 2.2  — label-based access
print(s2[1])       # 2.2  — positional access (deprecated, prefer iloc)

# Alignment: indices are matched, NaN for missing
s3 = pd.Series({'a': 100, 'b': 200, 'd': 400})
print(s2 + s3)
# a    101.1
# b    202.2
# c      NaN   ← 'c' not in s3
# d      NaN   ← 'd' not in s2

# ── DataFrame creation ────────────────────────────────────────────────────────
# From dict of lists (most common)
df = pd.DataFrame({
    'name':   ['Alice', 'Bob', 'Carol', 'Dave', 'Eve'],
    'age':    [25, 32, 28, 45, 31],
    'salary': [70000, 85000, 72000, 110000, 68000],
    'dept':   ['Engineering', 'Marketing', 'Engineering', 'Finance', 'Marketing'],
    'score':  [88.5, 72.0, 91.2, 65.3, 78.8],
})

# From list of dicts (rows)
rows = [{'x': 1, 'y': 2}, {'x': 3, 'y': 4}, {'x': 5, 'y': 6}]
df2 = pd.DataFrame(rows)

# From NumPy array
arr = np.random.randn(5, 3)
df3 = pd.DataFrame(arr, columns=['feat_a', 'feat_b', 'feat_c'])

# ── Quick inspection ──────────────────────────────────────────────────────────
print(df.head(3))           # First 3 rows
print(df.tail(2))           # Last 2 rows
print(df.shape)             # (5, 5)
print(df.dtypes)            # dtype per column
print(df.info())            # dtypes + non-null counts + memory
print(df.describe())        # count, mean, std, min, quartiles, max
print(df.describe(include='object'))   # stats for categorical columns
print(df.value_counts('dept'))         # frequency of each department
print(df.nunique())         # unique value count per column

💾 Loading & Saving Data

pd.read_csv Key Parameters

read_csv is the most-used Pandas function. Many performance and correctness issues stem from using wrong defaults — knowing the key parameters saves hours of debugging.

sep='|' — delimiter (default comma)
header=None — no header row in file
names=[...] — specify column names
usecols=['a','b'] — load only these columns
dtype={'id': 'int32'} — override inferred dtypes
na_values=['N/A', '-'] — extra strings → NaN
parse_dates=['created_at'] — auto parse timestamps
nrows=10000 — load only first N rows
chunksize=50000 — iterator for huge files

Other I/O Formats

Pandas supports a wide range of storage formats. Choosing the right format dramatically affects load time and file size for large datasets.

pd.read_json() — JSON arrays or records
pd.read_parquet() — columnar, compressed, fast
pd.read_feather() — ultra-fast in-process
pd.read_excel() — xlsx (requires openpyxl)
pd.read_sql(query, conn) — from any SQLAlchemy engine
pd.read_html(url) — scrape tables from web pages
pd.read_clipboard() — paste from spreadsheet

Saving Data

Always export processed datasets before feeding to ML pipelines so preprocessing can be reused without re-running. Parquet is the preferred format for any dataset above a few MB.

df.to_csv('out.csv', index=False) — always set index=False
df.to_parquet('out.parquet') — preserves dtypes
df.to_json('out.json', orient='records')
df.to_excel('out.xlsx', sheet_name='Data')
df.to_sql('table', engine, if_exists='replace')

Always Use Parquet for Large Datasets

A 1GB CSV takes ~3-8 seconds to load and consumes 2-3x RAM during parsing. The same data as Parquet loads in ~0.3 seconds, uses 3-10x less disk space due to columnar compression, and preserves all dtypes exactly — no re-casting needed after load.

import pandas as pd

# ── Reading CSV with real-world options ───────────────────────────────────────
df = pd.read_csv(
    'dataset.csv',
    sep=',',
    header=0,                          # first row is header (default)
    usecols=['user_id', 'event', 'ts', 'amount'],   # skip unwanted columns
    dtype={
        'user_id': 'int32',            # saves memory vs int64
        'amount':  'float32',
    },
    na_values=['NULL', 'N/A', '', '-'],
    parse_dates=['ts'],
    low_memory=False,                  # avoid mixed-type warnings
)

# ── Processing huge CSVs in chunks ────────────────────────────────────────────
chunk_iter = pd.read_csv('huge_file.csv', chunksize=100_000)
results = []
for chunk in chunk_iter:
    # Process each chunk (e.g., aggregate)
    agg = chunk.groupby('category')['value'].sum()
    results.append(agg)
final = pd.concat(results).groupby(level=0).sum()

# ── Parquet round-trip ────────────────────────────────────────────────────────
df.to_parquet('clean_data.parquet', index=False, compression='snappy')
df_loaded = pd.read_parquet('clean_data.parquet')
# Dtypes preserved exactly — no re-casting needed

# ── SQL query → DataFrame ─────────────────────────────────────────────────────
# from sqlalchemy import create_engine
# engine = create_engine('postgresql://user:pass@host/db')
# query = """
#     SELECT user_id, COUNT(*) as events, SUM(amount) as total
#     FROM transactions
#     WHERE event_date >= '2024-01-01'
#     GROUP BY user_id
# """
# df_sql = pd.read_sql(query, engine)

🧹 Cleaning & Handling Missing Data

Detecting Missing Data

Pandas represents missing values as NaN (float), pd.NaT (datetime), or pd.NA (nullable integer/string). Always audit missingness before any analysis or modelling.

df.isnull().sum() — count NaN per column
df.isnull().mean() * 100 — percent missing
df.isnull().any(axis=1) — rows with any NaN
df.notna().all(axis=1) — rows with no NaN
Columns with >40% missing: consider dropping

Handling Missing Values

The right strategy depends on the mechanism: MCAR (missing completely at random), MAR (missing at random), or MNAR (missing not at random). Never blindly fill or drop without understanding the cause.

df.dropna() — drop any row with NaN
df.dropna(subset=['col']) — only if specific col is NaN
df.fillna(df.mean()) — fill with column means
df['col'].fillna(method='ffill') — forward fill (time series)
df['col'].fillna(method='bfill') — backward fill
Use SimpleImputer in scikit-learn for pipelines

Duplicates & Type Conversion

Duplicate rows silently inflate statistics and model training. Type mismatches cause unexpected NaN after operations. Both are common sources of data bugs.

df.duplicated().sum() — count duplicate rows
df.drop_duplicates() — remove duplicate rows
df.drop_duplicates(subset=['id']) — by key columns
df['col'].astype('float32') — cast dtype
pd.to_datetime(df['ts']) — parse date strings
pd.to_numeric(df['col'], errors='coerce') — NaN on failure

import pandas as pd
import numpy as np

# ── Realistic cleaning pipeline ───────────────────────────────────────────────
# Simulate a messy dataset
data = {
    'user_id':  [1, 2, 2, 3, 4, 5, None, 6],
    'age':      ['25', '32', '32', None, '28', '999', '31', '45'],
    'salary':   [70000, None, None, 85000, 72000, 60000, 95000, None],
    'email':    ['  [email protected]  ', '[email protected]', '[email protected]',
                 '[email protected]', None, '[email protected]', '[email protected]', '[email protected]'],
    'label':    ['yes', 'no', 'no', 'yes', 'YES', 'No', 'yes', None],
}
df = pd.DataFrame(data)

print("=== Before Cleaning ===")
print(df)
print("\nMissing values:\n", df.isnull().sum())

# Step 1: Drop rows with missing user_id (primary key)
df = df.dropna(subset=['user_id'])
df['user_id'] = df['user_id'].astype(int)

# Step 2: Remove exact duplicate rows
print(f"\nDuplicates: {df.duplicated().sum()}")
df = df.drop_duplicates()

# Step 3: Fix age column
df['age'] = pd.to_numeric(df['age'], errors='coerce')   # 'None' → NaN
# Replace implausible values with NaN
df.loc[df['age'] > 120, 'age'] = np.nan
df.loc[df['age'] < 18,  'age'] = np.nan
# Impute with median (robust to outliers)
df['age'] = df['age'].fillna(df['age'].median())

# Step 4: Salary — fill with group median (by age band)
df['age_band'] = pd.cut(df['age'], bins=[18, 30, 40, 120], labels=['young','mid','senior'])
df['salary'] = df.groupby('age_band')['salary'].transform(
    lambda x: x.fillna(x.median())
)
df = df.drop(columns=['age_band'])

# Step 5: Clean string columns
df['email'] = df['email'].str.strip().str.lower()
df['label'] = df['label'].str.lower().str.strip()
df['label'] = df['label'].replace({'yes': 1, 'no': 0})

# Step 6: Drop rows still missing critical fields
df = df.dropna(subset=['email', 'label'])
df['label'] = df['label'].astype(int)

# Step 7: Report
print("\n=== After Cleaning ===")
print(df)
print("\nRemaining NaN:\n", df.isnull().sum())
print(f"Shape: {df.shape}")

🔨 Selection, Filtering & Transformation

loc vs iloc

loc is label-based: rows and columns are referenced by their index labels and column names. iloc is integer-position-based: 0-based integer positions, like NumPy. Never mix them up — they behave differently when the index is not a default RangeIndex.

df.loc[2, 'salary'] — row with index label 2, salary col
df.loc[2:5, ['name','salary']] — label slice (inclusive end)
df.iloc[0, 3] — row 0, column 3 (0-based positions)
df.iloc[1:4, :2] — rows 1-3, first 2 columns (exclusive end)
df.iloc[-1] — last row

Boolean Filtering & query()

Boolean indexing with multiple conditions is the most common selection pattern. The query() method provides a SQL-like string syntax that is often more readable for complex conditions.

df[df['age'] > 30] — rows where age > 30
df[(df.age > 25) & (df.dept == 'Eng')] — AND
df[df.dept.isin(['Eng','Finance'])] — membership
df.query("age > 25 and salary > 80000") — string query
df.query("dept in @depts") — inject Python variable

apply / map / groupby

Transforming data element-wise, row-wise, or within groups is the core of feature engineering. Prefer vectorised operations over apply where possible — apply runs a Python loop internally.

df['col'].map(dict) — recode values via dictionary
df['col'].apply(func) — element-wise function
df.apply(func, axis=1) — row-wise function (slow)
df.assign(new_col=...) — add column, returns copy
df.groupby('dept').agg({'salary': 'mean', 'age': 'max'})
df.groupby('dept')['sal'].transform('mean') — broadcast back

import pandas as pd
import numpy as np

rng = np.random.default_rng(42)
n = 500
df = pd.DataFrame({
    'label':   rng.integers(0, 3, n),            # 3-class problem
    'feat_1':  rng.standard_normal(n),
    'feat_2':  rng.standard_normal(n) * 2 + 1,
    'feat_3':  rng.uniform(0, 10, n),
    'category': rng.choice(['A','B','C','D'], n),
})

# ── Feature statistics per class (a standard EDA step) ───────────────────────

# Basic stats per label
class_stats = df.groupby('label').agg(
    count=('feat_1', 'count'),
    f1_mean=('feat_1', 'mean'),
    f1_std=('feat_1', 'std'),
    f2_mean=('feat_2', 'mean'),
    f2_median=('feat_2', 'median'),
    f3_q25=('feat_3', lambda x: x.quantile(0.25)),
    f3_q75=('feat_3', lambda x: x.quantile(0.75)),
).round(3)
print(class_stats)

# ── Feature engineering with assign ──────────────────────────────────────────
df = df.assign(
    feat_interaction = df['feat_1'] * df['feat_2'],
    feat_1_abs       = df['feat_1'].abs(),
    feat_3_log       = np.log1p(df['feat_3']),
    is_category_A    = (df['category'] == 'A').astype(int),
)

# ── Conditional transformation with np.where / map ───────────────────────────
df['feat_3_bin'] = pd.cut(
    df['feat_3'],
    bins=[0, 3, 7, 10],
    labels=['low', 'mid', 'high']
)
df['label_name'] = df['label'].map({0: 'class_0', 1: 'class_1', 2: 'class_2'})

# ── Class-conditioned normalisation (group z-score) ───────────────────────────
for col in ['feat_1', 'feat_2', 'feat_3']:
    group_mean = df.groupby('label')[col].transform('mean')
    group_std  = df.groupby('label')[col].transform('std')
    df[f'{col}_znorm'] = (df[col] - group_mean) / (group_std + 1e-8)

# ── Pivot table — feature means by label and category ─────────────────────────
pivot = df.pivot_table(
    values='feat_1',
    index='label',
    columns='category',
    aggfunc='mean'
).round(3)
print("\nFeat_1 mean by label x category:")
print(pivot)

🔁 Merging & Reshaping

Operation	Function	When to Use	SQL Equivalent
Inner join	`pd.merge(df1, df2, on='id', how='inner')`	Keep only matching rows in both tables	INNER JOIN
Left join	`pd.merge(df1, df2, on='id', how='left')`	Keep all rows from left, NaN for unmatched right	LEFT JOIN
Right join	`pd.merge(df1, df2, on='id', how='right')`	Keep all rows from right table	RIGHT JOIN
Outer join	`pd.merge(df1, df2, on='id', how='outer')`	Keep all rows from both, NaN for mismatches	FULL OUTER JOIN
Stack vertically	`pd.concat([df1, df2], axis=0)`	Append rows (same schema, different data)	UNION ALL
Stack horizontally	`pd.concat([df1, df2], axis=1)`	Add columns side by side (same row count)	—
Wide to long	`df.melt(id_vars=['id'], value_vars=[...])`	Unpivot column headers into rows	UNPIVOT
Long to wide	`df.pivot(index='id', columns='metric', values='val')`	One row per entity with metric columns	PIVOT

Merge Pitfalls

Always check the output row count after a merge. A many-to-many join will silently multiply rows (Cartesian product). Use validate='1:1', '1:m', or 'm:1' to assert the expected relationship and raise an error if violated.

import pandas as pd
import numpy as np

# ── Sample datasets ───────────────────────────────────────────────────────────
users = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'name':    ['Alice', 'Bob', 'Carol', 'Dave', 'Eve'],
    'dept_id': [10, 20, 10, 30, 20],
})

departments = pd.DataFrame({
    'dept_id':   [10, 20, 30, 40],
    'dept_name': ['Engineering', 'Marketing', 'Finance', 'HR'],
    'budget':    [500000, 200000, 300000, 150000],
})

transactions = pd.DataFrame({
    'user_id': [1, 1, 2, 3, 3, 3, 4],
    'month':   ['Jan','Feb','Jan','Jan','Feb','Mar','Jan'],
    'revenue': [120, 95, 200, 80, 150, 60, 310],
})

# ── Join users to departments ─────────────────────────────────────────────────
enriched = pd.merge(
    users, departments,
    on='dept_id',
    how='left',          # keep all users even if dept not found
    validate='m:1',      # assert dept_id is unique in departments
)
print(enriched[['name', 'dept_name', 'budget']])

# ── Aggregate transactions and join back ──────────────────────────────────────
user_totals = transactions.groupby('user_id').agg(
    total_revenue = ('revenue', 'sum'),
    n_transactions = ('revenue', 'count'),
    avg_revenue = ('revenue', 'mean'),
).reset_index()

full = pd.merge(enriched, user_totals, on='user_id', how='left')
full['total_revenue'] = full['total_revenue'].fillna(0)
print(full[['name', 'dept_name', 'total_revenue', 'n_transactions']].to_string())

# ── Wide to Long: multiple monthly columns → rows ─────────────────────────────
wide = transactions.pivot_table(
    index='user_id',
    columns='month',
    values='revenue',
    aggfunc='sum',
    fill_value=0
).reset_index()
print("\nWide format:")
print(wide)

long = wide.melt(
    id_vars=['user_id'],
    var_name='month',
    value_name='revenue'
).sort_values(['user_id', 'month'])
print("\nLong format:")
print(long.head(10))

# ── concat: stack train and validation sets ───────────────────────────────────
train = pd.DataFrame({'x': [1,2,3], 'y': [4,5,6], 'split': 'train'})
valid = pd.DataFrame({'x': [7,8],   'y': [9,10],  'split': 'valid'})
combined = pd.concat([train, valid], ignore_index=True)
print("\nCombined:", combined.shape)