⏱ 6 min read 📊 Beginner 🗓 Updated Jan 2025

📊 Series & DataFrame

Series — Labelled 1D Array

A pd.Series is a one-dimensional array with an associated index. The index can be integers, strings, datetimes, or any hashable type. Think of it as an ordered dictionary where values are numpy arrays.

  • Backed by a NumPy array (or extension array for nullable types)
  • Index enables label-based alignment during arithmetic
  • s.values — underlying NumPy array
  • s.index — the Index object
  • s.dtype — element dtype
  • s.name — optional name (becomes column name in DataFrame)

DataFrame — Tabular Data

A pd.DataFrame is a 2D labelled data structure — essentially a dict of Series sharing the same index. Each column is a Series. DataFrames are the primary object in any Pandas workflow.

  • Columns can have different dtypes (unlike NumPy 2D array)
  • Row index and column names both labelled
  • df.shape — (n_rows, n_cols)
  • df.columns — column names Index
  • df.dtypes — dtype per column
  • df.memory_usage(deep=True) — actual RAM used

Index Concept

The Index is central to Pandas. It enables O(1) label lookups, automatic alignment of Series during operations, and time-series resampling. Misaligned indices cause NaN fill rather than errors — a common gotcha for beginners.

  • Default: RangeIndex(0, n) — memory efficient
  • Custom: set with df.set_index('col')
  • Reset: df.reset_index() — turns index back to column
  • MultiIndex: hierarchical rows for grouped data
  • DatetimeIndex: powers time-series operations
import pandas as pd
import numpy as np

# ── Series creation ────────────────────────────────────────────────────────────
s1 = pd.Series([10, 20, 30, 40], name='values')
s2 = pd.Series({'a': 1.1, 'b': 2.2, 'c': 3.3}, name='floats')
print(s2['b'])     # 2.2  — label-based access
print(s2[1])       # 2.2  — positional access (deprecated, prefer iloc)

# Alignment: indices are matched, NaN for missing
s3 = pd.Series({'a': 100, 'b': 200, 'd': 400})
print(s2 + s3)
# a    101.1
# b    202.2
# c      NaN   ← 'c' not in s3
# d      NaN   ← 'd' not in s2

# ── DataFrame creation ────────────────────────────────────────────────────────
# From dict of lists (most common)
df = pd.DataFrame({
    'name':   ['Alice', 'Bob', 'Carol', 'Dave', 'Eve'],
    'age':    [25, 32, 28, 45, 31],
    'salary': [70000, 85000, 72000, 110000, 68000],
    'dept':   ['Engineering', 'Marketing', 'Engineering', 'Finance', 'Marketing'],
    'score':  [88.5, 72.0, 91.2, 65.3, 78.8],
})

# From list of dicts (rows)
rows = [{'x': 1, 'y': 2}, {'x': 3, 'y': 4}, {'x': 5, 'y': 6}]
df2 = pd.DataFrame(rows)

# From NumPy array
arr = np.random.randn(5, 3)
df3 = pd.DataFrame(arr, columns=['feat_a', 'feat_b', 'feat_c'])

# ── Quick inspection ──────────────────────────────────────────────────────────
print(df.head(3))           # First 3 rows
print(df.tail(2))           # Last 2 rows
print(df.shape)             # (5, 5)
print(df.dtypes)            # dtype per column
print(df.info())            # dtypes + non-null counts + memory
print(df.describe())        # count, mean, std, min, quartiles, max
print(df.describe(include='object'))   # stats for categorical columns
print(df.value_counts('dept'))         # frequency of each department
print(df.nunique())         # unique value count per column

💾 Loading & Saving Data

pd.read_csv Key Parameters

read_csv is the most-used Pandas function. Many performance and correctness issues stem from using wrong defaults — knowing the key parameters saves hours of debugging.

  • sep='|' — delimiter (default comma)
  • header=None — no header row in file
  • names=[...] — specify column names
  • usecols=['a','b'] — load only these columns
  • dtype={'id': 'int32'} — override inferred dtypes
  • na_values=['N/A', '-'] — extra strings → NaN
  • parse_dates=['created_at'] — auto parse timestamps
  • nrows=10000 — load only first N rows
  • chunksize=50000 — iterator for huge files

Other I/O Formats

Pandas supports a wide range of storage formats. Choosing the right format dramatically affects load time and file size for large datasets.

  • pd.read_json() — JSON arrays or records
  • pd.read_parquet() — columnar, compressed, fast
  • pd.read_feather() — ultra-fast in-process
  • pd.read_excel() — xlsx (requires openpyxl)
  • pd.read_sql(query, conn) — from any SQLAlchemy engine
  • pd.read_html(url) — scrape tables from web pages
  • pd.read_clipboard() — paste from spreadsheet

Saving Data

Always export processed datasets before feeding to ML pipelines so preprocessing can be reused without re-running. Parquet is the preferred format for any dataset above a few MB.

  • df.to_csv('out.csv', index=False) — always set index=False
  • df.to_parquet('out.parquet') — preserves dtypes
  • df.to_json('out.json', orient='records')
  • df.to_excel('out.xlsx', sheet_name='Data')
  • df.to_sql('table', engine, if_exists='replace')

Always Use Parquet for Large Datasets

A 1GB CSV takes ~3-8 seconds to load and consumes 2-3x RAM during parsing. The same data as Parquet loads in ~0.3 seconds, uses 3-10x less disk space due to columnar compression, and preserves all dtypes exactly — no re-casting needed after load.

import pandas as pd

# ── Reading CSV with real-world options ───────────────────────────────────────
df = pd.read_csv(
    'dataset.csv',
    sep=',',
    header=0,                          # first row is header (default)
    usecols=['user_id', 'event', 'ts', 'amount'],   # skip unwanted columns
    dtype={
        'user_id': 'int32',            # saves memory vs int64
        'amount':  'float32',
    },
    na_values=['NULL', 'N/A', '', '-'],
    parse_dates=['ts'],
    low_memory=False,                  # avoid mixed-type warnings
)

# ── Processing huge CSVs in chunks ────────────────────────────────────────────
chunk_iter = pd.read_csv('huge_file.csv', chunksize=100_000)
results = []
for chunk in chunk_iter:
    # Process each chunk (e.g., aggregate)
    agg = chunk.groupby('category')['value'].sum()
    results.append(agg)
final = pd.concat(results).groupby(level=0).sum()

# ── Parquet round-trip ────────────────────────────────────────────────────────
df.to_parquet('clean_data.parquet', index=False, compression='snappy')
df_loaded = pd.read_parquet('clean_data.parquet')
# Dtypes preserved exactly — no re-casting needed

# ── SQL query → DataFrame ─────────────────────────────────────────────────────
# from sqlalchemy import create_engine
# engine = create_engine('postgresql://user:pass@host/db')
# query = """
#     SELECT user_id, COUNT(*) as events, SUM(amount) as total
#     FROM transactions
#     WHERE event_date >= '2024-01-01'
#     GROUP BY user_id
# """
# df_sql = pd.read_sql(query, engine)

🧹 Cleaning & Handling Missing Data

Detecting Missing Data

Pandas represents missing values as NaN (float), pd.NaT (datetime), or pd.NA (nullable integer/string). Always audit missingness before any analysis or modelling.

  • df.isnull().sum() — count NaN per column
  • df.isnull().mean() * 100 — percent missing
  • df.isnull().any(axis=1) — rows with any NaN
  • df.notna().all(axis=1) — rows with no NaN
  • Columns with >40% missing: consider dropping

Handling Missing Values

The right strategy depends on the mechanism: MCAR (missing completely at random), MAR (missing at random), or MNAR (missing not at random). Never blindly fill or drop without understanding the cause.

  • df.dropna() — drop any row with NaN
  • df.dropna(subset=['col']) — only if specific col is NaN
  • df.fillna(df.mean()) — fill with column means
  • df['col'].fillna(method='ffill') — forward fill (time series)
  • df['col'].fillna(method='bfill') — backward fill
  • Use SimpleImputer in scikit-learn for pipelines

Duplicates & Type Conversion

Duplicate rows silently inflate statistics and model training. Type mismatches cause unexpected NaN after operations. Both are common sources of data bugs.

  • df.duplicated().sum() — count duplicate rows
  • df.drop_duplicates() — remove duplicate rows
  • df.drop_duplicates(subset=['id']) — by key columns
  • df['col'].astype('float32') — cast dtype
  • pd.to_datetime(df['ts']) — parse date strings
  • pd.to_numeric(df['col'], errors='coerce') — NaN on failure
import pandas as pd
import numpy as np

# ── Realistic cleaning pipeline ───────────────────────────────────────────────
# Simulate a messy dataset
data = {
    'user_id':  [1, 2, 2, 3, 4, 5, None, 6],
    'age':      ['25', '32', '32', None, '28', '999', '31', '45'],
    'salary':   [70000, None, None, 85000, 72000, 60000, 95000, None],
    'email':    ['  [email protected]  ', '[email protected]', '[email protected]',
                 '[email protected]', None, '[email protected]', '[email protected]', '[email protected]'],
    'label':    ['yes', 'no', 'no', 'yes', 'YES', 'No', 'yes', None],
}
df = pd.DataFrame(data)

print("=== Before Cleaning ===")
print(df)
print("\nMissing values:\n", df.isnull().sum())

# Step 1: Drop rows with missing user_id (primary key)
df = df.dropna(subset=['user_id'])
df['user_id'] = df['user_id'].astype(int)

# Step 2: Remove exact duplicate rows
print(f"\nDuplicates: {df.duplicated().sum()}")
df = df.drop_duplicates()

# Step 3: Fix age column
df['age'] = pd.to_numeric(df['age'], errors='coerce')   # 'None' → NaN
# Replace implausible values with NaN
df.loc[df['age'] > 120, 'age'] = np.nan
df.loc[df['age'] < 18,  'age'] = np.nan
# Impute with median (robust to outliers)
df['age'] = df['age'].fillna(df['age'].median())

# Step 4: Salary — fill with group median (by age band)
df['age_band'] = pd.cut(df['age'], bins=[18, 30, 40, 120], labels=['young','mid','senior'])
df['salary'] = df.groupby('age_band')['salary'].transform(
    lambda x: x.fillna(x.median())
)
df = df.drop(columns=['age_band'])

# Step 5: Clean string columns
df['email'] = df['email'].str.strip().str.lower()
df['label'] = df['label'].str.lower().str.strip()
df['label'] = df['label'].replace({'yes': 1, 'no': 0})

# Step 6: Drop rows still missing critical fields
df = df.dropna(subset=['email', 'label'])
df['label'] = df['label'].astype(int)

# Step 7: Report
print("\n=== After Cleaning ===")
print(df)
print("\nRemaining NaN:\n", df.isnull().sum())
print(f"Shape: {df.shape}")

🔨 Selection, Filtering & Transformation

loc vs iloc

loc is label-based: rows and columns are referenced by their index labels and column names. iloc is integer-position-based: 0-based integer positions, like NumPy. Never mix them up — they behave differently when the index is not a default RangeIndex.

  • df.loc[2, 'salary'] — row with index label 2, salary col
  • df.loc[2:5, ['name','salary']] — label slice (inclusive end)
  • df.iloc[0, 3] — row 0, column 3 (0-based positions)
  • df.iloc[1:4, :2] — rows 1-3, first 2 columns (exclusive end)
  • df.iloc[-1] — last row

Boolean Filtering & query()

Boolean indexing with multiple conditions is the most common selection pattern. The query() method provides a SQL-like string syntax that is often more readable for complex conditions.

  • df[df['age'] > 30] — rows where age > 30
  • df[(df.age > 25) & (df.dept == 'Eng')] — AND
  • df[df.dept.isin(['Eng','Finance'])] — membership
  • df.query("age > 25 and salary > 80000") — string query
  • df.query("dept in @depts") — inject Python variable

apply / map / groupby

Transforming data element-wise, row-wise, or within groups is the core of feature engineering. Prefer vectorised operations over apply where possible — apply runs a Python loop internally.

  • df['col'].map(dict) — recode values via dictionary
  • df['col'].apply(func) — element-wise function
  • df.apply(func, axis=1) — row-wise function (slow)
  • df.assign(new_col=...) — add column, returns copy
  • df.groupby('dept').agg({'salary': 'mean', 'age': 'max'})
  • df.groupby('dept')['sal'].transform('mean') — broadcast back
import pandas as pd
import numpy as np

rng = np.random.default_rng(42)
n = 500
df = pd.DataFrame({
    'label':   rng.integers(0, 3, n),            # 3-class problem
    'feat_1':  rng.standard_normal(n),
    'feat_2':  rng.standard_normal(n) * 2 + 1,
    'feat_3':  rng.uniform(0, 10, n),
    'category': rng.choice(['A','B','C','D'], n),
})

# ── Feature statistics per class (a standard EDA step) ───────────────────────

# Basic stats per label
class_stats = df.groupby('label').agg(
    count=('feat_1', 'count'),
    f1_mean=('feat_1', 'mean'),
    f1_std=('feat_1', 'std'),
    f2_mean=('feat_2', 'mean'),
    f2_median=('feat_2', 'median'),
    f3_q25=('feat_3', lambda x: x.quantile(0.25)),
    f3_q75=('feat_3', lambda x: x.quantile(0.75)),
).round(3)
print(class_stats)

# ── Feature engineering with assign ──────────────────────────────────────────
df = df.assign(
    feat_interaction = df['feat_1'] * df['feat_2'],
    feat_1_abs       = df['feat_1'].abs(),
    feat_3_log       = np.log1p(df['feat_3']),
    is_category_A    = (df['category'] == 'A').astype(int),
)

# ── Conditional transformation with np.where / map ───────────────────────────
df['feat_3_bin'] = pd.cut(
    df['feat_3'],
    bins=[0, 3, 7, 10],
    labels=['low', 'mid', 'high']
)
df['label_name'] = df['label'].map({0: 'class_0', 1: 'class_1', 2: 'class_2'})

# ── Class-conditioned normalisation (group z-score) ───────────────────────────
for col in ['feat_1', 'feat_2', 'feat_3']:
    group_mean = df.groupby('label')[col].transform('mean')
    group_std  = df.groupby('label')[col].transform('std')
    df[f'{col}_znorm'] = (df[col] - group_mean) / (group_std + 1e-8)

# ── Pivot table — feature means by label and category ─────────────────────────
pivot = df.pivot_table(
    values='feat_1',
    index='label',
    columns='category',
    aggfunc='mean'
).round(3)
print("\nFeat_1 mean by label x category:")
print(pivot)

🔁 Merging & Reshaping

Operation Function When to Use SQL Equivalent
Inner join pd.merge(df1, df2, on='id', how='inner') Keep only matching rows in both tables INNER JOIN
Left join pd.merge(df1, df2, on='id', how='left') Keep all rows from left, NaN for unmatched right LEFT JOIN
Right join pd.merge(df1, df2, on='id', how='right') Keep all rows from right table RIGHT JOIN
Outer join pd.merge(df1, df2, on='id', how='outer') Keep all rows from both, NaN for mismatches FULL OUTER JOIN
Stack vertically pd.concat([df1, df2], axis=0) Append rows (same schema, different data) UNION ALL
Stack horizontally pd.concat([df1, df2], axis=1) Add columns side by side (same row count)
Wide to long df.melt(id_vars=['id'], value_vars=[...]) Unpivot column headers into rows UNPIVOT
Long to wide df.pivot(index='id', columns='metric', values='val') One row per entity with metric columns PIVOT

Merge Pitfalls

Always check the output row count after a merge. A many-to-many join will silently multiply rows (Cartesian product). Use validate='1:1', '1:m', or 'm:1' to assert the expected relationship and raise an error if violated.

import pandas as pd
import numpy as np

# ── Sample datasets ───────────────────────────────────────────────────────────
users = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'name':    ['Alice', 'Bob', 'Carol', 'Dave', 'Eve'],
    'dept_id': [10, 20, 10, 30, 20],
})

departments = pd.DataFrame({
    'dept_id':   [10, 20, 30, 40],
    'dept_name': ['Engineering', 'Marketing', 'Finance', 'HR'],
    'budget':    [500000, 200000, 300000, 150000],
})

transactions = pd.DataFrame({
    'user_id': [1, 1, 2, 3, 3, 3, 4],
    'month':   ['Jan','Feb','Jan','Jan','Feb','Mar','Jan'],
    'revenue': [120, 95, 200, 80, 150, 60, 310],
})

# ── Join users to departments ─────────────────────────────────────────────────
enriched = pd.merge(
    users, departments,
    on='dept_id',
    how='left',          # keep all users even if dept not found
    validate='m:1',      # assert dept_id is unique in departments
)
print(enriched[['name', 'dept_name', 'budget']])

# ── Aggregate transactions and join back ──────────────────────────────────────
user_totals = transactions.groupby('user_id').agg(
    total_revenue = ('revenue', 'sum'),
    n_transactions = ('revenue', 'count'),
    avg_revenue = ('revenue', 'mean'),
).reset_index()

full = pd.merge(enriched, user_totals, on='user_id', how='left')
full['total_revenue'] = full['total_revenue'].fillna(0)
print(full[['name', 'dept_name', 'total_revenue', 'n_transactions']].to_string())

# ── Wide to Long: multiple monthly columns → rows ─────────────────────────────
wide = transactions.pivot_table(
    index='user_id',
    columns='month',
    values='revenue',
    aggfunc='sum',
    fill_value=0
).reset_index()
print("\nWide format:")
print(wide)

long = wide.melt(
    id_vars=['user_id'],
    var_name='month',
    value_name='revenue'
).sort_values(['user_id', 'month'])
print("\nLong format:")
print(long.head(10))

# ── concat: stack train and validation sets ───────────────────────────────────
train = pd.DataFrame({'x': [1,2,3], 'y': [4,5,6], 'split': 'train'})
valid = pd.DataFrame({'x': [7,8],   'y': [9,10],  'split': 'valid'})
combined = pd.concat([train, valid], ignore_index=True)
print("\nCombined:", combined.shape)