📊 Series & DataFrame
Series — Labelled 1D Array
A pd.Series is a one-dimensional array with an associated index. The index can be integers, strings, datetimes, or any hashable type. Think of it as an ordered dictionary where values are numpy arrays.
- Backed by a NumPy array (or extension array for nullable types)
- Index enables label-based alignment during arithmetic
s.values— underlying NumPy arrays.index— the Index objects.dtype— element dtypes.name— optional name (becomes column name in DataFrame)
DataFrame — Tabular Data
A pd.DataFrame is a 2D labelled data structure — essentially a dict of Series sharing the same index. Each column is a Series. DataFrames are the primary object in any Pandas workflow.
- Columns can have different dtypes (unlike NumPy 2D array)
- Row index and column names both labelled
df.shape— (n_rows, n_cols)df.columns— column names Indexdf.dtypes— dtype per columndf.memory_usage(deep=True)— actual RAM used
Index Concept
The Index is central to Pandas. It enables O(1) label lookups, automatic alignment of Series during operations, and time-series resampling. Misaligned indices cause NaN fill rather than errors — a common gotcha for beginners.
- Default: RangeIndex(0, n) — memory efficient
- Custom: set with
df.set_index('col') - Reset:
df.reset_index()— turns index back to column - MultiIndex: hierarchical rows for grouped data
- DatetimeIndex: powers time-series operations
import pandas as pd
import numpy as np
# ── Series creation ────────────────────────────────────────────────────────────
s1 = pd.Series([10, 20, 30, 40], name='values')
s2 = pd.Series({'a': 1.1, 'b': 2.2, 'c': 3.3}, name='floats')
print(s2['b']) # 2.2 — label-based access
print(s2[1]) # 2.2 — positional access (deprecated, prefer iloc)
# Alignment: indices are matched, NaN for missing
s3 = pd.Series({'a': 100, 'b': 200, 'd': 400})
print(s2 + s3)
# a 101.1
# b 202.2
# c NaN ← 'c' not in s3
# d NaN ← 'd' not in s2
# ── DataFrame creation ────────────────────────────────────────────────────────
# From dict of lists (most common)
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Carol', 'Dave', 'Eve'],
'age': [25, 32, 28, 45, 31],
'salary': [70000, 85000, 72000, 110000, 68000],
'dept': ['Engineering', 'Marketing', 'Engineering', 'Finance', 'Marketing'],
'score': [88.5, 72.0, 91.2, 65.3, 78.8],
})
# From list of dicts (rows)
rows = [{'x': 1, 'y': 2}, {'x': 3, 'y': 4}, {'x': 5, 'y': 6}]
df2 = pd.DataFrame(rows)
# From NumPy array
arr = np.random.randn(5, 3)
df3 = pd.DataFrame(arr, columns=['feat_a', 'feat_b', 'feat_c'])
# ── Quick inspection ──────────────────────────────────────────────────────────
print(df.head(3)) # First 3 rows
print(df.tail(2)) # Last 2 rows
print(df.shape) # (5, 5)
print(df.dtypes) # dtype per column
print(df.info()) # dtypes + non-null counts + memory
print(df.describe()) # count, mean, std, min, quartiles, max
print(df.describe(include='object')) # stats for categorical columns
print(df.value_counts('dept')) # frequency of each department
print(df.nunique()) # unique value count per column
💾 Loading & Saving Data
pd.read_csv Key Parameters
read_csv is the most-used Pandas function. Many performance and correctness issues stem from using wrong defaults — knowing the key parameters saves hours of debugging.
sep='|'— delimiter (default comma)header=None— no header row in filenames=[...]— specify column namesusecols=['a','b']— load only these columnsdtype={'id': 'int32'}— override inferred dtypesna_values=['N/A', '-']— extra strings → NaNparse_dates=['created_at']— auto parse timestampsnrows=10000— load only first N rowschunksize=50000— iterator for huge files
Other I/O Formats
Pandas supports a wide range of storage formats. Choosing the right format dramatically affects load time and file size for large datasets.
pd.read_json()— JSON arrays or recordspd.read_parquet()— columnar, compressed, fastpd.read_feather()— ultra-fast in-processpd.read_excel()— xlsx (requires openpyxl)pd.read_sql(query, conn)— from any SQLAlchemy enginepd.read_html(url)— scrape tables from web pagespd.read_clipboard()— paste from spreadsheet
Saving Data
Always export processed datasets before feeding to ML pipelines so preprocessing can be reused without re-running. Parquet is the preferred format for any dataset above a few MB.
df.to_csv('out.csv', index=False)— always set index=Falsedf.to_parquet('out.parquet')— preserves dtypesdf.to_json('out.json', orient='records')df.to_excel('out.xlsx', sheet_name='Data')df.to_sql('table', engine, if_exists='replace')
Always Use Parquet for Large Datasets
A 1GB CSV takes ~3-8 seconds to load and consumes 2-3x RAM during parsing. The same data as Parquet loads in ~0.3 seconds, uses 3-10x less disk space due to columnar compression, and preserves all dtypes exactly — no re-casting needed after load.
import pandas as pd
# ── Reading CSV with real-world options ───────────────────────────────────────
df = pd.read_csv(
'dataset.csv',
sep=',',
header=0, # first row is header (default)
usecols=['user_id', 'event', 'ts', 'amount'], # skip unwanted columns
dtype={
'user_id': 'int32', # saves memory vs int64
'amount': 'float32',
},
na_values=['NULL', 'N/A', '', '-'],
parse_dates=['ts'],
low_memory=False, # avoid mixed-type warnings
)
# ── Processing huge CSVs in chunks ────────────────────────────────────────────
chunk_iter = pd.read_csv('huge_file.csv', chunksize=100_000)
results = []
for chunk in chunk_iter:
# Process each chunk (e.g., aggregate)
agg = chunk.groupby('category')['value'].sum()
results.append(agg)
final = pd.concat(results).groupby(level=0).sum()
# ── Parquet round-trip ────────────────────────────────────────────────────────
df.to_parquet('clean_data.parquet', index=False, compression='snappy')
df_loaded = pd.read_parquet('clean_data.parquet')
# Dtypes preserved exactly — no re-casting needed
# ── SQL query → DataFrame ─────────────────────────────────────────────────────
# from sqlalchemy import create_engine
# engine = create_engine('postgresql://user:pass@host/db')
# query = """
# SELECT user_id, COUNT(*) as events, SUM(amount) as total
# FROM transactions
# WHERE event_date >= '2024-01-01'
# GROUP BY user_id
# """
# df_sql = pd.read_sql(query, engine)
🧹 Cleaning & Handling Missing Data
Detecting Missing Data
Pandas represents missing values as NaN (float), pd.NaT (datetime), or pd.NA (nullable integer/string). Always audit missingness before any analysis or modelling.
df.isnull().sum()— count NaN per columndf.isnull().mean() * 100— percent missingdf.isnull().any(axis=1)— rows with any NaNdf.notna().all(axis=1)— rows with no NaN- Columns with >40% missing: consider dropping
Handling Missing Values
The right strategy depends on the mechanism: MCAR (missing completely at random), MAR (missing at random), or MNAR (missing not at random). Never blindly fill or drop without understanding the cause.
df.dropna()— drop any row with NaNdf.dropna(subset=['col'])— only if specific col is NaNdf.fillna(df.mean())— fill with column meansdf['col'].fillna(method='ffill')— forward fill (time series)df['col'].fillna(method='bfill')— backward fill- Use
SimpleImputerin scikit-learn for pipelines
Duplicates & Type Conversion
Duplicate rows silently inflate statistics and model training. Type mismatches cause unexpected NaN after operations. Both are common sources of data bugs.
df.duplicated().sum()— count duplicate rowsdf.drop_duplicates()— remove duplicate rowsdf.drop_duplicates(subset=['id'])— by key columnsdf['col'].astype('float32')— cast dtypepd.to_datetime(df['ts'])— parse date stringspd.to_numeric(df['col'], errors='coerce')— NaN on failure
import pandas as pd
import numpy as np
# ── Realistic cleaning pipeline ───────────────────────────────────────────────
# Simulate a messy dataset
data = {
'user_id': [1, 2, 2, 3, 4, 5, None, 6],
'age': ['25', '32', '32', None, '28', '999', '31', '45'],
'salary': [70000, None, None, 85000, 72000, 60000, 95000, None],
'email': [' [email protected] ', '[email protected]', '[email protected]',
'[email protected]', None, '[email protected]', '[email protected]', '[email protected]'],
'label': ['yes', 'no', 'no', 'yes', 'YES', 'No', 'yes', None],
}
df = pd.DataFrame(data)
print("=== Before Cleaning ===")
print(df)
print("\nMissing values:\n", df.isnull().sum())
# Step 1: Drop rows with missing user_id (primary key)
df = df.dropna(subset=['user_id'])
df['user_id'] = df['user_id'].astype(int)
# Step 2: Remove exact duplicate rows
print(f"\nDuplicates: {df.duplicated().sum()}")
df = df.drop_duplicates()
# Step 3: Fix age column
df['age'] = pd.to_numeric(df['age'], errors='coerce') # 'None' → NaN
# Replace implausible values with NaN
df.loc[df['age'] > 120, 'age'] = np.nan
df.loc[df['age'] < 18, 'age'] = np.nan
# Impute with median (robust to outliers)
df['age'] = df['age'].fillna(df['age'].median())
# Step 4: Salary — fill with group median (by age band)
df['age_band'] = pd.cut(df['age'], bins=[18, 30, 40, 120], labels=['young','mid','senior'])
df['salary'] = df.groupby('age_band')['salary'].transform(
lambda x: x.fillna(x.median())
)
df = df.drop(columns=['age_band'])
# Step 5: Clean string columns
df['email'] = df['email'].str.strip().str.lower()
df['label'] = df['label'].str.lower().str.strip()
df['label'] = df['label'].replace({'yes': 1, 'no': 0})
# Step 6: Drop rows still missing critical fields
df = df.dropna(subset=['email', 'label'])
df['label'] = df['label'].astype(int)
# Step 7: Report
print("\n=== After Cleaning ===")
print(df)
print("\nRemaining NaN:\n", df.isnull().sum())
print(f"Shape: {df.shape}")
🔨 Selection, Filtering & Transformation
loc vs iloc
loc is label-based: rows and columns are referenced by their index labels and column names. iloc is integer-position-based: 0-based integer positions, like NumPy. Never mix them up — they behave differently when the index is not a default RangeIndex.
df.loc[2, 'salary']— row with index label 2, salary coldf.loc[2:5, ['name','salary']]— label slice (inclusive end)df.iloc[0, 3]— row 0, column 3 (0-based positions)df.iloc[1:4, :2]— rows 1-3, first 2 columns (exclusive end)df.iloc[-1]— last row
Boolean Filtering & query()
Boolean indexing with multiple conditions is the most common selection pattern. The query() method provides a SQL-like string syntax that is often more readable for complex conditions.
df[df['age'] > 30]— rows where age > 30df[(df.age > 25) & (df.dept == 'Eng')]— ANDdf[df.dept.isin(['Eng','Finance'])]— membershipdf.query("age > 25 and salary > 80000")— string querydf.query("dept in @depts")— inject Python variable
apply / map / groupby
Transforming data element-wise, row-wise, or within groups is the core of feature engineering. Prefer vectorised operations over apply where possible — apply runs a Python loop internally.
df['col'].map(dict)— recode values via dictionarydf['col'].apply(func)— element-wise functiondf.apply(func, axis=1)— row-wise function (slow)df.assign(new_col=...)— add column, returns copydf.groupby('dept').agg({'salary': 'mean', 'age': 'max'})df.groupby('dept')['sal'].transform('mean')— broadcast back
import pandas as pd
import numpy as np
rng = np.random.default_rng(42)
n = 500
df = pd.DataFrame({
'label': rng.integers(0, 3, n), # 3-class problem
'feat_1': rng.standard_normal(n),
'feat_2': rng.standard_normal(n) * 2 + 1,
'feat_3': rng.uniform(0, 10, n),
'category': rng.choice(['A','B','C','D'], n),
})
# ── Feature statistics per class (a standard EDA step) ───────────────────────
# Basic stats per label
class_stats = df.groupby('label').agg(
count=('feat_1', 'count'),
f1_mean=('feat_1', 'mean'),
f1_std=('feat_1', 'std'),
f2_mean=('feat_2', 'mean'),
f2_median=('feat_2', 'median'),
f3_q25=('feat_3', lambda x: x.quantile(0.25)),
f3_q75=('feat_3', lambda x: x.quantile(0.75)),
).round(3)
print(class_stats)
# ── Feature engineering with assign ──────────────────────────────────────────
df = df.assign(
feat_interaction = df['feat_1'] * df['feat_2'],
feat_1_abs = df['feat_1'].abs(),
feat_3_log = np.log1p(df['feat_3']),
is_category_A = (df['category'] == 'A').astype(int),
)
# ── Conditional transformation with np.where / map ───────────────────────────
df['feat_3_bin'] = pd.cut(
df['feat_3'],
bins=[0, 3, 7, 10],
labels=['low', 'mid', 'high']
)
df['label_name'] = df['label'].map({0: 'class_0', 1: 'class_1', 2: 'class_2'})
# ── Class-conditioned normalisation (group z-score) ───────────────────────────
for col in ['feat_1', 'feat_2', 'feat_3']:
group_mean = df.groupby('label')[col].transform('mean')
group_std = df.groupby('label')[col].transform('std')
df[f'{col}_znorm'] = (df[col] - group_mean) / (group_std + 1e-8)
# ── Pivot table — feature means by label and category ─────────────────────────
pivot = df.pivot_table(
values='feat_1',
index='label',
columns='category',
aggfunc='mean'
).round(3)
print("\nFeat_1 mean by label x category:")
print(pivot)
🔁 Merging & Reshaping
| Operation | Function | When to Use | SQL Equivalent |
|---|---|---|---|
| Inner join | pd.merge(df1, df2, on='id', how='inner') |
Keep only matching rows in both tables | INNER JOIN |
| Left join | pd.merge(df1, df2, on='id', how='left') |
Keep all rows from left, NaN for unmatched right | LEFT JOIN |
| Right join | pd.merge(df1, df2, on='id', how='right') |
Keep all rows from right table | RIGHT JOIN |
| Outer join | pd.merge(df1, df2, on='id', how='outer') |
Keep all rows from both, NaN for mismatches | FULL OUTER JOIN |
| Stack vertically | pd.concat([df1, df2], axis=0) |
Append rows (same schema, different data) | UNION ALL |
| Stack horizontally | pd.concat([df1, df2], axis=1) |
Add columns side by side (same row count) | — |
| Wide to long | df.melt(id_vars=['id'], value_vars=[...]) |
Unpivot column headers into rows | UNPIVOT |
| Long to wide | df.pivot(index='id', columns='metric', values='val') |
One row per entity with metric columns | PIVOT |
Merge Pitfalls
Always check the output row count after a merge. A many-to-many join will silently multiply rows (Cartesian product). Use validate='1:1', '1:m', or 'm:1' to assert the expected relationship and raise an error if violated.
import pandas as pd
import numpy as np
# ── Sample datasets ───────────────────────────────────────────────────────────
users = pd.DataFrame({
'user_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Carol', 'Dave', 'Eve'],
'dept_id': [10, 20, 10, 30, 20],
})
departments = pd.DataFrame({
'dept_id': [10, 20, 30, 40],
'dept_name': ['Engineering', 'Marketing', 'Finance', 'HR'],
'budget': [500000, 200000, 300000, 150000],
})
transactions = pd.DataFrame({
'user_id': [1, 1, 2, 3, 3, 3, 4],
'month': ['Jan','Feb','Jan','Jan','Feb','Mar','Jan'],
'revenue': [120, 95, 200, 80, 150, 60, 310],
})
# ── Join users to departments ─────────────────────────────────────────────────
enriched = pd.merge(
users, departments,
on='dept_id',
how='left', # keep all users even if dept not found
validate='m:1', # assert dept_id is unique in departments
)
print(enriched[['name', 'dept_name', 'budget']])
# ── Aggregate transactions and join back ──────────────────────────────────────
user_totals = transactions.groupby('user_id').agg(
total_revenue = ('revenue', 'sum'),
n_transactions = ('revenue', 'count'),
avg_revenue = ('revenue', 'mean'),
).reset_index()
full = pd.merge(enriched, user_totals, on='user_id', how='left')
full['total_revenue'] = full['total_revenue'].fillna(0)
print(full[['name', 'dept_name', 'total_revenue', 'n_transactions']].to_string())
# ── Wide to Long: multiple monthly columns → rows ─────────────────────────────
wide = transactions.pivot_table(
index='user_id',
columns='month',
values='revenue',
aggfunc='sum',
fill_value=0
).reset_index()
print("\nWide format:")
print(wide)
long = wide.melt(
id_vars=['user_id'],
var_name='month',
value_name='revenue'
).sort_values(['user_id', 'month'])
print("\nLong format:")
print(long.head(10))
# ── concat: stack train and validation sets ───────────────────────────────────
train = pd.DataFrame({'x': [1,2,3], 'y': [4,5,6], 'split': 'train'})
valid = pd.DataFrame({'x': [7,8], 'y': [9,10], 'split': 'valid'})
combined = pd.concat([train, valid], ignore_index=True)
print("\nCombined:", combined.shape)