Data Engineering
pandas Cleaning, Joins, and Panel Construction
Treat merge, reshape, and groupby as auditable transformations over matrices and key sets.
Mechanism Lab
Animation: how joins change sample support
The animation shows left keys, right keys, intersection, and left_only observations moving into the analysis sample.
Step 1 / 5
Keys
Define the unit and key first.
key = (id, year)Animation Control
Reduced-motion users receive the same step states without continuous motion.
01 / Intuition
Core Intuition
pandas is not an API memorization exercise; it is about preserving units, keys, time, and sample flow.
Every merge changes support; every groupby defines the conditioning set for a statistic.
Empirical bugs often come from duplicate keys, hidden many-to-many joins, unbalanced panels, and sample drift.
02 / Math
Joins and aggregations as mathematical objects
01 / Key sets
View left and right tables as key sets L and R. Inner joins keep the intersection.
support(inner_join) = L cap R02 / Many-to-many risk
If key k appears m_k and n_k times, the join creates m_k n_k rows.
rows_k = m_k * n_k03 / Conditional mean
groupby means are empirical expectations within each group.
mean_g(Y) = (1/n_g) sum_{i: G_i=g} Y_i03 / Code
Audited merge
Check duplicate keys and sample counts before and after merging.
import pandas as pd
left = pd.read_csv("students.csv")
right = pd.read_csv("scores.csv")
assert not left.duplicated(["student_id", "year"]).any()
assert not right.duplicated(["student_id", "year"]).any()
merged = left.merge(
right,
on=["student_id", "year"],
how="left",
validate="one_to_one",
indicator=True,
)
audit = merged["_merge"].value_counts(dropna=False)
print(audit)
panel = merged.pivot_table(
index="student_id",
columns="year",
values="score",
aggfunc="mean",
)04 / Case
Case: sample-flow audit for a student panel
- Verify student_id-year uniqueness.
- Record left_only, right_only, and both counts.
- Report missingness by year before DID or event-study analysis.
05 / Risks