Data Engineering

pandas Cleaning, Joins, and Panel Construction

Treat merge, reshape, and groupby as auditable transformations over matrices and key sets.

Mechanism Lab

Animation: how joins change sample support

The animation shows left keys, right keys, intersection, and left_only observations moving into the analysis sample.

Step 1 / 5

Keys

Define the unit and key first.

key = (id, year)

Animation Control

Reduced-motion users receive the same step states without continuous motion.

01 / Intuition

Core Intuition

pandas is not an API memorization exercise; it is about preserving units, keys, time, and sample flow.

Every merge changes support; every groupby defines the conditioning set for a statistic.

Empirical bugs often come from duplicate keys, hidden many-to-many joins, unbalanced panels, and sample drift.

02 / Math

Joins and aggregations as mathematical objects

01 / Key sets

View left and right tables as key sets L and R. Inner joins keep the intersection.

support(inner_join) = L cap R

02 / Many-to-many risk

If key k appears m_k and n_k times, the join creates m_k n_k rows.

rows_k = m_k * n_k

03 / Conditional mean

groupby means are empirical expectations within each group.

mean_g(Y) = (1/n_g) sum_{i: G_i=g} Y_i

03 / Code

Audited merge

Check duplicate keys and sample counts before and after merging.

import pandas as pd

left = pd.read_csv("students.csv")
right = pd.read_csv("scores.csv")

assert not left.duplicated(["student_id", "year"]).any()
assert not right.duplicated(["student_id", "year"]).any()

merged = left.merge(
    right,
    on=["student_id", "year"],
    how="left",
    validate="one_to_one",
    indicator=True,
)

audit = merged["_merge"].value_counts(dropna=False)
print(audit)
panel = merged.pivot_table(
    index="student_id",
    columns="year",
    values="score",
    aggfunc="mean",
)

04 / Case

Case: sample-flow audit for a student panel

  • Verify student_id-year uniqueness.
  • Record left_only, right_only, and both counts.
  • Report missingness by year before DID or event-study analysis.

05 / Risks

Common Pitfalls

Not using validate in merge.
Dropping unmatched rows silently.
Failing to check period-by-period panel counts.

References