Causal ML / Orthogonalization
DML and Causal Forests
DML uses machine learning for high-dimensional nuisance functions, then protects the target parameter with Neyman-orthogonal moments; causal forests localize the same idea for heterogeneous effects.
Mechanism Lab
Animation: how DML turns prediction signal into orthogonal causal signal
The animation starts with high-dimensional X entering two nuisance models, then shows residualized Y and D, the orthogonal score, cross-fit folds, and local tau(x) from a causal forest.
Step 1 / 5
Nuisance ML
Use machine learning to estimate E[Y|X] and E[D|X].
ell_hat(X), m_hat(X)Animation Control
Reduced-motion users receive the same step states without continuous motion.
01 / Intuition
Core Intuition
Standard machine learning is good at predicting Y or D, but causal work needs interpretable variation in treatment D, not just accurate prediction.
The key DML move is residualization: remove the parts of Y and D predictable from X, then estimate theta from the remaining orthogonal signal.
Cross-fitting keeps nuisance predictions out of sample so overfitting errors do not directly contaminate the target; causal forests turn theta into tau(x).
02 / Math
From a partially linear model to orthogonal moments and local forests
01 / Target model
With high-dimensional controls X, the partially linear model separates the treatment effect, a nonparametric control function, and residual noise. theta_0 is the average marginal treatment effect.
Y = theta_0 D + g_0(X) + U
D = m_0(X) + V
E[U|X,D]=0, E[V|X]=002 / Residualization
Let ell_0(X)=E[Y|X]. Project Y and D on X, then take residuals. At the truth, the Y residual equals theta_0 times the treatment residual plus noise.
tilde Y = Y - ell_0(X)
tilde D = D - m_0(X)
tilde Y = theta_0 tilde D + U03 / Orthogonal moment
DML does not treat machine-learning predictions as causal estimates. It inserts them into a moment condition that is insensitive to first-order nuisance error.
psi(W;theta,eta) = (Y - ell(X) - theta(D-m(X)))(D-m(X))
E[psi(W;theta_0,eta_0)] = 004 / Neyman orthogonality
Take Gateaux derivatives with respect to perturbations h_l for ell and h_m for m. At the truth, E[D-m_0(X)|X] and E[U|X] are zero, so the first-order terms vanish.
d/dt E[psi(theta_0, ell_0+t h_l, m_0)]|0 = -E[h_l(X)V] = 0
d/dt E[psi(theta_0, ell_0, m_0+t h_m)]|0 = E[h_m(X)(theta_0 V-U)] = 005 / Cross-fitted estimator
Split the sample into K folds, train nuisances outside each held-out fold, predict on the held-out fold, then regress residualized Y on residualized D.
theta_hat = sum_i tilde D_i tilde Y_i / sum_i tilde D_i^2
where tilde Y_i, tilde D_i are cross-fitted residuals06 / Causal forest localization
If the treatment effect varies with X, a forest assigns neighborhood weights alpha_i(x) around each target point x and solves a local orthogonal moment.
tau_hat(x) = [sum_i alpha_i(x) tilde D_i tilde Y_i] / [sum_i alpha_i(x) tilde D_i^2]03 / Code
Python code: cross-fitted DML plus a simple heterogeneity display
The example simulates high-dimensional confounding, estimates nuisance functions with random forests, cross-fits residuals, and then estimates the ATE plus a display model for tau(x).
import numpy as np
import pandas as pd
from sklearn.base import clone
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
rng = np.random.default_rng(7)
n = 2500
p = 12
X = rng.normal(size=(n, p))
# Heterogeneous treatment effect for demonstration.
tau = 0.8 + 0.6 * (X[:, 0] > 0) - 0.35 * X[:, 1]
m = 0.4 * X[:, 0] - 0.25 * X[:, 2] + 0.2 * X[:, 3] ** 2
D = m + rng.normal(scale=1.0, size=n)
g = 1.2 * np.sin(X[:, 0]) + X[:, 1] * X[:, 2]
Y = tau * D + g + rng.normal(scale=1.0, size=n)
base_y = RandomForestRegressor(
n_estimators=250,
min_samples_leaf=20,
random_state=11,
n_jobs=-1,
)
base_d = RandomForestRegressor(
n_estimators=250,
min_samples_leaf=20,
random_state=17,
n_jobs=-1,
)
y_hat = np.zeros(n)
d_hat = np.zeros(n)
folds = KFold(n_splits=5, shuffle=True, random_state=23)
for train_idx, test_idx in folds.split(X):
model_y = clone(base_y).fit(X[train_idx], Y[train_idx])
model_d = clone(base_d).fit(X[train_idx], D[train_idx])
y_hat[test_idx] = model_y.predict(X[test_idx])
d_hat[test_idx] = model_d.predict(X[test_idx])
y_resid = Y - y_hat
d_resid = D - d_hat
theta_hat = np.sum(d_resid * y_resid) / np.sum(d_resid ** 2)
# Orthogonal score should be close to zero at theta_hat.
score = (y_resid - theta_hat * d_resid) * d_resid
se = np.sqrt(np.mean(score ** 2) / (np.mean(d_resid ** 2) ** 2 * n))
# A simple heterogeneity display: pseudo-outcome for tau(X).
# Production causal forests use honest splitting and forest weights.
pseudo_tau = y_resid / np.where(np.abs(d_resid) < 0.05, np.nan, d_resid)
mask = np.isfinite(pseudo_tau) & (np.abs(d_resid) > 0.2)
tau_model = RandomForestRegressor(
n_estimators=300,
min_samples_leaf=40,
random_state=31,
n_jobs=-1,
).fit(X[mask], pseudo_tau[mask])
grid = pd.DataFrame({
"x0_group": ["low X0", "high X0"],
"tau_hat": [
tau_model.predict(X[X[:, 0] <= 0]).mean(),
tau_model.predict(X[X[:, 0] > 0]).mean(),
],
})
print(f"DML theta_hat = {theta_hat:.3f} +/- {1.96 * se:.3f}")
print(f"orthogonal score mean = {score.mean():.4f}")
print(grid)04 / Case
Case: selection bias and heterogeneity in a job-training program
- Question: does a job-training program raise later earnings? Participants differ in education, industry, region, prior earnings, and job-search history.
- Strong predictors of earnings are not causal evidence. DML first uses X to explain earnings and take-up, then estimates the average effect from residual treatment variation.
- A causal forest asks who benefits more: for example, whether low-baseline earners, industry switchers, or younger job seekers have higher tau(x).
- A credible report states the identification assumption, overlap checks, out-of-sample nuisance performance, cross-fitting design, ATE interval, heterogeneity calibration, and pre-specified subgroup interpretation.
05 / Risks
Common Pitfalls
References
- Chernozhukov et al. (2018), Double/Debiased Machine Learninghttps://doi.org/10.1111/ectj.12097
- Wager and Athey (2018), Estimation and Inference of Heterogeneous Treatment Effects using Random Forestshttps://doi.org/10.1080/01621459.2017.1319839
- Athey, Tibshirani, and Wager (2019), Generalized Random Forestshttps://doi.org/10.1214/18-AOS1709