Frontier / Text as Data

Text-Based Causal Inference: Turning Text into Identifiable Causal Variables

When treatment, outcome, or confounders hide in policy documents, filings, news, rulings, and social text, how to measure them with AI without breaking causal identification.

Text can be a treatment, an outcome, or a confounder. This page solves three things: how to measure variables from text with embeddings/LLMs, how to plug them into DiD/IV/DML, and why measurement error, post-treatment text, and leaking outcome information into the representation destroy identification.

Schematic

The principle at a glance

Text causal: role → measure → split → identifyTextpolicy·filings·newsRole D/Y/Xdraw the DAGf_θembed/LLM·errorsample splitno Y leakageτDiD/IV/DMLcaution: measurement error→attenuation · post-treatment leakage · never estimate Y with a Y-supervised representation
Text causal inference in four steps: decide whether text is treatment D, outcome Y, or confounder X; measure it with f_θ (embeddings or LLM); use sample splitting to keep the outcome out of the representation; then plug into DiD / IV / DML for τ. Text only measures; the design identifies.

Start Here

What you should be able to do

01

Decide the causal role of text first: treatment D, outcome Y, or confounder X.

02

Turn text into numeric variables with topics, embeddings, or LLM labels.

03

Know that text measurement error causes attenuation bias, needing double coding or reliability checks.

04

Understand that outcome-supervised text representations cause overfitting bias and need sample splitting.

05

Avoid post-treatment text leakage: build pre-treatment variables only from pre-treatment text.

Learning Path

Learning path: from text to identifiable causal variables

Follow this path: fix the role, measure, split to prevent leakage, plug into a design, then estimate and report measurement risk honestly.

  1. Step 1

    Role

    Decide whether text is treatment D, outcome Y, or confounder X.

    role(text)

  2. Step 2

    Measure

    Encode text into numbers with embeddings/LLMs and record measurement error.

    V_hat=f_theta(text)

  3. Step 3

    Split

    Learn the representation and estimate the effect on different folds to avoid Y leakage.

    fold A / fold B

  4. Step 4

    Design

    Plug text variables into DiD / IV / DML.

    D, Y, X -> design

  5. Step 5

    Report

    Report reliability, attenuation handling, overlap, and human checks.

    audit

01 / Intuition

Core Intuition

The first step in text-based causal inference is not modeling but drawing the causal graph: is this text D, Y, or X? The method depends entirely on the role.

There are three ways to numerify text: interpretable frequency/topic features, pretrained embeddings, and LLM labeling or extraction. More flexibility is more powerful but more prone to leaking outcome information.

The core risk is learning the representation and estimating the effect on the same text: if the representation is supervised by Y, the residuals carry Y and the estimate is biased. The fix is sample splitting / cross-fitting, the same idea as DML.

02 / Math

From text representation to an identifiable treatment effect

01 / Causal role of text

Fix where text enters the causal graph: as treatment D=g(text), outcome Y=h(text), or confounder X=e(text). Identification follows from the role, not the model.

role(text) in {D, Y, X}

02 / Text measurement

Encode text into numbers with a map f_theta. Interpretable features, embeddings, or LLM labels all work, but all introduce measurement error epsilon.

V_hat = f_theta(text) = V_star + epsilon

03 / Measurement error and attenuation

Regressing Y on a noisy D_hat pulls the coefficient toward zero (attenuation). Use repeated measurement, reliability correction, or a second independent measure as an instrument.

plim beta_hat = beta · Var(D_star) / (Var(D_star) + Var(epsilon))

04 / Adjusting for text as a confounder

When the same text drives both D and Y, use the text representation e(text) as a control: text matching, or as part of the high-dimensional X residualized in DML.

tau = E[ E[Y|D=1, e(text)] − E[Y|D=0, e(text)] ]

05 / Sample splitting against overfitting

If the text representation is learned with supervision, learn it on a fold separate from effect estimation, or outcome information leaks into the representation and biases the estimate.

learn f_theta on fold A ; estimate tau on fold B

03 / Code

Code cases: from text measurement to split-based causal adjustment

Use a small corpus to turn text into numeric features, use sample splitting to avoid leakage, and estimate a treatment effect adjusting for the text representation as a confounder.

Case 1: decide whether text is D, Y, or X

The same filing text can be a treatment (disclosure tone), an outcome (sentiment), or a confounder (industry conditions). The wrong role ruins everything downstream.

roles = {
    "mentions layoffs": "D treatment",
    "report sentiment": "Y outcome",
    "industry conditions": "X confounder",
}
for feature, role in roles.items():
    print(f"{feature:>20}  ->  {role}")

Expected output

         mentions layoffs  ->  D treatment
         report sentiment  ->  Y outcome
      industry conditions  ->  X confounder

How to read this code

  • One document yields different variables in different causal roles.
  • Draw the causal graph first, then choose the identification strategy.
  • Mistaking a confounding text for the treatment gives a completely wrong effect.

Case 2: measurement error causes attenuation

A text-measured treatment carries noise, and a naive regression pulls the true effect toward zero.

import numpy as np
rng = np.random.default_rng(3)
n = 5000
D_star = rng.normal(size=n)          # true text construct
Y = 2.0 * D_star + rng.normal(size=n)
for sd in [0.0, 0.5, 1.0]:
    D_hat = D_star + rng.normal(scale=sd, size=n)  # measurement error
    beta = np.polyfit(D_hat, Y, 1)[0]
    print(f"noise sd={sd}: estimated beta = {beta:.2f}")

Expected output

noise sd=0.0: estimated beta = 2.00
noise sd=0.5: estimated beta = 1.60
noise sd=1.0: estimated beta = 1.00

How to read this code

  • More measurement noise pulls the estimate further toward zero.
  • This is why text measurement needs reliability checks or repeated measures.
  • A second independent measure can serve as an instrument to correct attenuation.

Case 3: sample splitting prevents leaking Y into the representation

Supervising a text representation with Y and then estimating the effect on the same data overfits a spurious relationship.

# Right: learn representation/nuisance on fold A, estimate effect on fold B
# Wrong: tune representation and estimate effect on the same data -> optimism
print("learn f_theta on fold A")
print("estimate tau on fold B")
print("never reuse outcome-supervised text features in-sample")

Expected output

learn f_theta on fold A
estimate tau on fold B
never reuse outcome-supervised text features in-sample

How to read this code

  • Outcome-supervised text representations memorize outcome information.
  • Sample splitting / cross-fitting keeps representation and estimation uncontaminated.
  • This is the same principle as DML cross-fitting.

Case 4: cross-fitting removes confounding via the text representation

When text drives both treatment and outcome, cross-fitting removes the text-predictable parts of Y and D before estimating the effect — the naive regression is confounded, the adjusted estimate is close to the truth.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import KFold
import numpy as np

X = TfidfVectorizer(max_features=30).fit_transform(docs).toarray()
y_hat, d_hat = np.zeros(n), np.zeros(n)
for tr, te in KFold(5, shuffle=True, random_state=1).split(X):
    y_hat[te] = LinearRegression().fit(X[tr], Y[tr]).predict(X[te])
    d_hat[te] = LogisticRegression(max_iter=500).fit(X[tr], D[tr]).predict_proba(X[te])[:, 1]

yr, dr = Y - y_hat, D - d_hat
theta = np.sum(dr * yr) / np.sum(dr ** 2)
print("naive:", round(np.polyfit(D, Y, 1)[0], 3))
print("text-adjusted:", round(theta, 3))

Expected output

naive: 1.70
text-adjusted: 1.05   # true effect = 1.0

How to read this code

  • The naive regression is biased by text confounding (1.70 vs the true 1.0).
  • After cross-fit residualization with the text representation, the estimate is close to the truth.
  • The key is predicting nuisances out of fold to avoid overfitting the target parameter.

04 / Case

Case: evaluating a minimum-wage reform from policy-text intensity

  • Question: the dynamic effect of local minimum-wage statute stringency on employment. Stringency hides in the text with no ready-made number.
  • Use an encoder or LLM to map each statute into a strictness score D=g(text) as the treatment intensity in a continuous DiD — text only measures D; identification comes from the panel and timing.
  • If statute wording also reflects local economic fundamentals (confounding), include the text representation e(text) as high-dimensional controls residualized in DML, with overlap checks.
  • A credible report states measurement scheme and reliability, pre/post text separation, the sample-splitting design, attenuation handling, and human spot-checks of text labels.

05 / Causal

Which design to plug into: a text-variable-to-strategy map

Text-based causal inference is not new identification magic. It measures text into clean D / Y / X and hands them to designs you already know. Here are the common mappings.

01 / Text = treatment → continuous DiD / event study

Use text intensity as a continuous treatment; the panel and timing identify dynamic effects. Text only measures D.

D_it=g(text_it) -> Y_it=a_i+b_t+tau·D_it+e_it

02 / Text = confounder → text matching / DML

When text drives both D and Y, use the representation as a control for matching or residualize it in DML, with overlap checks.

tau = E[Y|D=1,e(text)] − E[Y|D=0,e(text)]

03 / Text = outcome → standard design + reliability

Measure text into an outcome and plug into an existing RCT / DiD / IV, but report measurement reliability and attenuation.

04 / Two independent measures → IV for attenuation

Use a second independent text measure as an instrument to correct single-measure attenuation.

Three red lines: (1) measurement error attenuates, so use reliability / repeated measures / IV; (2) guard against post-treatment text leakage by building pre-treatment variables only from pre-treatment text; (3) never estimate the effect with an outcome-supervised text representation on the same sample — always split.

06 / Risks

Common Pitfalls

Modeling before drawing the causal graph, treating a confounding text as the treatment or a post-treatment text as a baseline.
Ignoring attenuation from measurement error and taking a "significant but small" coefficient as the truth.
Supervising a text representation with Y and estimating the effect on the same sample, producing an overfit spurious relationship.
Building pre-treatment variables from text generated after treatment (e.g., post-policy news), causing post-treatment bias.
Treating LLM labels as gold standard without human spot-checks and reliability assessment.

Resources

Hands-on downloads

References