Agent Systems / Empirical Research

Using StatsPAI for Research: From Data and a Question to an Audited Causal Estimate

Turn 'run a regression' into an agent-native workflow: detect design -> recommend estimator -> fit (handle) -> audit robustness -> sensitivity -> verifiable citations.

StatsPAI is an agent-native causal-inference toolkit (MCP) that exposes hundreds of causal and econometric methods (DiD, IV, RD, synthetic control, DML, and more) as callable tools, with a recommended workflow: detect the design, recommend an estimator, fit to get a result handle, then chain it into audit, sensitivity, and citation tools. Its value is not 'thinking causally for you' but managing execution, auditing, and provenance so you focus on identification and interpretation.

Start Here

What you should be able to do

01

Understand the agent-native causal workflow: detect -> recommend -> fit -> audit -> sensitivity -> cite.

02

Use a result handle (result_id) to chain an estimate into downstream tools without re-passing beta / sigma.

03

Know that audit enumerates the missing robustness checks, turning identification into a checklist.

04

Run at least one sensitivity analysis (e.g., E-value) to quantify robustness to unobserved confounding.

05

Keep citation discipline: cite only from a verifiable bib, never letting the model fabricate references.

Learning Path

Learning path: detect -> recommend -> fit -> audit -> cite

Read StatsPAI along this path: detect the design, recommend an estimator, fit to a handle, audit robustness and run sensitivity, then close with verifiable citations.

  1. Step 1

    Detect

    Data shape -> design.

  2. Step 2

    Recommend

    Design -> robust estimator.

  3. Step 3

    Fit

    A fit returns a result handle.

  4. Step 4

    Audit

    Enumerate missing checks and coverage.

  5. Step 5

    Cite

    Cite only from a verifiable bib.

01 / Intuition

Core Intuition

Most people treat stats software as a set of commands to memorize. StatsPAI makes it a workflow: first ask what design this is, then what estimator to use.

Identification always precedes estimation: on the same panel, mistaking a staggered DiD for ordinary TWFE biases the estimate; classify the design first.

The result handle is the key abstraction: fit once to get a result_id that downstream audit, sensitivity, and tables all reference, keeping the evidence chain intact.

Credibility is not 'printing a star' but 'audit coverage + sensitivity + verifiable citations' — consistent with the whole course's causal spirit.

02 / Math

Formalizing the agent-native causal workflow

01 / Detect design

Map the data shape to a design: panel + staggered treatment -> DiD, a cutoff -> RD, an instrument -> IV, else selection on observables.

02 / Recommend estimator

Each design maps to a robust estimator, avoiding misuse (e.g., TWFE is biased under staggered timing).

03 / Estimand (ATT)

Most policy evaluations target the average treatment effect on the treated.

04 / Result handle

A fit returns a handle encapsulating coefficients, variance, and diagnostics for downstream tools.

05 / Audit coverage

Auditing turns identification into a checklist: the share of recommended checks that are done.

06 / Sensitivity (E-value)

The E-value measures how strong unobserved confounding must be to explain away the effect; larger is more robust.

03 / Code

Code cases: the workflow from detection to audit

Use plain Python to simulate the StatsPAI workflow logic: detect a design, recommend an estimator, chain with a handle, audit coverage, E-value sensitivity, and citation discipline. Real projects use the StatsPAI MCP tools.

Case 1: recommend a robust estimator by design

Once the design is identified, have StatsPAI recommend a matching estimator to avoid method misuse.

RECOMMEND = {
    "staggered_did": "callaway_santanna",
    "regression_discontinuity": "rdrobust",
    "instrumental_variables": "ivreg",
    "selection_on_observables": "dml",
}
design = "staggered_did"
print(f"design={design} -> recommended estimator: {RECOMMEND[design]}")
print("reason: TWFE is biased under heterogeneous treatment timing")

Expected output

design=staggered_did -> recommended estimator: callaway_santanna
reason: TWFE is biased under heterogeneous treatment timing

How to read this code

  • TWFE is biased under staggered timing, so callaway_santanna is recommended.
  • Design-driven estimation is the first step of 'using the right method.'

Case 2: chain estimation and audit with a result handle

A fit returns a result_id that downstream audit references directly, without re-passing beta / sigma.

results = {}
def fit(estimator, as_handle=True):
    rid = f"res_{len(results)+1}"
    results[rid] = {"estimator": estimator, "att": 0.073, "se": 0.021}
    return rid

def audit_result(result_id):
    r = results[result_id]
    return {"id": result_id, "t_stat": round(r["att"] / r["se"], 2),
            "missing": ["pretrends_test", "honest_did", "sensitivity"]}

rid = fit("callaway_santanna", as_handle=True)
print("handle:", rid)
print("audit:", audit_result(rid))

Expected output

handle: res_1
audit: {'id': 'res_1', 't_stat': 3.48, 'missing': ['pretrends_test', 'honest_did', 'sensitivity']}

How to read this code

  • The handle keeps the evidence chain intact across multi-step analysis.
  • Audit reads the handle to report the t-stat and the still-missing checks.

Case 3: audit coverage turns identification into a checklist

List the recommended robustness checks; done / missing is immediately visible.

recommended = {"pretrends_test", "honest_did", "sensitivity", "placebo", "cluster_se"}
done = {"pretrends_test", "cluster_se"}
coverage = len(done) / len(recommended)
print(f"robustness coverage = {coverage:.0%}")
print("still missing:", sorted(recommended - done))

Expected output

robustness coverage = 40%
still missing: ['honest_did', 'placebo', 'sensitivity']

How to read this code

  • Coverage = done / recommended turns identification from post-hoc defense into a pre-flight checklist.
  • A clear gap tells you what to add next.

Case 4: E-value sensitivity analysis

The E-value measures how strong unobserved confounding must be to explain away the effect; larger is more robust.

import math
def e_value(rr):
    if rr < 1:
        rr = 1 / rr
    return round(rr + math.sqrt(rr * (rr - 1)), 2)

for rr in (1.2, 1.5, 2.0):
    print(f"RR={rr} -> E-value = {e_value(rr)}")

Expected output

RR=1.2 -> E-value = 1.69
RR=1.5 -> E-value = 2.37
RR=2.0 -> E-value = 3.41

How to read this code

  • The E-value grows with the effect size.
  • It turns 'could there be an omitted variable?' into a reportable number.

Case 5: citation discipline — reject fabricated references

All citations must come from a verifiable bib; any key not in the library is rejected.

VERIFIED_BIB = {
    "callaway2021": "Callaway & Sant'Anna (2021), J. Econometrics",
    "goodmanbacon2021": "Goodman-Bacon (2021), J. Econometrics",
}
def cite(keys):
    out, bad = [], []
    for k in keys:
        (out if k in VERIFIED_BIB else bad).append(k)
    return {"cited": [VERIFIED_BIB[k] for k in out], "rejected_invented": bad}

print(cite(["callaway2021", "smith2099"]))

Expected output

{'cited': ["Callaway & Sant'Anna (2021), J. Econometrics"], 'rejected_invented': ['smith2099']}

How to read this code

  • The model's biggest danger is not 'looking machine-like' but fabricating plausible citations.
  • bibtex (paper.bib as the single source) is the only trustworthy source.

04 / Case

Case: evaluate a staggered policy rollout with the StatsPAI workflow

  • Question: a policy rolled out across regions at different times; estimate its average effect on firm investment from a region-year panel.
  • detect_design classifies it as staggered DiD; recommend suggests callaway_santanna because TWFE is biased by negative weights under staggered timing.
  • The fit returns a result handle; audit_result flags missing pretrends, honest DiD, and placebo checks and reports coverage.
  • After adding sensitivity (E-value / honest DiD), report the estimate + interval + audit coverage + sensitivity + limitations; all citations go through bibtex, with no fabrication.

05 / Causal

Bridge to causal inference: collapse the whole course into one auditable pipeline

StatsPAI unifies the designs from the first three weeks (DiD / IV / RD / synthetic control / DML) into one agent-native pipeline: identification first, estimation in the middle, audit and sensitivity after, citations to close. It is the concrete form of 'using AI for causal inference.'

01 / Design drives estimation (design -> estimator)

Classify the design first, then let StatsPAI recommend a matching estimator to avoid method misuse.

02 / Unbroken evidence chain (handle -> downstream)

Use result_id to chain estimation, audit, sensitivity, tables, and citations into a rerunnable evidence chain.

03 / Audit is identification (audit -> checklist)

Turn identification assumptions into an audit checklist with quantifiable coverage and fillable gaps.

04 / Verifiable citations (claim -> citation)

Every conclusion maps to verifiable references and code, with no fabrication.

Three red lines: (1) identification comes from design — tools only swap functional form, not defend your assumptions; (2) an estimate without audit and sensitivity is not a conclusion; (3) citations go only through a verifiable bib, and the model must not fabricate references.

06 / Risks

Common Pitfalls

Skipping identification and picking an estimator directly: TWFE is biased by negative weights under staggered timing.
Treating statistical significance as causal evidence: a star without audit and sensitivity is untrustworthy.
Re-passing beta / sigma to downstream tools by hand is error-prone; chain with a result handle.
Skipping sensitivity analysis leaves 'how big a threat is unobserved confounding?' unanswered.
Letting the model fabricate references: citations must come from a verifiable bib (paper.bib as single source).
Treating tool output as a conclusion: identification assumptions and claim boundaries still need human judgment.

Resources

Hands-on downloads

References