Bridging Balance and Efficiency

The Role of the Propensity Score in Causal Machine Learning

Nima Hejazi

Harvard Biostatistics

Mark van der Laan

UC Berkeley

August 6, 2023

Balance and the Propensity Score

The propensity score in causal inference

  • Cross-sectional problem: Evaluate average treatment effect (ATE) with data on \(n\) units \(O_i\vert_{i=1}^n\), for \(O = (X, Z, R) \sim P_0\)
    • \(X\) are baseline covariates (e.g., sex-at-birth, age)
    • \(Z \in \{0, 1\}\) is a binary “treatment” (e.g., smoking)
    • \(R\) is an outcome (e.g., cardiovascular disease)
  • Rosenbaum and Rubin (1983) introduced the propensity score: \(e_0(X) = \mathbb{P}_0(Z = 1 \mid X)\)
    • showed \(e_0(X)\) as critical tool for observational study data
    • tied \(e_n(X)\) to empirical balance, a now-popular diagnostic tool, e.g., cobalt R package (Greifer 2022)

Causal inference from observational study data

  • When does statistical inference coincide with causal inference?
    • no unmeasured confounding (or randomized treatment)
    • “sufficient” experimentation in treatment assignment
  • Propensity score tied to positivity (sufficient experimentation)
    • \(\epsilon < e_0(X) < 1 - \epsilon\) (for an \(\epsilon > 0\)) implies that the ATE is identified, and, by extension, counterfactual means, too
    • its estimate, \(e_n(X)\), allows diagnosis of “random” positivity violations (Hernán and Robins 2023; Petersen et al. 2012)
  • Empirical balance (independence) \(X \perp\!\!\!\perp Z \mid e_0(X)\) may imply the necessary \(\{R(1), R(0)\} \perp\!\!\!\perp Z \mid X\) through \(e_0(X)\)

Effect estimation in observational studies

  • Consider \(\tau_0 = \mathbb{E}_0 R(1)\) and its identification, based on IPW, as \(\tau_0 = \mathbb{E}_0 \{[\mathbb{I}(Z = 1)/e_0(X)] R \}\)
  • From this identification argument, the implied IPW estimator is \(\tau_n^{\text{IPW}} = n^{-1} \sum_{i=1}^n [\mathbb{I}(Z_i = 1)/e_n(X_i)] R_i\)
  • The key properties of this type of IPW estimator (Horvitz and Thompson 1952; Hernán and Robins 2023) depend on chosen estimator (e.g., logistic regression) for \(e_n(X_i)\)

Efficient estimation

Efficient estimation of causal effect estimands

  • A RAL estimator \(\tau_n\) is asymptotically efficient if it admits the form \(\tau_n - \tau_0 = n^{-1} \sum_{i = 1}^n D^{\star}(P_0)(O_i) + o_p(n^{-1/2})\)
    • \(D^{\star}(P_0)(O)\) is the efficient influence function (EIF) at the distribution \(P_0 \in \mathcal{M}\), evaluated for the study unit \(O\)
    • when \(P_n D^{\star}(P_0) \approx 0\), \(\tau_n \to_{\text{D}} \text{N}(\tau_0, P \{D^{\star}(P_0)\}^2)\)
  • Robins and Rotnitzky (1992)’s AIPW representation of the EIF: \(D^{\star}(P_0)(O) = D_{\text{IPW}}(P_0)(O) - D_{\text{CAR}}(P_0)(O)\)
    • \(D_{\text{IPW}} = [\mathbb{I}(Z = 1) / e_0(X)] R - \tau_0\); solved by \(\tau_n^{\text{IPW}}\)
    • \(D_{\text{CAR}} = [\overline{Q}_0(1, X) / e_0(X)] (Z - e_0(X))\); projection onto CAR tangent space (van der Laan and Robins 2003)

Efficiency of IPW estimators of causal effects

  • Noted that \(\tau_n^{\text{IPW}}\) is never asymptotically efficient
    • it solves \(P_n D_{\text{IPW}}(e_n) \approx 0\) but \(P_n D_{\text{CAR}}(e_n, \overline{Q}_n)\) remains unsolved; improved balance of \(e_n(X)\) does not fix this
    • draws attention to the score equation \(P_n D_{\text{CAR}} \approx 0\), of form \(h_0(X)(Z - e_0(X))\) — must be solved for efficiency
  • frameworks for construction of efficient estimators (e.g., sieve-based, DR-based, targeting) focus on solving score equations
    • result is that \(P_n D^{\star}(\overline{Q}_n, e_n) \approx 0\) is solved well enough
    • could involve undersmoothing \(e_n\), targeting \(\overline{Q}_n\) (or \(e_n\), per van der Laan (2014)), or “additive” debiasing using \(P_n D^{\star}\)

The empirical balance-efficiency tradeoff

  • Empirical balance seems desirable but so too does efficiency – are these desiderata related? (Hejazi and van der Laan 2023)
  • Yes! Solving score equation \(P_n s(e_n; f) \approx 0\) leads to \(f\)-specific balance: \(f(X) \perp\!\!\!\perp Z \mid e_n(X)\) for \(s(e_n; f) = f(X) (Z - e_n(X))\)
  • Independence relation: \(\mathbb{E}_0 (Z \mid f(X), e_n(X)) = \mathbb{E}_0(Z \mid e_n(X))\)
    • when \(e_n(X)\) solves \(P_n s(e_n; f) \approx 0\), then \(f(X)\) contains no more information, beyond that in \(e_n(X)\), on confounders \(X\)
    • when \(f \in \mathcal{F}\), for a rich function class \(\mathcal{F}\), then satisfaction of \(f\)-specific empirical balance suggests desirable \(e_n(X)\)
    • construct \(\mathcal{F}\) to contain functions \(h_0(X)\), implied by \(D^{\star}\), then \(e_n(X)\) may lead to an efficient estimator \(\tau_n\) as well

Score-based empirical balance and efficiency

  • Consider \(\text{logit}(\mathbb{E}_0[Z \mid X]) = \text{logit}(e_n(X)) + \beta f(X)\), taking \(f \in \mathcal{F}\) and \(e_n(X) = \mathbb{E}_n[Z \mid X]\) taken as an offset1
    • \(H_0(f): \beta = 0\) vs. \(H_1(f): \beta \neq 0\) gives a hypothesis test where \(X \perp\!\!\!\perp Z \mid e_n(X) \equiv X \perp\!\!\!\perp Z \mid e_n(X), f(X)\) under \(H_0\)
    • at \(\beta = 0\), \(s(e_n; f) = \mathbb{E}_n f(X)(Z - e_n(X))\) is the score of the empirical log-likelihood; can be used as a test statistic
  • Ertefaie, Hejazi, and van der Laan (2022) formulated a class of nonparametric IPW estimators via undersmoothing2 of highly adaptive lasso (HAL) (van der Laan 2015, 2017) for \(e_n(X)\)

Nonparametric IPW estimation

  • Ertefaie, Hejazi, and van der Laan (2022)’s IPW estimators:
    • use HAL to fit \(e_0(X)\), yielding \(e_{n, \lambda_{n, \text{CV}}}(X)\), where \(\lambda_{n, \text{CV}}\) is the \(L_1\) regularization penalty selected by cross-validation
    • HAL constructs indicator basis functions to span \(e_0(X)\), paring down the high-dimensional basis by regularization
    • undersmoothing relaxes regularization to restore basis functions, giving candidates \(e_{n, \lambda_n}\), where \(\lambda_n < \lambda_{n, \text{CV}}\)
    • proposal: pick \(\lambda_n\) so that \(e_{n, \lambda_n}\) solves \(P_n D_{\text{CAR}}(e_{n, \lambda_n}) \approx 0\), satisfying \(P_n D^{\star} \approx 0\) and attaining efficiency by extension
  • Classical measures of the balance property are incompatible

Numerical example

Simulation setup

Data from a cross-sectional observational study:

n_obs <- 1000
X1 <- rbinom(n_obs, 1, 0.3)
X2 <- rbinom(n_obs, 1, 0.6)
Z <- rbinom(n_obs, 1, ps_mech(X1, X2))
R <- or_mech(X1, X2, Z) + rnorm(n_obs, 0, 1)
data_obs <- data.table(X1 = X1, X2 = X2, Z = Z, R = R)
head(data_obs)
   X1 X2 Z          R
1:  0  1 0 -1.4678054
2:  1  0 1  3.9493182
3:  0  0 0 -1.3367169
4:  0  0 1  1.0713527
5:  0  0 0 -0.4104178
6:  0  1 0  1.1017295

Truth: counterfactual mean \(\tau_0 = \mathbb{E} R(1) \approx\) 3.721

Fitting nuisance functions

Fit propensity score with two variants of logistic regression:

  1. A naive GLM including only main terms for each of \(X_1, X_2\)
  2. An “oracle” GLM using the form of the generating function
ps_data <- data.table(X1 = X1, X2 = X2, Z = Z)
ps_glm_naive <- glm(Z ~ ., data = ps_data, family = "binomial")
ps_pred_naive <- predict(ps_glm_naive, type = "response")
ps_glm_orac <- glm(Z ~ X1 + X2 + X1 * X2, data = ps_data, family = "binomial")
ps_pred_orac <- predict(ps_glm_orac, type = "response")

Fit outcome regression using the form of the generating function

or_data <- copy(ps_data)[, R := R]
or_lm_orac <- lm(R ~ Z + X1 + X2 + X1 * X2, data = or_data)
or_pred_orac <- predict(or_lm_orac)

Fitting nuisance functions

Fit propensity score using HAL with undersmoothing

# fit HAL with cross-validation to identify CV-selected regularization term
X_cols <- c("X1", "X2")
ps_hal_getcv <- fit_hal(
  X = as.matrix(ps_data[, ..X_cols]), Y = ps_data$Z, family = "binomial",
  smoothness_orders = 0, fit_control = list(cv_select = TRUE, n_folds = 5)
)
lambda_cv <- ps_hal_getcv$lambda_star
lambda_seq <- seq(lambda_cv, 1e-5 * lambda_cv, length = 1e4)

# fit HAL over a grid of relaxed regularizations, starting from CV's choice
ps_hal_usm <- fit_hal(
  X = as.matrix(ps_data[, ..X_cols]), Y = ps_data$Z, family = "binomial",
  smoothness_orders = 0, fit_control = list(cv_select = FALSE, n_folds = 5),
  lambda = lambda_seq
)
ps_hal_pred_lambdaseq <- predict(ps_hal_usm, new_data = ps_data)

Comparing IPW estimators

Select propensity score from HAL estimates to minimize \(P_n D_{\text{CAR}}\)

# find the minimizer of the empirical mean of D_CAR
hal_dcar <- apply(ps_hal_pred_lambdaseq, 2, function(ps_pred) {
  mean(dcar(Z, ps_pred, or_pred_orac))
})
ps_pred_hal <- ps_hal_pred_lambdaseq[, which.min(hal_dcar)]

Candidate IPW estimators – bias and efficiency:

PS Estimator IPW Estimate SE Pn D_CAR Bias
GLM - oracle 3.745 0.006 1.791 0.023
GLM - naive 3.775 0.006 1.817 0.054
HAL 3.713 0.002 1.757 0.008

Comparing IPW estimators

Candidate IPW estimators – balance on \(X_1, X_2\):

Type GLM - oracle GLM - naive HAL
X1 Binary 0.064 0.075 0.076
X2 Binary -0.253 -0.254 -0.262
X1_0 * X2_0 Binary 0.151 0.145 0.147
X1_0 * X2_1 Binary -0.215 -0.221 -0.223
X1_1 * X2_0 Binary 0.103 0.109 0.115
X1_1 * X2_1 Binary -0.038 -0.033 -0.039

Classical measures of the balance property are incompatible

Done!

  @nhejazi

  @nshejazi

  nimahejazi.org

  10.1353/obs.2023.0001

References

Cox, David R. 1958. “Two Further Applications of a Model for Binary Regression.” Biometrika 45 (3/4): 562–65. https://doi.org/10.1093/biomet/45.3-4.562.
Ertefaie, Ashkan, Nima S Hejazi, and Mark J van der Laan. 2022. “Nonparametric Inverse-Probability-Weighted Estimators Based on the Highly Adaptive Lasso.” Biometrics (in Press). https://doi.org/10.1111/biom.13719.
Greifer, Noah. 2022. cobalt: Covariate Balance Tables and Plots. https://CRAN.R-project.org/package=cobalt.
Hejazi, Nima S, and Mark J van der Laan. 2023. “Revisiting the Propensity Score’s Central Role: Towards Bridging Balance and Efficiency in the Era of Causal Machine Learning.” Observational Studies 9 (1): 23–34. https://doi.org/10.1353/obs.2023.0001.
Hernán, Miguel A, and James M Robins. 2023. Causal Inference: What If. CRC Press.
Hirano, Keisuke, Guido W Imbens, and Geert Ridder. 2003. “Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score.” Econometrica 71 (4): 1161–89.
Horvitz, Daniel G, and Donovan J Thompson. 1952. “A Generalization of Sampling Without Replacement from a Finite Universe.” Journal of the American Statistical Association 47 (260): 663–85.
Petersen, Maya L, Kristin E Porter, Susan Gruber, Yue Wang, and Mark J van der Laan. 2012. “Diagnosing and Responding to Violations in the Positivity Assumption.” Statistical Methods in Medical Research 21 (1): 31–54. https://doi.org/10.1177/0962280210386207.
Robins, James M, and Andrea Rotnitzky. 1992. “Recovery of Information and Adjustment for Dependent Censoring Using Surrogate Markers.” In AIDS Epidemiology, 297–331. Springer.
Rosenbaum, Paul R, and Donald B Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55.
van der Laan, Mark J. 2014. “Targeted Estimation of Nuisance Parameters to Obtain Valid Statistical Inference.” International Journal of Biostatistics 10 (1): 29–57.
———. 2015. “A Generally Efficient Targeted Minimum Loss Based Estimator.” 343. University of California, Berkeley. https://biostats.bepress.com/ucbbiostat/paper343/.
———. 2017. “A Generally Efficient Targeted Minimum Loss Based Estimator Based on the Highly Adaptive Lasso.” International Journal of Biostatistics 13 (2).
van der Laan, Mark J, and James M Robins. 2003. Unified Methods for Censored Longitudinal Data and Causality. Springer.