Bridging Balance and Efficiency

The Role of the Propensity Score in Causal Machine Learning

Nima Hejazi

nhejazi@hsph.harvard.edu

Harvard Biostatistics

Mark van der Laan

laan@berkeley.edu

UC Berkeley

August 6, 2023

Balance and the Propensity Score

The propensity score in causal inference

Cross-sectional problem: Evaluate average treatment effect (ATE) with data on \(n\) units \(O_i\vert_{i=1}^n\), for \(O = (X, Z, R) \sim P_0\)
- \(X\) are baseline covariates (e.g., sex-at-birth, age)
- \(Z \in \{0, 1\}\) is a binary “treatment” (e.g., smoking)
- \(R\) is an outcome (e.g., cardiovascular disease)
Rosenbaum and Rubin (1983) introduced the propensity score: \(e_0(X) = \mathbb{P}_0(Z = 1 \mid X)\)
- showed \(e_0(X)\) as critical tool for observational study data
- tied \(e_n(X)\) to empirical balance, a now-popular diagnostic tool, e.g., cobalt R package (Greifer 2022)

Causal inference from observational study data

When does statistical inference coincide with causal inference?
- no unmeasured confounding (or randomized treatment)
- “sufficient” experimentation in treatment assignment
Propensity score tied to positivity (sufficient experimentation)
- \(\epsilon < e_0(X) < 1 - \epsilon\) (for an \(\epsilon > 0\)) implies that the ATE is identified, and, by extension, counterfactual means, too
- its estimate, \(e_n(X)\), allows diagnosis of “random” positivity violations (Hernán and Robins 2023; Petersen et al. 2012)
Empirical balance (independence) \(X \perp\!\!\!\perp Z \mid e_0(X)\) may imply the necessary \(\{R(1), R(0)\} \perp\!\!\!\perp Z \mid X\) through \(e_0(X)\)

Effect estimation in observational studies

Consider \(\tau_0 = \mathbb{E}_0 R(1)\) and its identification, based on IPW, as \(\tau_0 = \mathbb{E}_0 \{[\mathbb{I}(Z = 1)/e_0(X)] R \}\)
From this identification argument, the implied IPW estimator is \(\tau_n^{\text{IPW}} = n^{-1} \sum_{i=1}^n [\mathbb{I}(Z_i = 1)/e_n(X_i)] R_i\)
The key properties of this type of IPW estimator (Horvitz and Thompson 1952; Hernán and Robins 2023) depend on chosen estimator (e.g., logistic regression) for \(e_n(X_i)\)
- \(\tau_n^{\text{IPW}} \to \tau_0\) if \(e_n(X) \to e_0(X)\), for large enough \(n\)
- …but \(\tau_n^{\text{IPW}}\) never asymptotically efficient (Robins and Rotnitzky 1992; van der Laan and Robins 2003)

Efficient estimation

Efficient estimation of causal effect estimands

A RAL estimator \(\tau_n\) is asymptotically efficient if it admits the form \(\tau_n - \tau_0 = n^{-1} \sum_{i = 1}^n D^{\star}(P_0)(O_i) + o_p(n^{-1/2})\)
- \(D^{\star}(P_0)(O)\) is the efficient influence function (EIF) at the distribution \(P_0 \in \mathcal{M}\), evaluated for the study unit \(O\)
- when \(P_n D^{\star}(P_0) \approx 0\), \(\tau_n \to_{\text{D}} \text{N}(\tau_0, P \{D^{\star}(P_0)\}^2)\)
Robins and Rotnitzky (1992)’s AIPW representation of the EIF: \(D^{\star}(P_0)(O) = D_{\text{IPW}}(P_0)(O) - D_{\text{CAR}}(P_0)(O)\)
- \(D_{\text{IPW}} = [\mathbb{I}(Z = 1) / e_0(X)] R - \tau_0\); solved by \(\tau_n^{\text{IPW}}\)
- \(D_{\text{CAR}} = [\overline{Q}_0(1, X) / e_0(X)] (Z - e_0(X))\); projection onto CAR tangent space (van der Laan and Robins 2003)

Efficiency of IPW estimators of causal effects

Noted that \(\tau_n^{\text{IPW}}\) is never asymptotically efficient
- it solves \(P_n D_{\text{IPW}}(e_n) \approx 0\) but \(P_n D_{\text{CAR}}(e_n, \overline{Q}_n)\) remains unsolved; improved balance of \(e_n(X)\) does not fix this
- draws attention to the score equation \(P_n D_{\text{CAR}} \approx 0\), of form \(h_0(X)(Z - e_0(X))\) — must be solved for efficiency
frameworks for construction of efficient estimators (e.g., sieve-based, DR-based, targeting) focus on solving score equations
- result is that \(P_n D^{\star}(\overline{Q}_n, e_n) \approx 0\) is solved well enough
- could involve undersmoothing \(e_n\), targeting \(\overline{Q}_n\) (or \(e_n\), per van der Laan (2014)), or “additive” debiasing using \(P_n D^{\star}\)

The empirical balance-efficiency tradeoff

Empirical balance seems desirable but so too does efficiency – are these desiderata related? (Hejazi and van der Laan 2023)
Yes! Solving score equation \(P_n s(e_n; f) \approx 0\) leads to \(f\)-specific balance: \(f(X) \perp\!\!\!\perp Z \mid e_n(X)\) for \(s(e_n; f) = f(X) (Z - e_n(X))\)
Independence relation: \(\mathbb{E}_0 (Z \mid f(X), e_n(X)) = \mathbb{E}_0(Z \mid e_n(X))\)
- when \(e_n(X)\) solves \(P_n s(e_n; f) \approx 0\), then \(f(X)\) contains no more information, beyond that in \(e_n(X)\), on confounders \(X\)
- when \(f \in \mathcal{F}\), for a rich function class \(\mathcal{F}\), then satisfaction of \(f\)-specific empirical balance suggests desirable \(e_n(X)\)
- construct \(\mathcal{F}\) to contain functions \(h_0(X)\), implied by \(D^{\star}\), then \(e_n(X)\) may lead to an efficient estimator \(\tau_n\) as well

Score-based empirical balance and efficiency

Consider \(\text{logit}(\mathbb{E}_0[Z \mid X]) = \text{logit}(e_n(X)) + \beta f(X)\), taking \(f \in \mathcal{F}\) and \(e_n(X) = \mathbb{E}_n[Z \mid X]\) taken as an offset¹
- \(H_0(f): \beta = 0\) vs. \(H_1(f): \beta \neq 0\) gives a hypothesis test where \(X \perp\!\!\!\perp Z \mid e_n(X) \equiv X \perp\!\!\!\perp Z \mid e_n(X), f(X)\) under \(H_0\)
- at \(\beta = 0\), \(s(e_n; f) = \mathbb{E}_n f(X)(Z - e_n(X))\) is the score of the empirical log-likelihood; can be used as a test statistic
Ertefaie, Hejazi, and van der Laan (2022) formulated a class of nonparametric IPW estimators via undersmoothing² of highly adaptive lasso (HAL) (van der Laan 2015, 2017) for \(e_n(X)\)

Nonparametric IPW estimation

Ertefaie, Hejazi, and van der Laan (2022)’s IPW estimators:
- use HAL to fit \(e_0(X)\), yielding \(e_{n, \lambda_{n, \text{CV}}}(X)\), where \(\lambda_{n, \text{CV}}\) is the \(L_1\) regularization penalty selected by cross-validation
- HAL constructs indicator basis functions to span \(e_0(X)\), paring down the high-dimensional basis by regularization
- undersmoothing relaxes regularization to restore basis functions, giving candidates \(e_{n, \lambda_n}\), where \(\lambda_n < \lambda_{n, \text{CV}}\)
- proposal: pick \(\lambda_n\) so that \(e_{n, \lambda_n}\) solves \(P_n D_{\text{CAR}}(e_{n, \lambda_n}) \approx 0\), satisfying \(P_n D^{\star} \approx 0\) and attaining efficiency by extension
Classical measures of the balance property are incompatible

Numerical example

Simulation setup

Data from a cross-sectional observational study:

n_obs <- 1000
X1 <- rbinom(n_obs, 1, 0.3)
X2 <- rbinom(n_obs, 1, 0.6)
Z <- rbinom(n_obs, 1, ps_mech(X1, X2))
R <- or_mech(X1, X2, Z) + rnorm(n_obs, 0, 1)
data_obs <- data.table(X1 = X1, X2 = X2, Z = Z, R = R)
head(data_obs)

   X1 X2 Z          R
1:  0  1 0 -1.4678054
2:  1  0 1  3.9493182
3:  0  0 0 -1.3367169
4:  0  0 1  1.0713527
5:  0  0 0 -0.4104178
6:  0  1 0  1.1017295

Truth: counterfactual mean \(\tau_0 = \mathbb{E} R(1) \approx\) 3.721

Fitting nuisance functions

Fit propensity score with two variants of logistic regression:

A naive GLM including only main terms for each of \(X_1, X_2\)
An “oracle” GLM using the form of the generating function

ps_data <- data.table(X1 = X1, X2 = X2, Z = Z)
ps_glm_naive <- glm(Z ~ ., data = ps_data, family = "binomial")
ps_pred_naive <- predict(ps_glm_naive, type = "response")
ps_glm_orac <- glm(Z ~ X1 + X2 + X1 * X2, data = ps_data, family = "binomial")
ps_pred_orac <- predict(ps_glm_orac, type = "response")

Fit outcome regression using the form of the generating function

or_data <- copy(ps_data)[, R := R]
or_lm_orac <- lm(R ~ Z + X1 + X2 + X1 * X2, data = or_data)
or_pred_orac <- predict(or_lm_orac)

Fitting nuisance functions

Fit propensity score using HAL with undersmoothing

# fit HAL with cross-validation to identify CV-selected regularization term
X_cols <- c("X1", "X2")
ps_hal_getcv <- fit_hal(
  X = as.matrix(ps_data[, ..X_cols]), Y = ps_data$Z, family = "binomial",
  smoothness_orders = 0, fit_control = list(cv_select = TRUE, n_folds = 5)
)
lambda_cv <- ps_hal_getcv$lambda_star
lambda_seq <- seq(lambda_cv, 1e-5 * lambda_cv, length = 1e4)

# fit HAL over a grid of relaxed regularizations, starting from CV's choice
ps_hal_usm <- fit_hal(
  X = as.matrix(ps_data[, ..X_cols]), Y = ps_data$Z, family = "binomial",
  smoothness_orders = 0, fit_control = list(cv_select = FALSE, n_folds = 5),
  lambda = lambda_seq
)
ps_hal_pred_lambdaseq <- predict(ps_hal_usm, new_data = ps_data)

Comparing IPW estimators

Select propensity score from HAL estimates to minimize \(P_n D_{\text{CAR}}\)

# find the minimizer of the empirical mean of D_CAR
hal_dcar <- apply(ps_hal_pred_lambdaseq, 2, function(ps_pred) {
  mean(dcar(Z, ps_pred, or_pred_orac))
})
ps_pred_hal <- ps_hal_pred_lambdaseq[, which.min(hal_dcar)]

Candidate IPW estimators – bias and efficiency:

PS Estimator	IPW Estimate	SE	Pn D_CAR	Bias
GLM - oracle	3.745	0.006	1.791	0.023
GLM - naive	3.775	0.006	1.817	0.054
HAL	3.713	0.002	1.757	0.008

Comparing IPW estimators

Candidate IPW estimators – balance on \(X_1, X_2\):

	Type	GLM - oracle	GLM - naive	HAL
X1	Binary	0.064	0.075	0.076
X2	Binary	-0.253	-0.254	-0.262
X1_0 * X2_0	Binary	0.151	0.145	0.147
X1_0 * X2_1	Binary	-0.215	-0.221	-0.223
X1_1 * X2_0	Binary	0.103	0.109	0.115
X1_1 * X2_1	Binary	-0.038	-0.033	-0.039

Classical measures of the balance property are incompatible

Done!

@nhejazi

@nshejazi

nimahejazi.org

10.1353/obs.2023.0001

References

Cox, David R. 1958. “Two Further Applications of a Model for Binary Regression.” Biometrika 45 (3/4): 562–65. https://doi.org/10.1093/biomet/45.3-4.562.

Ertefaie, Ashkan, Nima S Hejazi, and Mark J van der Laan. 2022. “Nonparametric Inverse-Probability-Weighted Estimators Based on the Highly Adaptive Lasso.” Biometrics (in Press). https://doi.org/10.1111/biom.13719.

Greifer, Noah. 2022. cobalt: Covariate Balance Tables and Plots. https://CRAN.R-project.org/package=cobalt.

Hejazi, Nima S, and Mark J van der Laan. 2023. “Revisiting the Propensity Score’s Central Role: Towards Bridging Balance and Efficiency in the Era of Causal Machine Learning.” Observational Studies 9 (1): 23–34. https://doi.org/10.1353/obs.2023.0001.

Hernán, Miguel A, and James M Robins. 2023. Causal Inference: What If. CRC Press.

Hirano, Keisuke, Guido W Imbens, and Geert Ridder. 2003. “Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score.” Econometrica 71 (4): 1161–89.

Horvitz, Daniel G, and Donovan J Thompson. 1952. “A Generalization of Sampling Without Replacement from a Finite Universe.” Journal of the American Statistical Association 47 (260): 663–85.

Petersen, Maya L, Kristin E Porter, Susan Gruber, Yue Wang, and Mark J van der Laan. 2012. “Diagnosing and Responding to Violations in the Positivity Assumption.” Statistical Methods in Medical Research 21 (1): 31–54. https://doi.org/10.1177/0962280210386207.

Robins, James M, and Andrea Rotnitzky. 1992. “Recovery of Information and Adjustment for Dependent Censoring Using Surrogate Markers.” In AIDS Epidemiology, 297–331. Springer.

Rosenbaum, Paul R, and Donald B Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55.

van der Laan, Mark J. 2014. “Targeted Estimation of Nuisance Parameters to Obtain Valid Statistical Inference.” International Journal of Biostatistics 10 (1): 29–57.

———. 2015. “A Generally Efficient Targeted Minimum Loss Based Estimator.” 343. University of California, Berkeley. https://biostats.bepress.com/ucbbiostat/paper343/.

———. 2017. “A Generally Efficient Targeted Minimum Loss Based Estimator Based on the Highly Adaptive Lasso.” International Journal of Biostatistics 13 (2).

van der Laan, Mark J, and James M Robins. 2003. Unified Methods for Censored Longitudinal Data and Causality. Springer.