Valid Standard Errors for Propensity Score Matching, Anyone? 

 Propensity Score Matching (PSM) has become an increasingly popular method to estimate treatment effects in observational studies. Most papers that use PSM also provide standard errors for their treatment effect estimates. I always wonder where these standard errors actually come from. To my knowledge there still exists no method to calculate valid standard errors for PSM. What do you all think about this topic? 


 The issue is this: Getting standard errors for PSM works out nicely when the true propensity score is known. Alberto and Guido have developed a formula that provides principled standard errors when matching is done with covariates or the true propensity score. You can read about it  here . This formula is used by their  nnmatch  matching software in Stata and Jasjeet Sekhon’s  matching package  in R.  

 Yet, in observational settings we do not know the true propensity score so we first have to estimate it. Usually people regress the treatment indicator on a couple of covariates using a probit or logit link function. The predicted probabilities from this model are then extracted and taken as the estimated propensity score to be matched on in the second step (some people also match on the linear predictor, which is desirable because it does not tend to cluster so much around 0 and 1). 

 Unfortunately, the abovementioned formula does not work in the case of matching on the estimated propensity score, because the estimation uncertainty created in the first step is not accounted for. Thus, the confidence bounds on the treatment effect estimates in the second step will most likely not have the correct coverage.  

 This issue is not easily resolved. Why not just bootstrap the whole two-step procedure? Well, there is evidence to suggest that the bootstrap is likely to fail in the case of PSM. In the closely related problem of deriving standard errors for conventional nearest neighbor matching Guido and Alberto show in a  recent paper , that even in the simple case of matching on a single continuous covariate (when the estimator is root-N consistent and asymptotically normally distributed with zero asymptotic bias) the bootstrap does not provide standard errors with correct coverage. This is due to the extreme non-smoothness of nearest neighbor matching which leads the bootstrap variance to diverge from the actual variance.  

 In the case of PSM the same problem is likely to occur unless estimating the propensity score in the first step makes the matching estimator smooth enough for the bootstrap to work. But this is an open question. At least to my knowledge there exists no Monte Carlo evidence or theoretical justification for why the bootstrap should work here. I would be interested to hear opinions on this issue. It’s a critical question because the bootstrap for PSM is often done in practice, various matching codes (for example  pscore  or  psmatch2  in Stata) do offer bootstrapped standard errors options for matching on the estimated propensity score.