More Questions About Balance (And No Answers) 

 The recent posts on achieving good balance within matching have stimulated a certain amount of interest.  To this debate I offer more questions and, alas, no answers, which are what I'd really like to know. (For what it's worth, I am not doing research in this area.  All of my questions are genuine, not rhetorical.) 

 As I understand it, the genetic algorithm that   Diamond and Sekhon favor  searches for matches that minimize p-values from hypothesis tests.  The subject of the hypothesis tests are the covariates, taken one at a time, and the two-way interactions, also taken one at a time.   

 My questions:  
Is the objective in matching treated and control units to find sets of observations with the same JOINT distribution of the covariates, which is what one would have in a randomized experiment?  

 If so, do we expect achieving balance in all univariate (i.e. marginal) and two-way distributions to accomplish this goal, given that the marginal distributions of any multidimensional random vector do not determine the joint?  On the other hand, if two sets of random vectors have the same joint distribution, would we expect hypothesis tests applied to individual (univariate) covariates or their interactions to achieve p-values of .15 or greater?   

 Does the dimension of the vector (i.e. the number of covariates) play a role here, in that if we had 20 covariates, we would expect a comparison of individual covariates marginally to produce a few p-values of below .15?  Perhaps more broadly, what theory tells us that the genetic algorithm search is actually attempting to do the right thing - and what is it? 

 A propensity score method has answers to some of these questions, though it raises others.  On the plus side, the theorems say that observations with the same propensity score have the same joint (not merely marginal) distribution of the covariates.  Thus, if the goal is to replicate a randomized experiment's much-valued ability to produce observations with the same joint covariate distribution, conditioning on the true propensity score will do that.  That's the theory that tells us what propensity score matching is attempting to do is the right thing.  The problem is, of course, that in any case that matters, we don't know the true propensity scores, and estimation of them raises profound questions about model fit and adequacy.  One can check disparities in marginal distributions, but for the reasons stated above, such checks are not really enough.  A question for advocates of propensity scores is the following:  if propensity score matching is designed to reduce dependence on the substantive model that relates outcomes to covariates, does it do so only by inducing dependence on proper specification of the propensity score model? 

 For those who would eschew hypothesis tests in assessing balance (see yesterday's post), how does one assess balance?  True, one can always reduce the power of any test to reject a null by discarding observations (I have heard that K-S in particular has low power), but any comparison of distributions rests on some set of criteria.  Looking at t-scores is a hypothesis test (how else would one decide when the set of scores is too big or too small?).  Are hypothesis tests the worst method of assessing balance, except for all of the others? 

 I have only one suggestion on this subject:  whatever method one uses to create matched sets of treated and control groups, after all ordinary checking of marginal distributions is complete, throw something completely wild at the results.  For both groups, calculate a fifth moment of covariate one, interact it with a third moment of covariate two and a second moment of covariate three.  Do a test and see what happens.  If the two groups have the same joint distribution of their covariates . . . .