What Did (and Do We Still) Learn from the La Londe Dataset (Part II)? 

 I ended yesterday's post about the famous LaLonde dataset, with the following two questions:  (1) What have we learned from the La Londe debate? (2) Does it makes sense to beat this dataset any further or have we essentially exhausted the information that can be extracted from this data and need to move one to new datasets? 

 On the first point, VERY bluntly summarized, the comic strip history goes somewhat like this. First, La Londe showed that regression and IV do not get it right. Next, Heckman's research group released a string of papers in the late 80s and 90s trying to defend conventional regression and selection-based methods. Enter stage Dehija and Wahba ( 1999 ).  They showed that apparently, propensity score methods (sub-classification and matching) get it right if one controls for more than one year of pre-intervention earnings. Smith and Todd (2002, 2004) are next in line, claiming that propensity score methods do not get it right. Once one slightly tweaks the propensity score specification, the results are again all over the place. The ensuing debate spawned more than five papers as Rajeev Dehejia replied to the Smith and Todd findings (all papers of this debate can be found  here ). Then last but not least, Diamond and Sekhon ( 2005 ) argue that matching does get it right, if it’s done properly, namely if one achieves a really high standard of balance (we’ve already had quite a controversy about balance on this very blog. See for example  here ). 


 So what does this leave applied researchers with? What do we take away from the La Londe debate? Does anyone still think that regression (or maximum likelihood methods more generally) and/or 2-stage least squares IV produce reliable causal inferences in real world observational studies? In all seriousness, where is the validation? . This is the $1 million-dollar question, because MLE and IV methods represent the great majority of what is taught and published across the social sciences.  Also, can we trust propensity score methods? How about other matching methods? Or is there little hope for causal inference from observational data in any case (in which case I fear we are all out of a job, and the philosophers get the last laugh?) This is not necessarily my personal opinion, but I would be interested to hear people’s opinion. [The evidence is of course not limited to La Londe; there is ample evidence from other studies with similar findings. For example see Friedlander and Robins (1995), Fraker and Maynard (1987), Agodini and Dynarski (2004), Wilde and Hollister (2002) and various Rubin papers to name just a few]. 

 On the second point, let me play the devil’s advocate again and ask: What can we still learn from the La Londe data? After all it’s just one single dataset, the standard errors even for the experimental dataset are large, and once we match in the observational data, why would we even expect to get it right? There is obviously a strong case to be made for selection on unobservables in the case of the job training experiment. So even if we manage to adjust observed differences, why in the world should we get the estimate right? [Again, this is not my personal opinion, but I have heard a similar contention both at a recent conference and in Stat 214.]  Maybe instead of a job training experiment, we should first use experimental and observational data on something like plants or  frogs , where hidden bias may (!) be less of a problem (given this is actually the case)? Finally, what alternatives do we have—how would we know what the right answer was if we were not working with a La Londe-esque framework?  Again, I would be interested in everybody’s opinion on this point.