Applied Statistics - Gary King 

 This week, the Applied Statistics Workshop will present a talk by Gary King, the David Florence Professor of Government at Harvard and the Director of the Institute for Quantitative Social Science.  He has published over 100 articles, and his work has appeared journals in public heath, law, sociology, and statistics, as well as in every major journal in political science.  He is the author or co-author of seven books, many of which are standards in their field.  His research has been recognized with numerous awards, and he is one of the most cited authors in political science. He is also the faculty convenor of this blog. 

 Professor King will present a talk entitled "How to How to Read 100 Million Blogs (and How to Classify Deaths without Physicians)."  The talk is based on two papers, one co-authored with Dan Hopkins and the other with Ying Lu.  The presentation will be at noon on Wednesday, April 11 in Room N354, CGIS North, 1737 Cambridge St. As always, lunch will be provided. An abstract of the talk and links to the papers follow on the jump: 


 How to Read 100 Million Blogs 
(and How to Classify Deaths without Physicians)
 
Gary King
 
We develop a new method of computerized content analysis that gives approximately unbiased and statistically consistent estimates of quantities of theoretical interest to social scientists. With a small subset of documents hand coded into investigator-chosen categories, our approach can give accurate estimates of the proportion of text documents in each category in a larger population. The hand coded subset need not be a random sample, and may differ in dramatic but specific ways from the population. Previous methods require random samples, which are often infeasible in social science text analysis applications; they also attempt to maximize the percent of individual documents correctly classified, a criterion which leaves open the possibility of substantial estimation bias for the aggregate proportions of interest. We also correct, apparently for the first time, for the far less-than-perfect levels of inter-coder reliability that typically characterize human attempts to classify documents, an approach that will normally outperform even population hand coding when that is feasible. We illustrate the effectiveness of this approach by tracking the daily opinions of millions of people about candidates for the 2008 presidential nominations in online blogs, data we introduce and make available with this article. We demonstrate the broad applicability of our approach through additional evaluations in a variety of available corpora from other areas, including large databases of movie reviews and university web sites. We also offer easy-to-use software that implements all methods described. 

 The methods for a key part of this paper build on King and Lu (2007), which the talk will also briefly cover. This paper offers a new method of estimating cause-specific mortality in areas without medical death certification from "verbal autopsy data" (symptom questionnaires given to caregivers). This method turned out to give estimates considerably better than the existing approaches which included expensive and unreliable physician reviews (where three physicians spend 20 minutes with the answers to the symptom questions from each deceased to decide on the cause of death), expert rule-based algorithms, or model-dependent parametric statistical models.  

 Copies of the two papers are available at:  
  
http://gking.harvard.edu/files/abs/words-abs.shtml  
 http://gking.harvard.edu/files/abs/vamc-abs.shtml