March 2009
Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31        

Authors' Committee

Chair:

Matt Blackwell (Gov)

Members:

Martin Andersen (HealthPol)
Kevin Bartz (Stats)
Deirdre Bloome (Social Policy)
John Graves (HealthPol)
Rich Nielsen (Gov)
Maya Sen (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Arthur Spirling, Jamie Robins, Don Rubin, Chris Winship

Weekly Workshop Schedule

Recent Comments

Recent Entries

Categories

Blogroll

SMR Blog
Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 4.24-en


« February 2009 | Main | April 2009 »

30 March 2009

Spirling on ``Bargaining Power in Practice: US Treaty-Making with American Indians, 1784--1911"

Please join us this Wednesday when Arthur Spirling, Department of Government, will present ``Bargaining Power in Practice: US Treaty-making with American Indians, 1784--1911". Arthur provided the following overview for his talk:

I will discuss a new data set of treaties signed 1784--1911 between the United States government and American Indian tribes, and comment on some early findings using kernel methods to analyze these texts. I particularly welcome feedback and suggestions from the ASW on the appropriateness of the techniques given the problem at hand.

Arthur also provided the following abstract for a paper that is the basis for his talk:

Native Americans are unique among domestic actors in that their relations with the United States government involve treaty-making, with almost 600 such documents signed between the Revolutionary War and the turn of the twentieth century. We obtain and digitize all of these treaties for textual analysis. In particular, we employ new 'kernel methods' to study the evolution of their nature over time and show that the Indian Removal Act of 1830 represents a systematic shift in language. We relate our findings to a bargaining model with the parties---government and tribes---varying in power according to contemporary political and economic events. With a mind to earlier historical and legal literatures, we also show that the 'broken' treaties do not form their own cluster in the data, and that the post-1871 'agreements' represent a straightforward continuation of earlier treaty policy in both style and substance.

The Applied Statistics Workshop meets each Wednesday at 12 noon in K-354 CGIS-Knafel (1737 Cambridge St). The workshop begins with a light lunch and presentations usually start around 1215 and last until about 130 pm.

Posted by Justin Grimmer at 11:14 AM

27 March 2009

Automated Text Analysis of Political Scientists

future2c.jpg

This is a volume of 100 essays by political scientists, each with less than about 1,000 words concerning one novel or insufficiently appreciated idea in some area of the discipline (edited by me, Kay Scholzman, and Norman Nie, and in honor of Sidney Verba). (I might have heard a rumor that if you buy two copies, your next article will be accepted on the first round and you'll get a great new job offer!) In any event, the last blurb above says nice things about the organization of the essays, which of course we appreciate, but I especially like that the essays, except for a little fine tuning, were ordered via automated text analysis (using an algorithm Justin Grimmer and I are working on). The number of possible orderings of 100 essays is enormous of course -- a tad less than the number of elementary particles in the universe squared -- and the idea that a "mere" human being can choose an optimal ordering is absurd. We've become accustomed to understanding that computers can do arithmetic far faster than we, but we also need to start to get use to the fact that (with some help from modern statistics) they can also "read" better than us too.

Posted by Gary King at 8:42 AM

26 March 2009

Small samples across the country

A recent paper by Shetty, DeLeire, White, and Bhattacharya looked at the effect of workplace smoking bans across the country. This paper follows previous papers that looked at single smoking bans and attempted to identify the effect of smoking bans on health outcomes by comparing the area with the ban to a similar area without a ban. The particular contribution of this paper is that it compared all smoking bans across the country at once to conclude that smoking bans have considerably smaller effects on heart attacks and mortality than previous articles suggested.

The analysis used region (county) fixed effects to determine the effect of smoking bans within regions, so it is conceptually similar to differences-in-differences analyses that previous articles had used to address this same issue. The authors have taken pains to make sure that the results are not a result of modeling assumptions by including time-varying covariates in the model. This is where one of my pet peeves shows up since the assumption in fixed effects models is that there are no unobservable time-varying effects, but this assumption can be checked, not just by incorporating additional time-varying covariates, but also by including linear time trends that vary by region. The linear time trend provides a test of the assumption that there are no time trends that vary by region, independently of the included covariates.

Those are my peeves, one of the gems in this paper is that they took the time to simulate possible pairwise comparisons between regions (i.e. simulating the results of previous studies). These simulations indicate that one is just as likely to get the large health-improving effects seen in earlier studies as large health-detracting effects. This opens the possibility that the papers that were published came from the extreme of the distribution of pairwise outcomes.

Posted by Martin Andersen at 9:25 PM

25 March 2009

How to teach methods

Over on the polmeth mailing list there is a small discussion brewing about how to teach undergraduate methods classes. Much of the discussion is on how to manage the balance between computation and statistics. A few posters are using R as their main data analysis tool, which provoked others to comment that this might push a class too far away from its original intent: to learn research methods (although one teacher of R indicated that a bigger problem was the relative inability to handle .zip files). This got me thinking about how research methods, computing and statistics fit into the current education framework.

As a gross and unfair generalization, much of college is about learning how take a set of skills and use them to make effective and persuasive arguments. In a literature class, for instance, one might use the skills of reading and writing to critical engage a text. In mathematics, one might take the "skill" of logic and use it to derive a proof.

The issue with introductory methods classes is that many undergraduates come into school without a key skill: computing. It is becoming increasingly important to have proficient computing skills in order to make cogent arguments with data. I wonder if it is time to rethink how we teach computing at lower levels of education to adequately prepare students for the modern workplace. There is often emphasis on using computers to teach students, but I think it will become increasingly important to teach computers to students. This way courses on research methods can focus on how to combine computing and statistics in order to answer interesting questions. We could spend more time matching tools to questions and less time simply explaining the tool.

Of course, my argument reeks of passing buck. A broader question is this: where do data analysis and computing fit in the education model? Is this a more fundamental skill that we should build up in children earlier? Is it perfectly fine where it is, being taught in college?

Posted by Matt Blackwell at 3:08 PM

19 March 2009

Writing Excel Tables, Figures and Graphs Directly from R

As those of us who have worked on empirical projects surely know, at times a frustratingly large amount of time can be spent packaging results into tables or figures for publication or review. Fortunately, a number of modules have been developed to facilitate this process. For example, "write.csv" can be used from within R to output a table directly into an Excel-readable format. Likewise, practitioners of Stata can use the package "xml_tab" to do the same with a bit more flexibility.

Recently I've been involved in a large-scale modeling effort that requires a very detailed multi-worksheet Excel output that, depending on the task, includes a mix of tables, graphs and figures created in both R and Stata. Given the amount of modeling we're doing, creating this output manually every time would either take up 90% of my time, or would require hiring of an army of RAs whose sole task is creating these Excel files. So, while the above packages are no doubt helpful in specific contexts, we've had to scour through what's out there come up with our own way to do the outputs most efficiently.

What follows is what (I think) is a neat way to automate outputs directly from R. Hopefully, readers of this blog can benefit from using this in their own research. Basically, what one can do is use the "write" function in R to write a perl file that is then fed into the terrific Spreadsheet-WriteExcel module. This gives one the flexibility to, among other things, output to separate worksheets, format tables (with merged cells, different column widths, cell borders, etc.), include figures, and create charts, all in the same Excel file.

The example below is fairly simple -- it outputs two generic tables into separate worksheets -- but gives a good sense of how the powers of R and WriteExcel can be harnessed to really speed up the research process. Also, I'd appreciate any other thoughts on this from folks who have done similar things!

output_excel.txt
Here is the final output.

Posted by John Graves at 12:06 PM

18 March 2009

Breastfeeding Research and Intention to Treat

The current issue of The Atlantic features an interesting story by Hanna Rosin called "The Case Against Breastfeeding." Rosin argues that the health benefits of breastfeeding have been overstated by advocates and professional associations and that, given the costs (in mothers' time and independence), mothers should not be made to feel guilty if they decide not to breastfeed for the full recommended period. One of her key points is that observational studies overstate the benefits of breastfeeding by failing to adequately adjust for background differences between mothers who breastfeed and those who don't. Observable differences, reports Rosin, are considerable: breastfeeding is more common among women who are "white, older, and educated; a woman who attended college, for instance, is roughly twice as likely to nurse for six months." In the course of making her argument Rosin provides a very nice layman's treatment of the difficulties of learning from observational studies; I think the article could be useful in teaching statistical concepts to a non-technical audience, although the politics of the issue might overwhelm the statistical content.

I followed up a bit on one experimental study she mentions in which researchers implemented an encouragement design in Belarus: new mothers in randomly selected clinics who were already breastfeeding were exposed to an intervention strongly encouraging them to nurse exclusively for several months, and the health outcomes of those babies as well as babies in non-selected clinics were tracked for several years. Rosin reports that this study found an effect of breastfeeding on gastrointestinal infection and infant rashes (and possibly IQ), but no effect on a host of other outcomes (weight, blood pressure, ear infections, or allergies).

I read what appears to be the first paper from the study (published in 2001 in JAMA), which reported that the intervention reduced GI infection and rashes. One thing that surprised me was that all health effects was reported in terms of "intention-to-treat," ie a raw comparison of outcomes in the treatment group and the control group, irrespective of whether the mother actually breastfed. The intervention increased the proportion of mothers doing any breast-feeding at 6 months from .36 to .5, so we know that whatever effects are found via ITT understate the impact of breastfeeding itself (because they measure the impact of being assigned to treatment, which changes breastfeeding status only for some mothers). (The authors know this too, and they raise the point in a rejoinder.)

The standard approach I learned is to estimate a "complier average treatment effect" by essentially dividing the ITT by the effect of treatment assignment on treatment status, but the study appears to not do this. (The CATE for GI infection, according to my back-of-the-envelope calculation and assuming "no defiers," is about -.3, ie about a 30% decrease in the probability of infection for mothers who were induced to breastfeed by the intervention.) I suppose focusing on ITT's could be common in epidemiology because it addresses the policymaker's question of whether it's worth it to implement a similar program, assuming compliance rates would be similar. But for a mother thinking about what to do, the CATE gives much better information about whether or not to breast-feed.

Posted by Andy Eggers at 7:44 AM

16 March 2009

Lenz on ``Getting Richer in Office? Corruption and Wealth Accumulation in Congress"

Please join us this Wednesday when Gabriel Lenz, MIT Department of Political Science, will present ``Getting Rich(er) in Office? Corruption and Wealth Accumulation in Congress", work that is joint with Kevin Lim. Gabe provided the following abstract:


How corrupt is Congress? We provide an indirect test by comparing wealth accumulation from 1995 to 2005 among members of the U.S. House of Representatives and members of the public. Data on representatives are from Personal Financial Disclosure forms and data on the public are from the Panel Study of Income Dynamics (PSID). To test whether representatives accumulate wealth at a faster rate than expected, we construct counterfactuals based on the PSID with two approaches. We first use statistical models, conditioning on asset distribution over stocks, bonds, businesses, and land, as well as demographic variables. These models find representatives accumulating wealth about 20 percent faster than expected. Second, we employ matching. Unlike the modeling approach, matching finds an almost identical rate of wealth accumulation among both groups. Further analysis reveals that matching reduces bias from several incorrect functional form assumptions in the statistical models. We thus conclude that representatives report accumulating wealth at a rate consistent with similar non-representatives, suggesting no aggregate corruption. Besides examining overall wealth accumulation, we also test for effects of committee assignments, safe seats, career trajectories, and campaign contributions

The Applied Statistics Workshop meets each Wednesday at 12 noon in K-354 CGIS-Knafel (1737 Cambridge St). The workshop begins with a light lunch and presentations usually start around 1215 and last until about 130 pm.

Posted by Justin Grimmer at 3:34 PM

13 March 2009

English First Names for Chinese Americans

This entry uses people data from ZabaSearch to show which English first names are most popular among Chinese Americans.

When I worked at Google, I once did an employee search on "Vivian" and 26 of the 30 results were Chinese. This post examines this phenomenon a bit more scientifically, with two goals:


  • Find the most common English first names for Chinese last names (P(Name n | Chinese)).

  • Find the English first names that are differentially expressed -- that is, which are much more popular among Chinese Americans than among the general American public (i.e., P(Chinese | Name n)).

The ideal approach would be to download a phone book and tally the first names for Chinese last names. While there's nowhere to download a phone book, there are several searchable people databases online. The largest and most famous free option is ZabaSearch.

The first step: get a list of Chinese last names to search for. I used the 100 most common last names in both the natural Pinyin and Wade-Giles variants, for a total of 128 unique last names. With a script, I searched for each on ZabaSearch. Sadly, Zaba won't show you the results if there are over 1000 -- it just says "1000's of CHIN's found!" If you search across the entire U.S., then this happens for too many of the names, so I limited my searches to Boston. The 128 last names culled 22,483 unique people (after also de-duping by address).

Among Chinese in Boston, the most common three first names are Wei (1.34%), Hong (0.916%) and Hui (0.836%). Only about 25% -- 5,949 of 22,483 -- of the first names are English.

Of the Chinese population with English first names, the most popular three male and female names are shown below. For the American public, these are downloaded from the latest census report. (Note: To standardize the population sizes, I limited both populations to those with first names in the top 500.)

Name

Rank across America

Rank among Chinese

Frequency across America

Frequency among Chinese

DAVID
6
1
0.0286
0.0511
JOHN
2
2
0.0396
0.0378
JAMES
1
3
0.0402
0.0369

Name

Rank across America

Rank among Chinese

Frequency across America

Frequency among Chinese

JENNIFER
6
1
0.01300
0.0311
AMY
32
2
0.00631
0.0303
ANGELA
29
3
0.00655
0.0178

The three most popular Chinese male first names are also very popular in America as a whole. A more interesting question is the one about P(Chinese | Name n) -- which English first names are much more common among Chinese Americans than among all Americans? To answer that, I conducted a binomial proportion test and sorted the results by p-value, identifying the most extreme differences. The top 10 male and female differences are given below.

Some of the top results are nicknames -- Chinese are much more likely to pick "Andy" or "Jenny" as a legal name, while general Americans are formally named by the longer versions.

The other names on the list are more interesting. For males, "Andrew," "Eric," "Peter" and "Albert" are much more common among Chinese than among Americans. For females, it's "Amy," "Grace," "May" and, yes, "Vivian." By comparing the frequencies, you can see that these names are all over five times more popular among Chinese Americans!

I'll leave interpretation to the sociologists.

Name

Frequency across America

Frequency among Chinese

p-value

ANDREW
0.006510
0.02810
2.2e-30
ANDY
0.000594
0.00937
2.0e-26
DAN
0.001220
0.01120
3.6e-23
PETER
0.004620
0.01990
5.3e-22
ALBERT
0.003810
0.01570
7.0e-17
ERIC
0.006590
0.02120
1.5e-16
ALAN
0.002470
0.01150
2.9e-14
SAM
0.001110
0.00786
3.7e-14
ALEX
0.001390
0.00846
1.4e-13
DAVID
0.028600
0.05110
1.7e-12

Name

Frequency across America

Frequency among Chinese

p-value

AMY
0.006310
0.03030
2.5e-29
JENNY
0.000951
0.01330
6.9e-28
GRACE
0.002640
0.01670
4.3e-21
MAY
0.000406
0.00644
3.1e-15
VIVIAN
0.001650
0.01100
5.2e-15
ALICE
0.004990
0.01780
3.5e-13
JENNIFER
0.013000
0.03110
2.7e-12
CECILIA
0.000769
0.00644
6.8e-11
JANE
0.003500
0.01290
2.7e-10
CINDY
0.002690
0.01060
2.2e-09

Posted by Kevin Bartz at 5:46 PM

11 March 2009

Differences-in-Differences in the Rubin Causal Model

At today's Applied Statistics Workshop, Dan Hopkins gave a talk on contextual effects on political views in the United States and United Kingdom. Dan presented evidence that national political discussions increase the salience of local context for opinion formation. Namely, those who live in areas of high immigrant populations tend to react more strongly to changes in the national discussion of immigration than others. The data and analysis are interesting, but the talk's derailment interested me slightly more.

The derailment involved Dan's choice of method, a version of difference-in-difference (DID) estimator and how to represent it in the Rubin Causal Model. Putting this model in terms of the usual counterfactual framework is slightly nuanced, but not impossible.

The typical setup for a DID estimator is that there are two groups G = {0,1} and two time periods T={0,1}. Between time 0 and time 1, some policy is applied to group 1 and not applied to group 0. What we are interested in is the effect of that policy. For instance, if Y is the outcome in time 1 and Y(1) is the potential outcome (in time 1) in the counterfactual world where we forced the policy to be implemented, then we can define a possible quantity of interest: the average treatment effect on the treated (ATT): E[Y(1) - Y(0) | G = 1].

We could proceed from here by simply making an ignorability assumption about the treatment assignment. Unfortunately, policies are often not randomly assigned to the groups and the groups may differ in ways that affect the outcome. For instance, an example from the Wooldrige textbook is the effect of the placement of trash processing facility on house prices. The two groups in this case are "houses close to the facility" and "houses far from the facility" and the policy is the facility's placement. It would be borderline insane to imagine city planners randomly assigning the location of the facility and these two groups will differ in ways that are very related to house prices (I don't think I have seen too many newly minted trash dumps in rich neighboorhoods). Thus, we cannot simply use the observed data from the control group to make the counterfactual inference.

What we can do, however, is look at how changes in the dependent variable occur for the two groups and use these changes to identify the model. For instance, if we assume that X is the outcome in period 0, then the DID identifying assumption is

E[Y(0) - X(0) | G = 1] = E[Y(0) - X(0) | G = 0],

which is simply saying that the change in potential outcomes under control is the same for both groups. Or, that group 1 would have followed the same "path" as group 0 if they had not received treatment. With this assumption in hand, we can identify the ATT as the typical DID estimator

E[Y(1) - Y(0) | G =1] = (E[Y|G=1] - E[X|G=1]) - (E[Y|G=0] - E[X|G=0]).

The proof is short and can be found in Abadie (2005) and Athey & Imbens (2006) also show (these papers also go into considerable depth on how to simple schemes).

Two issues always arise for me when I see DID estimators. First is the incredibly difficult task of arguing that the policy is the only thing that changed between time 0 and time 1 with respect to the two groups. That is, perhaps the city also placed a freeway through the part of town where the trash processing facility was built at the same time. The DID estimator would not be able to differentiate effects. Thus, it is up to the practitioner to argue that all other changes in the period are orthogonal to the two groups. Second, I have very little insight about how identification or estimands change as we move from a simple non-parametric world to a highly parametric world (where most applied researchers live). If and how do inferences change when we move away from simple conditional expectations?

Posted by Matt Blackwell at 2:13 PM

9 March 2009

Hopkins on "Making Credible Inferences about the Effects of Local Contexts"

Please join us this Wednesday when Dan Hopkins, Post-Doctoral Fellow at Harvard University (and soon to be Assistant Professor at Georgetown), will present "Making Credible Inferences about the Effects of Local Contexts". Dan provided the following abstract for his presentation:

In the last decade, there has been an explosion of social science research exploring the influence of local contexts on attitudes and behavior. Yet such studies face methodological hurdles, including the endogeneity of individuals' moving decisions, significant measurement error, and ambiguity about their causal interpretation. This presentation reconceptualizes the effects of local contexts as an interaction between the local context and salient national issues. It then uses panel or time-series cross-sectional data to explore the impact of exogenous changes in the salience of national issues on local contextual effects. Across three empirical examples on attitudes toward immigration drawn from two countries, we observe that local contexts only correlate with attitudes when immigration is a nationally salient issue. The effects of local contexts vary in predictable ways with the topics of national politics. All politics might not be local after all.

Dan provided the this paper as background for his talk.

The Applied Statistics Workshop meets each Wednesday in room K354, 1737 Cambridge St (CGIS-Knafel). A light lunch is served at 12 noon, with presentations usually beginning at 1215 pm and the workshop usually concludes by 130 pm. All are welcome!

Posted by Justin Grimmer at 8:23 PM

7 March 2009

How to Take Log of Zero Income

I encounter a problem when using a Log normal distribution to model income distribution. Namely, there are a bunch of people in my dataset who report zero income, maybe due to unemployment, and I am wondering how to logarize the zero incomes. I notice some researchers just drop the observations with zero income while others assign a small amount of income to them so that logarithm can be taken legitimately. Obviously, we can try both ways to see how the results stand. But I am wondering if there are some experts on this topic who can clarify the pros and cons of these and other approaches treating zero incomes.

A related question is what model you think fits the income distribution best, a Lognormal, a power distribution, or a mixture model of a Normal and a point mass at zero, and so on.
Look forward to your thoughts on these questions.

Lastly, here is an interesting animation of the income distribution in the USA.

Posted by Weihua An at 6:07 PM

6 March 2009

xkcd on Correlation and Causation

XKCD Pic

Posted by Andy Eggers at 1:09 PM

4 March 2009

Follow-up on Robins' Talk ("A Bold Vision of Artificial Intelligence and Philosophy")

A few blog readers asked for more information about Jamie Robins' talk today and the "pinch of magic and miracle" he promised in the abstract. I wanted to offer my non-expert report on the presentation, particularly because Jamie and his coauthors don't yet have a paper to circulate.

Jamie organized the talk around a research scenario in which five variables are measured in trillions of independent experiments and the task is to uncover the causal process relating the variables. (His example involved gene expression.) He led us through an algorithm that he claimed could accomplish this feat (in some circumstances) with no outside, substantive input. The algorithm involves looking for conditional independencies in the data, not just in its original form but also under various transformations in which one or more independencies are induced by inverse probability weighting and we check whether others exist. For some data generating processes, this algorithm will hit on conditional independencies such that (under a key assumption, which he was coy about until the end of the talk) the causal model will be revealed -- the ordering and all of the effect sizes.

The key assumption is "faithfulness," which states that when two variables are found to be conditionally independent in the data, we can conclude that there is no causal arrow between them (i.e. we can rule out that there is an arrow between them that is perfectly offset by other effects). Without that assumption we can't infer the causal model from a joint density, but with it we can -- and the point of Jamie's talk was that, in the "star worlds" in which independencies have been induced by reweighting, even more information can be gleaned from the joint density than has been recognized.

All of this may seem surprising to people who have followed the debates over causal modeling and "causal discovery," much of which has centered around the work of Spirtes, Glymour, and Sheines. In these debates, Jamie has been (by his own admission) a consistent critic of the faithfulness assumption and has insisted that substantive knowledge, not conditional independence in sampled data, is the way to draw causal models. Rest assured, he has not changed his position. (I think he described the embrace of the faithfulness assumption by mainstream statistics as "probably insane" at one point in the talk.) The point of the talk was not to defend faithfulness, but rather to show that it implies a lot more than was realized by researchers who currently employ it to uncover causal structure from joint densities.

Anyone else who wants to fill in or correct my account, please chime in.

Posted by Andy Eggers at 10:19 PM

2 March 2009

Jamie Robins on ``A Bold Vision of Artificial Intelligence and Philosophy: Finding Causal Effects Without Background Knowledge or Statistical Independences"

Please join us this Wednesday, March 4th when Jamie Robins will present ``A Bold Vision of Artificial Intelligence and Philosophy: Finding Causal Effects Without Background Knowledge or Statistical Independences", a project that is joint with Thomas Richardson, Ilya Shpitser, and Steffen Lauritzen. Jamie provided the following abstract:

I describe a statistical methodology based on philosophy, causal directed acyclic graphs, and a pinch of magic and miracle that holds the promise of making a silk purse of causal knowledge out of the sow's ear of an observational data set with no obvious structure. In 10 years or so, for better or worse, this methodology may become part of mainstream genomics.

The workshop will meet at 12 noon in room K-354, CGIS-Knafel (1737 Cambridge St) with a light lunch served. The presentation will begin at 1215 and usually ends around 130 pm. All are welcome

Posted by Justin Grimmer at 6:56 PM