December 2005
Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Authors' Committee

Chair:

Matt Blackwell (Gov)

Members:

Martin Andersen (HealthPol)
Kevin Bartz (Stats)
Deirdre Bloome (Social Policy)
John Graves (HealthPol)
Rich Nielsen (Gov)
Maya Sen (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Arthur Spirling, Jamie Robins, Don Rubin, Chris Winship

Weekly Workshop Schedule

Recent Comments

Recent Entries

Categories

Blogroll

SMR Blog
Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 4.24-en


« November 2005 | Main | January 2006 »

21 December 2005

End-of-Year Hiatus

Jim Greiner

With universities out of session and many students away from their offices, the Social Science Statistics Blog will reduce the frequency of its postings. We will resume our at-least-one-per-day schedule in early January. Until then, check back periodically for the occasional entry.

Happy New Year!

Posted by James Greiner at 4:36 AM

20 December 2005

BUCLD: Statistical Learning in Language Development

Amy Perfors

The annual Boston University Conference on Language Development (BUCLD), this year held on November 4-6th, consistently offered a glimpse into the state of the art in language development. The highlight this year for me was a lunchtime symposium titled "Statistical learning in language development: what is it, what is its potential, and what are its limitations?" It featured a dialogue between three of the biggest names in this area: Jeff Elman at UCSD, who studies connectionist models of many aspects of language development; Mark Johnson at Brown, a computational linguist who applies insights from machine learning and Bayesian reasoning to study human language understanding; and Lou-Ann Gerken at the University of Arizona, who studies infants' sensitivity to statistical aspects of linguistic structure.

I was most interested in the dialogue between Elman and Johnson. Elman focused on a number of phenomena in language acquisition that connectionist models capture. One of them is "the importance of starting small," an argument that says essentially that beginning with limited capacities of memory and perception might actually be a helpful way of learning ultimately very complex things because it "forces" the learning mechanism to notice only the broad, consistent generalizations first and not to be led astray by local ambiguities and complications too soon. Johnson seconded that argument, and pointed out that models learning using Expectation Maximization embody this just as well as neural networks do. Another key insight of Johnson's was that statistical models implicitly extract more information from input than purely logical or rule-based models. This is because statistical models generally assume some underlying distributional form, and therefore when you don't see data from that distribution, that is a valuable form of negative evidence. Because there are a number of areas in which people appear not to receive much negative evidence, they must either incorporate or use statistical assumptions or be innately biased toward the "right" answer.

The most valuable aspect of the symposium, however, was the clarification of many of the issues in statistical learning and cognitive science in general that statistical learning can help to answer. Some of these important questions: in any given problem, what are the units of generalization that human learners (and hence our models) should and do use? [i.e., sentences, bigram frequencies, words, part of speech frequencies, phoneme transitions, etc] What is the range of computations that the human brain is capable of (possibly changing at different stages of development)? What statistical and computational models capture these? What is the nature of the input (the data) that human learners see; to what extent does this depend on factors external to them (the world) and to what extent is it due to internal factors (attentional biases, mental capacities, etc)?

If we can answer these questions, we will have answered a great many of the difficult questions in cognitive science. If we can't, I'd be very surprised if we make much real progress on them.

Posted by Amy Perfors at 2:09 AM

19 December 2005

Beyond Standard Errors, Part II: What Makes an Inference Prone to Survive Rosenbaum-Type Sensitivity Tests?

Jens Hainmueller

Continuing from my previous post on this subject, sensitivity tests are still somewhat rarely (yet increasingly) used in applied research. This is unfortunate, I think, because, at least according to my own tests on several datasets, observational studies do vary considerably in their sensitivity to hidden bias. Some results go away once you allow for only a tiny amount of hidden bias, others are rock solid weathering very strongest hidden bias. One should always give the reader this information I think.

One (and maybe not the most important) reason for why these tests are infrequently used is that they take time and effort to compute. So I was thinking, instead of computing the sensitivity tests each time, maybe it would be good to have some quick rules of thumbs to judge whether a study is insensitive to hidden bias.

Imagine you have two studies with identical estimated effect size and standard errors. Now, which one would you trust more regarding their insensitivity to hidden bias? In other words, are there particular features of the data, which makes an inference drawn from this data to excel on Rosenbaum type sensitivity tests? The literature I have read thus far provides little guidance on this issue.

We have a few ideas about this (which are still underdeveloped). For example, ceteris paribus, one could think that it’s better to have a rather imbalanced vector of treatment assignments (like only a few treated or only a few control). Another idea is that, ceteris paribus, inferences obtained from a smaller (matched) dataset should be less prone to get knocked over by hidden bias tests. Or, in the case of propensity score methods, one would like covariates that strongly predict treatment assignment so that an omitted variable cannot tweak the results much.

This is very much still work in progress; comments and feedback are highly appreciated.

Posted by James Greiner at 6:06 AM

16 December 2005

Redistricting and Electoral Competition: Part II

John Friedman and guest blogger Richard Holden

Yesterday, we blogged about whether gerrymandering or something else a principal cause of low turnover in the House of Representatives and other elected bodies. We continue that discussion today.

How can we determine whether gerrymandering is the culprit, given that any number of reasons could account for the increase in the incumbent reelection rate? The key is that redistricting usually happens only once each decade (at least until the recent controversies in Texas.) Other factors, such as money or electoral polarization, tend to change more smoothly over time. One can tease these factors apart with a
"regression discontinuity" approach, separating the time series into 1) a smooth function and 2) jumps at the time of gerrymandering.

In a recent paper (available at here), we find that redistricting has actually slightly reduced incumbent reelection rates over time. We also look to see if there are systematic differences between "bipartisan" gerrymanders, designed to protect incumbents from both parties, and "partisan" gerrymanders, in which one party attempts to leverage its support into more representation in the state's Congressional delegation. There is no evidence that the incumbent reelection rate responds differently after any of these forms of redistricting.

This research suggests that factors other than redistricting are the more important culprits in today's lack of electoral competition. In some sense, this isn't all that surprising. While the technology available has become more advanced, so have the constraints on gerrymanderers. Supreme Court decisions interpreting the 14th amendment and the Voting Rights Act have consistently narrowed the bounds within which redistricting must occur.

There may, of course, be other reasons to support independent commissions. For instance, they tend to create more geographically compact districts. Neutral bodies also help to avoid the most extreme cases of partisan gerrymandering, in which the neighborhood of an incumbent is grouped with distant voters in a tortuously shaped district. Perhaps most importantly, independent commissions may be able to ensure minority representation - though the Voting Rights Act also plays a fundamental role in this area.

The basic premise of supporters of non-partisan commissions - that political competition is important - is a sound one. But the evidence suggests that these advocates are focused in the wrong place. The redistricting process is far from the only cause of limited competition.

To increase competition in elections for Congress and state legislatures, we must pay more attention to other potential causes of the increase in the incumbent reelection rate. We must better understand how factors such as money, television, and candidate quality impact elections. But if we can direct towards these aspects of democracy the same spirit of reform that now supports the drive towards independent redistricting commissions, new and more promising solutions can't be far away.

Posted by James Greiner at 2:53 AM

15 December 2005

Redistricting and Electoral Competition: Part I

John Friedman and guest blogger Richard Holden

On Election Day, 2005, more than 48 million people in three states voted on whether non-partisan commissions, rather than elected state politicians, should conduct legislative redistricting. Though these initiatives were defeated, the popular movement towards non-partisan redistricting is gaining strength. Activists point to the systems in Iowa and Arizona - currently the only states without serious legislative involvement - as the way of future redistricting in this country.

The the non-partisan commission ballot initiatives - Proposition 77 in California and Issue Four in Ohio - were major policy items in the respective states. (The initiative in Florida was citizen-sponsored and attracted less attention). California Governor Arnold Schwarzenegger placed the issue at the heart of his plan of reform, commenting that "Nothing, absolutely nothing, is more important than the principle of 'one person - one vote.'" These measures have also received broad bipartisan support from politicians, organized interest groups, and grassroots organizations. Though partisan political interest has played a role, many of these groups support the proposed move to independent commissions out of a sincere desire to increase competitiveness in the political system. Unfortunately, the latest academic research suggests that this well-intentioned effort is misplaced: Gerrymandering has not caused the increasing trend of low legislative turnover in the Congress.

Proponents of independent commissions argue that redistricting by politicians has led to a vast rise in incumbent reelection rates. For instance, in the US House of Representatives, members are reelected at a
staggering 98% rate. Prior to World War II, that rate hovered around 85%. Many in favor of independent commissions argue that new technologies available to redistricters, such as sophisticated map drawing software,
has allowed bi-partisan gerrymandering. Incumbents band together to protect each other's electoral prospects, creating impregnable districts packed with supporters. As Bob Stern of the non-partisan Center for
Governmental Studies has said, "This lack of competition is due significantly to the legislature's decision to redraw electoral districts to protect party boundaries."

There are, however, a number of other factors that might explain the increase in incumbent reelection rates. For instance, there is a lot more money in politics than in the past. Incumbents, who usually have greater fund raising ability, raise large war chests for their campaigns. A more polarized electorate can also increase incumbent reelection rates because there are fewer swing voters for a potential challenger to persuade. Growing media penetration in the post-war period provides incumbents with free advertising, further increasing their prospects. All of these effects are magnified when more qualified challengers choose not to run against incumbents benefiting from these factors.

Tomorrow, we'll continue with our discussion of alternative explanations of low electoral turnover, plus a little about what we might do about it.

Posted by John N. Friedman at 2:25 AM

14 December 2005

Consumer Demand for Labor Standards, Part III

Michael Hiscox and Nicholas Smyth, guest bloggers

Continuing our discussion begun yesterday and the day before on labor standards labeling, perhaps the most important comments we received at the workshop had to do with how we might design our next set of experiments. It is very difficult to do anything fancy when it comes to in-store experiments. It could never be practical (and ABC would never give permission) for us to randomize treatments to individual items or brands on, say, a daily basis. The manner in which products are displayed (grouped by brand), the costs associated with altering labels and prices, and the potential problems for sales staff (and the potential complaints from frequent customers) impose severe constraints. Several workshop participants suggested that we conduct the next set of experiments through an online retailer. That way we might be able to randomly assign labels (and prices) to customers when they view product information on a website and decide whether or not to make a purchase. There would still be plenty of difficulties to iron out, as was quickly noted (e.g., making allowances for customers who attempt to return to the same product page at a later point in time, and for customers who "comparative shop" for products at multiple retailers). But this seems like the way to proceed in the future.

On a related theme, we noted that Ezaria.com, an online retailer run by Harvard students, is already planning to track a variety of economic data on its customers. Ezaria has a mission which involves providing markets for independent artisans from the developing world and donating 25% of profits to charity. At a minimum, looking at data on whether a customer is more likely to make a purchase after being shown the company's "mission" page (that explains their policies) would provide some measure of consumer demand for companies that source from high-standard producers. Perhaps we can persuade Ezaria to cooperate with us in a future experimental project. Or perhaps we can arrange the experiment with an even larger online retailer, with customers who are not so obviously self-selected as socially conscious.

Posted by James Greiner at 3:49 AM

13 December 2005

Consumer Demand for Labor Standards, Part II

Michael Hiscox and Nicholas Smyth, guest bloggers

We continue yesterday's entry discussing questions that arose during our recent presentation of our paper on consumer demand and labor standards labeling.

Another excellent question that was raised in the discussions concerned the evidence that sales of our labeled items actually rose (relative to sales of unlabeled control products) when their prices were raised. We have been interpreting this as evidence that consumers regarded the label as more credible when the product was more expensive relative to alternatives, since they expect to pay more for higher labor standards. One question was whether relative sales would have risen with price increases for any good (labeled or unlabeled) just because higher prices can signal better quality. Since we did not raise the price of unlabeled items, we cannot address this concern directly. It is not critical to one of our main findings: sales of labeled items increased markedly relative to sales of unlabeled alternatives when the labels were put in place (before prices were adjusted). But we will try to track down the research on the price-quality issue in the literature on consumer psychology. Our basic assumption is that the existing (equilibrium) product prices and sales levels at ABC (in the "baseline" period) accurately reflected the relative quality of treatment and control products.

Other questions raised concerned the evidence we discussed in the paper about the marked increase in sales of Fair Trade Certified coffee. It was pointed out that, to the extent that retailers like Starbucks are marketing only fair trade coffee as the brewed "coffee of the day" this seems more like a general CSR strategy by the firm and not a sign of demand for improved standards. We were really talking about sales of certified coffee beans, rather than brewed coffee. The labeled beans are sold in direct competition with similar (unlabeled) beans at both Starbucks and Peets. But it is important that we check the data and see if we can discriminate clearly between sales in different categories.

In general, we felt we have to do better in accounting for seasonal patterns in demand for home furnishings at ABC and how they might bear on our findings. This is obviously not a problem for our core results that hinge on the ratio of sales of labeled brands to unlabeled brands during each phase of the experiment. But for measuring price elasticities using changes in absolute sales of labeled items over time we would like to allow for the fact that sales of home furnishings were expected to dip during the summer months. To do this, we will probably need to estimate weekly sales for each brand using all the data we have from ABC prior to the start of our experiment (covering sales in 2004 and the first half of 2005). The relevant covariates would probably include recorded levels of total foot traffic in the store, total sales of other store products, some national or regional measures of economic activity and consumer confidence, variables accounting for any special sales and promotional campaigns, and seasonal dummies. We can then compare actual (absolute) sales of labeled brands with out-of-sample predictions based upon the estimations and thereby gauge the impact of our experimental treatments.

We will conclude our discussion in tomorrow's post.

Posted by James Greiner at 4:46 AM

Applied Statistics - No Meeting

There will be no session of the Applied Statistics workshop on Wednesday, December 14; the talk originally scheduled for this date will be rescheduled for next semester. Our next session will be held on Wednesday, February 1. We hope to see you then!

Posted by Mike Kellermann at 12:00 AM

12 December 2005

Consumer Demand for Labor Standards, Part I

Michael Hiscox and Nicholas Smyth, guest bloggers

We are very grateful to all the members of the Applied Statistics Workshop for inviting us to present our paper (abstract here) in the workshop this week. Thanks, especially, to Mike Kellerman for organizing everything and playing host.This was the first time we have presented the results from our experiments, and we received some very valuable feedback and suggestions for future work on this topic. One important question that was raised was why we do not simply assume that firms already know how much consumer demand there is for good labor standards? That is, if firms could make a buck doing this sort of thing, why not assume they would already be doing it? We think there are probably a couple of answers to this question. As we noted at the workshop (and in the paper), credible labeling would require cooperation from, and coordination with, independent non-profit organizations that could certify labor standards in factories abroad. So part of the issue here for firms is the uncertainty surrounding whether such organizations would be willing and able to take on such a role. The uncertainty about establishing a credible labeling scheme with cooperation from independent groups, on top of the uncertainty about consumer demand itself, may explain why firms are not doing as much research in this area as (we think) is warranted.

The other answer, or part of the answer, is that many firms may consider it too risky to do market research on labor standards labeling. We talked a little about how many firms refused to participate in our labeling experiments because they could not vouch for labor standards in all the factories from which they source and they were anxious about negative publicity if consumers or activist groups became curious about unlabeled items in their stores. Note that this is not evidence that labeling strategies must also be too risky for firms to ever contemplate. The risks of doing research on this issue are not identical
to the risks attached with actually adopting a labeling strategy (which depend on what the research can tell us about consumer demand, and on whether a firm decides to switch to selling only labeled products or some combination of labeled and unlabeled products, etc).

More on our paper and the questions that arose in the presentation tomorrow.

Posted by James Greiner at 2:38 AM

9 December 2005

What Did (and Do We Still) Learn from the La Londe Dataset (Part II)?

Jens Hainmueller

I ended yesterday's post about the famous LaLonde dataset, with the following two questions: (1) What have we learned from the La Londe debate? (2) Does it makes sense to beat this dataset any further or have we essentially exhausted the information that can be extracted from this data and need to move one to new datasets?

On the first point, VERY bluntly summarized, the comic strip history goes somewhat like this. First, La Londe showed that regression and IV do not get it right. Next, Heckman's research group released a string of papers in the late 80s and 90s trying to defend conventional regression and selection-based methods. Enter stage Dehija and Wahba (1999). They showed that apparently, propensity score methods (sub-classification and matching) get it right if one controls for more than one year of pre-intervention earnings. Smith and Todd (2002, 2004) are next in line, claiming that propensity score methods do not get it right. Once one slightly tweaks the propensity score specification, the results are again all over the place. The ensuing debate spawned more than five papers as Rajeev Dehejia replied to the Smith and Todd findings (all papers of this debate can be found here). Then last but not least, Diamond and Sekhon (2005) argue that matching does get it right, if it’s done properly, namely if one achieves a really high standard of balance (we’ve already had quite a controversy about balance on this very blog. See for example here).

So what does this leave applied researchers with? What do we take away from the La Londe debate? Does anyone still think that regression (or maximum likelihood methods more generally) and/or 2-stage least squares IV produce reliable causal inferences in real world observational studies? In all seriousness, where is the validation? . This is the $1 million-dollar question, because MLE and IV methods represent the great majority of what is taught and published across the social sciences. Also, can we trust propensity score methods? How about other matching methods? Or is there little hope for causal inference from observational data in any case (in which case I fear we are all out of a job, and the philosophers get the last laugh?) This is not necessarily my personal opinion, but I would be interested to hear people’s opinion. [The evidence is of course not limited to La Londe; there is ample evidence from other studies with similar findings. For example see Friedlander and Robins (1995), Fraker and Maynard (1987), Agodini and Dynarski (2004), Wilde and Hollister (2002) and various Rubin papers to name just a few].

On the second point, let me play the devil’s advocate again and ask: What can we still learn from the La Londe data? After all it’s just one single dataset, the standard errors even for the experimental dataset are large, and once we match in the observational data, why would we even expect to get it right? There is obviously a strong case to be made for selection on unobservables in the case of the job training experiment. So even if we manage to adjust observed differences, why in the world should we get the estimate right? [Again, this is not my personal opinion, but I have heard a similar contention both at a recent conference and in Stat 214.] Maybe instead of a job training experiment, we should first use experimental and observational data on something like plants or frogs, where hidden bias may (!) be less of a problem (given this is actually the case)? Finally, what alternatives do we have—how would we know what the right answer was if we were not working with a La Londe-esque framework? Again, I would be interested in everybody’s opinion on this point.

Posted by James Greiner at 6:14 AM

8 December 2005

What Did (and Do We Still) Learn from the La Londe Dataset (Part I)?

Jens Hainmueller

In a pioneering paper, Bob La Londe (1986) used experimental data from the National Supported Work Demonstration Program (NSW) as well as observational data from the Current Population Survey (CPS) and the Panel Study of Income Dynamics (PSID) to evaluate the reliability with which conventional estimators recover the experimental target estimate. He utilized the experimental data to establish a target estimate of the average treatment effect, then replaced the experimental controls with several control groups built from the general population surveys. Then he re-estimated the effects using conventional estimators. His crucial finding was that conventional regression as well as tweaks such as instrumental variables etc. get it wrong, i.e. they do not reliably recover the causal effects estimated in the experimental data. This is troubling, of course, because usually we do not know what the correct answer is, so we simply accept the estimates that our conventional estimators spit out, not knowing how wrong we may be.

This finding (and others) sparked a fierce debate in both econometrics and applied statistics. Several authors have used the same data to evaluate other estimators, such as several matching estimators and related techniques. In fact, today the La Londe data is THE canonical dataset in the causal inference literature. It has not only been used for many articles, it has also been widely distributed as a teaching tool. I think it’s about time we stand back for a second and ask two essential questions: (1) What have we learned from the La Londe debate? (2) Does it makes sense to beat this dataset any further or have we essentially exhausted the information that can be extracted from this data and need to move one to new datasets? I wholeheartedly invite everybody to join the discussion. I will provide some suggestions in a subsequent post tomorrow.

Posted by Jens Hainmueller at 4:33 AM

7 December 2005

Applied Statistics - Michael Hiscox and Nicholas Smyth

Today, the Applied Statistics Workshop will present a talk by Michael Hiscox and Nicholas Smyth of the Harvard Government Department. Professor Hiscox received his Ph.D from Harvard in 1997 and taught at the University of California at San Diego before returning to Harvard in 2001. His research interests focus on political economy and international trade, and his first book, International Trade and Political Conflict, won the Riker Prize for the best book in political economy in 2001. Nicholas Smyth is a senior in Harvard College concentrating in Government. He is an Undergraduate Scholar in the Institute for Quantitative Social Science. Hiscox and Smyth will present a paper entitled "Is There Consumer Demand for Improved Labor Standards? Evidence from Field Experiments in Social Labeling," based on joint research conducted this summer with the support of IQSS. The presentation will be at noon on Wednesday, December 7 in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided. The abstract of the paper follows on the jump:

A majority of surveyed consumers say they would be willing to pay extra for products made under good working conditions abroad rather than in sweatshops. But as yet there is no clear evidence that enough consumers would actually behave in this fashion, and pay a high enough premium, to make “social labeling� profitable for firms. Without clear evidence along these lines, firms and other actors (including independent groups that monitor and certify standards) may be unwilling to take a risk and invest in labeling. We provide new evidence on consumer behavior from experiments conducted in a major retail store in New York City. Sales rose dramatically for items labeled as being made under good labor standards, and demand for these products was very inelastic for price increases of up to 20% above baseline (unlabeled) levels. Estimated elasticities of demand for labeled towels, for example, ranged between -0.36 and -1.78. Given the observed demand for labor standards, it appears that many retailers could raise their profits by switching to labeled goods. If adopted by a large number of firms, this type of labeling strategy has the potential to markedly improve working conditions in developing nations without slowing trade, investment, and growth.

Posted by Mike Kellermann at 10:27 AM

Fun with R2

Mike Kellermann

This semester, I have been one of the TFs for Gov 2000 (the introductory statistics course for Ph.D. students in the Government Department). It the first time that I've been on the teaching staff for a course, and it has been quite an experience so far. We've spent the past month or so introducing the basic linear model. Along the way, Ryan Moore (the other TF) and I have had some fun sharing the best quotes that we've come across about everyone's favorite regression output, R2:

Nothing in the CR model requires that R2 be high. Hence a high R2 is not evidence in favor of the model, and a low R2 is not evidence against it. Nevertheless, in empirical research reports, one often reads statements to the effect that "I have a high R2, so my theory is good," or "My R2 is higher than yours, so my theory is better than yours." (Arthur Goldberger, A Course in Econometrics, 1991)
Thus R2 measures directly neither causal strength nor goodness of fit. It is instead a Mulligan Stew composed of each of them plus the variance of the independent variable. Its use is best restricted to description of the shape of the point cloud with causal strength measured by the slopes and goodness of fit captured by the standard error of the regression. (Chris Achen, Interpreting and Using Regression, 1982)
Q: But do you really want me to stop using R2? After all, my R2 is higher than all of my friends and higher than those in all the articles in the last issue of APSR!
A: If your goal is to get a big R2, then your goal is not the same as that for which regression analysis was designed. The purpose of regression analysis and all of parametric statistical analyses is to estimate interesting population parameters....
If the goal is just to get a big R2, then even though that is unlikely to be relevant to any political science research question, here is some "advice": Include independent variables that are very similar to the dependent variable. The "best" choice is the dependent variable; your R2 will be 1.0. (Gary King, "How not to lie with statistics,"AJPS, 1986).

So this is old news, right? Maybe not. Quite possibly the thing that has surprised me the most so far is just how much students want R2 to tell them how good their model is. You could almost see the anguish in their faces as we read these quotes to them, particularly among those who have taken some statistics in the past. The question I want to throw out is, why is R2 such an attractive number? Why do we want to believe it? Maybe our cognitive science colleagues have some insight....

Posted by Mike Kellermann at 5:00 AM

6 December 2005

The BLOG inference engine

Amy Perfors

There are two ways of thinking about almost anything. Consider family and kinship. One the one hand, we all know certain rules about how people can be related to each other -- that your father's brother is your uncle, that your mother cannot be younger than you. But you can also do probabilistic reasoning about families -- for instance, that grandfathers tend to have white hair, that it is extremely unlikely (but possible) for your mother to also be your aunt, or that people are usually younger than their uncles (but not always). These aren't logical inferences; they are statistical generalizations based on the attributes of families you have experienced in the world.

Though the statistics-rule dichotomy still persists in a diluted form, today many cognitive scientists are not only recognizing that people can do both types of reasoning much of the time but also beginning to develop behavioral methods and statistical and computational models that can clarify exactly how they do it and what that means. The BLOG inference engine, whose prototype was released very recently by Stuart Russell's computer science group at Berkeley, is one of the more promising computational developments for this goal.

BLOG (which stands for Bayesian LOGic, alas, not our kind of blog!) is a logical language for generating objects and structures, then doing probabilistic inference over those structures. So for instance, you could specify objects, such as people, with rules for how those objects could be generated (perhaps a new person (a child) is generated with some probability from two opposite-gender parents), as well as how attributes of these objects vary. For example, you could specify that certain attributes of people depend probabilistically on family structure - if you have a parent with that attribute, you're more likely to have that attribute yourself. Other attributes might also be probabilistically distributed, but not based on family structure: we know that 50% of people are male and 50% are female regardless of the nature of their parents.

The power of BLOG is that it allows you both to specify quite complex generative models and interesting logical rules and to do probabilistic inference given the rules you've set up. Using BLOG, for instance, you could ask things such as the following. If I find a person with Bill's eyes, what is the probability that this person is Bill's child? Is it possible for Bill's son to also be his daughter?

Though a few things are unexpectedly difficult in BLOG - reasoning about symmetric relations like "friend," for instance - I think it promises to be a tremendously valuable tool for anyone interested in how people do probablistic reasoning over structures/rules, or in doing it themselves.

Posted by Amy Perfors at 3:01 AM

5 December 2005

Anchoring Vignettes (II)

Sebastian Bauhoff

In my last post I mentioned how differences in expectations and norms could affect self-rated responses in surveys. One fix is to use anchoring vignettes that let the interviewer control the context against which ratings are made.

For example, in a 2002 paper on the use of vignettes in health research, Salomon, Tandon and Murray ask respondents to rank their own difficulty in mobility on a scale from 'no difficulty' to 'extreme difficulty'. Then they let respondents apply the same scale to some hypothetical persons using descriptions like these:

"Paul is an active athlete who runs long distances of 20km twice a week and plays soccer with no problems."

"Mary has no problems walking, running or using her hands, arms, and legs. She jogs 4km twice a week."

Using the difference in how people assess these controlled scenarios, one can adjust the rating of people's own health. Doing this across or within various populations then allows to examine systematic differences across groups. These vignettes have been used in recent World Health Surveys in a number of countries.

King, Murray, Salomon and Tandon introduced the vignettes approach and used the measured differences to correct responses to self-rated questions on political efficacy. The idea is that applying the vignettes to a sub-sample is cheap and sufficient to understand systematic differences in self-reports. Their methods are laid out in the paper, but the results show how much difference the vignettes method can make: instead of suggesting that there is a higher level of political efficacy in China than in Mexico (as self-reports would indicate), the vignette method shows the exact opposite because the Chinese have lower standards for efficacy and thus understand the scale differently.

Intuitively that's what we do all the time: once you talked to enough Europeans and Americans about their (and other peoples') well-being you use your mental model to adjust responses and stop taking the European's minor complaints too seriously. Using this insight in survey-based research can make a huge difference too.

Posted by James Greiner at 6:41 AM

2 December 2005

Questions about Free Software

Jim Greiner

This past spring at Harvard, a group of students from a variety of academic disciplines agitated for a course in C, C++, and R focusing on implementating iterative statistical algorithms such as EM, Gibbs sampling, and Metropolis-Hastings. The result was an informal summer class sponsored by IQSS and taught by recent Department of Statistics graduate Gopi Goswami. Professor Goswami created (from scratch) class notes, problem sets, and sample programs as well as compiling lists of web links and other useful materials. Course participants came from, among other places, Statistics, Biostatistics, Government, Japanese Studies, the Medical School, the Kennedy School, and Health Policy. For those interested in the lecture slides and other materials Professor Goswami compiled, the link is here. Principal among the subjects taught in the course was how to marry R's data-processing and display capabilities to an iterative inferential engine (try saying that phrase quickly three times) such as an EM or a Gibbs, with the latter written in C or C++ so as to increase (vastly) the speed of runs. In other words, we learned how to have R do the front end (data manipulation, data formatting) and back end (analysis of results, graphics) of an analysis while letting a faster language do the hard work in the middle.

The course both demonstrates and facilitates a growing trend in the quantitative social sciences toward making open-source software stemming from scholarly publications freely available to the academic community. Two examples from the ever-expanding field of ecological inference are Gary King's EI program, based on a truncated bivariate normal model and implemented in GAUSS, and Kosuke Imai and Ying Lu's implementation of a Dirichlet-process-based model), implemented with an R-C interface.

The trend toward freely available, model-specific software has obvious potential upsides. Previously written code can save the time of a user interested in applying the model. Moreover, if the code is used often enough and potential bugs are reported and fixed, the software may become better than what a potential user could write on his or her own. After all, few of us interested in answers to real-world issues want to spend the rest of our lives coding in C.

Nevertheless, I confess to a certain amount of apprehension. For me at least, freely available, model-specific software provides a temptation to use models I do not fully understand. Relatedly, I often think that I do understand a model fully, that I grasp all of its strengths and weakness, only to discover otherwise when I sit down to program it. Finally, oversight, hubris, or a desire to make accompanying documentation readable may cause the author of the software not to describe fully details of implementation or compromises made therein. Thus, while I am excited by the possibilities freely available social science software holds, I worry about the potential for misuse as well.

Posted by James Greiner at 6:00 AM

1 December 2005

Anchors Down (I)

Sebastian Bauhoff

"How's it going?" If you ever tried to compare the answer to this question between the average American ("great") and European ("so-so" followed a list of minor complaints), you hit directly on a big problem in measuring self-reported variables.

Essentially the responses to questions on self-reported health, political voice and so on are determined not only by differences in actual experience, but also by differences in expectations and norms. For a European "so-so" is a rather acceptable status of wellbeing whereas for Americans it might generate serious worries. Similarly people's expectations about health may change with age and responses can thus be incomparable within a population (see this hilarious video on Gary King's website for an example).

A way to address this problem in surveys is to use "anchoring vignettes" that let people compare themselves on some scale, and then also ask them to assess hypothetical people on the same scale. The idea is that ratings of the hypothetical persons reflect the respondents' norms and expectations similarly to the rating of their own situation. Since the hypothetical scenarios are fixed across the respondents any difference in response for the vignettes is due to the interpersonal incomparability.

Using vignettes is better than asking people to rank themselves on a scale from "best" to "worst" health because it makes the context explicit and puts it in control of the experimenter. Gary and colleagues have done work on this issue which shows that using vignettes can lead to very different results than self-reports (check out their site). I will write more on this in the next entry.

Posted by Sebastian Bauhoff at 2:21 AM