Visualization for data cleaning 

 Speaking of Fernanda Viegas and Martin Wattenberg's excellent presentation on visualization, I recently came across a data cleaning problem where visualization was a big help. Data cleaning is all about having powerful ways of finding mistakes quickly. Much of the time, clever scripting is the best way to detect errors, but in this case a simple data visualization turned out to be the best tool. Screenshot after the jump. 
 


 First, a little background on the project, which is a collaboration with Jens Hainmueller. The Times of London published election guides throughout the 20th century including voting results and candidate bios for every constituency in every election to the House of Commons. We scanned and OCR'd seven volumes of this series and wrote scripts to extract information about each constituency race, including the name, vote total, and short bio of each candidate. The challenge then was to determine which appearances belonged to the same individual. For example, when "P G Agnew" runs in 1950 and "Peter Agnew" runs in 1955, are they the same person? We trained a clustering algorithm to do this matching based on name similarity, year of birth, party, and gender, and wrote some scripts to catch likely errors. When we thought we had done as well as we could, we decided to produce a little visualization to admire our perfectly cleaned data. To our surprise, the visualization revealed a number of hard-to-catch remaining errors.  

 As can be seen in the screenshot below, we listed the candidates alphabetically by surname and depicted their election career graphically with a colored rectangle for each appearance in a race. We selected the colors to reflect the margin in the race, with deep green indicating an easy victory and deep red indicating a resounding defeat.  
  
Depicting the candidates' campaign history in this way helped us see patterns that suggested that a single candidate had been incorrectly coded as separate candidates. Brian Batsford, shown at the top of the screen shot, was one such case: the Brian Batsford who ran in 1959, 1964, and 1970 was very likely to be the same person as the Brian Batsford who ran in 1966. Indeed, it turned out that they were the same person; our clustering algorithm had mistakenly separated him in two because the year of birth had been miscoded as 1928 in his 1966 appearance.  

 The key point here is that the pattern that allowed us to see this mistake is easier to see than it is to articulate and, perhaps more importantly, than it is to write in a script. (OK, I'll try: "Find pairs of candidates who have similar names and did not appear in the same elections, especially if they appeared in contiguous elections and had similar results.") I prefer the pretty colors.