Linguistics of the Debate 

 In last week's debate in Philadelphia, 

 
  Clinton's favorite phrase was "You know," which she used 49 times to Obama's 18 
  Obama's favorite phrase was "American people," which he used 16 times to Clinton's 1 
  Obama was the only one to use the words "politics" (10 times), "economic" (9 times) and "election" (9 times). 
 


 Last week's debate provides a small but interesting corpus to analyze the candidates' favorite linguistic formulations. Overall, 

 
 12,329 words were uttered by a candidate 
 Obama uttered 6,206 words (1,331 unique) in 40 chunks 
 Clinton uttered 6,123 words (1,250 unique) in 37 chunks 
 

 So all in all, the candidates spoke about the same number of words. But which words? We can test that using a basic corpus comparison method. In all, there were 1,971 unique words. For each of these, we test the hypothesis that the candidates spoke the word with equal probability, using a simple chi-squared test. Next we sort all words by their p-values so that the most differentially expressed words percolate to the top. Here are the top 20 words by p-value, along with their frequencies from Obama and Clinton.  

    Word  obama  clinton  pval    will   18   56  0.0000    know   23   64  0.0000    that's   43   12  0.0001    she   16    0  0.0002    it   41   79  0.0005    how   36   12  0.0010    clinton   14    1  0.0021    i  150  205  0.0024    he    5   21  0.0029    politics   10    0  0.0047    this   58   30  0.0047    american   20    5  0.0056    to  211  268  0.0058    begin    0    9  0.0072    york    0    9  0.0072    decade    9    0  0.0081    economic    9    0  0.0081    election    9    0  0.0081    going   49   26  0.0128    give    1   10  0.0149    

 Sometimes control words (I, it, etc.) are excluded from analysis, but here I thought it would be fun to leave them in so we could see each candidate's preferred constructions. Besides the points listed above, here are a few interesting notes: 
- Clinton used the word "I" 205 times to Obama's 150 
- Obama loves to start sentences with "That's:" "That's why I'm...", "That's what we're," etc. 
- Obama loves the word "decade" -- evidently he used the phrase "decades after decades" several times 

 Of course, unigrams -- single words -- can only tell you so much. If we do the same analysis using bigrams, a few more bits of information drip out: 

    Word  obama  clinton  pval    you know  18  49  0.0002    american people  16   1  0.0008    senator clinton  13   0  0.0009    the american  17   2  0.0014    and that's  13   1  0.0035    have a   5  20  0.0046    this country  10   0  0.0047    i will   7  23  0.0055    going to  46  22  0.0061    new york   0   9  0.0072    

 So Clinton always punctuates her thoughts with "you know," while Obama attributes his goals to the "American people." 

 It will be interesting when McCain gets into the mix with one of these two. I think it would be fun to construct a  language model  -- a model for the probability that each candidate spoke a certain sentence. Given the differences, I bet that given a sentence, it could easily figure out whether Obama, Clinton or McCain said it!