Words and Credit Scores 

Find statistical evidence that borrowers who use words like "bill," "bills," and "need" in their loan applications are twice as likely to default. This post uses freely available data from the P2P lending site LendingClub.


  LendingClub  is a P2P lending site much like  Propser . What makes them special is that they've released a full  data set  of all 4,564 past loans and their current status. As a data source this is extraordinary, since most literature on credit scoring uses proprietary data. For the LendingClub data, can we beat the FICO at default prediction by incorporating additional clues? 

 This post focuses on the borrower's "Loan Description," which I use along with FICO scores to predict defaults. The loan description is written by the borrower and usually pitches his qualifications and reasons for needing the money. Here's a randomly chosen example from someone who is current on his payments. 

 I have some credit card debt that I would like to pay-off. It makes sense to pay one lender as opposed to 5 credit card companies. I'd rather pay interest to one payee rather than split between 5 or 6. 

 This is a relatively short one -- the average description is 58 words long. Perhaps there are keywords in the description that impact the probability of default after controlling for the FICO score. Here's what I did to test for these keywords: 
 
	 Find the 300 most common words in all loan descriptions. 
         For each word  w , test the hypothesis that use of  w  is conditionally independent of delinquency given the FICO score range (six ranges from 640 up). I apply the Maentel-Haenszel test. Note that for simplicity I am ignoring the survival analysis aspect of the problem here (i.e., some loans are newer than others) for simplicity since all loans are relatively new anyway (Lending Club started in January of 2007). 
 Order all the words by test's  p  value. Check that the distribution of  p  values is non-uniform to ensure significance in the presence of multiple comparisons.  

 Now, the fun stuff. For our purposes define a  Delinquency  as either being late in your payments or having defaulted completely. The 10 words with the greatest  p -values are below. I report marginal delinquency probabilities, not broken out by FICO score, simply for brevity; the actual M-H test controlled for the FICO scores. 

      Word  Loans With  P(Delinquency|No word)  P(Delinquency|Word)  p-value    also   215    0.067    0.140    0.0004     need    608    0.062    0.105    0.0015     business    233    0.069    0.116    0.0038     live     91    0.070    0.154    0.0057     already     64    0.071    0.156    0.0059     other    285    0.068    0.112    0.0081     bills    223    0.067    0.135    0.0082     bill    279    0.066    0.125    0.0117     interest    660    0.081    0.053    0.0136      

Some speculative reasoning: A word like "also" implies that the loan will be used for more than one purpose, which points to a heightened risk. Here's a randomly chosen delinquent borrower who used "also." It's clear that he has multiple goals in mind for the money and has obviously racked up quite a bit of debt.

 I have good credit and am looking to consolidate all my debt into one easy payment.  I am looking to get married soon so the less multiple bills we have to keep track of the better. I have two credit cards with low balances that I would like to pay off.  I have a furniture debt that I would also like to consolidate and I need to overhaul the commuter vehicle my fiance will begin driving.  I have no recorded late or delinquent payments on my credit. I have worked for my current employer for 5 1/2 yrs and have good standing.  I am excited to join hands in marriage with my lovely fiance and the remainder balance after consolidation will be used for marraige documentation purposes. I appreciate your consideration. Thank you. 

As for the other words, "need" implies that the borrower is in straits of some kind, while "live," "bill" and "bills" suggest that the money will be used for day-to-day expenses rather than a targeted goal, implying a systemic negative cash flow. "Already" suggests an existing outstanding loan. All but one word ("interest") on the list enhances delinquency risk.
 "Business" is somewhat surprising -- people who want money to start businesses must be greater risks. Here's an example:

 i am trying to buy a residential Land in emerging and booming market like new delhi where building cost is very cheap and return of investment is 150% in just six months.  I intend to purchase the land build the house with my friends help who is in building house business and make a six flats/3 floor  house. and sale it each one of them under USD 12, 000.00. 

I'm stunned something like this got funded!

All in all, such keywords look like a good building block for enhancing a credit score model that goes beyond FICO scores. In a saner credit market, a viable strategy would be to fund P2P loans judged by an enhanced model to minimize default risk. Right now, however, I'd be worried that the credit crisis could wipe out all these sites at the drop of a hat.