Saturday, 21 May 2016

IMPLEMENTING DECISION TREES WITH PYTHON

Hello, welcome to my blog. In my previous post I introduced another classification algorithm called decision trees. In this post I want to demonstrate how to implement decision trees using the scikit-learn library in Python.



For this demonstration, I will be using the LendingClub dataset. LendingClub is a peer-to-peer lending company that directly connects borrowers and potential lenders/investors. The goal of this demonstration is to build a decision tree classification model that will predict whether or not a loan provided by LendingClub is likely to default. The dataset contains records of about 120,000 loans granted by LendingClub.

The dataset is in SFrame format so I will use the ‘SFrame’ function from GraphLab to load the data. Feel free to use the stand alone ‘sframe’ library to load the data. Like I mentioned earlier I will use the scikit-learn library to build a decision tree model from the LendingClub data. The full code for this demonstration will be given at the end of this post, here I will just display screenshots of the results of the program.

PRELIMINARY STEPS
The dataset has a disproportionally large amount of safe loans (about 80% of the loans in the dataset are safe loans). This presents a problem because it could lead to misleading information about the performance of the classifier. In order to combat this problem, I undersampled the majority class so that the distribution of safe and unsafe loans was approximately equal for both classes. This is a crude method of handling imbalanced classes because we are throwing away data points which is not always a good idea. Read this excellent post for other ideas you can try.

Next, I converted the categorical variables in the data to binary features via one-hot encoding. This is because scikit-learn’s decision tree implementation requires numerical values for the data you give it. Finally, I split the data into training (80%) and test (20%) sets and used a random seed to ensure reproducibility.

TRAINING THE DECISION TREE MODEL
To create a decision tree model, I simply created an object of the sklearn.tree.DecisionTreeClassifier class. I specified max_depth = 2 so that I can easily visualize and interpret the decision tree. Obviously it would be easier to visualize and explain a small tree compared to a very large and complex tree. I used the ‘fit’ method to fit a decision tree model on the training data. 

VISUALIZING THE TREE
We can get a visual representation of the decision tree using scikit-learn and the Graphviz package. Below is a picture of the decision tree that was grown from the LendingClub dataset



Interpreting the Decision Tree
The first split was made on X[7] which is the feature grade.A (this column has a value of 1 if the loan was a grade A loan or 0 if it wasn’t). The left side of this split are the loans that were not grade A loans which amount to 32,094 of the 37,224 training examples. From this node another split was made on X[8] which is the feature grade.B. On the left side of this split i.e. loans that were not grade A and grade B loans 12,875 of these loans were unsafe while 8,853 of these loans were safe. On the right side of this split i.e. loans that were not grade A loans but grade B loans 4,343 of these loans were unsafe while 6,023 of these loans were considered to be safe.

Going back to the first split on X[7], the right side of this split represent loans that were grade A loans which represent 5,130 of the 37,224 training examples. From this node another split was made on X[6] which is the total late fees received for the loan. On the left side of this split i.e. loans with X[6] less than or equal to 14.8301 and also grade A loans, 1,153 of these loans were unsafe while 3,834 of the were safe. On the right side of this split i.e. grade A loans with X[6] greater than 14.8301, 105 were unsafe while 38 were safe.

Since we are predicting the majority class (this is a classification problem), we can classify a loan as safe or unsafe using the following rules:
  • If grade.A = 0 and grade.B = 0 then loan is unsafe.
  • If grade.A = 0 and grade.B = 1 then loan is safe.
  • If grade.A = 1 and total_rec_late_fee <= 14.8301 then loan is safe.
  • If grade.A = 1 and total_rec_late_fee > 14.8301 then loan is unsafe.

EVALUATING THE DECISION TREE
I evaluated the performance of the decision tree on test data using the ‘score’ method. The model had an accuracy of 62% on the test data. This indicates that the tree is fairly good at discriminating safe loans from unsafe loans. One of the things that slightly hampered the performance of the tree is the ‘max_depth’ parameter. Increasing this parameter will improve the performance of tree on test data although it should not be too large in order to avoid overfitting.

ASIDE
While working on this blog post, I noticed the ‘purpose’ column in the dataset which represents the reason why the borrower is taking the loan. I wondered if there was any relationship between the purpose of a loan and whether the borrower defaulted on the loan or not. This is something I feel is worth investigating and I will report my results on this blog when I do that.

I also want to announce that I have created a GitHub repository that has my implementation of gradient descent (for linear regression) and gradient ascent for (logistic regression). More projects will added to the repository very soon.

SUMMARY
In this blog post, I demonstrated how to implement decision trees using the scikit-learn library in Python. Thank you once again for reading this post. If you have any question about this post or any other post leave a comment and I will do my best to answer you. Once again I urge you to subscribe to my blog posts in case you have not done so. Enjoy the rest of your weekend. Cheers!!!

Code for decision trees

#import needed libraries
import numpy as np
import graphlab as gl #The data in sframe format
import sklearn.tree

#Load the dataset
loans = gl.SFrame('lending-club-data.gl/')

#Check the number of rows and columns in the data
print loans.shape

#Recode the 'bad_loans' column in a more intuitive way
loans['safe_loans'] = loans['bad_loans'].apply(lambda x: +1 if x==0 else -1)
loans.remove_column('bad_loans')

#proportion of safe loans and risky loans
print "Number of safe loans: ", np.sum(np.array(loans['safe_loans'] == +1)) /float(len(loans))
print "Number of risky loans: ", np.sum(np.array(loans['safe_loans'] == -1)) /float(len(loans))

#Extract subset of features from the dataset 
features = ['grade', # grade of the loan
'sub_grade', # sub-grade of the loan
'short_emp', # one year or less of employment
'emp_length_num', # number of years of employment
'home_ownership', # home_ownership status: own, mortgage or rent
'dti', # debt to income ratio
'purpose', # the purpose of the loan
'term', # the term of the loan
'last_delinq_none', # has borrower had a delinquincy
'last_major_derog_none', # has borrower had 90 day or worse rating
'revol_util', # percent of available credit being used
'total_rec_late_fee', # total late fees received to day
]

target = 'safe_loans' # prediction target (y) (+1 means safe, -1 is risky)
# Extract the feature columns and target column
loans = loans[features + [target]]

safe_loans_raw = loans[loans[target] == +1]
risky_loans_raw = loans[loans[target] == -1]

print "Number of safe loans : %s" % len(safe_loans_raw)
print "Number of risky loans : %s" % len(risky_loans_raw)


# Since there are fewer risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))
risky_loans = risky_loans_raw

#Sample this percentage from the safe_loans data
#setting seed for reproducibility
safe_loans = safe_loans_raw.sample(percentage, seed=1)
# Append the risky_loans with the downsampled version of safe_loans
loans_data = risky_loans.append(safe_loans)

#Check the proportion of safe and risky loans to ensure they are about the same
#proportion of safe loans and risky loans
print "Number of safe loans: ", np.sum(np.array(loans_data['safe_loans'] == +1)) /float(len(loans_data))
print "Number of risky loans: ", np.sum(np.array(loans_data['safe_loans'] == -1)) /float(len(loans_data))

#Function to perform one-hot encoding for categorical variables
categorical_variables = []
for feat_name, feat_type in zip(loans_data.column_names(), loans_data.column_types()):
    if feat_type == str:
        categorical_variables.append(feat_name)
for feature in categorical_variables:
    loans_data_one_hot_encoded = loans_data[feature].apply(lambda x: {x: 1})
    loans_data_unpacked = loans_data_one_hot_encoded.unpack(column_name_prefix=feature)
    # Change None's to 0's
    for column in loans_data_unpacked.column_names():
        loans_data_unpacked[column] = loans_data_unpacked[column].fillna(0)
    
    loans_data.remove_column(feature)
    loans_data.add_columns(loans_data_unpacked)

#Split data into train and test data
train_data, test_data = loans_data.random_split(.8, seed = 1)

featuresList = loans_data.column_names()
#Remove target column 'safe_loans' from featuresList
featuresList.remove('safe_loans')

#Index using our features list to create the training and test sets
pred_train = train_data[featuresList]
pred_test = test_data[featuresList]

#Do the same for the target column 
target_train = train_data['safe_loans']
target_test = test_data['safe_loans']

#Convert them to numpy arrays
pred_train = pred_train.to_numpy()
pred_test = pred_test.to_numpy()
target_train = target_train.to_numpy()
target_test = target_test.to_numpy()

#Let's train a decision tree classifer with max_depth = 2
treeClassifier = sklearn.tree.DecisionTreeClassifier(max_depth=2)

#Fit classifier on the training data
decision_tree_model = treeClassifier.fit(pred_train, target_train)

#Visualizing the tree
from io import BytesIO as StringIO
from IPython.display import Image

out = StringIO()
sklearn.tree.export_graphviz(decision_tree_model, out_file=out)
import pydotplus
graph=pydotplus.graph_from_dot_data(out.getvalue())
Image(graph.create_png())

#Evaluating the tree
#the score function gives us the accuracy of the decision tree model
print "Accuracy of decision tree on test data: %.3f" % decision_tree_model.score(pred_test, target_test)

#Accuracy of decision tree on test data: 0.619