## Thursday, 12 May 2016

### DECISION TREES

Hello, welcome to my blog. Recently, I have been talking about classification. I have talked about a linear classifier, how to use logistic regression to learn coefficients for a linear classifier. Furthermore, I demonstrated how to implement logistic regression using the sci-kit learn library in Python and I talked about evaluation metrics for classifiers.

Today, I want to talk about another algorithm for classifying data – Decision Trees. This algorithm takes a different approach to classifying data which I will talk about in this post.

INTUITION BEHIND DECISION TREES
When making a difficult or important decision, we often consider or weigh the pros and cons of making a particular decision from a set of possible decisions. For example, consider a jobseeker who has to choose one job out of several job offers. He may decide to choose the job with the highest pay and benefits but what if that job required that he had to move to a different country that spoke a different language? He may decide to consider the second job that is closer to him but what if that job does not pay as much as the previous job? Finally, he may decide to take a job that with decent pay and benefits and also is not too far his friends and family. You can see that his decision on either accepting or rejecting a job offer was made by a series of small specific choices.

Concretely, the goal of decision tree modelling is to build a tree (from the data) that lets us make predictions about our data. This is done by splitting the data into two or more homogeneous sets based on the most significant feature in the data. The decision tree is ‘grown’ from data by continually splitting the data into as many homogeneous sets as possible until a stopping criterion is reached.

Decision trees are best suited for problems where the dependent variables are categorical or the relationship between the dependent and independent variable is non-linear. Although decision trees are best suited for classification problems, they can also be applied to regression problems as well.

Before we go on let me mention some important terms associated with decision trees:
• Root node: This is the population or data we use to build the decision tree. This data gets subdivided into two or identical more sets.
• Splitting: This is the act of dividing a node into two or more sub-nodes.
• Pruning: The process of removing sub-nodes from a tree.
• Leaf/Terminal Node: This is a node that is not split on. These nodes are used for making predictions about our data.
• Decision Node: This is a sub-node that is split into two or more sub-nodes.
 Decision Tree Terminology
CRITERIA FOR SPLITTING
How do we decide the best feature to split on? The best feature is the feature that when split on gives the lowest classification error. This split results in the ‘purest’ or most ‘homogenous’ subset(s) of the data. There are various ways of measuring how ‘pure’ a split is. Some of them are
• Gini index: In this case we choose the feature which when split on results in a higher Gini score.
• Entropy: We choose the splitting feature as the one that results in the lowest entropy for a split.

STOPPING CRITERIA
Now that we know how to split our data. The next question is when do we stop splitting? There are three primary conditions:
1. When a feature split results in a homogeneous split. In this case we stop growing the tree because all the data points for that particular subset belong to the same target class so we don’t need to split that node.
2. There are no more features to split on.
3.  If the number of data-points in a node is too small. Making a split on such a node may result in predictions that may not be reliable. A good number is somewhere between 10 (for small data-sets) and 100 (for large data-sets).

MAKING PREDICTIONS
When a decision tree is fully ‘grown’ how do we make predictions with it? For a particular input xwe traverse the grown tree using features of that input until we reach a leaf/terminal node where we make a prediction. The prediction at a leaf node is usually the majority class of data points in that leaf (for classification tasks) or the mean or median of the target class for data points in that leaf (for regression tasks).