Monday, 29 February 2016


Hello, welcome to my blog. In my previous posts I have talked extensively about linear regression and how it can be implemented in Python. Now, I want to talk about another popular technique in Machine Learning – Nearest Neighbours.

The basic idea behind nearest neighbours is that similar items tend to be close to each other. Just like the saying in English ‘Birds of a feather flock together’, nearest neighbours uses this reasoning for either classification or prediction. For classification, a nearest neighbour classifier will classify an unlabeled example to the class of most similar labelled examples. An identical routine is also followed for prediction.

For our problem of predicting house prices, a nearest neighbour approach is very intuitive. If I have a house I want to sell, I would probably sell it at a price which is close to houses that are similar to mine. This pattern of reasoning would also apply to other items as well e.g. cars. This method is called k-Nearest Neighbours (KNN) when we are using more than 1 neighbour for prediction or classification i.e. we are considering the k closest observations to our test observation. k can be any positive whole number.

The nearest neighbour approach which is very simple to understand and implement; and surprisingly it is very powerful. This method is best used in cases where the relationship between the features and the target class is numerous and complicated but items of similar class type tend to be fairly homogeneous.

I said that nearest neighbours operates based on similarity or closeness between observations. How do we define similarity? A very popular distance metric is the Euclidean distance which defines the distance between two observation using the formula

x& xq is a vector containing the numeric feature values for the observations j and q.
d is the number of features (or attributes)

The formula above may look complicated but really it’s not. It just computes the difference between corresponding values in both vectors, squares and then sums these differences. Finally, the square root is computed and returned as the distance between xj and xq.

 Other popular distance metrics are:
  1. Manhattan distance
  2. Hamming distance
  3. Cosine similarity
  1. Simple to implement and very effective
  2. Makes no assumptions about the data unlike linear regression
  3. Fast training phase

Obviously this method also has some drawbacks. Some of them are
  1. Requires lots of data. Nearest neighbours works best when we have a considerable amount of data. This makes sense, it would be easier to find similarities between 1000 items compared to 100 items.
  2. Nearest neighbours does not produce a model like linear regression. Therefore, it does not give us concisely the relationship between features and the target class.

  1. Computer vision applications, including optical character recognition and face detection in images and video
  2. Predicting whether or not a person would like a movie recommended to him/her
  3. Identifying patterns in genetic data

In this post I have introduced the concept of using nearest neighbours for prediction and classification. In the next post I will show how to implement it in Python.

Thanks for reading this post. As always if you have any suggestion, remark or observation about what I wrote please feel free to comment about it. Have a wonderful week ahead.