Hello, welcome to my blog. In my previous posts I have talked
extensively about linear regression and how it can be implemented in Python.
Now, I want to talk about another popular technique in Machine Learning –
Nearest Neighbours.
The basic idea behind nearest neighbours is that similar
items tend to be close to each other. Just like the saying in English ‘Birds of
a feather flock together’, nearest neighbours uses this reasoning for either
classification or prediction. For classification, a nearest neighbour
classifier will classify an unlabeled example to the class of most similar
labelled examples. An identical routine is also followed for prediction.
For our problem of predicting house prices, a nearest
neighbour approach is very intuitive. If I have a house I want to sell, I would
probably sell it at a price which is close to houses that are similar to mine.
This pattern of reasoning would also apply to other items as well e.g. cars.
This method is called k-Nearest Neighbours (KNN) when we are using more than 1
neighbour for prediction or classification i.e. we are considering the k closest observations to our test
observation. k can be any positive
whole number.
The nearest neighbour approach which is very simple to
understand and implement; and surprisingly it is very powerful. This method is
best used in cases where the relationship between the features and the target
class is numerous and complicated but items of similar class type tend to be
fairly homogeneous.
HOW DO WE DEFINE CLOSENESS OR SIMILARITY?
I said that nearest neighbours operates based on similarity or
closeness between observations. How do we define similarity? A very popular
distance metric is the Euclidean distance which defines the distance between
two observation using the formula
Where,
xj & xq is a vector containing the numeric
feature values for the observations j and
q.
d is the number of features (or
attributes)
The formula above may look complicated but really it’s not.
It just computes the difference between corresponding values in both vectors,
squares and then sums these differences. Finally, the square root is computed
and returned as the distance between xj and xq.
- Manhattan distance
- Hamming distance
- Cosine similarity
ADVANTAGES OF NEAREST NEIGHBOURS
- Simple to implement and very effective
- Makes no assumptions about the data unlike linear regression
- Fast training phase
Obviously this method also has some drawbacks. Some of them
are
- Requires lots of data. Nearest neighbours works best when we have a considerable amount of data. This makes sense, it would be easier to find similarities between 1000 items compared to 100 items.
- Nearest neighbours does not produce a model like linear regression. Therefore, it does not give us concisely the relationship between features and the target class.
POPULAR APPLICATIONS
- Computer vision applications, including optical character recognition and face detection in images and video
- Predicting whether or not a person would like a movie recommended to him/her
- Identifying patterns in genetic data
Summary
In this post I have introduced the concept of using nearest
neighbours for prediction and classification. In the next post I will show how
to implement it in Python.
Thanks for reading this post. As always if you have any
suggestion, remark or observation about what I wrote please feel free to comment
about it. Have a wonderful week ahead.
No comments:
Post a Comment