Welcome to Plain Data: IMPLEMENTING NEAREST NEIGHBOURS IN PYTHON

Hello, welcome to my blog. In the last post I introduced the concept of nearest neighbours and how it can be used to for either prediction or classification. In case you have not read it, you can read it here.

In this post, I will show how to implement nearest neighbours in Python. This time it will be a little different because I will use my own code instead of a library (like I did in other posts). sklearn (a Python library) provides an implementation of nearest neighbours but I think it better if I implemented it myself so I can explain what is really happening in the program. I will give the full program at the end of the post. Here, I will just show relevant screenshots of parts of the program.

For this implementation I will be using a smaller version of the King County dataset – it contains about 8,000 rows compared to the previous version that had roughly 19,000 rows. Another thing - this dataset is not in a CSV format like the others I have talked about. It’s in an SFrame which is another very good data structure for loading and working with datasets. For more info on SFrames click here. Here is a list of the Python libraries I will be using:

sframe – for loading the dataset
matplotlib – for visualization
numpy – for array and mathematical computation.

First I used the ‘SFrame’ function the from the sframe library to load the dataset. Next, I randomly split the data into three parts. The ‘seed’ argument ensures that the dataset is ‘randomly’ split the same way every time I run the program. One part for training, validation and testing respectively. Before we can start implementing KNN, the dataset needs to be converted to a numpy array so we can efficiently compute the distances between data points. I also normalized the features so that large features like ‘sqft_living’ do not exert more influence on the distance compared to smaller features like ‘bedrooms’.

COMPUTING DISTANCE

Nearest neighbours predicts or classifies based on distance between observations. Here is a function that computes the distance between two points

Remember that long formula from the previous post? This function does exactly the same thing. It accepts two data points x & y as arguments and returns the Euclidean distance between them.

MAKING PREDICTIONS

Next, how do we make predictions? We can do this by taking a house from the query set and predicting the price of that house to be the average (mean) of the prices of the k closest houses in the training set to it. Remember k is simply a placeholder for any whole positive number you choose. For example, we can base our predictions on the average price of the 3 closest houses in the training set for any house in the query set. The function that does that is shown below

First, the function computes the distance between every house in the training set and query set. These distances are in the numpy array ‘myDistances’. Next, I used the np.argsort function to compute the sorted indices for the array (i.e. the indices which would sort the array in ascending order). Then I select the first k closest indices based on the value of k specified in the function call and compute the mean price of these houses. Finally, the predicted price is appended to a list which is converted to a numpy array before it is returned.

FINDING THE BEST K

To find the best k, I tested the function above using different values of k ranging from 1 to 15 using the validation set as my query set. In order to aid the selection of k, I made a plot of the RSS for each value of k against k. The plot is shown below

From the plot, it is clear that k=8 has the lowest RSS on the validation data. On the test data, using k=8 gave an RSS value 1.33*10¹⁴.

CONCLUSION

In this post, I have shown how to implement nearest neighbours in Python. Hope you enjoyed reading this post. I really appreciate feedback and I would like to know what you think about my posts. Feel free to leave a comment if there is anything you would like to say. Hope to see you soon.

Full code for the program

import sframe, numpy as np #importing required libraries

#Load the dataset

sales = sframe.SFrame('kc_house_data_small.gl/')

#Split into training, validation and test sets

(train_and_validation, test) = sales.random_split(.8, seed=1)

(train, validation) = train_and_validation.random_split(.8, seed=1)

#code to convert an SFrame to a numpy array

def get_numpy_data(data_sframe, features, output):

data_sframe['constant'] = 1 #add a constant column to the SFrame

features = ['constant'] + features

features_sframe = data_sframe[features] #select out features specified by feature list

#convert features_sframe to a numpy matrix

feature_matrix = features_sframe.to_numpy()

#assign output_sarray to the target output

output_sarray = data_sframe[output]

#convert sarray to a numpy array

output_array = output_sarray.to_numpy()

return (feature_matrix, output_array)

def normalize_features(feature_matrix):

norms = np.linalg.norm(feature_matrix, axis=0)

normalized_features = feature_matrix / norms

return (normalized_features, norms)

feature_list = ['bedrooms',

'bathrooms',

'sqft_living',

'sqft_lot',

'floors',

'waterfront',

'view',

'condition',

'grade',

'sqft_above',

'sqft_basement',

'yr_built',

'yr_renovated',

'lat',

'long',

'sqft_living15',

'sqft_lot15']

#Make numpy array of our features and the output

features_train, output_train = get_numpy_data(train, feature_list, 'price')

features_test, output_test = get_numpy_data(test, feature_list, 'price')

features_valid, output_valid = get_numpy_data(validation, feature_list, 'price')

features_train, norms = normalize_features(features_train) # normalize training set features (columns)

features_test = features_test / norms # normalize test set by training set norms

features_valid = features_valid / norms # normalize validation set by training set norms

def multPredFromKNN(k, features_train, output_train, query_set):

priceList = []

for i in xrange(query_set.shape[0]):

myDiff = features_train[0:len(features_train)] - query_set[i]

myDistances = np.sqrt(np.sum(myDiff**2, axis=1))

sortDist = np.argsort(myDistances) #not really sorted

sortInd = sortDist[0:k]

pred = np.mean(output_train[sortInd])

priceList.append(pred)

return np.array(priceList)

#function to compute the RSS

def get_RSS(predictions, output):

resid = output - predictions

temp = resid**2

RSS = sum(temp)

return RSS

#Find the best the value of k in a range of values

def find_best_k(query_set):

all_RSS = [] #list to store RSS values

for k in xrange(1, 16):

myPred = multPredFromKNN(k, features_train, output_train, query_set)

tempRSS = get_RSS(myPred, output_valid)

all_RSS.append(tempRSS)

return np.array(all_RSS)

#Make plot to guide choice of k

import matplotlib.pyplot as plt

%matplotlib inline

kvals = range(1, 16)

plt.plot(kvals, RSS_list,'bo-')

plt.xlabel('Values of k')

plt.ylabel('RSS on validation data')

#Best k = 8

#Use k = 8 for prediction on the test set

KNN_8 = multPredFromKNN(8, features_train, output_train, features_test)

#Use RSS function to get RSS value for k = 8

get_RSS(KNN_8, output_test)

Welcome to Plain Data

Thursday, 10 March 2016

IMPLEMENTING NEAREST NEIGHBOURS IN PYTHON

No comments:

Post a Comment

Search This Blog

Blog Archive

About Me

Popular Posts

Translate