Thursday, 10 March 2016

IMPLEMENTING NEAREST NEIGHBOURS IN PYTHON

Hello, welcome to my blog. In the last post I introduced the concept of nearest neighbours and how it can be used to for either prediction or classification. In case you have not read it, you can read it here

In this post, I will show how to implement nearest neighbours in Python. This time it will be a little different because I will use my own code instead of a library (like I did in other posts). sklearn (a Python library) provides an implementation of nearest neighbours but I think it better if I implemented it myself so I can explain what is really happening in the program. I will give the full program at the end of the post. Here, I will just show relevant screenshots of parts of the program.

For this implementation I will be using a smaller version of the King County dataset – it contains about 8,000 rows compared to the previous version that had roughly 19,000 rows. Another thing - this dataset is not in a CSV format like the others I have talked about. It’s in an SFrame which is another very good data structure for loading and working with datasets. For more info on SFrames click here. Here is a list of the Python libraries I will be using:
  1. sframe – for loading the dataset
  2. matplotlib – for visualization
  3. numpy – for array and mathematical computation.
First I used the ‘SFrame’ function the from the sframe library to load the dataset. Next, I randomly split the data into three parts. The ‘seed’ argument ensures that the dataset is ‘randomly’ split the same way every time I run the program. One part for training, validation and testing respectively. Before we can start implementing KNN, the dataset needs to be converted to a numpy array so we can efficiently compute the distances between data points. I also normalized the features so that large features like ‘sqft_living’ do not exert more influence on the distance compared to smaller features like ‘bedrooms’. 

COMPUTING DISTANCE
Nearest neighbours predicts or classifies based on distance between observations. Here is a function that computes the distance between two points


Remember that long formula from the previous post? This function does exactly the same thing. It accepts two data points x & y as arguments and returns the Euclidean distance between them.

MAKING PREDICTIONS
Next, how do we make predictions? We can do this by taking a house from the query set and predicting the price of that house to be the average (mean) of the prices of the k closest houses in the training set to it. Remember k is simply a placeholder for any whole positive number you choose. For example, we can base our predictions on the average price of the 3 closest houses in the training set for any house in the query set. The function that does that is shown below




First, the function computes the distance between every house in the training set and query set. These distances are in the numpy array ‘myDistances’. Next, I used the np.argsort function to compute the sorted indices for the array (i.e. the indices which would sort the array in ascending order). Then I select the first k closest indices based on the value of k specified in the function call and compute the mean price of these houses. Finally, the predicted price is appended to a list which is converted to a numpy array before it is returned.

FINDING THE BEST K
To find the best k, I tested the function above using different values of k ranging from 1 to 15 using the validation set as my query set. In order to aid the selection of k, I made a plot of the RSS for each value of k against k. The plot is shown below




From the plot, it is clear that k=8 has the lowest RSS on the validation data. On the test data, using k=8 gave an RSS value 1.33*1014.

CONCLUSION
In this post, I have shown how to implement nearest neighbours in Python. Hope you enjoyed reading this post. I really appreciate feedback and I would like to know what you think about my posts. Feel free to leave a comment if there is anything you would like to say. Hope to see you soon. 


Full code for the program

import sframe, numpy as np #importing required libraries

#Load the dataset
sales = sframe.SFrame('kc_house_data_small.gl/')

#Split into training, validation and test sets
(train_and_validation, test) = sales.random_split(.8, seed=1)
(train, validation) = train_and_validation.random_split(.8, seed=1)

#code to convert an SFrame to a numpy array
def get_numpy_data(data_sframe, features, output):
    data_sframe['constant'] = 1 #add a constant column to the SFrame
    features = ['constant'] + features
    features_sframe = data_sframe[features] #select out features specified by feature list
    #convert features_sframe to a numpy matrix
    feature_matrix = features_sframe.to_numpy()
    #assign output_sarray to the target output
    output_sarray = data_sframe[output]
    #convert sarray to a numpy array
    output_array = output_sarray.to_numpy()
    return (feature_matrix, output_array)

def normalize_features(feature_matrix):
    norms = np.linalg.norm(feature_matrix, axis=0)
    normalized_features = feature_matrix / norms
    return (normalized_features, norms)

feature_list = ['bedrooms',  
                'bathrooms',  
                'sqft_living',  
                'sqft_lot',  
                'floors',
                'waterfront',  
                'view',  
                'condition',  
                'grade',  
                'sqft_above',  
                'sqft_basement',
                'yr_built',  
                'yr_renovated',  
                'lat',  
                'long',  
                'sqft_living15',  
                'sqft_lot15']

#Make numpy array of our features and the output
features_train, output_train = get_numpy_data(train, feature_list, 'price')
features_test, output_test = get_numpy_data(test, feature_list, 'price')
features_valid, output_valid = get_numpy_data(validation, feature_list, 'price')

features_train, norms = normalize_features(features_train) # normalize training set features (columns)
features_test = features_test / norms # normalize test set by training set norms
features_valid = features_valid / norms # normalize validation set by training set norms

def multPredFromKNN(k, features_train, output_train, query_set):
    priceList = []
    for i in xrange(query_set.shape[0]):
        myDiff = features_train[0:len(features_train)] - query_set[i]
        myDistances = np.sqrt(np.sum(myDiff**2, axis=1))
        sortDist = np.argsort(myDistances) #not really sorted
        sortInd = sortDist[0:k]
        pred = np.mean(output_train[sortInd])
        priceList.append(pred)
    return np.array(priceList)

#function to compute the RSS
def get_RSS(predictions, output):
    resid = output - predictions
    temp = resid**2
    RSS = sum(temp)
    return RSS

#Find the best the value of k in a range of values
def find_best_k(query_set):
    all_RSS = [] #list to store RSS values
    for k in xrange(1, 16):
        myPred = multPredFromKNN(k, features_train, output_train, query_set)
        tempRSS = get_RSS(myPred, output_valid)
        all_RSS.append(tempRSS)
    return np.array(all_RSS)

#Make plot to guide choice of k
import matplotlib.pyplot as plt
%matplotlib inline

kvals = range(1, 16)
plt.plot(kvals, RSS_list,'bo-')
plt.xlabel('Values of k')
plt.ylabel('RSS on validation data')

#Best k = 8
#Use k = 8 for prediction on the test set
KNN_8 = multPredFromKNN(8, features_train, output_train, features_test)

#Use RSS function to get RSS value for k = 8
get_RSS(KNN_8, output_test)