Hello,
welcome to my blog. In the last post I introduced the concept of nearest
neighbours and how it can be used to for either prediction or classification.
In case you have not read it, you can read it here. 
In
this post, I will show how to implement nearest neighbours in Python. This time
it will be a little different because I will use my own code instead of a
library (like I did in other posts). sklearn (a Python library) provides an
implementation of nearest neighbours but I think it better if I implemented it myself
so I can explain what is really happening in the program. I will give the full program at the end of the post. Here, I will just show
relevant screenshots of parts of the program.
For this implementation I will be using a smaller version of
the King County dataset – it contains about 8,000 rows compared to the previous
version that had roughly 19,000 rows. Another thing - this dataset is not in a
CSV format like the others I have talked about. It’s in an SFrame which is
another very good data structure for loading and working with datasets. For
more info on SFrames click here. Here is a list of the Python libraries
I will be using:
- sframe – for loading the dataset
- matplotlib – for visualization
- numpy – for array and mathematical computation.
First I used the ‘SFrame’ function the from the sframe library
to load the dataset. Next, I randomly split the data into three parts. The ‘seed’
argument ensures that the dataset is ‘randomly’ split the same way every time I
run the program. One part for training, validation and testing respectively. Before
we can start implementing KNN, the dataset needs to be converted to a numpy
array so we can efficiently compute the distances between data points. I also normalized
the features so that large features like ‘sqft_living’ do not exert more
influence on the distance compared to smaller features like ‘bedrooms’. 
COMPUTING DISTANCE
Nearest neighbours predicts or classifies based on distance
between observations. Here is a function that computes the distance between two
points
Remember that long formula from the previous post? This
function does exactly the same thing. It accepts two data points x & y as
arguments and returns the Euclidean distance between them.
MAKING PREDICTIONS
Next, how do we make predictions? We can do this by taking a
house from the query set and predicting the price of that house to be the
average (mean) of the prices of the k closest
houses in the training set to it. Remember k
is simply a placeholder for any whole
positive number you choose. For example, we can base our predictions on the
average price of the 3 closest houses in the training set for any house in the
query set. The function that does that is shown below
First, the function computes the distance between every house
in the training set and query set. These distances are in the numpy array
‘myDistances’. Next, I used the np.argsort function to compute the sorted
indices for the array (i.e. the indices which would sort the array in ascending
order). Then I select the first k closest
indices based on the value of k specified
in the function call and compute the mean price of these houses. Finally, the
predicted price is appended to a list which is converted to a numpy array
before it is returned. 
FINDING THE BEST K
To find the best k,
I tested the function above using different values of k ranging from 1 to 15 using the validation set as my query set. In
order to aid the selection of k, I
made a plot of the RSS for each value of k
against k. The plot is shown below
From the plot, it is clear that k=8 has the lowest RSS on the validation data. On the test data,
using k=8 gave an RSS value 1.33*1014.
CONCLUSION
In this post, I have shown how to implement nearest neighbours
in Python. Hope you enjoyed reading this post. I really appreciate feedback and
I would like to know what you think about my posts. Feel free to leave a
comment if there is anything you would like to say. Hope to see you soon. 
Full code for the program
import sframe, numpy as np #importing required libraries
#Load the dataset
sales = sframe.SFrame('kc_house_data_small.gl/')
#Split into training, validation and test sets
(train_and_validation, test) = sales.random_split(.8, seed=1)
(train, validation) = train_and_validation.random_split(.8, seed=1)
#code to convert an SFrame to a numpy array
def get_numpy_data(data_sframe, features, output):
    data_sframe['constant'] = 1 #add a constant column to the SFrame
    features = ['constant'] + features
    features_sframe = data_sframe[features] #select out features specified by feature list
    #convert features_sframe to a numpy matrix
    feature_matrix = features_sframe.to_numpy()
    #assign output_sarray to the target output
    output_sarray = data_sframe[output]
    #convert sarray to a numpy array
    output_array = output_sarray.to_numpy()
    return (feature_matrix, output_array)
def normalize_features(feature_matrix):
    norms = np.linalg.norm(feature_matrix, axis=0)
    normalized_features = feature_matrix / norms
    return (normalized_features, norms)
feature_list = ['bedrooms',  
                'bathrooms',  
                'sqft_living',  
                'sqft_lot',  
                'floors',
                'waterfront',  
                'view',  
                'condition',  
                'grade',  
                'sqft_above',  
                'sqft_basement',
                'yr_built',  
                'yr_renovated',  
                'lat',  
                'long',  
                'sqft_living15',  
                'sqft_lot15']
#Make numpy array of our features and the output
features_train, output_train = get_numpy_data(train, feature_list, 'price')
features_test, output_test = get_numpy_data(test, feature_list, 'price')
features_valid, output_valid = get_numpy_data(validation, feature_list, 'price')
features_train, norms = normalize_features(features_train) # normalize training set features (columns)
features_test = features_test / norms # normalize test set by training set norms
features_valid = features_valid / norms # normalize validation set by training set norms
def multPredFromKNN(k, features_train, output_train, query_set):
    priceList = []
    for i in xrange(query_set.shape[0]):
        myDiff = features_train[0:len(features_train)] - query_set[i]
        myDistances = np.sqrt(np.sum(myDiff**2, axis=1))
        sortDist = np.argsort(myDistances) #not really sorted
        sortInd = sortDist[0:k]
        pred = np.mean(output_train[sortInd])
        priceList.append(pred)
    return np.array(priceList)
#function to compute the RSS
def get_RSS(predictions, output):
    resid = output - predictions
    temp = resid**2
    RSS = sum(temp)
    return RSS
#Find the best the value of k in a range of values
def find_best_k(query_set):
    all_RSS = [] #list to store RSS values
    for k in xrange(1, 16):
        myPred = multPredFromKNN(k, features_train, output_train, query_set)
        tempRSS = get_RSS(myPred, output_valid)
        all_RSS.append(tempRSS)
    return np.array(all_RSS)
#Make plot to guide choice of k
import matplotlib.pyplot as plt
%matplotlib inline
kvals = range(1, 16)
plt.plot(kvals, RSS_list,'bo-')
plt.xlabel('Values of k')
plt.ylabel('RSS on validation data')
#Best k = 8
#Use k = 8 for prediction on the test set
KNN_8 = multPredFromKNN(8, features_train, output_train, features_test)
#Use RSS function to get RSS value for k = 8
get_RSS(KNN_8, output_test)
 
No comments:
Post a Comment