Classify Data Based On K-Nearest Neighbor Algorithm Machine Learning

Introduction

K-Nearest Neighbour (KNN) is a basic classification algorithm of Machine Learning. It comes under supervised learning. It is often used in the solution of classification problems in the industry. It is widely used in pattern recognization, data mining, etc. It stores all the available cases from training dataset and classifies the new cases based on distance function.

I will explain KNN algorithm with the help of "Euclidean Distance" formula.

Euclidean Distance 

Euclidean distance formula is used to measure the distance in the plane. It is a very famous way to get the distance between two points.

Let's say the points (x1, y1) and (x2, y2) are points in 2-dimensional space and distance by using the Pythagorean formula like below.

Then, the Euclidean distance between (x1, y1) and (x2, y2) is,

d = √(x2 - x1)^2 + (y2 - y1) ^2

So, in short form,
 
Data Classification based on Euclidean Distance Formula

We have two different kinds of data mentioned below,
  • Training Data

    This set of data contains all the information included with classifications like below.
This training data includes classification with given x, y values.

  • Test Data

    This set of data contains only the values of x and y. Its classification type would be predicted based on the training data. 
This set of training data doesn't contain classification type. So, it will be predicted.

Implementation
 
Import the below libraries.
  1. import csv  
  2. import sys  
  3. from collections import Counter  
  4. from math import sqrt  
Import training data set.
  1. x = []  
  2. y = []  
  3. z = []  
  4. with open('training_data.csv','rt') as f:  
  5.     reader = csv.reader(f)  
  6.     for row in reader:  
  7.         x.append(float(row[0]))  
  8.         y.append(float(row[1]))  
  9.         z.append(row[2])  
  10.           
  11. coordinates = list(zip(x,y))  
  12. input_data = {coordinates[i]:z[i] for i in range(len(coordinates))}
Import test data set.
  1. test_x = []  
  2. test_y = []  
  3. with open('test_data.csv''rt') as f:  
  4.     reader = csv.reader(f)  
  5.     for row in reader:  
  6.         test_x.append(float(row[0]))  
  7.         test_y.append(float(row[1]))  
  8.           
  9. test_coordinates = list(zip(test_x, test_y))  
  10. print (test_coordinates)  
Generate the Euclidean distance.
  1. def euclidean_distance(x, y):  
  2.     if len(x) != len(y):  
  3.         return "Error: try equal length vectors"  
  4.     else:  
  5.         return sqrt(sum([(x[i]-y[i])**2 for i in range(len(y))]))  
KNN clissifier.
  1. def knn_classifier(neighbors, input_data):  
  2.     knn = [input_data[i] for i in neighbors]  
  3.     knn = Counter(knn)  
  4.     classifier, _ = knn.most_common(1)[0]  
  5.     return classifier  
Generate Neighbours.
  1. def neighbors(k, trained_points, new_point):  
  2.     neighbor_distances = {}  
  3.       
  4.     for point in trained_points:  
  5.         if point not in neighbor_distances:  
  6.             neighbor_distances[point] = euclidean_distance(point, new_point)  
  7.       
  8.     least_common = sorted(neighbor_distances.items(), key = lambda x: x[1])  
  9.       
  10.     k_nearest_neighbors = list(zip(*least_common[:k]))  
  11.       
  12.     return list(k_nearest_neighbors[0])  
Print Results.
  1. results = {}  
  2. for item in test_coordinates:  
  3.     results[item] = knn_classifier(neighbors(3,input_data.keys(), item), input_data)  
  4.       
  5. print (results)  
Output
 
 
 
Here, x and y data has been classified into different groups. I have attached the zipped Python code. Python 3 or above will be required to execute this code.
 
Conclusion

K-Nearest Neighbor algorithm is an important algorithm for supervised learning in Machine Learning.

Next Recommended Readings