k-Nearest Neighbor Algorithm in Python


Introduction

The k-Nearest Neighbor is a powerful and straightforward technique to solve problems related to classification and regression. It makes the prediction on the input samples and checks for how similar the samples are to one another. In this present article we will explain k-NN technique and implementation of python, with two different types of approaches . To ensure a clear comprehension of this well-known technique, we will offer step-by-step explanations replete with executable code and results.

k-Nearest Neighbor Algorithm

A supervised machine learning ( ML) technique used for classification and regression problems is called the k-Nearest Neighbour (k-NN) algorithm. It functions under the premise that identical instances frequently produce similar results. The algorithm locates the k closest training examples and determines the class (classification) or value (regression) to be predicted based on the majority of the labels or average values of those samples, respectively, given a fresh input.

Syntax

The syntax for both is similar . Here we will use the scikit- learn library, which can help to develop the k-NN method in python. There are two different methods depending on the user's needs. If the user wants the classification tasks, he/she can go with KNeighborsClassifier, or if the user wants to predict the numerical part , he/she can go with KNeighborsRegressor

1. Classification −

from sklearn.neighbors import KNeighborsClassifier

# Create an instance of the k-NN classifier
knn = KNeighborsClassifier(n_neighbors=k)

# Train the classifier using the training data
knn.fit(X_train, y_train)

# Make predictions on new data
predictions = knn.predict(X_test)

2. Regression −

from sklearn.neighbors import KNeighborsRegressor

# Create an instance of the k-NN regressor
knn = KNeighborsRegressor(n_neighbors=k)

# Train the regressor using the training data
knn.fit(X_train, y_train)

# Make predictions on new data
predictions = knn.predict(X_test)

The code above uses k to denote the number of neighbors to take into account, X_train and y_train to denote the training features and labels, and X_test to denote the fresh data on which predictions should be made.

Explanation of Syntax

  • The relevant class is imported from the sklearn.neighbors package.

  • By defining the k-NN classifier or regressor's number of neighbors to take into account, we can generate an instance of it.

  • Using the fit() function and the training set, we train the classifier or regressor.

  • Finally, we use the predict() method to generate predictions while giving the updated data.

Algorithm

  • Step 1 − Load the Data: -Read or load the dataset into your Python environment.

  • Step 2 − Split the Data: -Divide the dataset into training and testing sets to assess the algorithm's effectiveness.

  • Step 3 − Preprocess the Data: - To ensure consistent data representation, carry out any necessary preparation procedures, such as scaling or normalization.

  • Step 4 − Train the k-NN Model: -Utilizing the training data, create an instance of the k-NN classifier or regressor.

  • Step 5 − Evaluate the Model: - Make predictions based on the testing set, and then assess the model's performance using suitable measures like accuracy or mean squared error.

Approach

  • Approach 1 − k-NN Classification Example

  • Approach 2 − k-NN Regression Example

Approach 1: k-NN Classification Example

Let's have a look at a real-world application of k-NN classification to identify the species of iris flowers based on the dimensions of their sepals and petals. For this presentation, we'll make use of the well-known Iris dataset.

Example

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# Create an instance of the k-NN classifier
knn = KNeighborsClassifier(n_neighbors=3)

# Train the classifier using the training data
knn.fit(X_train, y_train)

# Make predictions on the testing set
predictions = knn.predict(X_test)

# Calculate and print the accuracy of the model
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

Output

Accuracy: 1.0

In Approach 1, the Iris dataset is loaded, divided into training and testing sets, and a k-NN classifier instance with n_neighbors=3 is created.

Using the training set of data, we train the classifier, and then we make predictions using the testing set.

In order to determine the model's accuracy, we compare the predicted labels to the actual labels. In this instance, the result displays the model's accuracy, which is 1.0 or 100%. This indicates that the k-NN classifier has 100% accuracy in identifying the species of iris blooms in the testing set.

Approach 2: k-NN Regression Example

Let's utilise the Boston Housing dataset to forecast the median price of owner-occupied homes as an example of regression. For this challenge, we'll use the k-NN regressor.

Example

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Load the Boston Housing dataset
boston = load_boston()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2, random_state=42)

# Create an instance of the k-NN regressor
knn = KNeighborsRegressor(n_neighbors=5)

# Train the regressor using the training data
knn.fit(X_train, y_train)

# Make predictions on the testing set
predictions = knn.predict(X_test)

# Calculate and print the mean squared error of the model
mse = mean_squared_error(y_test, predictions)
print("Mean Squared Error:", mse)

Output

Mean Squared Error: 30.137858823529412

In Approach 2, the Boston Housing dataset is loaded, divided into training and testing sets, and a k-NN regressor instance with n_neighbors=5 is created.

Generally, we train the algorithm using the training set of data which is also known as the training set, we then use this training set to make the predictions using the k-NN algorithm.

Finally, we compare the predicted and actual values to get the model's mean squared error. The model's mean squared error (MSE), which in this instance is around 30.1379, is displayed in the output. The MSE is the average squared difference between the true values in the testing set and the anticipated median values of the owner-occupied dwellings. In this case, a lower score would suggest a more accurate regression model because a lower MSE is indicative of greater performance. The n_neighbors parameter number used to build the KNeighborsRegressor instance, along with the random state used to divide the data into training and testing sets, will determine the precise amount of the mean squared error.

Conclusion

The k-Nearest Neighbour (k-NN) algorithm is a flexible and popular machine learning method. For problems involving classification and regression, it is especially helpful. The fundamentals of the k-NN technique, its syntax in Python, and detailed descriptions of how to implement it were all covered in this article. Along with entire executable code and results, we also looked at two methods for using k-NN for classification and regression. You can use the k-NN method as a potent tool to address a range of machine learning issues by comprehending it and its useful applications.

Updated on: 13-Oct-2023

100 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements