Scikit Learn - KNN Learning

Quiz

k-NN (k-Nearest Neighbor), one of the simplest machine learning algorithms, is non-parametric and lazy in nature. Non-parametric means that there is no assumption for the underlying data distribution i.e. the model structure is determined from the dataset. Lazy or instance-based learning means that for the purpose of model generation, it does not require any training data points and whole training data is used in the testing phase.

The k-NN algorithm consist of the following two steps −

Step 1

In this step, it computes and stores the k nearest neighbors for each sample in the training set.

Step 2

In this step, for an unlabeled sample, it retrieves the k nearest neighbors from dataset. Then among these k-nearest neighbors, it predicts the class through voting (class with majority votes wins).

The module, sklearn.neighbors that implements the k-nearest neighbors algorithm, provides the functionality for unsupervised as well as supervised neighbors-based learning methods.

The unsupervised nearest neighbors implement different algorithms (BallTree, KDTree or Brute Force) to find the nearest neighbor(s) for each sample. This unsupervised version is basically only step 1, which is discussed above, and the foundation of many algorithms (KNN and K-means being the famous one) which require the neighbor search. In simple words, it is Unsupervised learner for implementing neighbor searches.

On the other hand, the supervised neighbors-based learning is used for classification as well as regression.

Unsupervised KNN Learning

As discussed, there exist many algorithms like KNN and K-Means that requires nearest neighbor searches. That is why Scikit-learn decided to implement the neighbor search part as its own learner. The reason behind making neighbor search as a separate learner is that computing all pairwise distance for finding a nearest neighbor is obviously not very efficient. Lets see the module used by Sklearn to implement unsupervised nearest neighbor learning along with example.

Scikit-learn module

sklearn.neighbors.NearestNeighbors is the module used to implement unsupervised nearest neighbor learning. It uses specific nearest neighbor algorithms named BallTree, KDTree or Brute Force. In other words, it acts as a uniform interface to these three algorithms.

Parameters

Followings table consist the parameters used by NearestNeighbors module −

Sr.No	Parameter & Description
1	n_neighbors − int, optional The number of neighbors to get. The default value is 5.
2	radius − float, optional It limits the distance of neighbors to returns. The default value is 1.0.
3	algorithm − {auto, ball_tree, kd_tree, brute}, optional This parameter will take the algorithm (BallTree, KDTree or Brute-force) you want to use to compute the nearest neighbors. If you will provide auto, it will attempt to decide the most appropriate algorithm based on the values passed to fit method.
4	leaf_size − int, optional It can affect the speed of the construction & query as well as the memory required to store the tree. It is passed to BallTree or KDTree. Although the optimal value depends on the nature of the problem, its default value is 30.
5	metric − string or callable It is the metric to use for distance computation between points. We can pass it as a string or callable function. In case of callable function, the metric is called on each pair of rows and the resulting value is recorded. It is less efficient than passing the metric name as a string. We can choose from metric from scikit-learn or scipy.spatial.distance. the valid values are as follows − Scikit-learn − [cosine,manhattan,Euclidean, l1,l2, cityblock] Scipy.spatial.distance − [braycurtis,canberra,chebyshev,dice,hamming,jaccard, correlation,kulsinski,mahalanobis,minkowski,rogerstanimoto,russellrao, sokalmicheme,sokalsneath, seuclidean, sqeuclidean, yule]. The default metric is Minkowski.
6	P − integer, optional It is the parameter for the Minkowski metric. The default value is 2 which is equivalent to using Euclidean_distance(l2).
7	metric_params − dict, optional This is the additional keyword arguments for the metric function. The default value is None.
8	N_jobs − int or None, optional It reprsetst the numer of parallel jobs to run for neighbor search. The default value is None.

Implementation Example

The example below will find the nearest neighbors between two sets of data by using the sklearn.neighbors.NearestNeighbors module.

First, we need to import the required module and packages −

from sklearn.neighbors import NearestNeighbors
import numpy as np

Now, after importing the packages, define the sets of data in between we want to find the nearest neighbors −

Input_data = np.array([[-1, 1], [-2, 2], [-3, 3], [1, 2], [2, 3], [3, 4],[4, 5]])

Next, apply the unsupervised learning algorithm, as follows −

nrst_neigh = NearestNeighbors(n_neighbors = 3, algorithm = 'ball_tree')

Next, fit the model with input data set.

nrst_neigh.fit(Input_data)

Now, find the K-neighbors of data set. It will return the indices and distances of the neighbors of each point.

distances, indices = nbrs.kneighbors(Input_data)
indices

Output

array(
   [
      [0, 1, 3],
      [1, 2, 0],
      [2, 1, 0],
      [3, 4, 0],
      [4, 5, 3],
      [5, 6, 4],
      [6, 5, 4]
   ], dtype = int64
)
distances

Output

array(
   [
      [0. , 1.41421356, 2.23606798],
      [0. , 1.41421356, 1.41421356],
      [0. , 1.41421356, 2.82842712],
      [0. , 1.41421356, 2.23606798],
      [0. , 1.41421356, 1.41421356],
      [0. , 1.41421356, 1.41421356],
      [0. , 1.41421356, 2.82842712]
   ]
)

The above output shows that the nearest neighbor of each point is the point itself i.e. at zero. It is because the query set matches the training set.

Example

We can also show a connection between neighboring points by producing a sparse graph as follows −

nrst_neigh.kneighbors_graph(Input_data).toarray()

Output

array(
   [
      [1., 1., 0., 1., 0., 0., 0.],
      [1., 1., 1., 0., 0., 0., 0.],
      [1., 1., 1., 0., 0., 0., 0.],
      [1., 0., 0., 1., 1., 0., 0.],
      [0., 0., 0., 1., 1., 1., 0.],
      [0., 0., 0., 0., 1., 1., 1.],
      [0., 0., 0., 0., 1., 1., 1.]
   ]
)

Once we fit the unsupervised NearestNeighbors model, the data will be stored in a data structure based on the value set for the argument algorithm. After that we can use this unsupervised learners kneighbors in a model which requires neighbor searches.

Complete working/executable program

from sklearn.neighbors import NearestNeighbors
import numpy as np
Input_data = np.array([[-1, 1], [-2, 2], [-3, 3], [1, 2], [2, 3], [3, 4],[4, 5]])
nrst_neigh = NearestNeighbors(n_neighbors = 3, algorithm='ball_tree')
nrst_neigh.fit(Input_data)
distances, indices = nbrs.kneighbors(Input_data)
indices
distances
nrst_neigh.kneighbors_graph(Input_data).toarray()

Supervised KNN Learning

The supervised neighbors-based learning is used for following −

Classification, for the data with discrete labels
Regression, for the data with continuous labels.

Nearest Neighbor Classifier

We can understand Neighbors-based classification with the help of following two characteristics −

It is computed from a simple majority vote of the nearest neighbors of each point.
It simply stores instances of the training data, thats why it is a type of non-generalizing learning.

Scikit-learn modules

Followings are the two different types of nearest neighbor classifiers used by scikit-learn −

S.No.	Classifiers & Description
1.	KNeighborsClassifier The K in the name of this classifier represents the k nearest neighbors, where k is an integer value specified by the user. Hence as the name suggests, this classifier implements learning based on the k nearest neighbors. The choice of the value of k is dependent on data.
2.	RadiusNeighborsClassifier The Radius in the name of this classifier represents the nearest neighbors within a specified radius r, where r is a floating-point value specified by the user. Hence as the name suggests, this classifier implements learning based on the number neighbors within a fixed radius r of each training point.

Nearest Neighbor Regressor

It is used in the cases where data labels are continuous in nature. The assigned data labels are computed on the basis on the mean of the labels of its nearest neighbors.

Followings are the two different types of nearest neighbor regressors used by scikit-learn −

KNeighborsRegressor

The K in the name of this regressor represents the k nearest neighbors, where k is an integer value specified by the user. Hence, as the name suggests, this regressor implements learning based on the k nearest neighbors. The choice of the value of k is dependent on data. Lets understand it more with the help of an implementation example.

Followings are the two different types of nearest neighbor regressors used by scikit-learn −

Implementation Example

In this example, we will be implementing KNN on data set named Iris Flower data set by using scikit-learn KNeighborsRegressor.

First, import the iris dataset as follows −

from sklearn.datasets import load_iris
iris = load_iris()

Now, we need to split the data into training and testing data. We will be using Sklearn train_test_split function to split the data into the ratio of 70 (training data) and 20 (testing data) −

X = iris.data[:, :4]
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

Next, we will be doing data scaling with the help of Sklearn preprocessing module as follows −

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Next, import the KNeighborsRegressor class from Sklearn and provide the value of neighbors as follows.

Example

import numpy as np
from sklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors = 8)
knnr.fit(X_train, y_train)

Output

KNeighborsRegressor(
   algorithm = 'auto', leaf_size = 30, metric = 'minkowski',
   metric_params = None, n_jobs = None, n_neighbors = 8, p = 2,
   weights = 'uniform'
)

Example

Now, we can find the MSE (Mean Squared Error) as follows −

print ("The MSE is:",format(np.power(y-knnr.predict(X),4).mean()))

Output

The MSE is: 4.4333349609375

Example

Now, use it to predict the value as follows −

X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
from sklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors = 3)
knnr.fit(X, y)
print(knnr.predict([[2.5]]))

Output

[0.66666667]

Complete working/executable program

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data[:, :4]
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

import numpy as np
from sklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors=8)
knnr.fit(X_train, y_train)

print ("The MSE is:",format(np.power(y-knnr.predict(X),4).mean()))

X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
from sklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors=3)
knnr.fit(X, y)
print(knnr.predict([[2.5]]))

RadiusNeighborsRegressor

The Radius in the name of this regressor represents the nearest neighbors within a specified radius r, where r is a floating-point value specified by the user. Hence as the name suggests, this regressor implements learning based on the number neighbors within a fixed radius r of each training point. Lets understand it more with the help if an implementation example −

Implementation Example

In this example, we will be implementing KNN on data set named Iris Flower data set by using scikit-learn RadiusNeighborsRegressor −

First, import the iris dataset as follows −

from sklearn.datasets import load_iris
iris = load_iris()

Now, we need to split the data into training and testing data. We will be using Sklearn train_test_split function to split the data into the ratio of 70 (training data) and 20 (testing data) −

X = iris.data[:, :4]
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

Next, we will be doing data scaling with the help of Sklearn preprocessing module as follows −

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Next, import the RadiusneighborsRegressor class from Sklearn and provide the value of radius as follows −

import numpy as np
from sklearn.neighbors import RadiusNeighborsRegressor
knnr_r = RadiusNeighborsRegressor(radius=1)
knnr_r.fit(X_train, y_train)

Example

Now, we can find the MSE (Mean Squared Error) as follows −

print ("The MSE is:",format(np.power(y-knnr_r.predict(X),4).mean()))

Output

The MSE is: The MSE is: 5.666666666666667

Example

Now, use it to predict the value as follows −

X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
from sklearn.neighbors import RadiusNeighborsRegressor
knnr_r = RadiusNeighborsRegressor(radius=1)
knnr_r.fit(X, y)
print(knnr_r.predict([[2.5]]))

Output

[1.]

Complete working/executable program

from sklearn.datasets import load_iris

iris = load_iris()

X = iris.data[:, :4]
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
import numpy as np
from sklearn.neighbors import RadiusNeighborsRegressor
knnr_r = RadiusNeighborsRegressor(radius = 1)
knnr_r.fit(X_train, y_train)
print ("The MSE is:",format(np.power(y-knnr_r.predict(X),4).mean()))
X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
from sklearn.neighbors import RadiusNeighborsRegressor
knnr_r = RadiusNeighborsRegressor(radius = 1)
knnr_r.fit(X, y)
print(knnr_r.predict([[2.5]]))

Print Page