- Scikit Learn Tutorial
- Scikit Learn - Home
- Scikit Learn - Introduction
- Scikit Learn - Modelling Process
- Scikit Learn - Data Representation
- Scikit Learn - Estimator API
- Scikit Learn - Conventions
- Scikit Learn - Linear Modeling
- Scikit Learn - Extended Linear Modeling
- Stochastic Gradient Descent
- Scikit Learn - Support Vector Machines
- Scikit Learn - Anomaly Detection
- Scikit Learn - K-Nearest Neighbors
- Scikit Learn - KNN Learning
- Classification with Naïve Bayes
- Scikit Learn - Decision Trees
- Randomized Decision Trees
- Scikit Learn - Boosting Methods
- Scikit Learn - Clustering Methods
- Clustering Performance Evaluation
- Dimensionality Reduction using PCA
- Scikit Learn Useful Resources
- Scikit Learn - Quick Guide
- Scikit Learn - Useful Resources
- Scikit Learn - Discussion

- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who

# Scikit Learn - KNN Learning

k-NN (k-Nearest Neighbor), one of the simplest machine learning algorithms, is non-parametric and lazy in nature. Non-parametric means that there is no assumption for the underlying data distribution i.e. the model structure is determined from the dataset. Lazy or instance-based learning means that for the purpose of model generation, it does not require any training data points and whole training data is used in the testing phase.

The k-NN algorithm consist of the following two steps −

### Step 1

In this step, it computes and stores the k nearest neighbors for each sample in the training set.

### Step 2

In this step, for an unlabeled sample, it retrieves the k nearest neighbors from dataset. Then among these k-nearest neighbors, it predicts the class through voting (class with majority votes wins).

The module, **sklearn.neighbors** that implements the k-nearest neighbors algorithm, provides the functionality for **unsupervised** as well as **supervised** neighbors-based learning methods.

The unsupervised nearest neighbors implement different algorithms (BallTree, KDTree or Brute Force) to find the nearest neighbor(s) for each sample. This unsupervised version is basically only step 1, which is discussed above, and the foundation of many algorithms (KNN and K-means being the famous one) which require the neighbor search. In simple words, it is Unsupervised learner for implementing neighbor searches.

On the other hand, the supervised neighbors-based learning is used for classification as well as regression.

## Unsupervised KNN Learning

As discussed, there exist many algorithms like KNN and K-Means that requires nearest neighbor searches. That is why Scikit-learn decided to implement the neighbor search part as its own “learner”. The reason behind making neighbor search as a separate learner is that computing all pairwise distance for finding a nearest neighbor is obviously not very efficient. Let’s see the module used by Sklearn to implement unsupervised nearest neighbor learning along with example.

### Scikit-learn module

**sklearn.neighbors.NearestNeighbors** is the module used to implement unsupervised nearest neighbor learning. It uses specific nearest neighbor algorithms named BallTree, KDTree or Brute Force. In other words, it acts as a uniform interface to these three algorithms.

### Parameters

Followings table consist the parameters used by **NearestNeighbors** module −

Sr.No | Parameter & Description |
---|---|

1 |
The number of neighbors to get. The default value is 5. |

2 |
It limits the distance of neighbors to returns. The default value is 1.0. |

3 |
This parameter will take the algorithm (BallTree, KDTree or Brute-force) you want to use to compute the nearest neighbors. If you will provide ‘auto’, it will attempt to decide the most appropriate algorithm based on the values passed to fit method. |

4 |
It can affect the speed of the construction & query as well as the memory required to store the tree. It is passed to BallTree or KDTree. Although the optimal value depends on the nature of the problem, its default value is 30. |

5 |
It is the metric to use for distance computation between points. We can pass it as a string or callable function. In case of callable function, the metric is called on each pair of rows and the resulting value is recorded. It is less efficient than passing the metric name as a string. We can choose from metric from scikit-learn or scipy.spatial.distance. the valid values are as follows − Scikit-learn − [‘cosine’,’manhattan’,‘Euclidean’, ‘l1’,’l2’, ‘cityblock’] Scipy.spatial.distance − [‘braycurtis’,‘canberra’,‘chebyshev’,‘dice’,‘hamming’,‘jaccard’, ‘correlation’,‘kulsinski’,‘mahalanobis’,‘minkowski’,‘rogerstanimoto’,‘russellrao’, ‘sokalmicheme’,’sokalsneath’, ‘seuclidean’, ‘sqeuclidean’, ‘yule’]. The default metric is ‘Minkowski’. |

6 |
It is the parameter for the Minkowski metric. The default value is 2 which is equivalent to using Euclidean_distance(l2). |

7 |
This is the additional keyword arguments for the metric function. The default value is None. |

8 |
It reprsetst the numer of parallel jobs to run for neighbor search. The default value is None. |

**Implementation Example**

The example below will find the nearest neighbors between two sets of data by using the **sklearn.neighbors.NearestNeighbors** module.

First, we need to import the required module and packages −

from sklearn.neighbors import NearestNeighbors import numpy as np

Now, after importing the packages, define the sets of data in between we want to find the nearest neighbors −

Input_data = np.array([[-1, 1], [-2, 2], [-3, 3], [1, 2], [2, 3], [3, 4],[4, 5]])

Next, apply the unsupervised learning algorithm, as follows −

nrst_neigh = NearestNeighbors(n_neighbors = 3, algorithm = 'ball_tree')

Next, fit the model with input data set.

nrst_neigh.fit(Input_data)

Now, find the K-neighbors of data set. It will return the indices and distances of the neighbors of each point.

distances, indices = nbrs.kneighbors(Input_data) indices

**Output**

array( [ [0, 1, 3], [1, 2, 0], [2, 1, 0], [3, 4, 0], [4, 5, 3], [5, 6, 4], [6, 5, 4] ], dtype = int64 ) distances

**Output**

array( [ [0. , 1.41421356, 2.23606798], [0. , 1.41421356, 1.41421356], [0. , 1.41421356, 2.82842712], [0. , 1.41421356, 2.23606798], [0. , 1.41421356, 1.41421356], [0. , 1.41421356, 1.41421356], [0. , 1.41421356, 2.82842712] ] )

The above output shows that the nearest neighbor of each point is the point itself i.e. at zero. It is because the query set matches the training set.

**Example**

We can also show a connection between neighboring points by producing a sparse graph as follows −

nrst_neigh.kneighbors_graph(Input_data).toarray()

**Output**

array( [ [1., 1., 0., 1., 0., 0., 0.], [1., 1., 1., 0., 0., 0., 0.], [1., 1., 1., 0., 0., 0., 0.], [1., 0., 0., 1., 1., 0., 0.], [0., 0., 0., 1., 1., 1., 0.], [0., 0., 0., 0., 1., 1., 1.], [0., 0., 0., 0., 1., 1., 1.] ] )

Once we fit the unsupervised **NearestNeighbors** model, the data will be stored in a data structure based on the value set for the argument **‘algorithm’**. After that we can use this unsupervised learner’s **kneighbors** in a model which requires neighbor searches.

**Complete working/executable program**

from sklearn.neighbors import NearestNeighbors import numpy as np Input_data = np.array([[-1, 1], [-2, 2], [-3, 3], [1, 2], [2, 3], [3, 4],[4, 5]]) nrst_neigh = NearestNeighbors(n_neighbors = 3, algorithm='ball_tree') nrst_neigh.fit(Input_data) distances, indices = nbrs.kneighbors(Input_data) indices distances nrst_neigh.kneighbors_graph(Input_data).toarray()

## Supervised KNN Learning

The supervised neighbors-based learning is used for following −

- Classification, for the data with discrete labels
- Regression, for the data with continuous labels.

### Nearest Neighbor Classifier

We can understand Neighbors-based classification with the help of following two characteristics −

- It is computed from a simple majority vote of the nearest neighbors of each point.
- It simply stores instances of the training data, that’s why it is a type of non-generalizing learning.

### Scikit-learn modules

Followings are the two different types of nearest neighbor classifiers used by scikit-learn −

S.No. | Classifiers & Description |
---|---|

1. | KNeighborsClassifier
The K in the name of this classifier represents the k nearest neighbors, where k is an integer value specified by the user. Hence as the name suggests, this classifier implements learning based on the k nearest neighbors. The choice of the value of k is dependent on data. |

2. | RadiusNeighborsClassifier
The Radius in the name of this classifier represents the nearest neighbors within a specified radius r, where r is a floating-point value specified by the user. Hence as the name suggests, this classifier implements learning based on the number neighbors within a fixed radius r of each training point. |

## Nearest Neighbor Regressor

It is used in the cases where data labels are continuous in nature. The assigned data labels are computed on the basis on the mean of the labels of its nearest neighbors.

Followings are the two different types of nearest neighbor regressors used by scikit-learn −

## KNeighborsRegressor

The K in the name of this regressor represents the k nearest neighbors, where **k** is an **integer value** specified by the user. Hence, as the name suggests, this regressor implements learning based on the k nearest neighbors. The choice of the value of k is dependent on data. Let’s understand it more with the help of an implementation example.

Followings are the two different types of nearest neighbor regressors used by scikit-learn −

### Implementation Example

In this example, we will be implementing KNN on data set named Iris Flower data set by using scikit-learn **KNeighborsRegressor**.

First, import the iris dataset as follows −

from sklearn.datasets import load_iris iris = load_iris()

Now, we need to split the data into training and testing data. We will be using Sklearn **train_test_split** function to split the data into the ratio of 70 (training data) and 20 (testing data) −

X = iris.data[:, :4] y = iris.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

Next, we will be doing data scaling with the help of Sklearn preprocessing module as follows −

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test)

Next, import the **KNeighborsRegressor** class from Sklearn and provide the value of neighbors as follows.

### Example

import numpy as np from sklearn.neighbors import KNeighborsRegressor knnr = KNeighborsRegressor(n_neighbors = 8) knnr.fit(X_train, y_train)

### Output

KNeighborsRegressor( algorithm = 'auto', leaf_size = 30, metric = 'minkowski', metric_params = None, n_jobs = None, n_neighbors = 8, p = 2, weights = 'uniform' )

### Example

Now, we can find the MSE (Mean Squared Error) as follows −

print ("The MSE is:",format(np.power(y-knnr.predict(X),4).mean()))

### Output

The MSE is: 4.4333349609375

### Example

Now, use it to predict the value as follows −

X = [[0], [1], [2], [3]] y = [0, 0, 1, 1] from sklearn.neighbors import KNeighborsRegressor knnr = KNeighborsRegressor(n_neighbors = 3) knnr.fit(X, y) print(knnr.predict([[2.5]]))

### Output

[0.66666667]

### Complete working/executable program

from sklearn.datasets import load_iris iris = load_iris() X = iris.data[:, :4] y = iris.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20) from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) import numpy as np from sklearn.neighbors import KNeighborsRegressor knnr = KNeighborsRegressor(n_neighbors=8) knnr.fit(X_train, y_train) print ("The MSE is:",format(np.power(y-knnr.predict(X),4).mean())) X = [[0], [1], [2], [3]] y = [0, 0, 1, 1] from sklearn.neighbors import KNeighborsRegressor knnr = KNeighborsRegressor(n_neighbors=3) knnr.fit(X, y) print(knnr.predict([[2.5]]))

## RadiusNeighborsRegressor

The Radius in the name of this regressor represents the nearest neighbors within a specified radius r, where r is a floating-point value specified by the user. Hence as the name suggests, this regressor implements learning based on the number neighbors within a fixed radius r of each training point. Let’s understand it more with the help if an implementation example −

### Implementation Example

In this example, we will be implementing KNN on data set named Iris Flower data set by using scikit-learn **RadiusNeighborsRegressor** −

First, import the iris dataset as follows −

from sklearn.datasets import load_iris iris = load_iris()

Now, we need to split the data into training and testing data. We will be using Sklearn train_test_split function to split the data into the ratio of 70 (training data) and 20 (testing data) −

X = iris.data[:, :4] y = iris.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

Next, we will be doing data scaling with the help of Sklearn preprocessing module as follows −

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test)

Next, import the **RadiusneighborsRegressor** class from Sklearn and provide the value of radius as follows −

import numpy as np from sklearn.neighbors import RadiusNeighborsRegressor knnr_r = RadiusNeighborsRegressor(radius=1) knnr_r.fit(X_train, y_train)

### Example

Now, we can find the MSE (Mean Squared Error) as follows −

print ("The MSE is:",format(np.power(y-knnr_r.predict(X),4).mean()))

### Output

The MSE is: The MSE is: 5.666666666666667

### Example

Now, use it to predict the value as follows −

X = [[0], [1], [2], [3]] y = [0, 0, 1, 1] from sklearn.neighbors import RadiusNeighborsRegressor knnr_r = RadiusNeighborsRegressor(radius=1) knnr_r.fit(X, y) print(knnr_r.predict([[2.5]]))

### Output

[1.]

### Complete working/executable program

from sklearn.datasets import load_iris iris = load_iris() X = iris.data[:, :4] y = iris.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20) from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) import numpy as np from sklearn.neighbors import RadiusNeighborsRegressor knnr_r = RadiusNeighborsRegressor(radius = 1) knnr_r.fit(X_train, y_train) print ("The MSE is:",format(np.power(y-knnr_r.predict(X),4).mean())) X = [[0], [1], [2], [3]] y = [0, 0, 1, 1] from sklearn.neighbors import RadiusNeighborsRegressor knnr_r = RadiusNeighborsRegressor(radius = 1) knnr_r.fit(X, y) print(knnr_r.predict([[2.5]]))