# What is the K-nearest neighbors algorithm?

Data MiningDatabaseData Structure

A k-nearest-neighbors algorithm is a classification approach that does not create assumptions about the structure of the relationship among the class membership (Y) and the predictors X1, X2,…. Xn.

This is a nonparametric approach because it does not include estimation of parameters in a pretended function form, including the linear form pretended in linear regression. This approach draws data from similarities among the predictor values of the data in the dataset.

The concept in k-nearest-neighbors methods is to recognize k records in the training dataset that are the same as the new data that it is required to classify. It can use these similar (neighboring) records to define the new record into a class, creating the new data to the predominant class between these neighbors. It indicates the values of the predictors for this new record by X1, X2,…. Xn.

A central question is how to calculate the distance between data depending on their predictor values. The well-known measure of distance is the Euclidean distance. The Euclidean distance among two records (X1, X2,…. Xn) and (U1, U2,…. Un) is

$$\mathrm{\sqrt{(X_1-U_1)^2+(X_2-U_2)^2+...+(X_n-U_n)^2}}$$

The k-NN algorithm depends on several distance computations (between each data to be forecasted and each data in the training set), and hence the Euclidean distance, which is computationally inexpensive, is the most popular in k-NN.

It can balance the scales that the several predictors can have, in most cases, predictors must be standardized before calculating a Euclidean distance. The means and standard deviations that can standardize new data are those of the training data, and the new data is not involved in calculating them. The validation data, such as new data, are also not involved in this calculation.

After calculating the distances between the data to be defined and current records, it is required a rule to assign a class to the record to be classified, depending on the classes of its neighbors.

The simplest case is k = 1, where we look for the closest data (the nearest neighbor) and classify the new data as belonging to the equal class as its closest neighbor.

It is an extraordinary fact that this simple, perceptive concept of using a single nearest neighbor to classify records can be strong when we have multiple records in the training set. It changes out that the misclassification error of the 1-nearest neighbor design has a misclassification rate that is no more than twice the error when it can understand accurately the probability density functions for each class.