What are the benefits of k-NN Algorithms?

Data Mining Database Data Structure

A k-nearest-neighbors algorithm is a classification approach that does not create assumptions about the structure of the relationship among the class membership (Y) and the predictors X₁, X₂,…. X_n.

This is a nonparametric approach because it does not contain the estimation of parameters in a pretended function form, including the linear form pretended in linear regression. This method draws data from similarities among the predictor values of the data in the dataset.

The benefit of k-NN methods is their integrity and the need for parametric assumptions. In the presence of a huge training set, these approaches perform especially well, when each class is featured by several combinations of predictor values.

For example, in real-estate databases, there are likely to be several sets of {home type, number of rooms, neighborhood, asking price, etc.} that characterize homes that sell fastly vs. those that remain for a high period in the industry.

There are three difficulties with the realistic exploitation of the power of the k-NN method.

Although no time is needed to calculate parameters from the training data (as would be the case for parametric models including regression), the time to discover the nearest neighbors in a huge training set can be restrictive. Multiple concepts have been implemented to overcome this difficulty. The main concept is as follows −

It can reduce the time taken to compute distances by working in a reduced dimension using dimension reduction techniques such as principal components analysis.
It can use sophisticated data structures such as search trees to speed up the identification of the nearest neighbor. This method often settles for an “almost nearest” neighbor to enhance speed. An instance is using bucketing, where the data are combined into buckets so that data inside each bucket are near to each other.

The multiple data needed in the training set to qualify as large increases exponentially with the multiple predictor’s p. This is because the expected distance to the nearest neighbor goes up badly with p unless the amount of the training set boosts exponentially with p. This phenomenon is called the curse of dimensionality, a fundamental problem pertinent to some classification, prediction, and clustering approaches.

k-NN is a “lazy learner” − The time-consuming computation is delayed to the time of prediction. For each data to be predicted, it can calculate its distances from the complete set of training data only at the time of prediction. This behavior constraints using this algorithm for real-time prediction of multiple data simultaneously.

Ginni

Updated on: 10-Feb-2022

173 Views

Kickstart Your Career

Get certified by completing the course

Get Started