Article Categories

Selected Reading

k-Nearest Neighbor Algorithm in Python

Python Server Side Programming Programming

The k-Nearest Neighbor (k-NN) algorithm is a powerful and straightforward machine learning technique for classification and regression problems. It makes predictions by finding the most similar samples in the training data. This article explains k-NN implementation in Python using scikit-learn with practical examples.

What is k-Nearest Neighbor Algorithm?

The k-Nearest Neighbor algorithm is a supervised machine learning technique that works on the principle that similar instances often produce similar results. Given a new input, the algorithm finds the k closest training examples and determines the prediction based on majority voting (classification) or averaging (regression) of those neighbors.

Syntax

Python's scikit-learn library provides two main classes for k-NN implementation ?

For Classification:

from sklearn.neighbors import KNeighborsClassifier

# Create k-NN classifier
knn = KNeighborsClassifier(n_neighbors=k)

# Train the model
knn.fit(X_train, y_train)

# Make predictions
predictions = knn.predict(X_test)

For Regression:

from sklearn.neighbors import KNeighborsRegressor

# Create k-NN regressor
knn = KNeighborsRegressor(n_neighbors=k)

# Train the model
knn.fit(X_train, y_train)

# Make predictions
predictions = knn.predict(X_test)

Where k represents the number of neighbors, X_train and y_train are training features and labels, and X_test is the new data for predictions.

k-NN Classification Example

Let's implement k-NN classification using the Iris dataset to predict flower species ?

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Create k-NN classifier with k=3
knn = KNeighborsClassifier(n_neighbors=3)

# Train the classifier
knn.fit(X_train, y_train)

# Make predictions
predictions = knn.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
print("Predicted classes:", predictions)

Accuracy: 1.0
Predicted classes: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]

The classifier achieved perfect accuracy (100%) on the test set, correctly identifying all iris flower species.

k-NN Regression Example

For regression, let's use the California housing dataset to predict median house values ?

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Load California housing dataset
housing = fetch_california_housing()

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)

# Create k-NN regressor with k=5
knn = KNeighborsRegressor(n_neighbors=5)

# Train the regressor
knn.fit(X_train, y_train)

# Make predictions
predictions = knn.predict(X_test)

# Calculate metrics
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print("Mean Squared Error:", round(mse, 4))
print("R² Score:", round(r2, 4))
print("Sample predictions:", predictions[:5])

Mean Squared Error: 0.5983
R² Score: 0.5452
Sample predictions: [2.086 1.506 3.554 3.224 1.358]

The regressor shows reasonable performance with an R² score of 0.5452, indicating it explains about 54% of the variance in housing prices.

Key Parameters

Important parameters for k-NN algorithm ?

n_neighbors (k) Number of neighbors to consider. Higher values reduce overfitting but may underfit.
weights 'uniform' (default) or 'distance' for distance-weighted predictions.
metric Distance metric like 'euclidean', 'manhattan', or 'minkowski'.
algorithm Algorithm for neighbor search: 'auto', 'ball_tree', 'kd_tree', or 'brute'.

Choosing the Right k Value

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Test different k values
k_values = range(1, 21)
cv_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())

# Find best k
best_k = k_values[np.argmax(cv_scores)]
print(f"Best k value: {best_k}")
print(f"Best cross-validation score: {max(cv_scores):.4f}")

Best k value: 13
Best cross-validation score: 0.9667

Conclusion

The k-Nearest Neighbor algorithm is a simple yet effective machine learning technique for both classification and regression tasks. Choose appropriate k values using cross-validation and consider data preprocessing for optimal performance.

Arpana Jain

Updated on: 2026-03-27T15:05:39+05:30

815 Views

Previous Next