Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
k-Nearest Neighbor Algorithm in Python
The k-Nearest Neighbor (k-NN) algorithm is a powerful and straightforward machine learning technique for classification and regression problems. It makes predictions by finding the most similar samples in the training data. This article explains k-NN implementation in Python using scikit-learn with practical examples.
What is k-Nearest Neighbor Algorithm?
The k-Nearest Neighbor algorithm is a supervised machine learning technique that works on the principle that similar instances often produce similar results. Given a new input, the algorithm finds the k closest training examples and determines the prediction based on majority voting (classification) or averaging (regression) of those neighbors.
Syntax
Python's scikit-learn library provides two main classes for k-NN implementation ?
For Classification:
from sklearn.neighbors import KNeighborsClassifier # Create k-NN classifier knn = KNeighborsClassifier(n_neighbors=k) # Train the model knn.fit(X_train, y_train) # Make predictions predictions = knn.predict(X_test)
For Regression:
from sklearn.neighbors import KNeighborsRegressor # Create k-NN regressor knn = KNeighborsRegressor(n_neighbors=k) # Train the model knn.fit(X_train, y_train) # Make predictions predictions = knn.predict(X_test)
Where k represents the number of neighbors, X_train and y_train are training features and labels, and X_test is the new data for predictions.
k-NN Classification Example
Let's implement k-NN classification using the Iris dataset to predict flower species ?
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load the Iris dataset
iris = load_iris()
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
# Create k-NN classifier with k=3
knn = KNeighborsClassifier(n_neighbors=3)
# Train the classifier
knn.fit(X_train, y_train)
# Make predictions
predictions = knn.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
print("Predicted classes:", predictions)
Accuracy: 1.0 Predicted classes: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]
The classifier achieved perfect accuracy (100%) on the test set, correctly identifying all iris flower species.
k-NN Regression Example
For regression, let's use the California housing dataset to predict median house values ?
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
# Load California housing dataset
housing = fetch_california_housing()
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
# Create k-NN regressor with k=5
knn = KNeighborsRegressor(n_neighbors=5)
# Train the regressor
knn.fit(X_train, y_train)
# Make predictions
predictions = knn.predict(X_test)
# Calculate metrics
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print("Mean Squared Error:", round(mse, 4))
print("R² Score:", round(r2, 4))
print("Sample predictions:", predictions[:5])
Mean Squared Error: 0.5983 R² Score: 0.5452 Sample predictions: [2.086 1.506 3.554 3.224 1.358]
The regressor shows reasonable performance with an R² score of 0.5452, indicating it explains about 54% of the variance in housing prices.
Key Parameters
Important parameters for k-NN algorithm ?
n_neighbors (k) Number of neighbors to consider. Higher values reduce overfitting but may underfit.
weights 'uniform' (default) or 'distance' for distance-weighted predictions.
metric Distance metric like 'euclidean', 'manhattan', or 'minkowski'.
algorithm Algorithm for neighbor search: 'auto', 'ball_tree', 'kd_tree', or 'brute'.
Choosing the Right k Value
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
# Test different k values
k_values = range(1, 21)
cv_scores = []
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy')
cv_scores.append(scores.mean())
# Find best k
best_k = k_values[np.argmax(cv_scores)]
print(f"Best k value: {best_k}")
print(f"Best cross-validation score: {max(cv_scores):.4f}")
Best k value: 13 Best cross-validation score: 0.9667
Conclusion
The k-Nearest Neighbor algorithm is a simple yet effective machine learning technique for both classification and regression tasks. Choose appropriate k values using cross-validation and consider data preprocessing for optimal performance.
