Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to Increase Classification Model Accuracy?
Machine learning classification models rely heavily on accuracy as a key performance indicator. Improving accuracy involves multiple strategies including data preprocessing, feature engineering, model selection, and hyperparameter tuning.
This article explores practical techniques to enhance classification model performance with Python examples.
Data Preprocessing
Quality data preprocessing forms the foundation of accurate models. Clean, normalized data significantly improves model performance.
Data Cleaning and Normalization
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
# Sample dataset with missing values
data = pd.DataFrame({
'feature1': [1, 2, np.nan, 4, 5],
'feature2': [10, 20, 30, np.nan, 50],
'target': [0, 1, 0, 1, 0]
})
# Handle missing values
imputer = SimpleImputer(strategy='mean')
data[['feature1', 'feature2']] = imputer.fit_transform(data[['feature1', 'feature2']])
# Normalize features
scaler = StandardScaler()
data[['feature1', 'feature2']] = scaler.fit_transform(data[['feature1', 'feature2']])
print(data)
feature1 feature2 target 0 -1.341641 -1.341641 0 1 -0.447214 -0.447214 1 2 0.000000 0.000000 0 3 0.447214 0.447214 1 4 1.341641 1.341641 0
Feature Selection
Selecting relevant features reduces model complexity and prevents overfitting. Use correlation analysis and feature importance ranking to identify the best features.
Feature Importance with Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
# Generate sample data
X, y = make_classification(n_samples=100, n_features=10, n_informative=5, random_state=42)
# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
# Get feature importance
importance = rf.feature_importances_
features = [f'Feature_{i}' for i in range(len(importance))]
# Display top 5 features
for i, (feature, imp) in enumerate(zip(features, importance)):
if i < 5:
print(f"{feature}: {imp:.3f}")
Feature_0: 0.094 Feature_1: 0.179 Feature_2: 0.092 Feature_3: 0.112 Feature_4: 0.068
Model Selection and Comparison
Different algorithms perform better on different datasets. Compare multiple models to find the best performer.
Comparing Multiple Classifiers
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.datasets import make_classification
# Generate sample data
X, y = make_classification(n_samples=200, n_features=8, n_informative=4, random_state=42)
# Initialize models
models = {
'Logistic Regression': LogisticRegression(random_state=42),
'Decision Tree': DecisionTreeClassifier(random_state=42),
'SVM': SVC(random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=50, random_state=42)
}
# Compare models using cross-validation
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=5)
print(f"{name}: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
Logistic Regression: 0.870 (+/- 0.066) Decision Tree: 0.855 (+/- 0.103) SVM: 0.885 (+/- 0.053) Random Forest: 0.900 (+/- 0.071)
Hyperparameter Tuning
Fine?tuning hyperparameters optimizes model performance. Use Grid Search or Random Search to find optimal parameters.
Grid Search Example
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# Define parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 7],
'min_samples_split': [2, 5, 10]
}
# Initialize model
rf = RandomForestClassifier(random_state=42)
# Grid search with cross-validation
grid_search = GridSearchCV(rf, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X, y)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best accuracy: {grid_search.best_score_:.3f}")
Best parameters: {'max_depth': 7, 'min_samples_split': 2, 'n_estimators': 100}
Best accuracy: 0.900
Handling Imbalanced Data
Imbalanced datasets can bias models toward majority classes. Use sampling techniques to balance class distribution.
SMOTE Oversampling
from imblearn.over_sampling import SMOTE
from collections import Counter
# Create imbalanced dataset
X_imb, y_imb = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1],
n_features=4, random_state=42)
print("Original distribution:", Counter(y_imb))
# Apply SMOTE
smote = SMOTE(random_state=42)
X_balanced, y_balanced = smote.fit_resample(X_imb, y_imb)
print("Balanced distribution:", Counter(y_balanced))
Original distribution: Counter({0: 900, 1: 100})
Balanced distribution: Counter({0: 900, 1: 900})
Cross?Validation Strategies
Proper validation prevents overfitting and provides reliable performance estimates. Use stratified cross?validation for imbalanced data.
Stratified K?Fold Cross?Validation
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
# Initialize stratified k-fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
accuracies = []
for train_idx, val_idx in skf.split(X, y):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
accuracies.append(accuracy_score(y_val, y_pred))
print(f"Cross-validation accuracy: {np.mean(accuracies):.3f} (+/- {np.std(accuracies):.3f})")
Cross-validation accuracy: 0.895 (+/- 0.050)
Key Strategies Summary
| Technique | Purpose | Best For |
|---|---|---|
| Data Preprocessing | Clean and normalize data | All datasets |
| Feature Selection | Remove irrelevant features | High-dimensional data |
| Hyperparameter Tuning | Optimize model parameters | Fine-tuning performance |
| Cross-Validation | Prevent overfitting | Model evaluation |
| Ensemble Methods | Combine multiple models | Complex datasets |
Conclusion
Improving classification accuracy requires a systematic approach combining data preprocessing, feature engineering, proper model selection, and validation techniques. While 100% accuracy may not always be achievable, these strategies significantly enhance model performance and reliability.
