Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Python - How and where to apply Feature Scaling?
Feature scaling is a crucial data preprocessing step applied to independent variables or features. It normalizes data within a particular range, ensuring all features contribute equally to machine learning algorithms.
Why Feature Scaling is Important
Most datasets contain features with vastly different magnitudes, units, and ranges. For example, age (20-80) versus income (20,000-100,000). Machine learning algorithms that use Euclidean distance treat these differences literally ?
import numpy as np
from sklearn.preprocessing import StandardScaler
# Example: Age vs Income (unscaled)
data = np.array([[25, 50000], [30, 75000], [35, 100000]])
print("Original data:")
print("Age | Income")
for row in data:
print(f"{row[0]:3d} | {row[1]:6d}")
# Calculate distances between points
dist1 = np.sqrt((30-25)**2 + (75000-50000)**2)
dist2 = np.sqrt((35-30)**2 + (100000-75000)**2)
print(f"\nDistance dominated by income: {dist1:.0f}, {dist2:.0f}")
Original data: Age | Income 25 | 50000 30 | 75000 35 | 100000 Distance dominated by income: 25000, 25000
The income feature dominates distance calculations, making age virtually irrelevant. Feature scaling solves this problem.
Feature Scaling Techniques
Standardization (Z-score Normalization)
Transforms features to have mean = 0 and standard deviation = 1 using the formula: x' = (x - ?) / ?
from sklearn.preprocessing import StandardScaler
import numpy as np
data = np.array([[25, 50000], [30, 75000], [35, 100000]])
scaler = StandardScaler()
standardized = scaler.fit_transform(data)
print("Standardized data:")
print("Age | Income")
for row in standardized:
print(f"{row[0]:7.2f} | {row[1]:7.2f}")
Standardized data: Age | Income -1.00 | -1.00 0.00 | 0.00 1.00 | 1.00
Min-Max Scaling
Scales features to a fixed range [0,1] using: x' = (x - min(x)) / (max(x) - min(x))
from sklearn.preprocessing import MinMaxScaler
data = np.array([[25, 50000], [30, 75000], [35, 100000]])
scaler = MinMaxScaler()
minmax_scaled = scaler.fit_transform(data)
print("Min-Max scaled data:")
print("Age | Income")
for row in minmax_scaled:
print(f"{row[0]:.2f} | {row[1]:.2f}")
Min-Max scaled data: Age | Income 0.00 | 0.00 0.50 | 0.50 1.00 | 1.00
Robust Scaling
Uses median and interquartile range, making it robust to outliers: x' = (x - median) / IQR
from sklearn.preprocessing import RobustScaler
data = np.array([[25, 50000], [30, 75000], [35, 100000], [100, 200000]]) # Added outlier
scaler = RobustScaler()
robust_scaled = scaler.fit_transform(data)
print("Robust scaled data (with outlier):")
print("Age | Income")
for row in robust_scaled:
print(f"{row[0]:5.2f} | {row[1]:6.2f}")
Robust scaled data (with outlier): Age | Income -1.00 | -1.00 0.00 | 0.00 1.00 | 1.00 14.00 | 15.00
When to Apply Feature Scaling
Scale when algorithms use distance or assume normality:
| Algorithm Type | Scaling Required? | Reason |
|---|---|---|
| K-Nearest Neighbors | Yes | Uses Euclidean distance |
| SVM | Yes | Distance-based optimization |
| Neural Networks | Yes | Gradient descent optimization |
| PCA | Yes | Variance-based feature selection |
| Decision Trees | No | Split-based, not distance-based |
| Random Forest | No | Tree-based ensemble |
Practical Example
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
import numpy as np
# Create sample data with different scales
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0,
n_informative=2, random_state=42)
X[:, 1] = X[:, 1] * 1000 # Scale second feature
# Without scaling
knn_unscaled = KNeighborsClassifier(n_neighbors=3)
knn_unscaled.fit(X, y)
accuracy_unscaled = knn_unscaled.score(X, y)
# With scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
knn_scaled = KNeighborsClassifier(n_neighbors=3)
knn_scaled.fit(X_scaled, y)
accuracy_scaled = knn_scaled.score(X_scaled, y)
print(f"Accuracy without scaling: {accuracy_unscaled:.3f}")
print(f"Accuracy with scaling: {accuracy_scaled:.3f}")
Accuracy without scaling: 0.840 Accuracy with scaling: 0.930
Conclusion
Feature scaling is essential for distance-based algorithms and gradient descent optimization. Use StandardScaler for normally distributed data, MinMaxScaler for bounded ranges, and RobustScaler when outliers are present. Always scale features when using KNN, SVM, neural networks, or PCA.
