Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
What is Standardization in Machine Learning
Standardization is a crucial preprocessing technique in machine learning that ensures all features are on the same scale. This process transforms data to have a mean of 0 and a standard deviation of 1, making features comparable and improving model performance.
What is Standardization?
Standardization, also known as Z-score normalization, is a feature scaling technique that transforms data by subtracting the mean and dividing by the standard deviation. This process ensures that all features contribute equally to machine learning algorithms that are sensitive to feature scale.
Mathematical Formula
The standardization formula is ?
Z = (X - ?) / ?
Where:
- Z the standardized value
- X the original data point
- ? the mean of the dataset
- ? the standard deviation of the dataset
Manual Calculation Example
Let's standardize the dataset [3, 5, 7, 8, 9, 4] manually ?
import numpy as np
# Original data
data = [3, 5, 7, 8, 9, 4]
print("Original data:", data)
# Calculate mean and standard deviation
mean = np.mean(data)
std = np.std(data)
print(f"Mean: {mean}")
print(f"Standard deviation: {std:.2f}")
# Manual standardization
standardized = [(x - mean) / std for x in data]
print("Standardized data:", [round(z, 2) for z in standardized])
# Verify: mean should be ~0, std should be ~1
print(f"New mean: {np.mean(standardized):.2f}")
print(f"New std: {np.std(standardized):.2f}")
Original data: [3, 5, 7, 8, 9, 4] Mean: 6.0 Standard deviation: 2.16 Standardized data: [-1.39, -0.46, 0.46, 0.93, 1.39, -0.93] New mean: 0.00 New std: 1.00
Using StandardScaler from scikit-learn
For real-world applications, use scikit-learn's StandardScaler ?
from sklearn.preprocessing import StandardScaler
import numpy as np
# Create sample data matrix
X = np.array([[85, 72, 80],
[64, 35, 26],
[67, 48, 29],
[100, 11, 102],
[130, 14, 151]])
print("Original data:")
print(X)
# Create and fit StandardScaler
scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)
print("\nStandardized data:")
print(X_standardized.round(2))
# Verify standardization
print(f"\nMeans: {X_standardized.mean(axis=0).round(2)}")
print(f"Standard deviations: {X_standardized.std(axis=0).round(2)}")
Original data: [[ 85 72 80] [ 64 35 26] [ 67 48 29] [100 11 102] [130 14 151]] Standardized data: [[-0.17 1.59 0.05] [-1.04 -0.04 -1.1 ] [-0.92 0.53 -1.04] [ 0.45 -1.11 0.52] [ 1.69 -0.97 1.56]] Means: [ 0. -0. 0.] Standard deviations: [1. 1. 1.]
When to Use Standardization
Standardization is essential for algorithms that are sensitive to feature scale:
- Support Vector Machines (SVM) relies on distance calculations
- K-Means Clustering uses Euclidean distance
- Principal Component Analysis (PCA) sensitive to variance
- Neural Networks improves convergence speed
- Logistic Regression with regularization
Standardization vs Normalization
| Aspect | Standardization | Normalization |
|---|---|---|
| Formula | (X - ?) / ? | (X - min) / (max - min) |
| Result Range | No fixed range | [0, 1] |
| Mean | 0 | Depends on data |
| Standard Deviation | 1 | Depends on data |
Conclusion
Standardization ensures all features contribute equally to machine learning models by transforming them to have zero mean and unit variance. Use StandardScaler for consistent preprocessing and improved model performance.
