Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to eliminate mean values from feature vector using scikit-learn library in Python?
Data preprocessing is essential for machine learning, involving cleaning data, removing noise, and standardizing features. Sometimes you need to eliminate mean values from feature vectors to center the data around zero, which helps algorithms perform better.
The scikit-learn library provides the preprocessing.scale() function to remove mean values and standardize features. This process is called standardization or z-score normalization.
Syntax
sklearn.preprocessing.scale(X, axis=0, with_mean=True, with_std=True)
Parameters
- X ? Input array or matrix
- axis ? Axis along which to compute (0 for columns, 1 for rows)
- with_mean ? Boolean to center data by removing mean
- with_std ? Boolean to scale data to unit variance
Example
Let's eliminate mean values from a feature vector using scikit-learn ?
import numpy as np
from sklearn import preprocessing
# Create sample input data
input_data = np.array([
[34.78, 31.9, -65.5],
[-16.5, 2.45, -83.5],
[0.5, -87.98, 45.62],
[5.9, 2.38, -55.82]
])
print("Original data:")
print(input_data)
print("\nMean values:", input_data.mean(axis=0))
print("Standard deviation:", input_data.std(axis=0))
# Scale the data (remove mean and standardize)
data_scaled = preprocessing.scale(input_data)
print("\nAfter scaling:")
print("Mean values:", data_scaled.mean(axis=0))
print("Standard deviation:", data_scaled.std(axis=0))
Original data: [[ 34.78 31.9 -65.5 ] [-16.5 2.45 -83.5 ] [ 0.5 -87.98 45.62] [ 5.9 2.38 -55.82]] Mean values: [ 6.17 -12.8125 -39.8 ] Standard deviation: [18.4708067 45.03642047 50.30754615] After scaling: Mean values: [-2.60208521e-18 -8.32667268e-17 -1.11022302e-16] Standard deviation: [1. 1. 1.]
How It Works
The preprocessing.scale() function performs two operations ?
- Mean removal ? Subtracts the mean from each feature
- Standardization ? Divides by standard deviation to get unit variance
The mathematical formula is: (x - mean) / std
Only Removing Mean (Without Standardization)
To remove only the mean without standardizing ?
import numpy as np
from sklearn import preprocessing
input_data = np.array([
[34.78, 31.9, -65.5],
[-16.5, 2.45, -83.5],
[0.5, -87.98, 45.62],
[5.9, 2.38, -55.82]
])
# Remove mean only (keep original standard deviation)
data_centered = preprocessing.scale(input_data, with_std=False)
print("Mean after centering:", data_centered.mean(axis=0))
print("Std after centering:", data_centered.std(axis=0))
Mean after centering: [-1.38777878e-17 0.00000000e+00 2.77555756e-17] Std after centering: [18.4708067 45.03642047 50.30754615]
Conclusion
Use preprocessing.scale() to eliminate mean values and standardize features in scikit-learn. Set with_std=False to remove only the mean while preserving original variance.
