Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How can data be scaled using scikit-learn library in Python?
Feature scaling is an important step in the data pre-processing stage when building machine learning algorithms. It helps normalize the data to fall within a specific range, which ensures all features contribute equally to the model's predictions.
At times, it also helps in increasing the speed at which calculations are performed by the machine learning algorithms.
Why Feature Scaling is Needed?
Data fed to learning algorithms should remain consistent and structured. All features of the input data should be on a similar scale to effectively predict values. However, in real-world scenarios, data is often unstructured and features have different scales.
For example, age might range from 0-100, while income could range from 0-100,000. Without scaling, the income feature would dominate the model simply due to its larger numeric range.
This is when normalization comes into picture. It is one of the most important data-preparation processes that transforms feature values to fall on the same scale.
MinMaxScaler Example
Let's see how scikit-learn's MinMaxScaler can be used to scale features between 0 and 1 ?
import numpy as np
from sklearn import preprocessing
input_data = np.array([
[34.78, 31.9, -65.5],
[-16.5, 2.45, -83.5],
[0.5, -87.98, 45.62],
[5.9, 2.38, -55.82]
])
data_scaler_minmax = preprocessing.MinMaxScaler(feature_range=(0,1))
data_scaled_minmax = data_scaler_minmax.fit_transform(input_data)
print("The scaled data is:")
print(data_scaled_minmax)
The scaled data is: [[1. 1. 0.1394052 ] [0. 0.75433767 0. ] [0.33151326 0. 1. ] [0.43681747 0.75375375 0.21437423]]
StandardScaler Example
Another common scaling method is StandardScaler, which standardizes features by removing the mean and scaling to unit variance ?
import numpy as np
from sklearn import preprocessing
input_data = np.array([
[34.78, 31.9, -65.5],
[-16.5, 2.45, -83.5],
[0.5, -87.98, 45.62],
[5.9, 2.38, -55.82]
])
data_scaler_standard = preprocessing.StandardScaler()
data_scaled_standard = data_scaler_standard.fit_transform(input_data)
print("The standardized data is:")
print(data_scaled_standard)
The standardized data is: [[ 1.26765207 1.40754641 -0.39878479] [-1.07750831 0.08847073 -1.19968102] [-0.14525707 -1.49094695 1.59846581] [-0.04488669 -0.00507019 0. ]]
Comparison of Scaling Methods
| Method | Range | Formula | Best For |
|---|---|---|---|
| MinMaxScaler | 0 to 1 | (X - X_min) / (X_max - X_min) | When you know the bounds |
| StandardScaler | Mean=0, Std=1 | (X - mean) / std | When data is normally distributed |
How It Works
The MinMaxScaler transforms features by scaling each feature to the range [0,1] using the minimum and maximum values.
The StandardScaler transforms features by removing the mean and scaling to unit variance (standard normal distribution).
Both methods use
fit_transform()to compute the scaling parameters and apply the transformation in one step.The scaled data maintains the relative relationships between data points while ensuring all features contribute equally to the model.
Conclusion
Feature scaling is essential for machine learning algorithms to perform optimally. Use MinMaxScaler when you need bounded ranges, and StandardScaler when your data follows a normal distribution. Both methods ensure features contribute equally to model predictions.
