What is Standardization in Machine Learning

Machine Learning Python Numpy

A dataset is the heart of any ML model. It is of utmost importance that the data in a dataset are scaled and are within a particular range, to provide accurate results.

Standardization in machine learning , a type of feature scaling ,is used to bring uniformity to the datasets , resulting in independent variables and features of the same scale and range. Standardization transforms the standard deviation to 1 and the mean to 0 . In standardization, the mean is subtracted from each data point and the result obtained is divided by the standard deviation , resulting in standardized and rescaled data.

This technique is used in machine learning models such as Principal Component Analysis , Support Vector Machine and k-means clustering, as they depend on the Euclidean distance.

The mathematical representation is as follows

Z = (X - m ) / s

where

X − a data point

m − the mean

s − the standard deviation

Z − the standardized value

Algorithm

Step 1 − Import the libraries required. Some of the commonly imported libraries to standardize an ML model are numpy, pandas or scikit-learn .

Step 2 − Import the StandardScaler() function from the preprocessor.

Step 3 − Upload the data set that you want to standardize.

Step 4 − Divide the data into training data and testing data : X_test, y_test, X_train and y_train .

Step 5 − Fit the data into the StandardScaler() function to standardize .

Example

In this example, we will examine standardization by taking up random data values. Lets us consider the following set of values as data points −

3 5 7 8 9 4 
The mean m= 36/6 = 6  
The standard deviation s = 2.36 
Z1= - 1.27
Z2= - 0.42
Z3= - 0.42
Z4=   0.84
Z5=   1.27
Z6=  -0.84
Now, the mean is (Z1 + Z2 + Z3 + Z4 + Z5)/5
= (- 1.27 - 0.42 + 0.42 + 0.84 + 1.27 - 0.84)/5
= 0

And the standard deviation is 1

Thus , after standardization , the values are within the same range , the mean is 0 and the standard deviation is 1.

Example

from sklearn.preprocessing import StandardScaler
import numpy as np

# Create a sample data matrix 
X = np.array([[85,72,80], [64, 35, 26], [67, 48, 29], [100, 11, 102], [130, 14, 151]])

# create an instance of StandardScaler
standard_scaler = StandardScaler()

# Fit the scaler to the data
standard_scaler.fit(X)

# Transform the data using the scaler
X_new= standard_scaler.transform(X)

# Print the transformed data
print(" new data:", X_new)

Output

new data: [[-0.17359522  1.59410679  0.0511375 ]
 [-1.04157134 -0.04428074 -1.09945622]
 [-0.91757475  0.53136893 -1.03553435]
 [ 0.44638772 -1.10701861  0.5198979 ]
 [ 1.68635359 -0.97417637  1.56395517]]

In this program, the variable X contains the features as an array of numbers . It is fitted into the StandardScaler() function and the standardized array is displayed .

Conclusion

Standardization is a great way to get error free results by manipulating our data. Datasets have various variables whose values can be out of range . This problem is fixed using standardization and normalization, both of which come under feature scaling. The motive of feature scaling is to ensure that all the features are given equal importance while predicting the output using machine learning models.

Jaisshree

Updated on: 21-Jul-2023

326 Views

Kickstart Your Career

Get certified by completing the course

Get Started