- Data Structure
- Networking
- RDBMS
- Operating System
- Java
- MS Excel
- iOS
- HTML
- CSS
- Android
- Python
- C Programming
- C++
- C#
- MongoDB
- MySQL
- Javascript
- PHP
- Physics
- Chemistry
- Biology
- Mathematics
- English
- Economics
- Psychology
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
What is Standardization in Machine Learning
A dataset is the heart of any ML model. It is of utmost importance that the data in a dataset are scaled and are within a particular range, to provide accurate results.
Standardization in machine learning , a type of feature scaling ,is used to bring uniformity to the datasets , resulting in independent variables and features of the same scale and range. Standardization transforms the standard deviation to 1 and the mean to 0 . In standardization, the mean is subtracted from each data point and the result obtained is divided by the standard deviation , resulting in standardized and rescaled data.
This technique is used in machine learning models such as Principal Component Analysis , Support Vector Machine and k-means clustering, as they depend on the Euclidean distance.
The mathematical representation is as follows
Z = (X - m ) / s
where
X − a data point
m − the mean
s − the standard deviation
Z − the standardized value
Algorithm
Step 1 − Import the libraries required. Some of the commonly imported libraries to standardize an ML model are numpy, pandas or scikit-learn .
Step 2 − Import the StandardScaler() function from the preprocessor.
Step 3 − Upload the data set that you want to standardize.
Step 4 − Divide the data into training data and testing data : X_test, y_test, X_train and y_train .
Step 5 − Fit the data into the StandardScaler() function to standardize .
Example
In this example, we will examine standardization by taking up random data values. Lets us consider the following set of values as data points −
3 5 7 8 9 4 The mean m= 36/6 = 6 The standard deviation s = 2.36 Z1= - 1.27 Z2= - 0.42 Z3= - 0.42 Z4= 0.84 Z5= 1.27 Z6= -0.84 Now, the mean is (Z1 + Z2 + Z3 + Z4 + Z5)/5 = (- 1.27 - 0.42 + 0.42 + 0.84 + 1.27 - 0.84)/5 = 0
And the standard deviation is 1
Thus , after standardization , the values are within the same range , the mean is 0 and the standard deviation is 1.
Example
from sklearn.preprocessing import StandardScaler import numpy as np # Create a sample data matrix X = np.array([[85,72,80], [64, 35, 26], [67, 48, 29], [100, 11, 102], [130, 14, 151]]) # create an instance of StandardScaler standard_scaler = StandardScaler() # Fit the scaler to the data standard_scaler.fit(X) # Transform the data using the scaler X_new= standard_scaler.transform(X) # Print the transformed data print(" new data:", X_new)
Output
new data: [[-0.17359522 1.59410679 0.0511375 ] [-1.04157134 -0.04428074 -1.09945622] [-0.91757475 0.53136893 -1.03553435] [ 0.44638772 -1.10701861 0.5198979 ] [ 1.68635359 -0.97417637 1.56395517]]
In this program, the variable X contains the features as an array of numbers . It is fitted into the StandardScaler() function and the standardized array is displayed .
Conclusion
Standardization is a great way to get error free results by manipulating our data. Datasets have various variables whose values can be out of range . This problem is fixed using standardization and normalization, both of which come under feature scaling. The motive of feature scaling is to ensure that all the features are given equal importance while predicting the output using machine learning models.