Pre-processing data refers to cleaning of data, removing invalid data, noise, replacing data with relevant values and so on.
This doesn’t always mean text data; it could also be images or video processing as well. It is an important step in the machine learning pipeline.
Data pre-processing basically refers to the task of gathering all the data (which is collected from various resources or a single resource) into a common format or into uniform datasets (depending on the type of data).
This is done so that the learning algorithm can learn from this dataset and give relevant results with high accuracy. Since real-world data is never ideal, there is a possibility that the data would have missing cells, errors, outliers, discrepancies in columns, and much more.
Sometimes, images may not be correctly aligned, or may not be clear or may have a very large size. The goal of pre-processing is to remove these discrepancies and errors. Data pre-processing isn’t a single task, but a set of tasks that are performed step by step.
The output of one step becomes the input to the next step and so on.
Let us take the example of converting numerical values into Boolean values −
import numpy as np from sklearn import preprocessing input_data = np.array([[34.78, 31.9, -65.5],[-16.5, 2.45, -83.5],[0.5, -87.98, 45.62], [5.9, 2.38, -55.82]]) data_binarized = preprocessing.Binarizer(threshold=0.5).transform(input_data) print("\Values converted from numeric to Boolean :\n", data_binarized)
Values converted from numeric to Boolean : [[1. 1. 0.] [0. 1. 0.] [0. 0. 1.] [1. 1. 0.]]