How can scikit learn library be used to preprocess data in Python?

PythonServer Side ProgrammingProgramming

Pre-processing data refers to cleaning of data, removing invalid data, noise, replacing data with relevant values and so on.

This doesn’t always mean text data; it could also be images or video processing as well. It is an important step in the machine learning pipeline.

Data pre-processing basically refers to the task of gathering all the data (which is collected from various resources or a single resource) into a common format or into uniform datasets (depending on the type of data).

This is done so that the learning algorithm can learn from this dataset and give relevant results with high accuracy. Since real-world data is never ideal, there is a possibility that the data would have missing cells, errors, outliers, discrepancies in columns, and much more.

Sometimes, images may not be correctly aligned, or may not be clear or may have a very large size. The goal of pre-processing is to remove these discrepancies and errors. Data pre-processing isn’t a single task, but a set of tasks that are performed step by step.

The output of one step becomes the input to the next step and so on.

Let us take the example of converting numerical values into Boolean values −


import numpy as np
from sklearn import preprocessing
input_data = np.array([[34.78, 31.9, -65.5],[-16.5, 2.45, -83.5],[0.5, -87.98, 45.62],
[5.9, 2.38, -55.82]])
data_binarized = preprocessing.Binarizer(threshold=0.5).transform(input_data)
print("\Values converted from numeric to Boolean :\n", data_binarized)


Values converted from numeric to Boolean :
[[1. 1. 0.]
[0. 1. 0.]
[0. 0. 1.]
[1. 1. 0.]]


  • The required packages are imported.
  • The input data is generated using the Numpy library.
  • The ‘Binarizer’ function present in the ‘preprocessing’ class of sklearn is used to convert numerical values into Boolean values.
  • Boolean values basically refers to 1 and 0 only.
  • This converted data is printed on the console.
Published on 10-Dec-2020 13:34:59