
- Python Basic Tutorial
- Python - Home
- Python - Overview
- Python - Environment Setup
- Python - Basic Syntax
- Python - Comments
- Python - Variables
- Python - Data Types
- Python - Operators
- Python - Decision Making
- Python - Loops
- Python - Numbers
- Python - Strings
- Python - Lists
- Python - Tuples
- Python - Dictionary
- Python - Date & Time
- Python - Functions
- Python - Modules
- Python - Files I/O
- Python - Exceptions
How can scikit learn library be used to preprocess data in Python?
Pre-processing data refers to cleaning of data, removing invalid data, noise, replacing data with relevant values and so on.
This doesn’t always mean text data; it could also be images or video processing as well. It is an important step in the machine learning pipeline.
Data pre-processing basically refers to the task of gathering all the data (which is collected from various resources or a single resource) into a common format or into uniform datasets (depending on the type of data).
This is done so that the learning algorithm can learn from this dataset and give relevant results with high accuracy. Since real-world data is never ideal, there is a possibility that the data would have missing cells, errors, outliers, discrepancies in columns, and much more.
Sometimes, images may not be correctly aligned, or may not be clear or may have a very large size. The goal of pre-processing is to remove these discrepancies and errors. Data pre-processing isn’t a single task, but a set of tasks that are performed step by step.
The output of one step becomes the input to the next step and so on.
Let us take the example of converting numerical values into Boolean values −
Example
import numpy as np from sklearn import preprocessing input_data = np.array([[34.78, 31.9, -65.5],[-16.5, 2.45, -83.5],[0.5, -87.98, 45.62], [5.9, 2.38, -55.82]]) data_binarized = preprocessing.Binarizer(threshold=0.5).transform(input_data) print("\Values converted from numeric to Boolean :\n", data_binarized)
Output
Values converted from numeric to Boolean : [[1. 1. 0.] [0. 1. 0.] [0. 0. 1.] [1. 1. 0.]]
Explanation
- The required packages are imported.
- The input data is generated using the Numpy library.
- The ‘Binarizer’ function present in the ‘preprocessing’ class of sklearn is used to convert numerical values into Boolean values.
- Boolean values basically refers to 1 and 0 only.
- This converted data is printed on the console.
- Related Articles
- How can scikit-learn library be used to load data in Python?
- How can data be scaled using scikit-learn library in Python?
- How can scikit learn library be used to upload and view an image in Python?
- How can scikit-learn library be used to get the resolution of an image in Python?
- Explain how L1 Normalization can be implemented using scikit-learn library in Python?
- Explain how L2 Normalization can be implemented using scikit-learn library in Python?
- Explain how scikit-learn library can be used to split the dataset for training and testing purposes in Python?
- How can TensorFlow be used to preprocess Fashion MNIST data in Python?
- How can Tensorflow text be used to preprocess text data?
- How can scikit-learn be used to convert an image from RGB to grayscale in Python?
- Explain the basics of scikit-learn library in Python?
- How can FacetGrid be used to visualize data in Python Seaborn Library?
- How to binarize the data using Python Scikit-learn?
- How can the countplot be used to visualize data in Python Seaborn Library?
- How can a specific tint be added to grayscale images in scikit-learn in Python?
