How can scikit learn library be used to preprocess data in Python?

Data preprocessing is the process of cleaning and transforming raw data into a format suitable for machine learning algorithms. The scikit-learn library provides powerful preprocessing tools to handle missing values, scale features, encode categorical variables, and convert data formats.

Real-world data often contains inconsistencies, missing values, outliers, and features with different scales. Preprocessing ensures your machine learning model receives clean, standardized data for optimal performance.

Binarization

Binarization converts numerical values to binary (0 or 1) based on a threshold. Values above the threshold become 1, while values below become 0 ?

import numpy as np
from sklearn import preprocessing

input_data = np.array([[34.78, 31.9, -65.5],
                       [-16.5, 2.45, -83.5],
                       [0.5, -87.98, 45.62],
                       [5.9, 2.38, -55.82]])

data_binarized = preprocessing.Binarizer(threshold=0.5).transform(input_data)
print("Values converted from numeric to Boolean:")
print(data_binarized)
Values converted from numeric to Boolean:
[[1. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 1. 0.]]

Feature Scaling

StandardScaler normalizes features to have zero mean and unit variance, which helps algorithms converge faster ?

from sklearn.preprocessing import StandardScaler
import numpy as np

data = np.array([[100, 0.001],
                 [200, 0.005],
                 [300, 0.003]])

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print("Original data:")
print(data)
print("\nScaled data:")
print(scaled_data)
Original data:
[[100 0.001]
 [200 0.005]
 [300 0.003]]

Scaled data:
[[-1.22474487 -0.87038828]
 [ 0.          1.30558242]
 [ 1.22474487 -0.43519414]]

Label Encoding

LabelEncoder converts categorical text labels into numerical values ?

from sklearn.preprocessing import LabelEncoder

labels = ['cat', 'dog', 'fish', 'cat', 'dog', 'fish']
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)

print("Original labels:", labels)
print("Encoded labels:", encoded_labels)
print("Label mapping:", dict(zip(encoder.classes_, encoder.transform(encoder.classes_))))
Original labels: ['cat', 'dog', 'fish', 'cat', 'dog', 'fish']
Encoded labels: [0 1 2 0 1 2]
Label mapping: {'cat': 0, 'dog': 1, 'fish': 2}

Common Preprocessing Techniques

Technique Purpose Scikit-learn Class
Binarization Convert to binary values Binarizer
Scaling Normalize feature ranges StandardScaler, MinMaxScaler
Label Encoding Convert categories to numbers LabelEncoder
One-hot Encoding Create binary columns OneHotEncoder

Conclusion

Scikit-learn's preprocessing module provides essential tools for data cleaning and transformation. Proper preprocessing is crucial for machine learning success, as it ensures your data is in the optimal format for model training.

Updated on: 2026-03-25T13:16:54+05:30

373 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements