Article Categories

Selected Reading

How can scikit learn library be used to preprocess data in Python?

Python Server Side Programming Programming

Data preprocessing is the process of cleaning and transforming raw data into a format suitable for machine learning algorithms. The scikit-learn library provides powerful preprocessing tools to handle missing values, scale features, encode categorical variables, and convert data formats.

Real-world data often contains inconsistencies, missing values, outliers, and features with different scales. Preprocessing ensures your machine learning model receives clean, standardized data for optimal performance.

Binarization

Binarization converts numerical values to binary (0 or 1) based on a threshold. Values above the threshold become 1, while values below become 0 ?

import numpy as np
from sklearn import preprocessing

input_data = np.array([[34.78, 31.9, -65.5],
                       [-16.5, 2.45, -83.5],
                       [0.5, -87.98, 45.62],
                       [5.9, 2.38, -55.82]])

data_binarized = preprocessing.Binarizer(threshold=0.5).transform(input_data)
print("Values converted from numeric to Boolean:")
print(data_binarized)

Values converted from numeric to Boolean:
[[1. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 1. 0.]]

Feature Scaling

StandardScaler normalizes features to have zero mean and unit variance, which helps algorithms converge faster ?

from sklearn.preprocessing import StandardScaler
import numpy as np

data = np.array([[100, 0.001],
                 [200, 0.005],
                 [300, 0.003]])

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print("Original data:")
print(data)
print("\nScaled data:")
print(scaled_data)

Original data:
[[100 0.001]
 [200 0.005]
 [300 0.003]]

Scaled data:
[[-1.22474487 -0.87038828]
 [ 0.          1.30558242]
 [ 1.22474487 -0.43519414]]

Label Encoding

LabelEncoder converts categorical text labels into numerical values ?

from sklearn.preprocessing import LabelEncoder

labels = ['cat', 'dog', 'fish', 'cat', 'dog', 'fish']
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)

print("Original labels:", labels)
print("Encoded labels:", encoded_labels)
print("Label mapping:", dict(zip(encoder.classes_, encoder.transform(encoder.classes_))))

Original labels: ['cat', 'dog', 'fish', 'cat', 'dog', 'fish']
Encoded labels: [0 1 2 0 1 2]
Label mapping: {'cat': 0, 'dog': 1, 'fish': 2}

Common Preprocessing Techniques

Technique	Purpose	Scikit-learn Class
Binarization	Convert to binary values	Binarizer
Scaling	Normalize feature ranges	StandardScaler, MinMaxScaler
Label Encoding	Convert categories to numbers	LabelEncoder
One-hot Encoding	Create binary columns	OneHotEncoder

Conclusion

Scikit-learn's preprocessing module provides essential tools for data cleaning and transformation. Proper preprocessing is crucial for machine learning success, as it ensures your data is in the optimal format for model training.

AmitDiwan

Updated on: 2026-03-25T13:16:54+05:30

427 Views

Previous Next