Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How can scikit learn library be used to preprocess data in Python?
Data preprocessing is the process of cleaning and transforming raw data into a format suitable for machine learning algorithms. The scikit-learn library provides powerful preprocessing tools to handle missing values, scale features, encode categorical variables, and convert data formats.
Real-world data often contains inconsistencies, missing values, outliers, and features with different scales. Preprocessing ensures your machine learning model receives clean, standardized data for optimal performance.
Binarization
Binarization converts numerical values to binary (0 or 1) based on a threshold. Values above the threshold become 1, while values below become 0 ?
import numpy as np
from sklearn import preprocessing
input_data = np.array([[34.78, 31.9, -65.5],
[-16.5, 2.45, -83.5],
[0.5, -87.98, 45.62],
[5.9, 2.38, -55.82]])
data_binarized = preprocessing.Binarizer(threshold=0.5).transform(input_data)
print("Values converted from numeric to Boolean:")
print(data_binarized)
Values converted from numeric to Boolean: [[1. 1. 0.] [0. 1. 0.] [0. 0. 1.] [1. 1. 0.]]
Feature Scaling
StandardScaler normalizes features to have zero mean and unit variance, which helps algorithms converge faster ?
from sklearn.preprocessing import StandardScaler
import numpy as np
data = np.array([[100, 0.001],
[200, 0.005],
[300, 0.003]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print("Original data:")
print(data)
print("\nScaled data:")
print(scaled_data)
Original data: [[100 0.001] [200 0.005] [300 0.003]] Scaled data: [[-1.22474487 -0.87038828] [ 0. 1.30558242] [ 1.22474487 -0.43519414]]
Label Encoding
LabelEncoder converts categorical text labels into numerical values ?
from sklearn.preprocessing import LabelEncoder
labels = ['cat', 'dog', 'fish', 'cat', 'dog', 'fish']
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)
print("Original labels:", labels)
print("Encoded labels:", encoded_labels)
print("Label mapping:", dict(zip(encoder.classes_, encoder.transform(encoder.classes_))))
Original labels: ['cat', 'dog', 'fish', 'cat', 'dog', 'fish']
Encoded labels: [0 1 2 0 1 2]
Label mapping: {'cat': 0, 'dog': 1, 'fish': 2}
Common Preprocessing Techniques
| Technique | Purpose | Scikit-learn Class |
|---|---|---|
| Binarization | Convert to binary values | Binarizer |
| Scaling | Normalize feature ranges | StandardScaler, MinMaxScaler |
| Label Encoding | Convert categories to numbers | LabelEncoder |
| One-hot Encoding | Create binary columns | OneHotEncoder |
Conclusion
Scikit-learn's preprocessing module provides essential tools for data cleaning and transformation. Proper preprocessing is crucial for machine learning success, as it ensures your data is in the optimal format for model training.
