Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to binarize the data using Python Scikit-learn?
Binarization is a preprocessing technique used to convert numerical data into binary values (0 and 1). The scikit-learn function sklearn.preprocessing.binarize() transforms data based on a threshold value ? features below or equal to the threshold become 0, while values above it become 1.
In this tutorial, we will learn to binarize data and sparse matrices using Scikit-learn in Python.
Basic Data Binarization
Let's see how to binarize a numpy array using the Binarizer class ?
# Importing the necessary packages
import numpy as np
from sklearn import preprocessing
# Sample data
X = [[0.4, -1.8, 2.9],
[2.5, 0.9, 0.3],
[0.0, 1.0, -1.5],
[0.1, 2.9, 5.9]]
# Create binarizer with threshold 0.5
binarizer = preprocessing.Binarizer(threshold=0.5)
binarized_data = binarizer.transform(X)
print("Original data:")
print(np.array(X))
print("\nBinarized data (threshold=0.5):")
print(binarized_data)
Original data: [[ 0.4 -1.8 2.9] [ 2.5 0.9 0.3] [ 0. 1. -1.5] [ 0.1 2.9 5.9]] Binarized data (threshold=0.5): [[0. 0. 1.] [1. 1. 0.] [0. 1. 0.] [0. 1. 1.]]
Using the binarize() Function
You can also use the standalone binarize() function for direct transformation ?
from sklearn.preprocessing import binarize
import numpy as np
# Sample data
data = [[1.2, -0.5, 2.1],
[0.3, 1.8, 0.7]]
# Binarize with threshold 1.0
result = binarize(data, threshold=1.0)
print("Original data:")
print(np.array(data))
print("\nBinarized data (threshold=1.0):")
print(result)
Original data: [[ 1.2 -0.5 2.1] [ 0.3 1.8 0.7]] Binarized data (threshold=1.0): [[1. 0. 1.] [0. 1. 0.]]
Binarizing Sparse Matrices
Sparse matrices contain mostly zero values and are memory-efficient as zeros aren't stored. You can binarize sparse matrices using scikit-learn, but the threshold must be non-negative.
Creating and Binarizing a Sparse Matrix
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.preprocessing import binarize
# Create a sparse matrix
data = np.array([0.1, 0.8, 1.5, 0.3, 2.1])
row = np.array([0, 0, 1, 1, 2])
col = np.array([0, 2, 1, 3, 2])
sparse_matrix = csr_matrix((data, (row, col)), shape=(3, 4))
print("Original sparse matrix:")
print(sparse_matrix.toarray())
# Binarize with threshold 0.5
binarized_sparse = binarize(sparse_matrix, threshold=0.5)
print("\nBinarized sparse matrix (threshold=0.5):")
print(binarized_sparse.toarray())
Original sparse matrix: [[0.1 0. 0.8 0. ] [0. 1.5 0. 0.3] [0. 0. 2.1 0. ]] Binarized sparse matrix (threshold=0.5): [[0. 0. 1. 0.] [0. 1. 0. 0.] [0. 0. 1. 0.]]
Threshold Restriction for Sparse Matrices
When working with sparse matrices, the threshold cannot be negative. Here's what happens when you try ?
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.preprocessing import binarize
# Create sparse matrix
sparse_matrix = csr_matrix([[0.1, 0.8], [1.2, 0.0]])
# This will raise an error
try:
result = binarize(sparse_matrix, threshold=-0.5)
except ValueError as e:
print("Error:", e)
Error: Cannot binarize a sparse matrix with threshold < 0
Key Parameters
| Parameter | Description | Default |
|---|---|---|
threshold |
Feature values ≤ threshold become 0, values > threshold become 1 | 0.0 |
copy |
Whether to create a copy or modify in-place | True |
Conclusion
Binarization converts numerical data to binary values based on a threshold. Use preprocessing.Binarizer() for reusable transformers or binarize() for direct conversion. Remember that sparse matrices require non-negative threshold values.
