Article Categories

Selected Reading

How to binarize the data using Python Scikit-learn?

Python Scikit-learn Server Side Programming Programming

Binarization is a preprocessing technique used to convert numerical data into binary values (0 and 1). The scikit-learn function sklearn.preprocessing.binarize() transforms data based on a threshold value ? features below or equal to the threshold become 0, while values above it become 1.

In this tutorial, we will learn to binarize data and sparse matrices using Scikit-learn in Python.

Basic Data Binarization

Let's see how to binarize a numpy array using the Binarizer class ?

# Importing the necessary packages
import numpy as np
from sklearn import preprocessing

# Sample data
X = [[0.4, -1.8, 2.9],
     [2.5, 0.9, 0.3], 
     [0.0, 1.0, -1.5],
     [0.1, 2.9, 5.9]]

# Create binarizer with threshold 0.5
binarizer = preprocessing.Binarizer(threshold=0.5)
binarized_data = binarizer.transform(X)

print("Original data:")
print(np.array(X))
print("\nBinarized data (threshold=0.5):")
print(binarized_data)

Original data:
[[ 0.4 -1.8  2.9]
 [ 2.5  0.9  0.3]
 [ 0.   1.  -1.5]
 [ 0.1  2.9  5.9]]

Binarized data (threshold=0.5):
[[0. 0. 1.]
 [1. 1. 0.]
 [0. 1. 0.]
 [0. 1. 1.]]

Using the binarize() Function

You can also use the standalone binarize() function for direct transformation ?

from sklearn.preprocessing import binarize
import numpy as np

# Sample data
data = [[1.2, -0.5, 2.1],
        [0.3, 1.8, 0.7]]

# Binarize with threshold 1.0
result = binarize(data, threshold=1.0)

print("Original data:")
print(np.array(data))
print("\nBinarized data (threshold=1.0):")
print(result)

Original data:
[[ 1.2 -0.5  2.1]
 [ 0.3  1.8  0.7]]

Binarized data (threshold=1.0):
[[1. 0. 1.]
 [0. 1. 0.]]

Binarizing Sparse Matrices

Sparse matrices contain mostly zero values and are memory-efficient as zeros aren't stored. You can binarize sparse matrices using scikit-learn, but the threshold must be non-negative.

Creating and Binarizing a Sparse Matrix

import numpy as np
from scipy.sparse import csr_matrix
from sklearn.preprocessing import binarize

# Create a sparse matrix
data = np.array([0.1, 0.8, 1.5, 0.3, 2.1])
row = np.array([0, 0, 1, 1, 2])
col = np.array([0, 2, 1, 3, 2])
sparse_matrix = csr_matrix((data, (row, col)), shape=(3, 4))

print("Original sparse matrix:")
print(sparse_matrix.toarray())

# Binarize with threshold 0.5
binarized_sparse = binarize(sparse_matrix, threshold=0.5)
print("\nBinarized sparse matrix (threshold=0.5):")
print(binarized_sparse.toarray())

Original sparse matrix:
[[0.1 0.  0.8 0. ]
 [0.  1.5 0.  0.3]
 [0.  0.  2.1 0. ]]

Binarized sparse matrix (threshold=0.5):
[[0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]]

Threshold Restriction for Sparse Matrices

When working with sparse matrices, the threshold cannot be negative. Here's what happens when you try ?

import numpy as np
from scipy.sparse import csr_matrix
from sklearn.preprocessing import binarize

# Create sparse matrix
sparse_matrix = csr_matrix([[0.1, 0.8], [1.2, 0.0]])

# This will raise an error
try:
    result = binarize(sparse_matrix, threshold=-0.5)
except ValueError as e:
    print("Error:", e)

Error: Cannot binarize a sparse matrix with threshold < 0

Key Parameters

Parameter	Description	Default
`threshold`	Feature values ≤ threshold become 0, values > threshold become 1	0.0
`copy`	Whether to create a copy or modify in-place	True

Conclusion

Binarization converts numerical data to binary values based on a threshold. Use preprocessing.Binarizer() for reusable transformers or binarize() for direct conversion. Remember that sparse matrices require non-negative threshold values.

Gaurav Leekha

Updated on: 2026-03-26T22:12:47+05:30

4K+ Views

Previous Next