Pandas - Interaction with scipy.sparse



Pandas provides various functionality for handling sparse data in both DataFrames and Series. Whether you are converting sparse matrices to Pandas objects or managing sparse Series, these methods enable you to efficient memory usage and flexibility. This is particularly useful when working with large datasets or high-dimensional data with many missing values.

In this tutorial, we will learn about handling sparse data with Pandas and scipy.sparse object, including conversions between sparse matrices and Pandas data structures in DataFrames and Series.

Converting a Sparse Matrix to a Sparse DataFrame

Pandas provides the DataFrame.sparse.from_spmatrix() method to easily convert a SciPy sparse matrix into a Pandas DataFrame with sparse values.

Example

Here is a basic example of converting a sparse DataFrame from a sparse matrix. In this example we first create a sparse matrix using SciPy.sparse.csr_matrix() method and convert it into a sparse DataFrame using the DataFrame.sparse.from_spmatrix() method.

from scipy.sparse import csr_matrix
import numpy as np
import pandas as pd

# Create a random array with more zeros
arr = np.random.random(size=(100, 5))
arr[arr < 0.9] = 0

# Convert the array to a CSR (Compressed Sparse Row) matrix
sp_arr = csr_matrix(arr)
print("Input array of the Sparse matrix:")
print(sp_arr)

# Create a sparse DataFrame from the sparse matrix
sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr)

print("\nOutput Sparse DataFrame from the sparse matrix:")
print(sdf.head())
print("\nData Type of Each Column:")
print(sdf.dtypes)

Following is the output of the above code −

Input array of the Sparse matrix:
  (1, 1)	0.9137277003859157
  (3, 4)	0.9263452511427155
  (6, 3)	0.9832401131805535
  (8, 4)	0.9751146163778053
  (10, 2)	0.9696967110498368
  (12, 0)	0.9326065132533555
  (12, 1)	0.9990853392619211
  (13, 4)	0.9973662602891205
  (15, 0)	0.9079549759762293
  (15, 1)	0.9334145404658547
  (18, 0)	0.9921968603425697
  (19, 4)	0.9398274379484053
  (20, 4)	0.9795110690144339
  (22, 0)	0.9526258283069171
  (22, 3)	0.9218543168446979
  (23, 0)	0.9869169842331807
  (23, 3)	0.9869393496783473
  (24, 1)	0.9358536531710068
  (27, 1)	0.9250891723993367
  (29, 0)	0.9935875472700555
  (29, 4)	0.9824805592612114
  (32, 3)	0.9141945434103862
  (35, 4)	0.9154158109572998
  (37, 0)	0.9784189117255467
  (37, 4)	0.9253723150674816
  (38, 4)	0.9793466184464948
  (40, 2)	0.9534016769461144
  (50, 0)	0.9286513214811297
  (50, 3)	0.9065906405639927
  (54, 2)	0.9772390891062281
  (56, 3)	0.9243510758420271
  (67, 2)	0.983578938624393
  (69, 3)	0.9810396613781989
  (70, 1)	0.9232051506012988
  (70, 3)	0.9909535064779623
  (72, 3)	0.9015247637186882
  (77, 1)	0.9709105762700359
  (80, 1)	0.9573836327480426
  (83, 0)	0.9638579367616993
  (83, 3)	0.9423143492693533
  (83, 4)	0.9825050316896803
  (84, 2)	0.9507158381012188
  (86, 2)	0.9224509384078009
  (91, 2)	0.9434356087077086
  (91, 4)	0.9039509185806063
  (96, 4)	0.9704980927833841
  (98, 0)	0.9465290724610705
  (98, 1)	0.9987570168197035
  (99, 0)	0.9157188758677448

Output Sparse DataFrame from the sparse matrix:
01234
00.0000000.00.00.000000
10.9137280.00.00.000000
20.0000000.00.00.000000
30.0000000.00.00.926345
40.0000000.00.00.000000
Data Type of Each Column: 0 Sparse[float64, 0] 1 Sparse[float64, 0] 2 Sparse[float64, 0] 3 Sparse[float64, 0] 4 Sparse[float64, 0] dtype: object

Converting Sparse DataFrame to SciPy COO Matrix

To convert a sparse DataFrame back to a SciPy COO format sparse matrix can be done by using the DataFrame.sparse.to_coo() method.

Example

This example uses the DataFrame.sparse.to_coo() method for converting the sparse DataFrame back to a SciPy sparse matrix.

from scipy.sparse import csr_matrix
import numpy as np
import pandas as pd

# Create a random array with more zeros
arr = np.random.random(size=(100, 5))
arr[arr < 0.9] = 0

# Convert the array to a CSR (Compressed Sparse Row) matrix
sp_arr = csr_matrix(arr)

# Create a sparse DataFrame from the sparse matrix
sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr)

print("Input Sparse DataFrame:")
print(sdf.head())

# Convert the sparse DataFrame back to a SciPy sparse matrix in COO format
coo_matrix = sdf.sparse.to_coo()

print("\nOutput SciPy sparse matrix in COO format:")
print(coo_matrix)

Following is the output of the above code −

Input Sparse DataFrame:
01234
00.0000000.9084230.00.000000
10.0000000.0000000.00.000000
20.0000000.0000000.00.000000
30.9631570.0000000.00.926345
40.0000000.0000000.00.000000
Output SciPy sparse matrix in COO format: (15, 0) 0.9565290830076498 (17, 0) 0.9993362833967224 (30, 0) 0.9448638895454046 (31, 0) 0.9402087004435209 (58, 0) 0.9508291602896428 (83, 0) 0.9640231591519308 (97, 0) 0.9480057964425729 (2, 1) 0.9527820297479814 (6, 1) 0.9056492958192004 (12, 1) 0.9379960920971506 (20, 1) 0.9015685427514573 (65, 1) 0.9983971860655619 (82, 1) 0.9124624680792796 (89, 1) 0.9068317839072992 (94, 1) 0.9255111334071213 (98, 1) 0.9225964783729922 (8, 2) 0.9667760342577757 (9, 2) 0.9702359301169143 (43, 2) 0.9642705326313852 (74, 2) 0.9259533004624514 (77, 2) 0.9357506814171687 (6, 3) 0.925468994786992 (22, 3) 0.9500643025248764 (26, 3) 0.9747272350720176 (53, 3) 0.9035753562402721 (70, 3) 0.9800270940419613 (94, 3) 0.9711182636213348 (13, 4) 0.9735913964763091 (24, 4) 0.9728154298260043 (41, 4) 0.9066571542352223 (67, 4) 0.9524203675383792 (68, 4) 0.9102136594380712 (76, 4) 0.973817970417395 (79, 4) 0.9149958113485477 (88, 4) 0.9012937048788182 (91, 4) 0.9682458600689072 (92, 4) 0.9472791016324659

Converting a Sparse Series with MultiIndex to COO Format

Pandas to_coo() method also allows you to convert an MultiIndex sparse Series object into a SciPy sparse COO matrix.

Example

This example demonstrates converting a Pandas MultiIndex Series object into a SciPy sparse COO matrix using the Series.sparse.to_coo() method

from scipy.sparse import csr_matrix
import numpy as np
import pandas as pd

# Define a MultiIndex for the Series
s = pd.Series([1.0, np.nan, 7.0, 9.0, np.nan, np.nan])
s.index = pd.MultiIndex.from_tuples([(1, 2, "a", 0),
(1, 2, "a", 1),
(1, 1, "b", 0),
(1, 1, "b", 1),
(2, 1, "b", 0),
(2, 1, "b", 1)], names=["A", "B", "C", "D"])

# Convert it to Sparse Series
sparse_series = s.astype('Sparse')
print("Input MultiIndexed Sparse Series:")
print(sparse_series)

# Convert the MultiIndexed sparse Series to SciPy sparse matrix in COO format
coo_matrix, row_labels, col_labels = sparse_series.sparse.to_coo(
row_levels=["A", "B"], column_levels=["C", "D"], sort_labels=True)

print("\nOutput SciPy sparse matrix in COO format:")
print(coo_matrix)
print("COO matrix in dense:")
print(coo_matrix.todense())
print("\nRow Labels:", row_labels)
print("Column Lables:", col_labels)

Following is the output of the above code −

Input MultiIndexed Sparse Series:
A  B  C  D
1  2  a  0    1.0
         1    NaN
   1  b  0    7.0
         1    9.0
2  1  b  0    NaN
         1    NaN
dtype: Sparse[float64, nan]

Output SciPy sparse matrix in COO format:
  (1, 0)	1.0
  (0, 2)	7.0
  (0, 3)	9.0
COO matrix in dense:
[[0. 0. 7. 9.]
 [1. 0. 0. 0.]
 [0. 0. 0. 0.]]

Row Labels: [(1, 1), (1, 2), (2, 1)]
Column Lables: [('a', 0), ('a', 1), ('b', 0), ('b', 1)]

Creating a Sparse Series from a COO Matrix

You can also create a Series with sparse values directly from a COO format sparse matrix using Series.sparse.from_coo() method.

Example: Creating Basic Sparse Series from a COO Matrix

The following example demonstrates the creating a Series with sparse values from a scipy.sparse.coo_matrix using the Series.sparse.from_coo() method.

import pandas as pd
from scipy import sparse

# Create a sparse COO matrix
coo_matrx = sparse.coo_matrix(([8.0, 5.0, 7.0], ([1, 0, 0], [0, 2, 3])), shape=(3, 4))

# Display the input sparse COO matrix
print("Input Sparse COO Matrix:")
print(coo_matrx )

# convert the sparse COO matrix to Sparse Series 
ss = pd.Series.sparse.from_coo(coo_matrx)

# Display the Sparse Series
print("\nSparse Series:")
print(ss)

Following is the output of the above code −

Input Sparse COO Matrix:
  (1, 0)	8.0
  (0, 2)	5.0
  (0, 3)	7.0

Sparse Series:
0  2    5.0
   3    7.0
1  0    8.0
dtype: Sparse[float64, nan]

By default, Series.sparse.from_coo() method will only include non-zero values. For a dense index, you can specify dense_index=True, this will generates an index with all possible row-column combinations, consuming more memory but providing a complete row and column coordinates matrix structure.

Example: Creating Dense Indexed Sparse Series from a COO Matrix

This example creates a dense indexed sparse Series from the a COO matrix using the Series.sparse.from_coo() method by specifying the boolean value true to the dense_index parameter.

import pandas as pd
from scipy import sparse

# Create a sparse COO matrix
coo_matrx = sparse.coo_matrix(([8.0, 5.0, 7.0], ([1, 0, 0], [0, 2, 3])), shape=(3, 4))

# Display the input sparse COO matrix
print("Input Sparse COO Matrix:")
print(coo_matrx )

# Create a dense indexed sparse Series
ss = pd.Series.sparse.from_coo(coo_matrx, dense_index=True)

# Display the Sparse Series
print("\nSparse Series with Dense Index:")
print(ss)

Following is the output of the above code −

Input Sparse COO Matrix:
  (1, 0)	8.0
  (0, 2)	5.0
  (0, 3)	7.0

Sparse Series:
1  0    8.0
   2    NaN
   3    NaN
0  0    NaN
   2    5.0
   3    7.0
   0    NaN
   2    5.0
   3    7.0
dtype: Sparse[float64, nan]
Advertisements