- Python Pandas - Home
- Python Pandas - Introduction
- Python Pandas - Environment Setup
- Python Pandas - Basics
- Python Pandas - Introduction to Data Structures
- Python Pandas - Index Objects
- Python Pandas - Panel
- Python Pandas - Basic Functionality
- Python Pandas - Indexing & Selecting Data
- Python Pandas - Series
- Python Pandas - Series
- Python Pandas - Slicing a Series Object
- Python Pandas - Attributes of a Series Object
- Python Pandas - Arithmetic Operations on Series Object
- Python Pandas - Converting Series to Other Objects
- Python Pandas - DataFrame
- Python Pandas - DataFrame
- Python Pandas - Accessing DataFrame
- Python Pandas - Slicing a DataFrame Object
- Python Pandas - Modifying DataFrame
- Python Pandas - Removing Rows from a DataFrame
- Python Pandas - Arithmetic Operations on DataFrame
- Python Pandas - IO Tools
- Python Pandas - IO Tools
- Python Pandas - Working with CSV Format
- Python Pandas - Reading & Writing JSON Files
- Python Pandas - Reading Data from an Excel File
- Python Pandas - Writing Data to Excel Files
- Python Pandas - Working with HTML Data
- Python Pandas - Clipboard
- Python Pandas - Working with HDF5 Format
- Python Pandas - Comparison with SQL
- Python Pandas - Data Handling
- Python Pandas - Sorting
- Python Pandas - Reindexing
- Python Pandas - Iteration
- Python Pandas - Concatenation
- Python Pandas - Statistical Functions
- Python Pandas - Descriptive Statistics
- Python Pandas - Working with Text Data
- Python Pandas - Function Application
- Python Pandas - Options & Customization
- Python Pandas - Window Functions
- Python Pandas - Aggregations
- Python Pandas - Merging/Joining
- Python Pandas - MultiIndex
- Python Pandas - Basics of MultiIndex
- Python Pandas - Indexing with MultiIndex
- Python Pandas - Advanced Reindexing with MultiIndex
- Python Pandas - Renaming MultiIndex Labels
- Python Pandas - Sorting a MultiIndex
- Python Pandas - Binary Operations
- Python Pandas - Binary Comparison Operations
- Python Pandas - Boolean Indexing
- Python Pandas - Boolean Masking
- Python Pandas - Data Reshaping & Pivoting
- Python Pandas - Pivoting
- Python Pandas - Stacking & Unstacking
- Python Pandas - Melting
- Python Pandas - Computing Dummy Variables
- Python Pandas - Categorical Data
- Python Pandas - Categorical Data
- Python Pandas - Ordering & Sorting Categorical Data
- Python Pandas - Comparing Categorical Data
- Python Pandas - Handling Missing Data
- Python Pandas - Missing Data
- Python Pandas - Filling Missing Data
- Python Pandas - Interpolation of Missing Values
- Python Pandas - Dropping Missing Data
- Python Pandas - Calculations with Missing Data
- Python Pandas - Handling Duplicates
- Python Pandas - Duplicated Data
- Python Pandas - Counting & Retrieving Unique Elements
- Python Pandas - Duplicated Labels
- Python Pandas - Grouping & Aggregation
- Python Pandas - GroupBy
- Python Pandas - Time-series Data
- Python Pandas - Date Functionality
- Python Pandas - Timedelta
- Python Pandas - Sparse Data Structures
- Python Pandas - Sparse Data
- Python Pandas - Visualization
- Python Pandas - Visualization
- Python Pandas - Additional Concepts
- Python Pandas - Caveats & Gotchas
Pandas - Interaction with scipy.sparse
Pandas provides various functionality for handling sparse data in both DataFrames and Series. Whether you are converting sparse matrices to Pandas objects or managing sparse Series, these methods enable you to efficient memory usage and flexibility. This is particularly useful when working with large datasets or high-dimensional data with many missing values.
In this tutorial, we will learn about handling sparse data with Pandas and scipy.sparse object, including conversions between sparse matrices and Pandas data structures in DataFrames and Series.
Converting a Sparse Matrix to a Sparse DataFrame
Pandas provides the DataFrame.sparse.from_spmatrix() method to easily convert a SciPy sparse matrix into a Pandas DataFrame with sparse values.
Example
Here is a basic example of converting a sparse DataFrame from a sparse matrix. In this example we first create a sparse matrix using SciPy.sparse.csr_matrix() method and convert it into a sparse DataFrame using the DataFrame.sparse.from_spmatrix() method.
from scipy.sparse import csr_matrix
import numpy as np
import pandas as pd
# Create a random array with more zeros
arr = np.random.random(size=(100, 5))
arr[arr < 0.9] = 0
# Convert the array to a CSR (Compressed Sparse Row) matrix
sp_arr = csr_matrix(arr)
print("Input array of the Sparse matrix:")
print(sp_arr)
# Create a sparse DataFrame from the sparse matrix
sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr)
print("\nOutput Sparse DataFrame from the sparse matrix:")
print(sdf.head())
print("\nData Type of Each Column:")
print(sdf.dtypes)
Following is the output of the above code −
Input array of the Sparse matrix: (1, 1) 0.9137277003859157 (3, 4) 0.9263452511427155 (6, 3) 0.9832401131805535 (8, 4) 0.9751146163778053 (10, 2) 0.9696967110498368 (12, 0) 0.9326065132533555 (12, 1) 0.9990853392619211 (13, 4) 0.9973662602891205 (15, 0) 0.9079549759762293 (15, 1) 0.9334145404658547 (18, 0) 0.9921968603425697 (19, 4) 0.9398274379484053 (20, 4) 0.9795110690144339 (22, 0) 0.9526258283069171 (22, 3) 0.9218543168446979 (23, 0) 0.9869169842331807 (23, 3) 0.9869393496783473 (24, 1) 0.9358536531710068 (27, 1) 0.9250891723993367 (29, 0) 0.9935875472700555 (29, 4) 0.9824805592612114 (32, 3) 0.9141945434103862 (35, 4) 0.9154158109572998 (37, 0) 0.9784189117255467 (37, 4) 0.9253723150674816 (38, 4) 0.9793466184464948 (40, 2) 0.9534016769461144 (50, 0) 0.9286513214811297 (50, 3) 0.9065906405639927 (54, 2) 0.9772390891062281 (56, 3) 0.9243510758420271 (67, 2) 0.983578938624393 (69, 3) 0.9810396613781989 (70, 1) 0.9232051506012988 (70, 3) 0.9909535064779623 (72, 3) 0.9015247637186882 (77, 1) 0.9709105762700359 (80, 1) 0.9573836327480426 (83, 0) 0.9638579367616993 (83, 3) 0.9423143492693533 (83, 4) 0.9825050316896803 (84, 2) 0.9507158381012188 (86, 2) 0.9224509384078009 (91, 2) 0.9434356087077086 (91, 4) 0.9039509185806063 (96, 4) 0.9704980927833841 (98, 0) 0.9465290724610705 (98, 1) 0.9987570168197035 (99, 0) 0.9157188758677448 Output Sparse DataFrame from the sparse matrix:
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| 0 | 0.000000 | 0.0 | 0.0 | 0.000000 | |
| 1 | 0.913728 | 0.0 | 0.0 | 0.000000 | |
| 2 | 0.000000 | 0.0 | 0.0 | 0.000000 | |
| 3 | 0.000000 | 0.0 | 0.0 | 0.926345 | |
| 4 | 0.000000 | 0.0 | 0.0 | 0.000000 |
Converting Sparse DataFrame to SciPy COO Matrix
To convert a sparse DataFrame back to a SciPy COO format sparse matrix can be done by using the DataFrame.sparse.to_coo() method.
Example
This example uses the DataFrame.sparse.to_coo() method for converting the sparse DataFrame back to a SciPy sparse matrix.
from scipy.sparse import csr_matrix
import numpy as np
import pandas as pd
# Create a random array with more zeros
arr = np.random.random(size=(100, 5))
arr[arr < 0.9] = 0
# Convert the array to a CSR (Compressed Sparse Row) matrix
sp_arr = csr_matrix(arr)
# Create a sparse DataFrame from the sparse matrix
sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr)
print("Input Sparse DataFrame:")
print(sdf.head())
# Convert the sparse DataFrame back to a SciPy sparse matrix in COO format
coo_matrix = sdf.sparse.to_coo()
print("\nOutput SciPy sparse matrix in COO format:")
print(coo_matrix)
Following is the output of the above code −
Input Sparse DataFrame:
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| 0 | 0.000000 | 0.908423 | 0.0 | 0.000000 | |
| 1 | 0.000000 | 0.000000 | 0.0 | 0.000000 | |
| 2 | 0.000000 | 0.000000 | 0.0 | 0.000000 | |
| 3 | 0.963157 | 0.000000 | 0.0 | 0.926345 | |
| 4 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
Converting a Sparse Series with MultiIndex to COO Format
Pandas to_coo() method also allows you to convert an MultiIndex sparse Series object into a SciPy sparse COO matrix.
Example
This example demonstrates converting a Pandas MultiIndex Series object into a SciPy sparse COO matrix using the Series.sparse.to_coo() method
from scipy.sparse import csr_matrix
import numpy as np
import pandas as pd
# Define a MultiIndex for the Series
s = pd.Series([1.0, np.nan, 7.0, 9.0, np.nan, np.nan])
s.index = pd.MultiIndex.from_tuples([(1, 2, "a", 0),
(1, 2, "a", 1),
(1, 1, "b", 0),
(1, 1, "b", 1),
(2, 1, "b", 0),
(2, 1, "b", 1)], names=["A", "B", "C", "D"])
# Convert it to Sparse Series
sparse_series = s.astype('Sparse')
print("Input MultiIndexed Sparse Series:")
print(sparse_series)
# Convert the MultiIndexed sparse Series to SciPy sparse matrix in COO format
coo_matrix, row_labels, col_labels = sparse_series.sparse.to_coo(
row_levels=["A", "B"], column_levels=["C", "D"], sort_labels=True)
print("\nOutput SciPy sparse matrix in COO format:")
print(coo_matrix)
print("COO matrix in dense:")
print(coo_matrix.todense())
print("\nRow Labels:", row_labels)
print("Column Lables:", col_labels)
Following is the output of the above code −
Input MultiIndexed Sparse Series:
A B C D
1 2 a 0 1.0
1 NaN
1 b 0 7.0
1 9.0
2 1 b 0 NaN
1 NaN
dtype: Sparse[float64, nan]
Output SciPy sparse matrix in COO format:
(1, 0) 1.0
(0, 2) 7.0
(0, 3) 9.0
COO matrix in dense:
[[0. 0. 7. 9.]
[1. 0. 0. 0.]
[0. 0. 0. 0.]]
Row Labels: [(1, 1), (1, 2), (2, 1)]
Column Lables: [('a', 0), ('a', 1), ('b', 0), ('b', 1)]
Creating a Sparse Series from a COO Matrix
You can also create a Series with sparse values directly from a COO format sparse matrix using Series.sparse.from_coo() method.
Example: Creating Basic Sparse Series from a COO Matrix
The following example demonstrates the creating a Series with sparse values from a scipy.sparse.coo_matrix using the Series.sparse.from_coo() method.
import pandas as pd
from scipy import sparse
# Create a sparse COO matrix
coo_matrx = sparse.coo_matrix(([8.0, 5.0, 7.0], ([1, 0, 0], [0, 2, 3])), shape=(3, 4))
# Display the input sparse COO matrix
print("Input Sparse COO Matrix:")
print(coo_matrx )
# convert the sparse COO matrix to Sparse Series
ss = pd.Series.sparse.from_coo(coo_matrx)
# Display the Sparse Series
print("\nSparse Series:")
print(ss)
Following is the output of the above code −
Input Sparse COO Matrix: (1, 0) 8.0 (0, 2) 5.0 (0, 3) 7.0 Sparse Series: 0 2 5.0 3 7.0 1 0 8.0 dtype: Sparse[float64, nan]
By default, Series.sparse.from_coo() method will only include non-zero values. For a dense index, you can specify dense_index=True, this will generates an index with all possible row-column combinations, consuming more memory but providing a complete row and column coordinates matrix structure.
Example: Creating Dense Indexed Sparse Series from a COO Matrix
This example creates a dense indexed sparse Series from the a COO matrix using the Series.sparse.from_coo() method by specifying the boolean value true to the dense_index parameter.
import pandas as pd
from scipy import sparse
# Create a sparse COO matrix
coo_matrx = sparse.coo_matrix(([8.0, 5.0, 7.0], ([1, 0, 0], [0, 2, 3])), shape=(3, 4))
# Display the input sparse COO matrix
print("Input Sparse COO Matrix:")
print(coo_matrx )
# Create a dense indexed sparse Series
ss = pd.Series.sparse.from_coo(coo_matrx, dense_index=True)
# Display the Sparse Series
print("\nSparse Series with Dense Index:")
print(ss)
Following is the output of the above code −
Input Sparse COO Matrix: (1, 0) 8.0 (0, 2) 5.0 (0, 3) 7.0 Sparse Series: 1 0 8.0 2 NaN 3 NaN 0 0 NaN 2 5.0 3 7.0 0 NaN 2 5.0 3 7.0 dtype: Sparse[float64, nan]