- Python Pandas - Home
- Python Pandas - Introduction
- Python Pandas - Environment Setup
- Python Pandas - Basics
- Python Pandas - Introduction to Data Structures
- Python Pandas - Index Objects
- Python Pandas - Panel
- Python Pandas - Basic Functionality
- Python Pandas - Indexing & Selecting Data
- Python Pandas - Series
- Python Pandas - Series
- Python Pandas - Slicing a Series Object
- Python Pandas - Attributes of a Series Object
- Python Pandas - Arithmetic Operations on Series Object
- Python Pandas - Converting Series to Other Objects
- Python Pandas - DataFrame
- Python Pandas - DataFrame
- Python Pandas - Accessing DataFrame
- Python Pandas - Slicing a DataFrame Object
- Python Pandas - Modifying DataFrame
- Python Pandas - Removing Rows from a DataFrame
- Python Pandas - Arithmetic Operations on DataFrame
- Python Pandas - IO Tools
- Python Pandas - IO Tools
- Python Pandas - Working with CSV Format
- Python Pandas - Reading & Writing JSON Files
- Python Pandas - Reading Data from an Excel File
- Python Pandas - Writing Data to Excel Files
- Python Pandas - Working with HTML Data
- Python Pandas - Clipboard
- Python Pandas - Working with HDF5 Format
- Python Pandas - Comparison with SQL
- Python Pandas - Data Handling
- Python Pandas - Sorting
- Python Pandas - Reindexing
- Python Pandas - Iteration
- Python Pandas - Concatenation
- Python Pandas - Statistical Functions
- Python Pandas - Descriptive Statistics
- Python Pandas - Working with Text Data
- Python Pandas - Function Application
- Python Pandas - Options & Customization
- Python Pandas - Window Functions
- Python Pandas - Aggregations
- Python Pandas - Merging/Joining
- Python Pandas - MultiIndex
- Python Pandas - Basics of MultiIndex
- Python Pandas - Indexing with MultiIndex
- Python Pandas - Advanced Reindexing with MultiIndex
- Python Pandas - Renaming MultiIndex Labels
- Python Pandas - Sorting a MultiIndex
- Python Pandas - Binary Operations
- Python Pandas - Binary Comparison Operations
- Python Pandas - Boolean Indexing
- Python Pandas - Boolean Masking
- Python Pandas - Data Reshaping & Pivoting
- Python Pandas - Pivoting
- Python Pandas - Stacking & Unstacking
- Python Pandas - Melting
- Python Pandas - Computing Dummy Variables
- Python Pandas - Categorical Data
- Python Pandas - Categorical Data
- Python Pandas - Ordering & Sorting Categorical Data
- Python Pandas - Comparing Categorical Data
- Python Pandas - Handling Missing Data
- Python Pandas - Missing Data
- Python Pandas - Filling Missing Data
- Python Pandas - Interpolation of Missing Values
- Python Pandas - Dropping Missing Data
- Python Pandas - Calculations with Missing Data
- Python Pandas - Handling Duplicates
- Python Pandas - Duplicated Data
- Python Pandas - Counting & Retrieving Unique Elements
- Python Pandas - Duplicated Labels
- Python Pandas - Grouping & Aggregation
- Python Pandas - GroupBy
- Python Pandas - Time-series Data
- Python Pandas - Date Functionality
- Python Pandas - Timedelta
- Python Pandas - Sparse Data Structures
- Python Pandas - Sparse Data
- Python Pandas - Visualization
- Python Pandas - Visualization
- Python Pandas - Additional Concepts
- Python Pandas - Caveats & Gotchas
Python Pandas - Advanced Parquet File Operations
Parquet is a columnar storage file format that is highly efficient for both reading and writing operations. Pandas provides advanced options for working with Parquet file format including data type handling, custom index management, data partitioning, and compression techniques.
In previous tutorial, we learned about the basics of the Parquet File Format in Pandas, focusing on how to use it for basic operations like reading and writing data. In this tutorial, we will explore more advanced features of Parquet in Pandas, including custom handling of data types, managing indices, partitioning data, and utilizing compression techniques for better storage efficiency.
Parquet Engines in Pandas
Pandas provides support for two main Parquet engines, PyArrow and Fastparquet, each engine offers unique features that affect performance and compatibility.
PyArrow engine provides advanced data type support, including efficient handling of complex data types and the preservation of extension types.
Whereas, Fastparquet is a lightweight engine that supports time zone-aware datetime objects. It is a good for managing datetime data with time zone support is essential.
Handling Indexes in Pandas Parquet file
When writing DataFrames to Parquet files, Pandas handles the DataFrame indices differently depending on the engine. By default, the index is stored along with the data.
This can lead to unexpected extra columns, especially when sharing Parquet files with other tools. To control this behavior, you can use the index parameter to control whether the index should be saved or excluded during the write operation.
Example
This example demonstrates omitting indexes in the Parquet file. Here, the index=False argument ensures that the index is not stored in the Parquet file.
import pandas as pd
# Creating a DataFrame with various types
df = pd.DataFrame({
"strings": ["x", "y", "z"],
"numbers": [1, 2, 3],
"dates": pd.date_range("2024-01-01", periods=3),
}, index=['Row1', 'Row2', 'Row3'])
print("Original DataFrame:")
print(df)
# Write DataFrame to Parquet file without index
df.to_parquet("without_index.parquet", engine="pyarrow", index=False)
# Reading the DataFrame from Parquet
result = pd.read_parquet("without_index.parquet", engine="pyarrow", dtype_backend="pyarrow")
print('\nLoaded DataFrame Parquet File:')
print(result)
Following is an output of the above code −
Original DataFrame:
| strings | numbers | dates | |
|---|---|---|---|
| Row1 | x | 1 | 2024-01-01 |
| Row2 | y | 2 | 2024-01-01 |
| Row3 | z | 3 | 2024-01-01 |
| strings | numbers | dates | |
|---|---|---|---|
| 0 | x | 1 | 2024-01-01 00:00:00 |
| 1 | y | 2 | 2024-01-01 00:00:00 |
| 2 | z | 3 | 2024-01-01 00:00:00 |
Handling Data Types in Parquet
Parquet file format efficiently manages Pandas data types, including categories and datetime. However, some types, like Interval, are unsupported. By setting the dtype_backend argument, you can control the default data types used during read and write operations, ensuring the appropriate conversion of data types.
Example
This example demonstrates customizing the back-end data type while reading the DataFrame from a Parquet file. Here we have changed the back-end data type to pyarrow by setting the dtype_backend parameter of the read_parquet() method.
import pandas as pd
# Creating a DataFrame with various types
df = pd.DataFrame({
"strings": ["x", "y", "z"],
"numbers": [1, 2, 3],
"dates": pd.date_range("2024-01-01", periods=3),
}, index=['Row1', 'Row2', 'Row3'])
print("Original DataFrame:")
print(df)
# Write DataFrame to Parquet file
df.to_parquet("example_parquet_file.parquet", engine="pyarrow")
# Reading the DataFrame from Parquet with 'pyarrow' data type
result = pd.read_parquet("example_parquet_file.parquet", engine="pyarrow", dtype_backend="pyarrow")
# Display the DataFrame's data types
print("\nLoaded DataFrame's data types:")
print(result.dtypes)
Following is an output of the above code −
Original DataFrame:
| strings | numbers | dates | |
|---|---|---|---|
| Row1 | x | 1 | 2024-01-01 |
| Row2 | y | 2 | 2024-01-01 |
| Row3 | z | 3 | 2024-01-01 |
Data Partitioning in Parquet Files
Partitioning is a technique used to split large datasets into smaller, more manageable block based on the values of one or more columns. Which improves data retrieval by splitting files into segments based on a specific column.
Partitioning can be done through the partition_cols argument of the to_parquet() method, which allows you to partition data when writing to the file.
Example
In this example, the DataFrame is partitioned by the "numbers" column using the partition_cols parameter, resulting in a Parquet file structure where the data will be stored in separate directories based on the values in this column.
import pandas as pd
# Creating a DataFrame with various types
df = pd.DataFrame({
"strings": ["x", "y", "z"],
"numbers": [1, 2, 3],
"dates": pd.date_range("2024-01-01", periods=3),
}, index=['Row1', 'Row2', 'Row3'])
print("Original DataFrame:")
print(df)
# Partition by column 'numbers'
df.to_parquet("partitioned_data.parquet", engine="pyarrow", partition_cols=["numbers"])
print("\nPartitioned DataFrame is successfully saved in separate directories.")
Following is an output of the above code −
Original DataFrame:
| strings | numbers | dates | |
|---|---|---|---|
| Row1 | x | 1 | 2024-01-01 |
| Row2 | y | 2 | 2024-01-01 |
| Row3 | z | 3 | 2024-01-01 |
Please verify the path (working directory) where the Parquet file were saved for parent directory to which Partitioned data will be saved.
Compression Techniques in Parquet files
Parquet file format supports several compression techniques, including Snappy, Gzip, lz4, and Brotli. Which can be specified using the compression argument of the to_parquet() method. Compression is useful for reducing the storage size while maintaining efficient data access.
Example
The following example shows how to use the to_parquet() method with for saving the Pandas DataFrame as a parquet file with compression. In this example, we will apply Gzip compression to the Parquet file, which helps reduce the storage size while maintaining the ability to read and write the data efficiently.
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({"Col_1": range(5), "Col_2": range(5, 10)})
print("Original DataFrame:")
print(df)
# Save the DataFrame to a parquet file with compression
df.to_parquet('compressed_data.parquet.gzip', compression='gzip')
print("\nDataFrame is saved as parquet format with compression..")
Following is an output of the above code −
Original DataFrame:
| Col_1 | Col_2 | |
|---|---|---|
| 0 | 0 | 5 |
| 1 | 1 | 6 |
| 2 | 2 | 7 |
| 3 | 3 | 8 |
| 4 | 4 | 9 |