Python Pandas - Home
Python Pandas - Introduction
Python Pandas - Environment Setup
Python Pandas - Basics
Python Pandas - Introduction to Data Structures
Python Pandas - Index Objects
Python Pandas - Panel
Python Pandas - Basic Functionality
Python Pandas - Indexing & Selecting Data
Python Pandas - Series
Python Pandas - Series
Python Pandas - Slicing a Series Object
Python Pandas - Attributes of a Series Object
Python Pandas - Arithmetic Operations on Series Object
Python Pandas - Converting Series to Other Objects
Python Pandas - DataFrame
Python Pandas - DataFrame
Python Pandas - Accessing DataFrame
Python Pandas - Slicing a DataFrame Object
Python Pandas - Modifying DataFrame
Python Pandas - Removing Rows from a DataFrame
Python Pandas - Arithmetic Operations on DataFrame
Python Pandas - IO Tools
Python Pandas - IO Tools
Python Pandas - Working with CSV Format
Python Pandas - Reading & Writing JSON Files
Python Pandas - Reading Data from an Excel File
Python Pandas - Writing Data to Excel Files
Python Pandas - Working with HTML Data
Python Pandas - Clipboard
Python Pandas - Working with HDF5 Format
Python Pandas - Comparison with SQL
Python Pandas - Data Handling
Python Pandas - Sorting
Python Pandas - Reindexing
Python Pandas - Iteration
Python Pandas - Concatenation
Python Pandas - Statistical Functions
Python Pandas - Descriptive Statistics
Python Pandas - Working with Text Data
Python Pandas - Function Application
Python Pandas - Options & Customization
Python Pandas - Window Functions
Python Pandas - Aggregations
Python Pandas - Merging/Joining
Python Pandas - MultiIndex
Python Pandas - Basics of MultiIndex
Python Pandas - Indexing with MultiIndex
Python Pandas - Advanced Reindexing with MultiIndex
Python Pandas - Renaming MultiIndex Labels
Python Pandas - Sorting a MultiIndex
Python Pandas - Binary Operations
Python Pandas - Binary Comparison Operations
Python Pandas - Boolean Indexing
Python Pandas - Boolean Masking
Python Pandas - Data Reshaping & Pivoting
Python Pandas - Pivoting
Python Pandas - Stacking & Unstacking
Python Pandas - Melting
Python Pandas - Computing Dummy Variables
Python Pandas - Categorical Data
Python Pandas - Categorical Data
Python Pandas - Ordering & Sorting Categorical Data
Python Pandas - Comparing Categorical Data
Python Pandas - Handling Missing Data
Python Pandas - Missing Data
Python Pandas - Filling Missing Data
Python Pandas - Interpolation of Missing Values
Python Pandas - Dropping Missing Data
Python Pandas - Calculations with Missing Data
Python Pandas - Handling Duplicates
Python Pandas - Duplicated Data
Python Pandas - Counting & Retrieving Unique Elements
Python Pandas - Duplicated Labels
Python Pandas - Grouping & Aggregation
Python Pandas - GroupBy
Python Pandas - Time-series Data
Python Pandas - Date Functionality
Python Pandas - Timedelta
Python Pandas - Sparse Data Structures
Python Pandas - Sparse Data
Python Pandas - Visualization
Python Pandas - Visualization
Python Pandas - Additional Concepts
Python Pandas - Caveats & Gotchas

Python Pandas - Advanced Parquet File Operations

Quiz

Parquet is a columnar storage file format that is highly efficient for both reading and writing operations. Pandas provides advanced options for working with Parquet file format including data type handling, custom index management, data partitioning, and compression techniques.

In previous tutorial, we learned about the basics of the Parquet File Format in Pandas, focusing on how to use it for basic operations like reading and writing data. In this tutorial, we will explore more advanced features of Parquet in Pandas, including custom handling of data types, managing indices, partitioning data, and utilizing compression techniques for better storage efficiency.

Parquet Engines in Pandas

Pandas provides support for two main Parquet engines, PyArrow and Fastparquet, each engine offers unique features that affect performance and compatibility.

PyArrow engine provides advanced data type support, including efficient handling of complex data types and the preservation of extension types.
Whereas, Fastparquet is a lightweight engine that supports time zone-aware datetime objects. It is a good for managing datetime data with time zone support is essential.

Handling Indexes in Pandas Parquet file

When writing DataFrames to Parquet files, Pandas handles the DataFrame indices differently depending on the engine. By default, the index is stored along with the data.

This can lead to unexpected extra columns, especially when sharing Parquet files with other tools. To control this behavior, you can use the index parameter to control whether the index should be saved or excluded during the write operation.

Example

This example demonstrates omitting indexes in the Parquet file. Here, the index=False argument ensures that the index is not stored in the Parquet file.

import pandas as pd

# Creating a DataFrame with various types
df = pd.DataFrame({
"strings": ["x", "y", "z"],
"numbers": [1, 2, 3],
"dates": pd.date_range("2024-01-01", periods=3),
}, index=['Row1', 'Row2', 'Row3'])

print("Original DataFrame:")
print(df)

# Write DataFrame to Parquet file without index
df.to_parquet("without_index.parquet", engine="pyarrow", index=False)

# Reading the DataFrame from Parquet
result = pd.read_parquet("without_index.parquet", engine="pyarrow", dtype_backend="pyarrow")
print('\nLoaded DataFrame Parquet File:')
print(result)

Following is an output of the above code −

Original DataFrame:

	strings	numbers	dates
Row1	x	1	2024-01-01
Row2	y	2	2024-01-01
Row3	z	3	2024-01-01

Loaded DataFrame Parquet File:

	strings	numbers	dates
0	x	1	2024-01-01 00:00:00
1	y	2	2024-01-01 00:00:00
2	z	3	2024-01-01 00:00:00

Handling Data Types in Parquet

Parquet file format efficiently manages Pandas data types, including categories and datetime. However, some types, like Interval, are unsupported. By setting the dtype_backend argument, you can control the default data types used during read and write operations, ensuring the appropriate conversion of data types.

Example

This example demonstrates customizing the back-end data type while reading the DataFrame from a Parquet file. Here we have changed the back-end data type to pyarrow by setting the dtype_backend parameter of the read_parquet() method.

import pandas as pd

# Creating a DataFrame with various types
df = pd.DataFrame({
"strings": ["x", "y", "z"],
"numbers": [1, 2, 3],
"dates": pd.date_range("2024-01-01", periods=3),
}, index=['Row1', 'Row2', 'Row3'])

print("Original DataFrame:")
print(df)

# Write DataFrame to Parquet file
df.to_parquet("example_parquet_file.parquet", engine="pyarrow")

# Reading the DataFrame from Parquet with 'pyarrow' data type
result = pd.read_parquet("example_parquet_file.parquet", engine="pyarrow", dtype_backend="pyarrow")

# Display the DataFrame's data types
print("\nLoaded DataFrame's data types:")
print(result.dtypes)

Following is an output of the above code −

Original DataFrame:

	strings	numbers	dates
Row1	x	1	2024-01-01
Row2	y	2	2024-01-01
Row3	z	3	2024-01-01

Loaded DataFrame's data types: strings string[pyarrow] numbers int64[pyarrow] dates timestamp[ns][pyarrow] dtype: object

Data Partitioning in Parquet Files

Partitioning is a technique used to split large datasets into smaller, more manageable block based on the values of one or more columns. Which improves data retrieval by splitting files into segments based on a specific column.

Partitioning can be done through the partition_cols argument of the to_parquet() method, which allows you to partition data when writing to the file.

Example

In this example, the DataFrame is partitioned by the "numbers" column using the partition_cols parameter, resulting in a Parquet file structure where the data will be stored in separate directories based on the values in this column.

import pandas as pd

# Creating a DataFrame with various types
df = pd.DataFrame({
"strings": ["x", "y", "z"],
"numbers": [1, 2, 3],
"dates": pd.date_range("2024-01-01", periods=3),
}, index=['Row1', 'Row2', 'Row3'])

print("Original DataFrame:")
print(df)

# Partition by column 'numbers'
df.to_parquet("partitioned_data.parquet", engine="pyarrow", partition_cols=["numbers"])

print("\nPartitioned DataFrame is successfully saved in separate directories.")

Following is an output of the above code −

Original DataFrame:

	strings	numbers	dates
Row1	x	1	2024-01-01
Row2	y	2	2024-01-01
Row3	z	3	2024-01-01

Partitioned DataFrame is successfully saved in separate directories.

Please verify the path (working directory) where the Parquet file were saved for parent directory to which Partitioned data will be saved.

Compression Techniques in Parquet files

Parquet file format supports several compression techniques, including Snappy, Gzip, lz4, and Brotli. Which can be specified using the compression argument of the to_parquet() method. Compression is useful for reducing the storage size while maintaining efficient data access.

Example

The following example shows how to use the to_parquet() method with for saving the Pandas DataFrame as a parquet file with compression. In this example, we will apply Gzip compression to the Parquet file, which helps reduce the storage size while maintaining the ability to read and write the data efficiently.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({"Col_1": range(5), "Col_2": range(5, 10)})
print("Original DataFrame:")
print(df)

# Save the DataFrame to a parquet file with compression
df.to_parquet('compressed_data.parquet.gzip', compression='gzip')
print("\nDataFrame is saved as parquet format with compression..")

Following is an output of the above code −

Original DataFrame:

	Col_1	Col_2
0	0	5
1	1	6
2	2	7
3	3	8
4	4	9

DataFrame is saved as parquet format with compression..

Print Page