How to Save Pandas Dataframe as gzip/zip File?


Pandas dataframe can be saved in gzip/zip format using the gzip and zipfile module in Python. Pandas is a Python library that is used for data manipulation and analysis. It provides a two-dimensional labeled data structure with columns of potentially different data types. To reduce the size of the data frame we need to store it in gzip/zip format. In this article, we will understand how we can save Pandas Dataframe as gzip/zip file.

Algorithm

A generalized algorithm to save a Pandas DataFrame as a compressed gzip/zip file is written below. However, the exact implementation of this algorithm may vary depending on the specific use case and file format being used. For example, if using PyArrow and the Parquet format, the algorithm would need to use the PyArrow library to convert the DataFrame to a Parquet table before saving it to the compressed file.

  • Import the necessary libraries: Pandas, gzip/zip library (e.g. zipfile for zip compression, gzip for gzip compression), and PyArrow (if using Parquet format).

  • Load or create the Pandas DataFrame that you want to save as a compressed file.

  • Choose the compression method you want to use (gzip or zip) and open a file object to write to using the appropriate library. For example, if using gzip, you would use "gzip.open" to create a file object.

  • Use the appropriate method (e.g. to_csv, to_pickle, to_parquet) to save the DataFrame to the file object, with the "compression" argument set to the chosen compression method.

  • Close the file object.

Method 1: Using the to_csv() method

Saving Pandas Data Frame as a Gzip file

Gzip is a compression format used to compress files in Python. It is used in Linux and UNIX operating Systems. To save Pandas dataframe as gzip file we need to import gzip module and use its open() method to create a file object in write mode. The file object is then passed to the to_csv() method of the DataFrame object.

Syntax

df.to_csv('data.csv.gz', index=False, compression='gzip')

Here, the to_csv() method saves a Pandas DataFrame as a compressed CSV file with gzip compression. The "index=False" argument specifies that the index column should not be included in the output file, and the "compression='gzip'" argument tells the method to apply gzip compression to the output file.

Example

In the below code, we created a DataFrame with three columns (Name, Age, and Salary) and saved it as a gzip file named "data.gz". We used the with statement to ensure that the file object is closed after writing the DataFrame to it. The index=False argument tells the to_csv() method not to write the row index to the file.

import pandas as pd
import gzip

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
      'Age': [25, 30, 35],
      'Salary': [50000, 60000, 70000]}
df = pd.DataFrame(data)

# Save DataFrame as a gzip file
with gzip.open('data.gz', 'wb') as f:
   df.to_csv(f, index=False)

Output

The dataframe will be saved as gzip file and the content of the file is shown below −

Name,Age,Salary
Alice,25,50000
Bob,30,60000
Charlie,35,70000

Saving Pandas DataFrame as a Zip File

Zip is a Popular compression file format mostly used in the Windows Operating System. The zipfile module in Python provides a simple way to save the Pandas Dataframe as a compressed zip file.

To save a Pandas DataFrame as a zip file, we need to import the zipfile module and use its ZipFile() method to create a ZipFile object in write mode. Then, we can use the open() method of the ZipFile object to create a file object inside the zip file. Finally, we can pass this file object to the to_csv() method of the DataFrame object.

Example

In the below code, we created a DataFrame with three columns (Name, Age, and Salary) and saved it as a zip file named "data.zip". We used the with statement to ensure that the ZipFile object is closed after writing the DataFrame to it. The compression=zipfile.ZIP_DEFLATED argument tells the ZipFile object to use the DEFLATE compression algorithm. The open() method of the ZipFile object creates a file object inside the zip file named "data.csv". The index=False argument tells the to_csv() method not to write the row index to the file.

import pandas as pd
import zipfile

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
   'Age': [25, 30, 35],
   'Salary': [50000, 60000, 70000]}
df = pd.DataFrame(data)

# Save DataFrame as a zip file
with zipfile.ZipFile('data.zip', 'w', compression=zipfile.ZIP_DEFLATED) as z:
   with z.open('data.csv', 'w') as f:
   df.to_csv(f, index=False)

Output

Name	Age	Name
Alice	25	Alice
Bob	30	Bob
Charlie	35	Charlie

Method 2: Using the to_pickle() method with gzip/zip compression

The to_pickle() method of a DataFrame object can be used to save a DataFrame as a pickle file with gzip or zip compression.

Syntax

df.to_pickle('data.pkl.gz', compression='gzip')
df.to_pickle('data.pkl.zip', compression='zip')

Here, the to_pickle() method saves a Pandas DataFrame as a compressed pickle file with gzip or zip compression. The "compression='gzip'" or "compression='zip'" argument tells the method to apply gzip or zip compression to the output file.

Example

In the below code, we used the to_pickle() method to save the DataFrame as a pickle file with gzip and zip compression, respectively. The argument "compression='gzip'" or "compression='zip'" tells the method to apply gzip or zip compression to the output file.

import pandas as pd

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
      'Age': [25, 30, 35],
      'Salary': [50000, 60000, 70000]}
df = pd.DataFrame(data)

# Save DataFrame as a gzipped pickle file
df.to_pickle('data.pkl.gz', compression='gzip')

# Save DataFrame as a zipped pickle file
df.to_pickle('data.pkl.zip', compression='zip')

Output

Name	Age	Name
Alice	25	Alice
Bob	30	Bob
Charlie	35	Charlie

Method 3: Using the to_parquet method with gzip/zip compression

The to_parquet() method of a DataFrame object can be used to save a DataFrame as a Parquet file with gzip or zip compression using the PyArrow library.

Syntax

pq.write_table(table, 'data.parquet.gz', compression='gzip')
pq.write_table(table, 'data.parquet.zip', compression='snappy')

Here, the write_table() method saves a Pandas DataFrame as a compressed Parquet file using the PyArrow library. The "compression='gzip'" or "compression='snappy'" argument tells the method to apply gzip or snappy compression to the output file.

Example

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
      'Age': [25, 30, 35],
      'Salary': [50000, 60000, 70000]}
df = pd.DataFrame(data)

# Convert DataFrame to PyArrow table
table = pa.Table.from_pandas(df)

# Save PyArrow table as a gzipped Parquet file
pq.write_table(table, 'data.parquet.gz', compression='gzip')

# Save PyArrow table as a zipped Parquet file
pq.write_table(table, 'data.parquet.zip', compression='snappy')

Output

Name	Age	Name
Alice	25	Alice
Bob	30	Bob
Charlie	35	Charlie

Conclusion

In this article, we discussed how to save a Pandas DataFrame as a gzip/zip file using Python. We used the gzip module to create a gzip file and the zipfile module to create a zip file. Both of these compression formats are widely used and can help reduce the size of large data files, making them easier to store and transmit. Gzip tends to have better compression ratios for text-based data, while zip files are better suited for compressing binary files like images or other non-text data.

Updated on: 11-Jul-2023

3K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements