How to speed up Pandas with cuDF?


When it comes to the utilization of Python in the data analysis realm, Pandas stands as a renowned library extensively employed for its potent capabilities in data manipulation. Nevertheless, one might encounter speed bumps while handling substantial datasets via Pandas, chiefly in systems centered around CPU. A brilliant alternative to this predicament is cuDF, a GPU DataFrame library, meticulously crafted by NVIDIA under the umbrella of the RAPIDS ecosystem. cuDF ingeniously deploys the prowess of GPUs to facilitate parallelized data processing, thereby significantly surging ahead of the traditional operations of Pandas in terms of performance. This piece intends to guide you through the path of supercharging Pandas with cuDF, bolstered by crystal clear elucidations for each line of code.

Procuring cuD

Prior to delving into the crux of the code, it's imperative to ensure that cuDF is successfully installed in your environment. You can achieve this via Conda, a well-known package handler for Python −

conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults cudf

Take into account that cuDF requires a compatible NVIDIA GPU and CUDA toolkit for optimum functionality. For a comprehensive guide on installation instructions and system requirements, the official cuDF documentation is your best bet :https://rapids.ai/start.html

Summoning Pandas and cuDF

Once equipped with the necessary library, it's time to usher Pandas and cuDF into your Python manuscript −

import pandas as pd
import cudf

Ingesting Data into a Pandas DataFrame

To kickstart, we'll ingest data into a Pandas DataFrame. For the sake of simplicity, we'll fabricate a sample DataFrame employing the pd.DataFrame() constructor.

data = {
   'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
   'Age': [25, 30, 35, 28, 22],
   'City': ['New York', 'Los Angeles', 'Chicago', 'San Francisco', 'Austin']
}
pandas_df = pd.DataFrame(data)

Transmuting a Pandas DataFrame into a cuDF DataFrame

In order to tap into the GPU processing capabilities infused by cuDF, our next move entails converting our Pandas DataFrame into a cuDF DataFrame. This metamorphosis can be executed using the cudf.from_pandas() function −

cudf_df = cudf.from_pandas(pandas_df)

From this juncture, any operations enacted on the cudf_df DataFrame will be executed on the GPU, delivering considerable speed advancements contrasted with CPU-based Pandas operations.

Implementing Data Manipulation with cuDF

With your data now transformed into a cuDF DataFrame, a variety of data manipulation operations can be performed, akin to the functionalities provided by Pandas. For instance, let's filter the DataFrame to incorporate solely those rows where the 'Age' exceeds 25 −

filtered_cudf_df = cudf_df[cudf_df['Age'] > 25]
print(filtered_cudf_df)

Observe that the syntax and function invocations remain virtually identical to Pandas, thereby easing the transition between the two libraries.

Reverting a cuDF DataFrame Back to a Pandas DataFrame

Subsequent to conducting the desired data manipulation operations utilizing cuDF, you may feel the need to revert the cuDF DataFrame back into a Pandas DataFrame for further processing or exporting. To fulfil this, employ the to_pandas() function −

filtered_pandas_df = filtered_cudf_df.to_pandas()

Here is the entire Python code −

# Step 1: Installing cuDF (run this in your system's terminal or command prompt)
# conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults cudf

# Step 2: Importing Pandas and cuDF
import pandas as pd
import cudf

# Step 3: Creating a Pandas DataFrame
data = {
   'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
   'Age': [25, 30, 35, 28, 22],
   'City': ['New York', 'Los Angeles', 'Chicago', 'San Francisco', 'Austin']
}
pandas_df = pd.DataFrame(data)
print(pandas_df)

# Step 4: Converting Pandas DataFrame to cuDF DataFrame
cudf_df = cudf.from_pandas(pandas_df)

# Step 5: Applying data manipulation on cuDF DataFrame
filtered_cudf_df = cudf_df[cudf_df['Age'] > 25]
print(filtered_cudf_df)

# Step 6: Converting cuDF DataFrame back to Pandas DataFrame
filtered_pandas_df = filtered_cudf_df.to_pandas()
print(filtered_pandas_df)

This script creates a Pandas DataFrame with some sample data. It then converts that DataFrame into a cuDF DataFrame, which allows you to use GPU processing capabilities for data operations. The script filters the cuDF DataFrame to include only rows where the 'Age' is greater than 25. Finally, it converts the cuDF DataFrame back into a Pandas DataFrame.

Based on this, the predicted output would be

Pandas DataFrame

      Name  Age           City
0    Alice   25       New York
1      Bob   30    Los Angeles
2  Charlie   35        Chicago
3    David   28  San Francisco
4      Eva   22         Austin

Filtered cuDF DataFrame

      Name  Age           City
1      Bob   30    Los Angeles
2  Charlie   35        Chicago
3    David   28  San Francisco

Filtered Pandas DataFrame

      Name  Age           City
1      Bob   30    Los Angeles
2  Charlie   35        Chicago
3    David   28  San Francisco

Conclusion

In summary, cuDF, being part of the RAPIDS ecosystem, provides an avenue to elevate the performance of your data analysis tasks. Its striking similarity with Pandas in terms of its API makes it an excellent tool for those accustomed to Pandas' operations. By harnessing the power of GPU parallel processing, cuDF enables a considerable performance boost when managing large datasets. As the field of data manipulation continues to advance, incorporating tools like cuDF will further streamline your workflow, improving efficiency and productivity in your data science projects. So dive in, experiment, and let the story of your data unfold!

Updated on: 09-Aug-2023

155 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements