Drop One or Multiple Columns From PySpark DataFrame


The PySpark data frame is a powerful, real time data processing framework which was developed by the Apache Spark developers. Spark was originally written in “scala” programming language and in order to increase its reach and flexibility, several APIs were built. These APIs provided an interface which can be used to run spark applications on our local environment.

One such API is known as PySpark which was developed for the python environment. The PySpark data frame also consists of rows and columns but the processing part is different as it uses in-system (RAM) computational techniques for processing the data.

In this article, we will perform and understand a basic operation of dropping single and multiple columns from a PySpark data frame. Firstly, we will create a reference data frame.

Creating a PySpark Data Frame

We have to create a SparkSession which deals with the configuration part of the data frame. A SparkSession acts as an entry point to access the spark APIs. We create a SparkSession object which handles the cluster manager and functionality of the framework.

We can use this object to read the dataset and prepare the data frame. Generally, we require a “schema” for generating the data frame but the dataset alone can also structure the PySpark data frame. Let’s create a data frame and enhance our understanding.

Example

  • We imported the pandas library and used the pyspark library to import the SparkSession.

  • We created an instance for the SparkSession with the help of “builder” method. This builder method allows us to configure the framework and set the application name as “SparkDsc”. We retrieve a current existing SparkSession instance or create a new one with the help of “getOrCreate()” method.

  • After this we passed a dictionary dataset consisting of information related to different electronic gadgets. We used this dataset to generate a pandas data frame. The generated pandas data frame serves as the reference data structure for the PySpark data frame.

  • We created a PySpark data frame with the help of “createDataFrame()” method and finally printed/displayed it with the help of “dataframe_spark.show()” method.

from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.appName("SparkDsc").getOrCreate()
dataset = {"Device name":["Laptop", "Mobile phone", "TV", "Radio"], "Store name":["JJ electronics", "Birla dealers", "Ajay services", "Kapoor stores"], "Device price":[45000, 30000, 50000, 15000], "Warranty":["6 months", "8 months", "1 year", "4 months"]}

dataframe_pd = pd.DataFrame(dataset, index= ["Device 1", "Device 2", "Device 3", "Device 4"])

dataframe_spark = spark.createDataFrame(dataframe_pd)
print("The original spark data frame is: -")
dataframe_spark.show() 

Output

The original spark data frame is: -
+-------------+--------------+------------+--------+
|  Device name|    Store name|Device price|Warranty|
+-------------+--------------+------------+--------+
|      Laptop |JJ electronics|       45000|6 months|
| Mobile phone| Birla dealers|       30000|8 months|
|           TV| Ajay services|       50000| 1 year |
|        Radio| Kapoor stores|       15000|4 months| 

Now, that we have successfully created a data frame, let’s quickly discuss the different method to drop columns from it.

Using Drop() Function to Drop Columns from the Data Frame

The drop() function offers a simple method to eliminate unwanted data from the data frame. There are several techniques associated with this function and most of them depends upon the data frame. We can use the drop() function to remove single as well as multiple columns from the data frame.

Dropping a Single Column

Let’s look at its implementation which drops a single column from a dataframe –

from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.appName("SparkDsc").getOrCreate()
dataset = {"Device name":["Laptop", "Mobile phone", "TV", "Radio"], "Store name":["JJ electronics", "Birla dealers", "Ajay srvices", "Kapoor stores"],
           "Device price":[45000, 30000, 50000, 15000], "Warranty":["6 months", "8 months", "1 year", "4 months"]}

dataframe_pd = pd.DataFrame(dataset, index= ["Device 1", "Device 2", "Device 3", "Device 4"])

dataframe_spark = spark.createDataFrame(dataframe_pd)
print("The original spark data frame is: -")
dataframe_spark.show()

#drop a single column using drop() method: -
dataframe_spark = dataframe_spark.drop("Warranty")
dataframe_spark.show()

Output

The original spark data frame is: -
+------------+--------------+-------------+--------+
| Device name|    Store name|Device price|Warranty|
+------------+--------------+-------------+--------+
|      Laptop|JJ electronics|        45000|6 months|
|Mobile phone| Birla dealers|        30000|8 months|
|          TV|  Ajay srvices|        50000|  1 year|
|       Radio| Kapoor stores|        15000|4 months|
+------------+--------------+-------------+--------+

+------------+--------------+-------------+
| Device name|    Store name|Device price |
+------------+--------------+-------------+
|      Laptop|JJ electronics|        45000|
|Mobile phone| Birla dealers|        30000|
|          TV|  Ajay srvices|        50000|
|       Radio| Kapoor stores|        15000|
+------------+--------------+-------------+

After creating the PySpark data frame, we used the drop() function to drop the “warranty” column from the data frame. The entire data under this column will be removed from the data frame and the processing unit.

Dropping Multiple Columns

We will use the same function to execute this operation. But this time we will use “*” operator to target multiple columns.

Example

#dropping multiple columns using drop() method: -
dataframe_spark = dataframe_spark.drop(*("Device price", "Warranty"))
dataframe_spark.show()

Output

+-------------+--------------+
|  Device name|    Store name|
+-------------+--------------+
|       Laptop|JJ electronics|
| Mobile phone| Birla dealers|
|           TV|  Ajay srvices|
|        Radio| Kapoor stores|
+-------------+--------------+ 

Here, we used the “*” operator to drop “Device price” and “Warranty” columns from the data frame. We can also pass a list of columns as the parameters for the drop() function.

Example

dataframe_spark = dataframe_spark.drop(["Store name", "Warranty"])
dataframe_spark.show()

Output

+------------+-------------+
| Device name|Device price|
+------------+-------------+
|      Laptop|        45000|
|Mobile phone|        30000|
|          TV|        50000|
|       Radio|        15000|
+------------+-------------+

Any of the above discussed method can be used to remove all the columns from the data frame.

Using List Comprehension and the Select() Method

We can use the select method along with the list comprehension technique to drop specific columns from the data frame.

Example

dataframe_spark = dataframe_spark.select([columns for columns in dataframe_spark if columns not in {"Device name", "store name"}])
dataframe_spark.show()

Output

+------------+--------+
|Device price|Warranty|
+------------+--------+
|       45000|6 months|
|       30000|8 months|
|       50000| 1 year |
|       15000|4 months|
+------------+--------+ 

Here, the “Device name” and “Store name” columns are removed from the data frame with the help of list comprehension. We used select() method to lock all the columns except “Device name” and “Store name”.

Conclusion

In this article, we performed a basic operation of dropping single and multiple columns from a PySpark data frame. We discussed the different possible methods. We used “drop()” function and “select()” method to drop different columns.

Updated on: 05-May-2023

616 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements