Drop One or Multiple Columns From PySpark DataFrame

A PySpark DataFrame is a distributed data structure built on Apache Spark that provides powerful data processing capabilities. Sometimes you need to remove unnecessary columns to optimize performance or focus on specific data. PySpark offers several methods to drop one or multiple columns from a DataFrame.

Creating a PySpark DataFrame

First, let's create a sample DataFrame to demonstrate column dropping operations ?

from pyspark.sql import SparkSession
import pandas as pd

# Create SparkSession
spark = SparkSession.builder.appName("DropColumns").getOrCreate()

# Sample dataset
dataset = {
    "Device name": ["Laptop", "Mobile phone", "TV", "Radio"], 
    "Store name": ["JJ electronics", "Birla dealers", "Ajay services", "Kapoor stores"], 
    "Device price": [45000, 30000, 50000, 15000], 
    "Warranty": ["6 months", "8 months", "1 year", "4 months"]
}

# Create pandas DataFrame first, then convert to PySpark
dataframe_pd = pd.DataFrame(dataset)
dataframe_spark = spark.createDataFrame(dataframe_pd)

print("Original PySpark DataFrame:")
dataframe_spark.show()
Original PySpark DataFrame:
+------------+--------------+------------+--------+
|Device name |    Store name|Device price|Warranty|
+------------+--------------+------------+--------+
|      Laptop|JJ electronics|       45000|6 months|
|Mobile phone| Birla dealers|       30000|8 months|
|          TV| Ajay services|       50000|  1 year|
|       Radio| Kapoor stores|       15000|4 months|
+------------+--------------+------------+--------+

Method 1: Using drop() Function

The drop() method is the most straightforward way to remove columns from a PySpark DataFrame.

Dropping a Single Column

from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.appName("DropColumns").getOrCreate()

dataset = {
    "Device name": ["Laptop", "Mobile phone", "TV", "Radio"], 
    "Store name": ["JJ electronics", "Birla dealers", "Ajay services", "Kapoor stores"], 
    "Device price": [45000, 30000, 50000, 15000], 
    "Warranty": ["6 months", "8 months", "1 year", "4 months"]
}

dataframe_pd = pd.DataFrame(dataset)
dataframe_spark = spark.createDataFrame(dataframe_pd)

# Drop single column
df_single_drop = dataframe_spark.drop("Warranty")
print("After dropping 'Warranty' column:")
df_single_drop.show()
After dropping 'Warranty' column:
+------------+--------------+------------+
|Device name |    Store name|Device price|
+------------+--------------+------------+
|      Laptop|JJ electronics|       45000|
|Mobile phone| Birla dealers|       30000|
|          TV| Ajay services|       50000|
|       Radio| Kapoor stores|       15000|
+------------+--------------+------------+

Dropping Multiple Columns

You can drop multiple columns using the * operator or by passing a list ?

from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.appName("DropColumns").getOrCreate()

dataset = {
    "Device name": ["Laptop", "Mobile phone", "TV", "Radio"], 
    "Store name": ["JJ electronics", "Birla dealers", "Ajay services", "Kapoor stores"], 
    "Device price": [45000, 30000, 50000, 15000], 
    "Warranty": ["6 months", "8 months", "1 year", "4 months"]
}

dataframe_pd = pd.DataFrame(dataset)
dataframe_spark = spark.createDataFrame(dataframe_pd)

# Method 1: Using * operator
df_multiple_drop1 = dataframe_spark.drop(*["Device price", "Warranty"])
print("Dropping multiple columns with * operator:")
df_multiple_drop1.show()

# Method 2: Using drop with multiple calls
df_multiple_drop2 = dataframe_spark.drop("Store name").drop("Warranty")
print("Dropping multiple columns with chained drop:")
df_multiple_drop2.show()
Dropping multiple columns with * operator:
+------------+--------------+
|Device name |    Store name|
+------------+--------------+
|      Laptop|JJ electronics|
|Mobile phone| Birla dealers|
|          TV| Ajay services|
|       Radio| Kapoor stores|
+------------+--------------+

Dropping multiple columns with chained drop:
+------------+------------+
|Device name |Device price|
+------------+------------+
|      Laptop|       45000|
|Mobile phone|       30000|
|          TV|       50000|
|       Radio|       15000|
+------------+------------+

Method 2: Using select() with List Comprehension

The select() method with list comprehension allows you to specify which columns to keep by excluding unwanted ones ?

from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.appName("DropColumns").getOrCreate()

dataset = {
    "Device name": ["Laptop", "Mobile phone", "TV", "Radio"], 
    "Store name": ["JJ electronics", "Birla dealers", "Ajay services", "Kapoor stores"], 
    "Device price": [45000, 30000, 50000, 15000], 
    "Warranty": ["6 months", "8 months", "1 year", "4 months"]
}

dataframe_pd = pd.DataFrame(dataset)
dataframe_spark = spark.createDataFrame(dataframe_pd)

# Drop columns using select with list comprehension
columns_to_drop = {"Device name", "Store name"}
df_select_drop = dataframe_spark.select([col for col in dataframe_spark.columns if col not in columns_to_drop])
print("Using select() with list comprehension:")
df_select_drop.show()
Using select() with list comprehension:
+------------+--------+
|Device price|Warranty|
+------------+--------+
|       45000|6 months|
|       30000|8 months|
|       50000|  1 year|
|       15000|4 months|
+------------+--------+

Comparison of Methods

Method Syntax Best For
drop() df.drop("col") Simple column removal
drop(*list) df.drop(*["col1", "col2"]) Multiple specific columns
select() df.select([cols...]) Complex conditional dropping

Conclusion

Use drop() for straightforward column removal and select() with list comprehension for conditional column filtering. The drop() method is generally more readable and efficient for most use cases.

---
Updated on: 2026-03-27T06:13:48+05:30

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements