Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Drop One or Multiple Columns From PySpark DataFrame
A PySpark DataFrame is a distributed data structure built on Apache Spark that provides powerful data processing capabilities. Sometimes you need to remove unnecessary columns to optimize performance or focus on specific data. PySpark offers several methods to drop one or multiple columns from a DataFrame.
Creating a PySpark DataFrame
First, let's create a sample DataFrame to demonstrate column dropping operations ?
from pyspark.sql import SparkSession
import pandas as pd
# Create SparkSession
spark = SparkSession.builder.appName("DropColumns").getOrCreate()
# Sample dataset
dataset = {
"Device name": ["Laptop", "Mobile phone", "TV", "Radio"],
"Store name": ["JJ electronics", "Birla dealers", "Ajay services", "Kapoor stores"],
"Device price": [45000, 30000, 50000, 15000],
"Warranty": ["6 months", "8 months", "1 year", "4 months"]
}
# Create pandas DataFrame first, then convert to PySpark
dataframe_pd = pd.DataFrame(dataset)
dataframe_spark = spark.createDataFrame(dataframe_pd)
print("Original PySpark DataFrame:")
dataframe_spark.show()
Original PySpark DataFrame: +------------+--------------+------------+--------+ |Device name | Store name|Device price|Warranty| +------------+--------------+------------+--------+ | Laptop|JJ electronics| 45000|6 months| |Mobile phone| Birla dealers| 30000|8 months| | TV| Ajay services| 50000| 1 year| | Radio| Kapoor stores| 15000|4 months| +------------+--------------+------------+--------+
Method 1: Using drop() Function
The drop() method is the most straightforward way to remove columns from a PySpark DataFrame.
Dropping a Single Column
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName("DropColumns").getOrCreate()
dataset = {
"Device name": ["Laptop", "Mobile phone", "TV", "Radio"],
"Store name": ["JJ electronics", "Birla dealers", "Ajay services", "Kapoor stores"],
"Device price": [45000, 30000, 50000, 15000],
"Warranty": ["6 months", "8 months", "1 year", "4 months"]
}
dataframe_pd = pd.DataFrame(dataset)
dataframe_spark = spark.createDataFrame(dataframe_pd)
# Drop single column
df_single_drop = dataframe_spark.drop("Warranty")
print("After dropping 'Warranty' column:")
df_single_drop.show()
After dropping 'Warranty' column: +------------+--------------+------------+ |Device name | Store name|Device price| +------------+--------------+------------+ | Laptop|JJ electronics| 45000| |Mobile phone| Birla dealers| 30000| | TV| Ajay services| 50000| | Radio| Kapoor stores| 15000| +------------+--------------+------------+
Dropping Multiple Columns
You can drop multiple columns using the * operator or by passing a list ?
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName("DropColumns").getOrCreate()
dataset = {
"Device name": ["Laptop", "Mobile phone", "TV", "Radio"],
"Store name": ["JJ electronics", "Birla dealers", "Ajay services", "Kapoor stores"],
"Device price": [45000, 30000, 50000, 15000],
"Warranty": ["6 months", "8 months", "1 year", "4 months"]
}
dataframe_pd = pd.DataFrame(dataset)
dataframe_spark = spark.createDataFrame(dataframe_pd)
# Method 1: Using * operator
df_multiple_drop1 = dataframe_spark.drop(*["Device price", "Warranty"])
print("Dropping multiple columns with * operator:")
df_multiple_drop1.show()
# Method 2: Using drop with multiple calls
df_multiple_drop2 = dataframe_spark.drop("Store name").drop("Warranty")
print("Dropping multiple columns with chained drop:")
df_multiple_drop2.show()
Dropping multiple columns with * operator: +------------+--------------+ |Device name | Store name| +------------+--------------+ | Laptop|JJ electronics| |Mobile phone| Birla dealers| | TV| Ajay services| | Radio| Kapoor stores| +------------+--------------+ Dropping multiple columns with chained drop: +------------+------------+ |Device name |Device price| +------------+------------+ | Laptop| 45000| |Mobile phone| 30000| | TV| 50000| | Radio| 15000| +------------+------------+
Method 2: Using select() with List Comprehension
The select() method with list comprehension allows you to specify which columns to keep by excluding unwanted ones ?
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName("DropColumns").getOrCreate()
dataset = {
"Device name": ["Laptop", "Mobile phone", "TV", "Radio"],
"Store name": ["JJ electronics", "Birla dealers", "Ajay services", "Kapoor stores"],
"Device price": [45000, 30000, 50000, 15000],
"Warranty": ["6 months", "8 months", "1 year", "4 months"]
}
dataframe_pd = pd.DataFrame(dataset)
dataframe_spark = spark.createDataFrame(dataframe_pd)
# Drop columns using select with list comprehension
columns_to_drop = {"Device name", "Store name"}
df_select_drop = dataframe_spark.select([col for col in dataframe_spark.columns if col not in columns_to_drop])
print("Using select() with list comprehension:")
df_select_drop.show()
Using select() with list comprehension: +------------+--------+ |Device price|Warranty| +------------+--------+ | 45000|6 months| | 30000|8 months| | 50000| 1 year| | 15000|4 months| +------------+--------+
Comparison of Methods
| Method | Syntax | Best For |
|---|---|---|
drop() |
df.drop("col") |
Simple column removal |
drop(*list) |
df.drop(*["col1", "col2"]) |
Multiple specific columns |
select() |
df.select([cols...]) |
Complex conditional dropping |
Conclusion
Use drop() for straightforward column removal and select() with list comprehension for conditional column filtering. The drop() method is generally more readable and efficient for most use cases.
