Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Drop rows containing specific value in pyspark dataframe
When dealing with large datasets, PySpark provides powerful tools for data processing and manipulation. PySpark is Apache Spark's Python API that allows you to work with distributed data processing in your local Python environment.
In this tutorial, we'll learn how to drop rows containing specific values from a PySpark DataFrame using different methods. This selective data elimination is essential for data cleaning and maintaining data relevance.
Creating a Sample PySpark DataFrame
First, let's create a sample DataFrame to demonstrate the row dropping techniques ?
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder.appName("DropRowsDemo").getOrCreate()
# Sample dataset
data = [("23", "25", "18", "19"), ("18", "25", "28", "40"),
("15", "18", "45", "11"), ("12", "8", "18", "28"),
("25", "42", "18", "29"), ("45", "50", "17", "11")]
columns = ["Col1", "Col2", "Col3", "Col4"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
print("Original DataFrame:")
df.show()
Original DataFrame: +----+----+----+----+ |Col1|Col2|Col3|Col4| +----+----+----+----+ | 23| 25| 18| 19| | 18| 25| 28| 40| | 15| 18| 45| 11| | 12| 8| 18| 28| | 25| 42| 18| 29| | 45| 50| 17| 11| +----+----+----+----+
Using filter() to Drop Rows
The filter() method allows you to keep rows that meet specific conditions. To drop rows with a particular value, use the "not equal" operator ?
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DropRowsDemo").getOrCreate()
data = [("23", "25", "18", "19"), ("18", "25", "28", "40"),
("15", "18", "45", "11"), ("12", "8", "18", "28"),
("25", "42", "18", "29"), ("45", "50", "17", "11")]
columns = ["Col1", "Col2", "Col3", "Col4"]
df = spark.createDataFrame(data, columns)
# Drop rows where Col1 equals "18"
filtered_df = df.filter(df["Col1"] != "18")
print("After dropping rows with Col1 = '18':")
filtered_df.show()
After dropping rows with Col1 = '18': +----+----+----+----+ |Col1|Col2|Col3|Col4| +----+----+----+----+ | 23| 25| 18| 19| | 15| 18| 45| 11| | 12| 8| 18| 28| | 25| 42| 18| 29| | 45| 50| 17| 11| +----+----+----+----+
Using where() to Drop Rows
The where() method works identically to filter() and can be used interchangeably ?
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DropRowsDemo").getOrCreate()
data = [("23", "25", "18", "19"), ("18", "25", "28", "40"),
("15", "18", "45", "11"), ("12", "8", "18", "28"),
("25", "42", "18", "29"), ("45", "50", "17", "11")]
columns = ["Col1", "Col2", "Col3", "Col4"]
df = spark.createDataFrame(data, columns)
# Drop rows where Col2 equals "18"
result_df = df.where(df["Col2"] != "18")
print("After dropping rows with Col2 = '18':")
result_df.show()
After dropping rows with Col2 = '18': +----+----+----+----+ |Col1|Col2|Col3|Col4| +----+----+----+----+ | 23| 25| 18| 19| | 18| 25| 28| 40| | 12| 8| 18| 28| | 25| 42| 18| 29| | 45| 50| 17| 11| +----+----+----+----+
Dropping Rows with Multiple Conditions
You can combine multiple conditions using logical operators to drop rows based on values in different columns ?
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DropRowsDemo").getOrCreate()
data = [("23", "25", "18", "19"), ("18", "25", "28", "40"),
("15", "18", "45", "11"), ("12", "8", "18", "28"),
("25", "42", "18", "29"), ("45", "50", "17", "11")]
columns = ["Col1", "Col2", "Col3", "Col4"]
df = spark.createDataFrame(data, columns)
# Drop rows containing "18" in any column
filtered_df = df.filter((df["Col1"] != "18") &
(df["Col2"] != "18") &
(df["Col3"] != "18") &
(df["Col4"] != "18"))
print("After dropping all rows containing '18':")
filtered_df.show()
After dropping all rows containing '18': +----+----+----+----+ |Col1|Col2|Col3|Col4| +----+----+----+----+ | 45| 50| 17| 11| +----+----+----+----+
Comparison
| Method | Syntax | Use Case |
|---|---|---|
filter() |
df.filter(condition) |
Standard SQL-like filtering |
where() |
df.where(condition) |
Alternative to filter(), identical functionality |
| Multiple conditions | df.filter(cond1 & cond2) |
Complex filtering with logical operators |
Conclusion
Both filter() and where() methods effectively drop rows containing specific values in PySpark DataFrames. Use logical operators to combine conditions when filtering multiple columns. Always analyze your data carefully before applying filters to ensure data integrity.
