Drop rows containing specific value in pyspark dataframe

When dealing with large datasets, PySpark provides powerful tools for data processing and manipulation. PySpark is Apache Spark's Python API that allows you to work with distributed data processing in your local Python environment.

In this tutorial, we'll learn how to drop rows containing specific values from a PySpark DataFrame using different methods. This selective data elimination is essential for data cleaning and maintaining data relevance.

Creating a Sample PySpark DataFrame

First, let's create a sample DataFrame to demonstrate the row dropping techniques ?

from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName("DropRowsDemo").getOrCreate()

# Sample dataset
data = [("23", "25", "18", "19"), ("18", "25", "28", "40"),
        ("15", "18", "45", "11"), ("12", "8", "18", "28"),
        ("25", "42", "18", "29"), ("45", "50", "17", "11")]

columns = ["Col1", "Col2", "Col3", "Col4"]

# Create DataFrame
df = spark.createDataFrame(data, columns)
print("Original DataFrame:")
df.show()
Original DataFrame:
+----+----+----+----+
|Col1|Col2|Col3|Col4|
+----+----+----+----+
|  23|  25|  18|  19|
|  18|  25|  28|  40|
|  15|  18|  45|  11|
|  12|   8|  18|  28|
|  25|  42|  18|  29|
|  45|  50|  17|  11|
+----+----+----+----+

Using filter() to Drop Rows

The filter() method allows you to keep rows that meet specific conditions. To drop rows with a particular value, use the "not equal" operator ?

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DropRowsDemo").getOrCreate()

data = [("23", "25", "18", "19"), ("18", "25", "28", "40"),
        ("15", "18", "45", "11"), ("12", "8", "18", "28"),
        ("25", "42", "18", "29"), ("45", "50", "17", "11")]

columns = ["Col1", "Col2", "Col3", "Col4"]
df = spark.createDataFrame(data, columns)

# Drop rows where Col1 equals "18"
filtered_df = df.filter(df["Col1"] != "18")
print("After dropping rows with Col1 = '18':")
filtered_df.show()
After dropping rows with Col1 = '18':
+----+----+----+----+
|Col1|Col2|Col3|Col4|
+----+----+----+----+
|  23|  25|  18|  19|
|  15|  18|  45|  11|
|  12|   8|  18|  28|
|  25|  42|  18|  29|
|  45|  50|  17|  11|
+----+----+----+----+

Using where() to Drop Rows

The where() method works identically to filter() and can be used interchangeably ?

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DropRowsDemo").getOrCreate()

data = [("23", "25", "18", "19"), ("18", "25", "28", "40"),
        ("15", "18", "45", "11"), ("12", "8", "18", "28"),
        ("25", "42", "18", "29"), ("45", "50", "17", "11")]

columns = ["Col1", "Col2", "Col3", "Col4"]
df = spark.createDataFrame(data, columns)

# Drop rows where Col2 equals "18"
result_df = df.where(df["Col2"] != "18")
print("After dropping rows with Col2 = '18':")
result_df.show()
After dropping rows with Col2 = '18':
+----+----+----+----+
|Col1|Col2|Col3|Col4|
+----+----+----+----+
|  23|  25|  18|  19|
|  18|  25|  28|  40|
|  12|   8|  18|  28|
|  25|  42|  18|  29|
|  45|  50|  17|  11|
+----+----+----+----+

Dropping Rows with Multiple Conditions

You can combine multiple conditions using logical operators to drop rows based on values in different columns ?

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DropRowsDemo").getOrCreate()

data = [("23", "25", "18", "19"), ("18", "25", "28", "40"),
        ("15", "18", "45", "11"), ("12", "8", "18", "28"),
        ("25", "42", "18", "29"), ("45", "50", "17", "11")]

columns = ["Col1", "Col2", "Col3", "Col4"]
df = spark.createDataFrame(data, columns)

# Drop rows containing "18" in any column
filtered_df = df.filter((df["Col1"] != "18") & 
                       (df["Col2"] != "18") &
                       (df["Col3"] != "18") &
                       (df["Col4"] != "18"))

print("After dropping all rows containing '18':")
filtered_df.show()
After dropping all rows containing '18':
+----+----+----+----+
|Col1|Col2|Col3|Col4|
+----+----+----+----+
|  45|  50|  17|  11|
+----+----+----+----+

Comparison

Method Syntax Use Case
filter() df.filter(condition) Standard SQL-like filtering
where() df.where(condition) Alternative to filter(), identical functionality
Multiple conditions df.filter(cond1 & cond2) Complex filtering with logical operators

Conclusion

Both filter() and where() methods effectively drop rows containing specific values in PySpark DataFrames. Use logical operators to combine conditions when filtering multiple columns. Always analyze your data carefully before applying filters to ensure data integrity.

Updated on: 2026-03-27T06:14:16+05:30

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements