Article Categories

Selected Reading

Filtering a PySpark dataframe using isin by Exclusion

Python Server Side Programming Programming

PySpark DataFrames are distributed collections of data organized into named columns, similar to tables in a database. When working with large datasets, you often need to filter out specific rows based on whether column values match a predefined list. The isin() function combined with the negation operator (~) provides an efficient way to exclude rows by filtering out unwanted values.

Understanding isin() Function

The isin() function checks whether DataFrame values are present in a list of values. It returns a boolean result ? True if the column value exists in the provided list, False otherwise.

Syntax

column.isin(list_of_values)

For exclusion filtering, we use the negation operator ~ before isin() to get the opposite result.

Basic Exclusion Filtering

Example 1: Excluding by Age

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize Spark session
spark = SparkSession.builder.appName("IsinExclusion").getOrCreate()

# Create sample DataFrame
data = [("Karthik", 25), ("Vijay", 30), ("Kruthik", 35), ("Sricharan", 40), ("Aditi", 45)]
df = spark.createDataFrame(data, ["name", "age"])

# Define exclusion list
exclusion_list = [25, 30]

# Filter out rows where age is in exclusion list
filtered_df = df.filter(~col("age").isin(exclusion_list))
filtered_df.show()

+---------+---+
|     name|age|
+---------+---+
|  Kruthik| 35|
|Sricharan| 40|
|    Aditi| 45|
+---------+---+

Example 2: Excluding by City

# Create DataFrame with city data
data = [("Karthik", "New York"), ("Vijay", "Chicago"), ("Kruthik", "San Francisco"), 
        ("Sricharan", "Los Angeles"), ("Aditi", "Miami")]
df = spark.createDataFrame(data, ["name", "city"])

# Exclude specific cities
exclusion_list = ["New York", "Chicago", "Miami"]
filtered_df = df.filter(~col("city").isin(exclusion_list))
filtered_df.show()

+---------+-------------+
|     name|         city|
+---------+-------------+
|Sricharan|  Los Angeles|
|  Kruthik|San Francisco|
+---------+-------------+

Complex Filtering Conditions

You can combine isin() exclusion with other conditions using logical operators.

# Create DataFrame
data = [("Alice", "New York"), ("Bob", "Chicago"), ("Charlie", "San Francisco"), 
        ("David", "Los Angeles"), ("Eva", "Miami")]
df = spark.createDataFrame(data, ["name", "city"])

# Complex condition: exclude cities but keep specific names
exclusion_cities = ["New York", "Chicago", "Miami"]
filtered_df = df.filter(~col("city").isin(exclusion_cities) | (col("name") == "David"))
filtered_df.show()

+-------+-------------+
|   name|         city|
+-------+-------------+
|  David|  Los Angeles|
|Charlie|San Francisco|
+-------+-------------+

Using filter() vs where()

Both filter() and where() methods work identically for exclusion filtering. Here's a comparison using a complete example:

# Create sample DataFrame
data = [[1, "Karthik", "Sharma"],
        [2, "Kruthik", "Ballari"],
        [3, "Vijay", "Kumar"],
        [4, "Aditi", "Gupta"],
        [5, "Sricharan", "Sharma"]]

columns = ['ID', 'NAME', 'LASTNAME']
dataframe = spark.createDataFrame(data, columns)

# Using filter() to exclude specific names
excluded_names = ["Vijay", "Aditi"]
result_filter = dataframe.filter(~col("NAME").isin(excluded_names))

# Using where() - same result
result_where = dataframe.where(~col("NAME").isin(excluded_names))

print("Using filter():")
result_filter.show()

Using filter():
+---+---------+--------+
| ID|     NAME|LASTNAME|
+---+---------+--------+
|  1|  Karthik|  Sharma|
|  2|  Kruthik| Ballari|
|  5|Sricharan|  Sharma|
+---+---------+--------+

Comparison

Method	Syntax	Use Case
`filter()`	`df.filter(~col("column").isin(list))`	Standard filtering approach
`where()`	`df.where(~col("column").isin(list))`	SQL-like syntax preference
Column notation	`df.filter(~df.column.isin(list))`	Direct column reference

Conclusion

The isin() function with negation (~) provides an efficient way to exclude rows from PySpark DataFrames based on column values. This approach is particularly useful when working with large datasets where you need to filter out specific categories or values, helping reduce data processing overhead and focus analysis on relevant information.

Jaisshree

Updated on: 2026-03-27T11:58:04+05:30

2K+ Views

Previous Next