Filtering a PySpark dataframe using isin by Exclusion

Python Server Side Programming Programming

Python is an object-oriented, dynamically semantic, high-level, interpreted programming language. Rapid Application Development, as well as used as a scripting or glue language to- bring existing components together, find its high-level built-in data structures, coupled with dynamic type and dynamic binding, to be particularly appealing.

PySpark Dataframe

Data is organized into named columns in PySpark dataframes which are distributed collections of data that can be run on different computers. These dataframes may draw from existing resilient distributed datasets (RDDs), external databases, or structured data files.

Syntax - Isin ()

isin(list_)

The list_ a=parameter takes the value of a column as a list of values.

This is used to check or filter whether Dataframe values are present or included in the list of values, use PySpark’ s isin () or IN operator.

The Column class's isin () function returns a boolean value. True if the evaluated values of the parameters include the expression's value.

Example 1

# Import the necessary libraries
from pyspark.sql.functions import col
 
# Create a PySpark DataFrame
data = [("Karthik", 25), ("Vijay", 30), ("Kruthik", 35), ("Sricharan", 40), ("Aditi", 45)]
df = spark.createDataFrame(data, ["name", "age"])
 
# Define the list of values to exclude
exclusion_list = [25, 30]
 
# Filter out rows where age is in the exclusion list
filtered_df = df.filter(~col("age").isin(exclusion_list))
 
# Display the filtered DataFrame
filtered_df.show()

Output

+------+---+
|  name|age|
+------+---+
|Kruthik| 35|
| Sricharan| 40|
|   Aditi| 45|
+------+---+

Example 2

# Import the necessary libraries
from pyspark.sql.functions import col 
# Create a PySpark DataFrame
data = [("Karthik", "New York"), ("Vijay", "Chicago"), ("Kruthik", "San Francisco"), ("Sricharan", "Los Angeles"), ("Aditi", "Miami")]
df = spark.createDataFrame(data, ["name", "city"])
# Define the list of values to exclude
exclusion_list = ["New York", "Chicago", "Miami"] 
# Filter out rows where city is in the exclusion list
filtered_df = df.filter(~col("city").isin(exclusion_list))
# Display the filtered DataFrame
filtered_df.show()

Output

+------+--------------+
| name|   	city|
+------+--------------+
|Sricharan |   Los Angeles|
|Kruthik|San Francisco|
+------+--------------+

Creating Dataframe for Demonstration

Example 1

# Create a PySpark DataFrame
data = [("Alice", "New York"), ("Bob", "Chicago"), ("Charlie", "San Francisco"), ("David", 
"Los Angeles"), ("Eva", "Miami")]
df = spark.createDataFrame(data, ["name", "city"])
# Define the list of values to exclude
exclusion_list = ["New York", "Chicago", "Miami"]
# Filter out rows where city is in the exclusion list and name is not "David"
filtered_df = df.filter(~(col("city").isin(exclusion_list) & (col("name") != "David")))
# Display the filtered DataFrame
filtered_df.show()

Output

+------+--------------+
|  name|     city|
+------+--------------+
|David |   Los Angeles|
|Charlie|San Francisco|
+------+--------------+

Example 2

# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of students data with null values
# we can define null values with none
data = [[1, "Karthik", "Sharma"],
         	[2, "Kruthik", "Ballari"],
         	[3, "Vijay", "Kumar"],
         	[4, "Aditi", "Gupta"],
         	[5, "Sricharan", "Sharma"]]
# specify column names
columns = ['ID', 'NAME', 'Lastname']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
dataframe.show()

Output

+------+---------+----------------+
ID   	NAME  LASTNAME
+------+---------+----------------+
1   	|Karthik| Sharma |
2   	|Kruthik| Ballari |
3   	|Vijay| Kumar |
4   	|Aditi| Gupta |
5   	|Sricharan| Sharma |

Example 3

The following code illustrates fetching names from dataframe columns and displays them.

filter(): This clause is used to check the condition and give the results, Both are similar

Syntax

dataframe.filter(condition)

# Getting Kruthik's name
dataframe.filter((dataframe.NAME).isin(['Kruthik'])).show()

Output

+------+---------+----------------+
ID   	NAME  LASTNAME
+------+---------+----------------+
1   	|Kruthik| Ballari |

Example 4

The following program illustrates fetching of data from dataframe whose last name is Sharma and prints their full names.

where(): This clause is used to check the condition and give the results

Syntax

dataframe.where(condition)

# Fetching names of people whose last name is Sharma
dataframe.where((dataframe.college).isin(['Sharma'])).show()

Output

+------+---------+----------------+
ID   	NAME  LASTNAME
+------+---------+----------------+
1        |Karthik| Sharma |
2        |Sricharan| Sharma |

Conclusion

The isin by exclusion function is a very helpful way to filter out rows from a PySpark dataframe where a column value does not match a predefined list of values. It has many filtering options and can be applied in different circumstances.

When working with enormous datasets, this strategy is especially helpful because it can drastically minimize the amount of data that needs to be processed. Data scientists and analysts can quickly filter out irrelevant data and concentrate exclusively on the information they require for their study by employing isin by exclusion.

Jaisshree

Updated on: 10-Aug-2023

511 Views

Kickstart Your Career

Get certified by completing the course

Get Started