How to check for a substring in a PySpark dataframe?


Pyspark is the library which provides the interface to Apache spark. The Apache spark is the distributed computing system to process the large datasets. We have the feasibility in pyspark to write the spark applications using python apart of the traditional languages Java and Scala.

Verifying for a substring in a PySpark

Pyspark provides the dataframe API which helps us in manipulating the structured data such as the SQL queries. It can read various formats of data like parquet, csv, JSON and much more. It provides the features to support the machine learning library to use classification, regression, clustering and etc.

We have different ways to check for a substring in a pyspark dataframe. In this article we will discuss the following

  • Using like() or rlike()

  • Using substr()

Installing PySpark

Before seeing about each approach in detail, we have to install the pyspark in our working environment by using the following code.

!pip install pyspark

On successful installation, we will get the similar output as shown below –

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.2.tar.gz (281.4 MB)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Installing collected packages: py4j, pyspark
  Attempting uninstall: py4j
    Found existing installation: py4j 0.10.9.7
    Uninstalling py4j-0.10.9.7:
      Successfully uninstalled py4j-0.10.9.7
Successfully installed py4j-0.10.9.5 pyspark-3.3.2

Creating a dataframe in the pyspark

Now we have to create the Dataframe of the pyspark with the specified details of data. The following are the steps.

  • We have to import all the necessary PySpar modules and packages to create the Dataframe.

from pyspark.sql import SparkSession
from pyspark.sql.types import *
  • Next we have to create the sparksession to work with the pyspark dataframe.

spark = SparkSession.builder \
    .appName("substring check") \
    .getOrCreate()
  • We have to define the schema for creating the dataframe using the pyspark

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("gender", StringType(), True),
    StructField("occupation", StringType(), True),
    StructField("salary", DoubleType(), True)
])
  • In this step, we will create the list rows, which forms then Dataframe

data = [("John", 25, "Male", "Developer", 5000.0),
        ("Jane", 30, "Female", "Manager", 7000.0),
        ("Bob", 35, "Male", "Director", 10000.0),
        ("Alice", 40, "Female", "CEO", 15000.0)]
  • We create the dataframe as per the schema and data which we specified in the above two steps and display the dataframe after creation.

df = spark.createDataFrame(data, schema)
df.show()

Output

The following is the output of the dataframe created using the pyspark .

------+---+------+----------+-------+
| name|age|gender|occupation| salary|
+-----+---+------+----------+-------+
| John| 25|  Male| Developer| 5000.0|
| Jane| 30|Female|   Manager| 7000.0|
|  Bob| 35|  Male|  Director|10000.0|
|Alice| 40|Female|       CEO|15000.0|
+-----+---+------+----------+-------+

Now, let’s see each approach of checking the substring in the dataframe of pyspark.

Using like() or rlike() functions

In pyspark, we have two functions like() and rlike() ; which are used to check the substring of the data frame. The like() function is used to check if any particular column contains specified pattern, whereas the rlike() function checks for the regular expression pattern in the column.

from pyspark.sql.functions import col
df = spark.createDataFrame(data, schema)
df.show()
df.where(col("occupation").like("%Developer%")).show()
df.where(col("occupation").rlike(".*Developer.*")).show()

Output

The output of the like and rlike functions are as follows.

+-----+---+------+----------+-------+
| name|age|gender|occupation| salary|
+-----+---+------+----------+-------+
| John| 25|  Male| Developer| 5000.0|
| Jane| 30|Female|   Manager| 7000.0|
|  Bob| 35|  Male|  Director|10000.0|
|Alice| 40|Female|       CEO|15000.0|
+-----+---+------+----------+-------+

+----+---+------+----------+------+
|name|age|gender|occupation|salary|
+----+---+------+----------+------+
|John| 25|  Male| Developer|5000.0|
+----+---+------+----------+------+

+----+---+------+----------+------+
|name|age|gender|occupation|salary|
+----+---+------+----------+------+
|John| 25|  Male| Developer|5000.0|
+----+---+------+----------+------+

Using the substr() function

This substr function is used to extract the substring from the column and then checks if the substring matches the specified pattern.

from pyspark.sql.functions import col
df = spark.createDataFrame(data, schema)
df.show()
df.where(col("occupation").substr(1, 5) == "Manag").show()

Output

The output is as follows for the substr function in pyspark.

+-----+---+------+----------+-------+
| name|age|gender|occupation| salary|
+-----+---+------+----------+-------+
| John| 25|  Male| Developer| 5000.0|
| Jane| 30|Female|   Manager| 7000.0|
|  Bob| 35|  Male|  Director|10000.0|
|Alice| 40|Female|       CEO|15000.0|
+-----+---+------+----------+-------+

+----+---+------+----------+------+
|name|age|gender|occupation|salary|
+----+---+------+----------+------+
|Jane| 30|Female|   Manager|7000.0|
+----+---+------+----------+------+

Updated on: 09-Aug-2023

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements