How to check for a substring in a PySpark dataframe?

PySpark is a Python library that provides an interface to Apache Spark, a distributed computing system for processing large datasets. PySpark allows you to write Spark applications using Python, making it accessible for Python developers to work with big data.

What is Substring Checking in PySpark?

Substring checking in PySpark DataFrames involves searching for specific patterns or text within column values. This is commonly used for data filtering, pattern matching, and text analysis operations. PySpark provides several built-in functions to perform substring operations efficiently across distributed datasets.

Setting Up PySpark

First, let's create a sample DataFrame to demonstrate substring checking methods ?

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
from pyspark.sql.functions import col

# Create SparkSession
spark = SparkSession.builder \
    .appName("substring_check") \
    .getOrCreate()

# Define schema
schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("gender", StringType(), True),
    StructField("occupation", StringType(), True),
    StructField("salary", DoubleType(), True)
])

# Sample data
data = [("John", 25, "Male", "Developer", 5000.0),
        ("Jane", 30, "Female", "Manager", 7000.0),
        ("Bob", 35, "Male", "Director", 10000.0),
        ("Alice", 40, "Female", "CEO", 15000.0)]

# Create DataFrame
df = spark.createDataFrame(data, schema)
df.show()
+-----+---+------+----------+-------+
| name|age|gender|occupation| salary|
+-----+---+------+----------+-------+
| John| 25|  Male| Developer| 5000.0|
| Jane| 30|Female|   Manager| 7000.0|
|  Bob| 35|  Male|  Director|10000.0|
|Alice| 40|Female|       CEO|15000.0|
+-----+---+------+----------+-------+

Method 1: Using like() Function

The like() function checks if a column contains a specified pattern using SQL-like wildcards. Use % as a wildcard to match any sequence of characters ?

# Filter rows where occupation contains "Dev"
result = df.where(col("occupation").like("%Dev%"))
result.show()
+----+---+------+----------+------+
|name|age|gender|occupation|salary|
+----+---+------+----------+------+
|John| 25|  Male| Developer|5000.0|
+----+---+------+----------+------+

Method 2: Using rlike() Function

The rlike() function uses regular expressions for pattern matching, providing more flexibility than like() ?

# Filter using regular expression pattern
result = df.where(col("occupation").rlike(".*Manager.*"))
result.show()
+----+---+------+----------+------+
|name|age|gender|occupation|salary|
+----+---+------+----------+------+
|Jane| 30|Female|   Manager|7000.0|
+----+---+------+----------+------+

Method 3: Using substr() Function

The substr() function extracts a substring from a specific position and compares it with a target string ?

# Check if first 3 characters of occupation equal "Dir"
result = df.where(col("occupation").substr(1, 3) == "Dir")
result.show()
+----+---+------+----------+-------+
|name|age|gender|occupation| salary|
+----+---+------+----------+-------+
| Bob| 35|  Male|  Director|10000.0|
+----+---+------+----------+-------+

Method 4: Using contains() Function

The contains() function is a simple way to check if a column contains a specific substring ?

# Filter rows where name contains "Al"
result = df.where(col("name").contains("Al"))
result.show()
+-----+---+------+----------+-------+
| name|age|gender|occupation| salary|
+-----+---+------+----------+-------+
|Alice| 40|Female|       CEO|15000.0|
+-----+---+------+----------+-------+

Comparison of Methods

Method Pattern Type Use Case
like() SQL wildcards (%) Simple pattern matching
rlike() Regular expressions Complex pattern matching
substr() Position-based Fixed position substring
contains() Exact substring Simple substring search

Conclusion

PySpark provides multiple methods for substring checking: like() for SQL-style patterns, rlike() for regex patterns, substr() for position-based extraction, and contains() for simple substring matching. Choose the method that best fits your specific pattern matching requirements.

Updated on: 2026-03-27T11:42:02+05:30

4K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements