Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to check for a substring in a PySpark dataframe?
PySpark is a Python library that provides an interface to Apache Spark, a distributed computing system for processing large datasets. PySpark allows you to write Spark applications using Python, making it accessible for Python developers to work with big data.
What is Substring Checking in PySpark?
Substring checking in PySpark DataFrames involves searching for specific patterns or text within column values. This is commonly used for data filtering, pattern matching, and text analysis operations. PySpark provides several built-in functions to perform substring operations efficiently across distributed datasets.
Setting Up PySpark
First, let's create a sample DataFrame to demonstrate substring checking methods ?
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
from pyspark.sql.functions import col
# Create SparkSession
spark = SparkSession.builder \
.appName("substring_check") \
.getOrCreate()
# Define schema
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("gender", StringType(), True),
StructField("occupation", StringType(), True),
StructField("salary", DoubleType(), True)
])
# Sample data
data = [("John", 25, "Male", "Developer", 5000.0),
("Jane", 30, "Female", "Manager", 7000.0),
("Bob", 35, "Male", "Director", 10000.0),
("Alice", 40, "Female", "CEO", 15000.0)]
# Create DataFrame
df = spark.createDataFrame(data, schema)
df.show()
+-----+---+------+----------+-------+ | name|age|gender|occupation| salary| +-----+---+------+----------+-------+ | John| 25| Male| Developer| 5000.0| | Jane| 30|Female| Manager| 7000.0| | Bob| 35| Male| Director|10000.0| |Alice| 40|Female| CEO|15000.0| +-----+---+------+----------+-------+
Method 1: Using like() Function
The like() function checks if a column contains a specified pattern using SQL-like wildcards. Use % as a wildcard to match any sequence of characters ?
# Filter rows where occupation contains "Dev"
result = df.where(col("occupation").like("%Dev%"))
result.show()
+----+---+------+----------+------+ |name|age|gender|occupation|salary| +----+---+------+----------+------+ |John| 25| Male| Developer|5000.0| +----+---+------+----------+------+
Method 2: Using rlike() Function
The rlike() function uses regular expressions for pattern matching, providing more flexibility than like() ?
# Filter using regular expression pattern
result = df.where(col("occupation").rlike(".*Manager.*"))
result.show()
+----+---+------+----------+------+ |name|age|gender|occupation|salary| +----+---+------+----------+------+ |Jane| 30|Female| Manager|7000.0| +----+---+------+----------+------+
Method 3: Using substr() Function
The substr() function extracts a substring from a specific position and compares it with a target string ?
# Check if first 3 characters of occupation equal "Dir"
result = df.where(col("occupation").substr(1, 3) == "Dir")
result.show()
+----+---+------+----------+-------+ |name|age|gender|occupation| salary| +----+---+------+----------+-------+ | Bob| 35| Male| Director|10000.0| +----+---+------+----------+-------+
Method 4: Using contains() Function
The contains() function is a simple way to check if a column contains a specific substring ?
# Filter rows where name contains "Al"
result = df.where(col("name").contains("Al"))
result.show()
+-----+---+------+----------+-------+ | name|age|gender|occupation| salary| +-----+---+------+----------+-------+ |Alice| 40|Female| CEO|15000.0| +-----+---+------+----------+-------+
Comparison of Methods
| Method | Pattern Type | Use Case |
|---|---|---|
like() |
SQL wildcards (%) | Simple pattern matching |
rlike() |
Regular expressions | Complex pattern matching |
substr() |
Position-based | Fixed position substring |
contains() |
Exact substring | Simple substring search |
Conclusion
PySpark provides multiple methods for substring checking: like() for SQL-style patterns, rlike() for regex patterns, substr() for position-based extraction, and contains() for simple substring matching. Choose the method that best fits your specific pattern matching requirements.
