How to check if something is a RDD or a DataFrame in PySpark?


RDD is abbreviated as Resilient Distributed Dataset, which is PySpark fundamental abstraction (Immutable collection of objects). The RDD’s are the primary building blocks of the PySpark. They split into smaller chunks and distributed among the nodes in a cluster. It supports the operations of transformations and actions.

Dataframe in PySpark

DataFrame is a two dimensional labeled data structure in python. It is used for data manipulation and data analysis. It accepts different datatypes such as integer, float, strings etc. The column labels are unique, while the rows are labeled with a unique index value that facilitates accessing specific rows.

Dataframes are commonly used in machine learning tasks to manipulate and analyze large data sets. They support operations such as filtering, sorting, merging, grouping and transforming data.

PySpark, provides a function namely isinstance() that helps to check whether the given object is RDD or a DataFrame.

Syntax

Following is the syntax for using the isinstance() function.

isinstance(data,rdd/dataframe)

Where,

  • Isinstance() is the function used to find the data is RDD or DataFrame

  • data is the input data

Installing PySpark

Firstly, we have to install PySpark library in the python environment using the below code.

pip install PySpark 

Output

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting PySpark
  Downloading PySpark-3.3.2.tar.gz (281.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 281.4/281.4 MB 5.3 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.7/199.7 KB 28.1 MB/s eta 0:00:00
Building wheels for collected packages: PySpark
  Building wheel for PySpark (setup.py) ... done
  Created wheel for PySpark: filename=PySpark-3.3.2-py2.py3-none-any.whl size=281824028 sha256=184a9a6949d4be5a4746cd53cb28d40cf38a4771048f5f14445d8ee4ab14a07c
  Stored in directory: /root/.cache/pip/wheels/6c/e3/9b/0525ce8a69478916513509d43693511463c6468db0de237c86
Successfully built PySpark
Installing collected packages: py4j, PySpark
  Attempting uninstall: py4j
    Found existing installation: py4j 0.10.9.7
    Uninstalling py4j-0.10.9.7:
      Successfully uninstalled py4j-0.10.9.7
Successfully installed py4j-0.10.9.5 PySpark-3.3.2

Before working with the isinstance() function, we have to create the data which is dataframe or RDD.

from PySpark.sql importmSparkSession
from PySpark.sql.types import *

spark = SparkSession.builder \
   .appName("substring check") \
   .getOrCreate()

schema = StructType([
   StructField("name", StringType(), True),
   StructField("age", IntegerType(), True),
   StructField("gender", StringType(), True),
   StructField("occupation", StringType(), True),
   StructField("salary", DoubleType(), True)
])

df = [("John", 25, "Male", "Developer", 5000.0),
        ("Jane", 30, "Female", "Manager", 7000.0),
        ("Bob", 35, "Male", "Director", 10000.0),
        ("Alice", 40, "Female", "CEO", 15000.0)]
data = spark.createDataFrame(df, schema)
df.show()

Output

+-----+---+------+----------+-------+
| name|age|gender|occupation| salary|
+-----+---+------+----------+-------+
| John| 25|  Male| Developer| 5000.0|
| Jane| 30|Female|   Manager| 7000.0|
|  Bob| 35|  Male|  Director|10000.0|
|Alice| 40|Female|       CEO|15000.0|
+-----+---+------+----------+-------+

Example

In the following example, we pass the data along with the format name RDD or DataFrame to isinstance() function of the PySpark.

from PySpark.sql import DataFrame
from PySpark.rdd import RDD
if isinstance(data, RDD):
   print("The given data is an RDD")
elif isinstance(data, DataFrame):
   print("The given data is a DataFrame")
else:
   print("The given data is neither an RDD nor a DataFrame")

Output

The given data is a DataFrame

Example

In the following example, we will pass the list data structure to the isinstance() function.

from PySpark.sql import DataFrame
from PySpark.rdd import RDD
data = [22,1,14,5,12,5,7,2,24,2,21,11]
if isinstance(data, RDD):
   print("The given data is an RDD")
elif isinstance(data, DataFrame):
   print("The given data is a DataFrame")
else:
   print("The given data is neither an RDD nor a DataFrame")

Output

The given data is neither an RDD nor a DataFrame

Updated on: 20-Oct-2023

203 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements