- Data Structure
- Networking
- RDBMS
- Operating System
- Java
- MS Excel
- iOS
- HTML
- CSS
- Android
- Python
- C Programming
- C++
- C#
- MongoDB
- MySQL
- Javascript
- PHP
- Physics
- Chemistry
- Biology
- Mathematics
- English
- Economics
- Psychology
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to check if something is a RDD or a DataFrame in PySpark?
RDD is abbreviated as Resilient Distributed Dataset, which is PySpark fundamental abstraction (Immutable collection of objects). The RDD’s are the primary building blocks of the PySpark. They split into smaller chunks and distributed among the nodes in a cluster. It supports the operations of transformations and actions.
Dataframe in PySpark
DataFrame is a two dimensional labeled data structure in python. It is used for data manipulation and data analysis. It accepts different datatypes such as integer, float, strings etc. The column labels are unique, while the rows are labeled with a unique index value that facilitates accessing specific rows.
Dataframes are commonly used in machine learning tasks to manipulate and analyze large data sets. They support operations such as filtering, sorting, merging, grouping and transforming data.
PySpark, provides a function namely isinstance() that helps to check whether the given object is RDD or a DataFrame.
Syntax
Following is the syntax for using the isinstance() function.
isinstance(data,rdd/dataframe)
Where,
Isinstance() is the function used to find the data is RDD or DataFrame
data is the input data
Installing PySpark
Firstly, we have to install PySpark library in the python environment using the below code.
pip install PySpark
Output
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Collecting PySpark Downloading PySpark-3.3.2.tar.gz (281.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 281.4/281.4 MB 5.3 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done Collecting py4j==0.10.9.5 Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.7/199.7 KB 28.1 MB/s eta 0:00:00 Building wheels for collected packages: PySpark Building wheel for PySpark (setup.py) ... done Created wheel for PySpark: filename=PySpark-3.3.2-py2.py3-none-any.whl size=281824028 sha256=184a9a6949d4be5a4746cd53cb28d40cf38a4771048f5f14445d8ee4ab14a07c Stored in directory: /root/.cache/pip/wheels/6c/e3/9b/0525ce8a69478916513509d43693511463c6468db0de237c86 Successfully built PySpark Installing collected packages: py4j, PySpark Attempting uninstall: py4j Found existing installation: py4j 0.10.9.7 Uninstalling py4j-0.10.9.7: Successfully uninstalled py4j-0.10.9.7 Successfully installed py4j-0.10.9.5 PySpark-3.3.2
Before working with the isinstance() function, we have to create the data which is dataframe or RDD.
from PySpark.sql importmSparkSession from PySpark.sql.types import * spark = SparkSession.builder \ .appName("substring check") \ .getOrCreate() schema = StructType([ StructField("name", StringType(), True), StructField("age", IntegerType(), True), StructField("gender", StringType(), True), StructField("occupation", StringType(), True), StructField("salary", DoubleType(), True) ]) df = [("John", 25, "Male", "Developer", 5000.0), ("Jane", 30, "Female", "Manager", 7000.0), ("Bob", 35, "Male", "Director", 10000.0), ("Alice", 40, "Female", "CEO", 15000.0)] data = spark.createDataFrame(df, schema) df.show()
Output
+-----+---+------+----------+-------+ | name|age|gender|occupation| salary| +-----+---+------+----------+-------+ | John| 25| Male| Developer| 5000.0| | Jane| 30|Female| Manager| 7000.0| | Bob| 35| Male| Director|10000.0| |Alice| 40|Female| CEO|15000.0| +-----+---+------+----------+-------+
Example
In the following example, we pass the data along with the format name RDD or DataFrame to isinstance() function of the PySpark.
from PySpark.sql import DataFrame from PySpark.rdd import RDD if isinstance(data, RDD): print("The given data is an RDD") elif isinstance(data, DataFrame): print("The given data is a DataFrame") else: print("The given data is neither an RDD nor a DataFrame")
Output
The given data is a DataFrame
Example
In the following example, we will pass the list data structure to the isinstance() function.
from PySpark.sql import DataFrame from PySpark.rdd import RDD data = [22,1,14,5,12,5,7,2,24,2,21,11] if isinstance(data, RDD): print("The given data is an RDD") elif isinstance(data, DataFrame): print("The given data is a DataFrame") else: print("The given data is neither an RDD nor a DataFrame")
Output
The given data is neither an RDD nor a DataFrame