- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to select a range of rows from a dataframe in PySpark?
The dataframe in PySpark is defined by a shared collection of data that can be used to run in computer machines and structurize the data into rows and columns format. The range of rows defines a horizontal line(set of multiple values according to condition) in the dataset. In general, the range sets the lowest and highest values. In Python, we have some built-in functions like filter(), where(), and, collect() to select a range of rows from a dataframe in PySpark.
Syntax
The following syntax is used in the examples −
createDataFrame()
This is a built-in method in Python that takes schema argument to define the schemas of dataframe.
filter()
The filter() is an in-built function in Python that defines and allows to iterate the specific rows or columns according to a given condition.
where()
This is a built-in function in Python that is used to set the specific condition based on rows or columns and its return by a specific element.
collect()
Pyspark collect is used to access all the elements from the given dataset and it is used in loops.
show()
The show() is the built-in method in Python that is used at the end of the program to get the result.
Installation requirement −
pip install pyspark
Using CreateDataframe()
The createDataframe follows the PySpark module that accepts two parameters- data_name(set the value of each column in each row) and data_Columns( set the value by defining all the columns).
Example
In the following example, we will start the program by building the SparkSession.builder and set the name of database using appName(), and with assign the function getOrCreate() that will create the session and store it in the variable spark. Then set the data values of customer database in the variable customer_Data. Next, set all the values in the variable data_Columns. Now create the Dataframe using the method createDataFrame() that accepts two parameters- customer_Data and data_Columns that will structurize the format in tabular form and connect with spark and store it in the variable customer_DF. The tabular forms show the values in rows and columns. Finally, we are using the method named show() with customer_DF to get the overall data of each column.
# Creation of SparkSession spark = SparkSession.builder \ .appName('CUSTOMER') \ .getOrCreate() # customer DataFrame customer_Data = [("PREM KUMAR", 1281, "AC", 40000,4000), ("RATAN SINGH", 1289, "HOME THEATER", 35000, 5000), ("DAVID K", 1221, "NIKON CAMERA", 88000, 10000), ("JONATHAN REDDY", 1743, "GEYSER", 15000, 500), ("JASPREET BRAR", 1234, "HP LAPTOP", 78000, 3564), ("NEIL KAMANT", 1222, "WASHING MACHINE", 25000, 2000) ] data_Columns = ["CUSTOMER NAME","PRODUCT ID","PRODUCT NAME", "Actual Price","EMI PER MONTH"] customer_DF = spark.createDataFrame(customer_Data, data_Columns) customer_DF.show()
Output
+--------------+----------+---------------+------------+-------------+ | CUSTOMER NAME|PRODUCT ID| PRODUCT NAME|Actual Price|EMI PER MONTH| +--------------+----------+---------------+------------+-------------+ | PREM KUMAR| 1281| AC| 40000| 4000| | RATAN SINGH| 1289| HOME THEATER| 35000| 5000| | DAVID K| 1221| NIKON CAMERA| 88000| 10000| |JONATHAN REDDY| 1743| GEYSER| 15000| 500| | JASPREET BRAR| 1234| HP LAPTOP| 78000| 3564| | NEIL KAMANT| 1222|WASHING MACHINE| 25000| 2000| +--------------+----------+---------------+------------+-------------+
Using Filter Method
This method follows the above program and using of this allows us to set the condition of the range of rows from a dataframe.
Example
In the following example, we follow the previous example code and here it allows us to set the two conditions with dataframe i.e. DF that accepts the parameter values as Actual Price which sets the range of rows condition between the price 25000 to 40000 and it will find the specific range from the rows. Finally, we are using the show() method to get the result.
DF.filter((DF['Actual Price'] >= 25000) & (DF['Actual Price'] <= 40000)).show()
Output
+-------------+----------+---------------+------------+-------------+ |CUSTOMER NAME|PRODUCT ID| PRODUCT NAME|Actual Price|EMI PER MONTH| +-------------+----------+---------------+------------+-------------+ | PREM KUMAR| 1281| AC| 40000| 4000| | RATAN SINGH| 1289| HOME THEATER| 35000| 5000| | NEIL KAMANT| 1222|WASHING MACHINE| 25000| 2000| +-------------+----------+---------------+------------+-------------+
Using Where() Method
This method follows the first example from this article, here it uses the where() method to set the range of rows from a dataframe in PySpark.
Example
In the following example, we will use the built-in method where() that accepts two conditions using and(&) operator to get the range of rows. Next, the where() method along uses the show() method to get the result.
DF.where((DF['EMI PER MONTH'] >= 10000) & (DF['EMI PER MONTH'] <= 38000)).show()
Output
+-------------+----------+------------+------------+-------------+ |CUSTOMER NAME|PRODUCT ID|PRODUCT NAME|Actual Price|EMI PER MONTH| +-------------+----------+------------+------------+-------------+ | DAVID K| 1221|NIKON CAMERA| 88000| 10000| +-------------+----------+------------+------------+-------------+
Using Collect() Method
This method follows the first example, allowing it to iterate into specific columns to get the range of rows from a dataframe in PySpark.
Example
In the following example, we will use the for loop where the variable row iterates through dataframe with collect() method and it will iterate all the elements from a given dataframe. Now it uses the if-statement to set the condition if the Actual price is between 30000 to 50000 it returns the specific range of rows. Next, it uses the print() method that accepts a row as a parameter to return the result.
for row in DF.collect(): if 30000 <= row['Actual Price'] <= 50000: print(row)
Output
Row(CUSTOMER NAME='PREM KUMAR', PRODUCT ID=1281, PRODUCT NAME='AC', Actual Price=40000, EMI PER MONTH=4000) Row(CUSTOMER NAME='RATAN SINGH', PRODUCT ID=1289, PRODUCT NAME='HOME THEATER', Actual Price=35000, EMI PER MONTH=5000)
Conclusion
We discussed the various method to get the range of rows from a dataframe in PySpark. All the methods follow the first example because that example sets the complete input structure of the rows and columns data. PySpark has been used by top MNCs like Amazon, Walmart, Trivago, and many more.