How to select a range of rows from a dataframe in PySpark?


The dataframe in PySpark is defined by a shared collection of data that can be used to run in computer machines and structurize the data into rows and columns format. The range of rows defines a horizontal line(set of multiple values according to condition) in the dataset. In general, the range sets the lowest and highest values. In Python, we have some built-in functions like filter(), where(), and, collect() to select a range of rows from a dataframe in PySpark.

Syntax

The following syntax is used in the examples −

createDataFrame()

This is a built-in method in Python that takes schema argument to define the schemas of dataframe.

filter()

The filter() is an in-built function in Python that defines and allows to iterate the specific rows or columns according to a given condition.

where()

This is a built-in function in Python that is used to set the specific condition based on rows or columns and its return by a specific element.

collect()

Pyspark collect is used to access all the elements from the given dataset and it is used in loops.

show()

The show() is the built-in method in Python that is used at the end of the program to get the result.

Installation requirement −

pip install pyspark

Using CreateDataframe()

The createDataframe follows the PySpark module that accepts two parameters- data_name(set the value of each column in each row) and data_Columns( set the value by defining all the columns).

Example

In the following example, we will start the program by building the SparkSession.builder and set the name of database using appName(), and with assign the function getOrCreate() that will create the session and store it in the variable spark. Then set the data values of customer database in the variable customer_Data. Next, set all the values in the variable data_Columns. Now create the Dataframe using the method createDataFrame() that accepts two parameters- customer_Data and data_Columns that will structurize the format in tabular form and connect with spark and store it in the variable customer_DF. The tabular forms show the values in rows and columns. Finally, we are using the method named show() with customer_DF to get the overall data of each column.

# Creation of SparkSession
spark = SparkSession.builder \
   .appName('CUSTOMER') \
   .getOrCreate()    
# customer DataFrame
customer_Data = [("PREM KUMAR", 1281, "AC", 40000,4000),
   ("RATAN SINGH", 1289, "HOME THEATER", 35000, 5000),
   ("DAVID K", 1221, "NIKON CAMERA", 88000, 10000),
   ("JONATHAN REDDY", 1743, "GEYSER", 15000, 500),
   ("JASPREET BRAR", 1234, "HP LAPTOP", 78000, 3564),
   ("NEIL KAMANT", 1222, "WASHING MACHINE", 25000, 2000)
]
data_Columns = ["CUSTOMER NAME","PRODUCT ID","PRODUCT NAME",
   "Actual Price","EMI PER MONTH"]
customer_DF = spark.createDataFrame(customer_Data, data_Columns)
customer_DF.show()

Output

+--------------+----------+---------------+------------+-------------+
| CUSTOMER NAME|PRODUCT ID|   PRODUCT NAME|Actual Price|EMI PER MONTH|
+--------------+----------+---------------+------------+-------------+
|    PREM KUMAR|      1281|             AC|       40000|         4000|
|   RATAN SINGH|      1289|   HOME THEATER|       35000|         5000|
|       DAVID K|      1221|   NIKON CAMERA|       88000|        10000|
|JONATHAN REDDY|      1743|         GEYSER|       15000|          500|
| JASPREET BRAR|      1234|      HP LAPTOP|       78000|         3564|
|   NEIL KAMANT|      1222|WASHING MACHINE|       25000|         2000|
+--------------+----------+---------------+------------+-------------+

Using Filter Method

This method follows the above program and using of this allows us to set the condition of the range of rows from a dataframe.

Example

In the following example, we follow the previous example code and here it allows us to set the two conditions with dataframe i.e. DF that accepts the parameter values as Actual Price which sets the range of rows condition between the price 25000 to 40000 and it will find the specific range from the rows. Finally, we are using the show() method to get the result.

DF.filter((DF['Actual Price'] >= 25000) & (DF['Actual Price'] <= 40000)).show()

Output

+-------------+----------+---------------+------------+-------------+
|CUSTOMER NAME|PRODUCT ID|   PRODUCT NAME|Actual Price|EMI PER MONTH|
+-------------+----------+---------------+------------+-------------+
|   PREM KUMAR|      1281|             AC|       40000|         4000|
|  RATAN SINGH|      1289|   HOME THEATER|       35000|         5000|
|  NEIL KAMANT|      1222|WASHING MACHINE|       25000|         2000|
+-------------+----------+---------------+------------+-------------+

Using Where() Method

This method follows the first example from this article, here it uses the where() method to set the range of rows from a dataframe in PySpark.

Example

In the following example, we will use the built-in method where() that accepts two conditions using and(&) operator to get the range of rows. Next, the where() method along uses the show() method to get the result.

DF.where((DF['EMI PER MONTH'] >= 10000) & (DF['EMI PER MONTH'] <= 38000)).show()

Output

+-------------+----------+------------+------------+-------------+
|CUSTOMER NAME|PRODUCT ID|PRODUCT NAME|Actual Price|EMI PER MONTH|
+-------------+----------+------------+------------+-------------+
|      DAVID K|      1221|NIKON CAMERA|       88000|        10000|
+-------------+----------+------------+------------+-------------+

Using Collect() Method

This method follows the first example, allowing it to iterate into specific columns to get the range of rows from a dataframe in PySpark.

Example

In the following example, we will use the for loop where the variable row iterates through dataframe with collect() method and it will iterate all the elements from a given dataframe. Now it uses the if-statement to set the condition if the Actual price is between 30000 to 50000 it returns the specific range of rows. Next, it uses the print() method that accepts a row as a parameter to return the result.

for row in DF.collect():
   if 30000 <= row['Actual Price'] <= 50000:
      print(row)

Output

Row(CUSTOMER NAME='PREM KUMAR', PRODUCT ID=1281, PRODUCT NAME='AC', Actual Price=40000, EMI PER MONTH=4000)
Row(CUSTOMER NAME='RATAN SINGH', PRODUCT ID=1289, PRODUCT NAME='HOME THEATER', Actual Price=35000, EMI PER MONTH=5000)

Conclusion

We discussed the various method to get the range of rows from a dataframe in PySpark. All the methods follow the first example because that example sets the complete input structure of the rows and columns data. PySpark has been used by top MNCs like Amazon, Walmart, Trivago, and many more.

Updated on: 17-Jul-2023

604 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements