How to sort by value in PySpark?

PySpark Python Server Side Programming Programming

PySpark is distributed data processing engine that will use to write the code for an API. PySpark is the collaboration of Apache Spark and Python. Spark is large-scale data processing platform that provides the capability to process petabyte scale data. In Python, we have PySpark built-in functions like orderBy(), sort(), sortBy(), createDataFrame(), collect(), and asc_nulls_last() that can be used to sort the values.

Syntax

The following syntax is used in the examples −

createDataFrame()

This is a built-in function in Python that represents another way to create the DataFrame from the PySpark module.

orderBy()

This is the built-in method in Python that follows the PySpark module to represent the order of ascending or descending in one or more columns.

sort()

This method is used to list the order in ascending order. If we categorize any condition over to the sort function then it allows us to work in descending order.

sortBy()

The sortBy() method in Python follows the sparkContext that can be used to sort the data in an ordered sequence.

parallelize()

This method is contained by the sparkContext that allows the data to distribute across multiple nodes.

collect()

This is commonly known as PySpark RDD or DataFrame collect which can be used to access all the items or numbers from the dataset.

asc_nulls_last("column_name")

This is a built-in function in Python that returns the sequence of order in ascending order.

Installation Required −

pip install pyspark

This command helps to run the program based on PySpark.

Example 1

In the following example, we will show how to sort DataFrame by a single column. First, start importing the module named pyspark.sql.SparkSession. Then create an object named SparkSession. Then store the list of tuples in the variable spark. Next, create a DataFrame using spark.createDataFrame() and provide the data and column names. Moving ahead to use the orderBy() function on the DataFrame to sort it by the desired column, in this case, "Age". Finally, use the sorted DataFrame using the show method to get the final result.

from pyspark.sql 
import SparkSession
# Creation of SparkSession
spark = SparkSession.builder.getOrCreate()

# Creation of DataFrame
stu_data = [("Akash", 25), ("Bhuvan", 23), ("Peter", 18), ("Mohan", 26)]
df = spark.createDataFrame(stu_data, ["Name", "Age"])

# Sorting of Dataframe column(Age) in ascending order
sorted_df = df.orderBy("Age")

# sorted DataFrame
sorted_df.show()

Output

+------+---+
|  Name|Age|
+------+---+
| Peter| 18|
|Bhuvan| 23|
| Akash| 25|
| Mohan| 26|
+------+---+

Example 2

In the following example, we will show how to sort a DataFrame by Multiple columns. Here it uses the orderBy method that accepts two parameters- list(to set the column name) and ascending(to set the value as true which allows to sort the order sequence) and store it in the variable sorted_df. Then use the method name show() with sorted_df to get the result.

from pyspark.sql 
import SparkSession
# Create SparkSession
spark = SparkSession.builder.getOrCreate()

# Create a DataFrame
data = [("Umbrella", 125), ("Bottle", 20), ("Colgate", 118)]
df = spark.createDataFrame(data, ["Product", "Price"])

# Sort DataFrame by product and price in ascending order
sorted_df = df.orderBy(["Price", "Product"], ascending=[True, True])

# Show the sorted DataFrame
sorted_df.show()

Output

+--------+-----+
| Product|Price|
+--------+-----+
|  Bottle|   20|
| Colgate|  118|
|Umbrella|  125|
+--------+-----+

Example 3

In the following example, we will show how to sort a dataframe in descending order. Here it uses the built-in function createDataframe that manually sets the dataframe in a 2D structure. Then initialize the variable named sorted_df that store the value by using the method named sort()[The sort function accept the parameter as a built-in function desc(“Age”) that will follow the sequence of descending order]. Finally, we are printing the result using sorted_df.show().

from pyspark.sql 
import SparkSession
from pyspark.sql.functions 
import desc
# Creation of SparkSession
spark = SparkSession.builder.getOrCreate()

# Creation of DataFrame
Emp_data = [("Abhinav", 25, "Male"), ("Meera", 32, "Female"), ("Riya", 18, "Female"), ("Deepak", 33, "Male"), ("Elon", 50, "Male")]
df = spark.createDataFrame(Emp_data, ["Name", "Age", "Gender"])

# Sort DataFrame by Age in descending order
sorted_df = df.sort(desc("Age"))

# Show the sorted DataFrame
sorted_df.show()

Output

+-------+---+------+
|   Name|Age|Gender|
+-------+---+------+
|   Elon| 50|  Male|
| Deepak| 33|  Male|
|  Meera| 32|Female|
|Abhinav| 25|  Male|
|   Riya| 18|Female|
+-------+---+------+

Example 4

In the following example, we will show how to sort RDD by value. Here we create an RDD from a list of tuples. Then use the sortBy() function on the RDD and provide a lambda function that extracts the value to be sorted by the second element of each tuple. Next, collect and iterate over the sorted RDD to access the sorted records.

# Sorting RDD by Value
from pyspark.sql 
import SparkSession

# Creation of SparkSession
spark = SparkSession.builder.getOrCreate()

# Creation of RDD
data = [("X", 25), ("Y", 32), ("Z", 18)]
rdd = spark.sparkContext.parallelize(data)

# Sort RDD by value in ascending order
sorted_rdd = rdd.sortBy(lambda x: x[1])

# Print the sorted RDD
for record in sorted_rdd.collect():
   print(record)

Output

('Z', 18)
('X', 25)
('Y', 32)

Example 5

In the following example, we will show how to sort a dataframe with null values. Here, it uses the sort() function on the DataFrame to sort it by the "Price" column in ascending order with null values i.e. none. Next, it passes asc_nulls_last("Price") as the argument to sort() and then specifies the sorting order in the sequence of ascending to descending. Then allocate the sorted dataframe in the variable sorted_df and then use the same variable with show() method to get the result.

# Sorting DataFrame with Null Values
from pyspark.sql 
import SparkSession
from pyspark.sql.functions 
import asc_nulls_last

# Creation of SparkSession
spark = SparkSession.builder.getOrCreate()

# Creation of DataFrame with null values
data = [("Charger", None), ("Mouse", 320), ("PEN", 18), ("Bag", 1000), ("Notebook", None)] # None = null
df = spark.createDataFrame(data, ["Product", "Price"])

# Sorting of DataFrame column(Price) in ascending order with null values last
sorted_df = df.sort(asc_nulls_last("Price"))

# Show the sorted DataFrame
sorted_df.show()

Output

+--------+-----+
| Product|Price|
+--------+-----+
|     PEN|   18|
|   Mouse|  320|
|     Bag| 1000|
| Charger| null|
|Notebook| null|
+--------+-----+

Conclusion

We discussed the different ways to sort the value in PuSpark. We used some of the built-in functions like orderBy(), sort(), asc_nulls_last(), etc. The purpose of sorting is used to get the sequence order whether it’s ascending or descending. The various application related to PySpark such as real-time libraries, large-scale data processing, and build API.

Tapas Kumar Ghosh

Updated on: 17-Jul-2023

340 Views

Kickstart Your Career

Get certified by completing the course

Get Started