How to Order PySpark DataFrame by Multiple Columns?


When working with large datasets, one common PySpark operation is to order a DataFrame by multiple columns. You can prioritize the sorting based on various criteria when you sort data based on multiple columns. Using PySpark, we'll look at a few different approaches in this article.

In this article, we will learn How to Order PysPark DataFrame by Multiple Columns. PySpark provides several methods to order DataFrames where each method offers different features and performance characteristics, so let's dive into each one and understand how to use them.

Different Method to Order PysPark DataFrame by Multiple Columns

There are various approaches by which we can order our dataframe by multiple columns using pyspark in Python. Let's see some of the commonly used methods to perform this task:

Approach 1: Using orderBy() method

The first approach to order a data frame by multiple columns is using the orderBy() method. The DataFrame is sorted in either an ascending or descending order using this approach. It returns a brand−new DataFrame with the specified sorting order after taking as arguments the names of one or more columns. A column's descending order can be specified using pyspark.sql.functions module's desc() function. The sorting order is set to ascending by default. The strategy gives a method for arranging the whole DataFrame in light of the predetermined sections.

Syntax

Below is the syntax to order a Pyspark dataframe by multiple columns using the orderBy() method in Python:

dataframe.orderBy('column1', 'column2', 'column n').show()

Parameters

  • column1, column2, column n: One or more column names or expressions to sort the DataFrame.

  • ascending: (Optional) A boolean or list of booleans specifying the sorting order for each column. Default is True for ascending order. If a list is provided, it should have the same length as the number of columns being sorted.

Example

In this example, we order our DataFrame by multiple columns using the orderBy() method in PySpark. The DataFrame consists of information about individuals, including their name, age, gender, city, and salary. The DataFrame is ordered in descending order of the "P_Salary" column and then in ascending order of the "P_Name" column. The output shows the sorted DataFrame, with the individuals' information displayed accordingly.

from pyspark.sql import SparkSession
from pyspark.sql.functions import desc

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Create a sample DataFrame
data = [
    ("John", 25, "Male", "New York", 2000),
    ("Emma", 23, "Female", "London", 1800),
    ("Mike", 30, "Male", "Paris", 2500),
    ("Alice", 28, "Female", "Berlin", 2200),
    ("David", 35, "Male", "Tokyo", 3000),
    ("Sophia", 26, "Female", "Sydney", 1900),
    ("Daniel", 32, "Male", "Toronto", 2700),
    ("Olivia", 29, "Female", "Melbourne", 2300),
    ("James", 31, "Male", "Chicago", 2600),
    ("Emily", 27, "Female", "Los Angeles", 2100)
]

df = spark.createDataFrame(data, ["P_Name", "P_Age", "P_Sex", "P_State", "P_Salary"])

# Order by multiple columns (Age in descending order, Salary in ascending order)
df.orderBy(desc("P_Salary"), "P_Name").show()

Output

+------+-----+------+-----------+--------+
|P_Name|P_Age| P_Sex|    P_State|P_Salary|
+------+-----+------+-----------+--------+
| David|   35|  Male|      Tokyo|    3000|
|Daniel|   32|  Male|    Toronto|    2700|
| James|   31|  Male|    Chicago|    2600|
|  Mike|   30|  Male|      Paris|    2500|
|Olivia|   29|Female|  Melbourne|    2300|
| Alice|   28|Female|     Berlin|    2200|
| Emily|   27|Female|Los Angeles|    2100|
|  John|   25|  Male|   New York|    2000|
|Sophia|   26|Female|     Sydney|    1900|
|  Emma|   23|Female|     London|    1800|
+------+-----+------+-----------+--------+

Approach 2: Using sort() method

The second approach to order a data frame by multiple columns is using the sort() method.

The DataFrame is sorted in ascending order using this approach based on one or more columns. It takes at least one segment name as argument and returns another DataFrame with the predetermined sorting order. The pyspark.sql.functions module's asc() and desc() functions can be used to specify a column's ascending or descending order, respectively. The DataFrame can be sorted in a variety of ways using the method, including by multiple columns in a different order.

Syntax

Below is the syntax to order a Pyspark dataframe by multiple columns using the sort() method in Python:

dataframe.sort('column1', 'column2', 'column n', ascending=True).show()

Parameters

  • column1, column2, column n: One or more column names or expressions to sort the DataFrame.

  • ascending: (Optional) A boolean or list of booleans specifying the sorting order for each column. Default is True for ascending order. If a list is provided, it should have the same length as the number of columns being sorted.

Example

In the above example, we sorted a data frame by multiple columns using the sort() method in PySpark. Similar to the previous example, the DataFrame contains information about individuals. However, this time the DataFrame is sorted in ascending order of the "P_Sex" column and then in descending order of the "P_Salary" column. The output presents the sorted DataFrame, with the individuals' information arranged based on the sorting criteria.

from pyspark.sql import SparkSession
from pyspark.sql.functions import desc

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Create a sample DataFrame
data = [
    ("John", 25, "Male", "New York", 2000),
    ("Emma", 23, "Female", "London", 1800),
    ("Mike", 30, "Male", "Paris", 2500),
    ("Alice", 28, "Female", "Berlin", 2200),
    ("David", 35, "Male", "Tokyo", 3000),
    ("Sophia", 26, "Female", "Sydney", 1900),
    ("Daniel", 32, "Male", "Toronto", 2700),
    ("Olivia", 29, "Female", "Melbourne", 2300),
    ("James", 31, "Male", "Chicago", 2600),
    ("Emily", 27, "Female", "Los Angeles", 2100)
]

df = spark.createDataFrame(data, ["P_Name", "P_Age", "P_Sex", "P_State", "P_Salary"])

# Sort by multiple columns (Age in ascending order, Salary in descending order)
df.sort(df.P_Sex.asc(), df.P_Salary.desc()).show()

Output

+------+-----+------+-----------+--------+
|P_Name|P_Age| P_Sex|    P_State|P_Salary|
+------+-----+------+-----------+--------+
|Olivia|   29|Female|  Melbourne|    2300|
| Alice|   28|Female|     Berlin|    2200|
| Emily|   27|Female|Los Angeles|    2100|
|Sophia|   26|Female|     Sydney|    1900|
|  Emma|   23|Female|     London|    1800|
| David|   35|  Male|      Tokyo|    3000|
|Daniel|   32|  Male|    Toronto|    2700|
| James|   31|  Male|    Chicago|    2600|
|  Mike|   30|  Male|      Paris|    2500|
|  John|   25|  Male|   New York|    2000|
+------+-----+------+-----------+--------+

Method 3: Using sortWithinPartitions() method

The third method to order a data frame by multiple columns is using the sortWithinPartitions() method. The DataFrame within each partition is sorted using this method according to one or more columns. It returns a new DataFrame with the specified sorting order within each partition and takes one or more column names as arguments. When sorting needs to be done on each partition separately rather than the entire DataFrame, this approach is helpful. Reducing data shuffle across partitions, makes the sorting process more efficient. The sorting order of columns within partitions can be specified using the asc() and desc() functions.

Syntax

Below is the syntax to order a Pyspark dataframe by multiple columns using the sortWithinPartitions() method in Python:

dataframe.sortWithinPartitions('column1', 'column2', 'column n').show()
  • column1, column2, column n: One or more column names or expressions to sort the DataFrame within partitions.

Example

In the above example, we used the sortWithinPartitions() method in PySpark to order the dataframe data. Again, the DataFrame includes information about individuals. However, in this case, the DataFrame is sorted within partitions based on the "P_Age" column in descending order. The output displays the sorted DataFrame, where the individuals' information is grouped within partitions and ordered by age within each partition. This method is useful for optimizing operations that involve sorting within partitions rather than the entire DataFrame.

from pyspark.sql import SparkSession
from pyspark.sql.functions import desc

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Create a sample DataFrame
data = [
    ("John", 25, "Male", "New York", 2000),
    ("Emma", 23, "Female", "London", 1800),
    ("Mike", 30, "Male", "Paris", 2500),
    ("Alice", 28, "Female", "Berlin", 2200),
    ("David", 35, "Male", "Tokyo", 3000),
    ("Sophia", 26, "Female", "Sydney", 1900),
    ("Daniel", 32, "Male", "Toronto", 2700),
    ("Olivia", 29, "Female", "Melbourne", 2300),
    ("James", 31, "Male", "Chicago", 2600),
    ("Emily", 27, "Female", "Los Angeles", 2100)
]

df = spark.createDataFrame(data, ["P_Name", "P_Age", "P_Sex", "P_State", "P_Salary"])

# Sort within partitions by a column (Age in descending order)
df.sortWithinPartitions(desc("P_Age")).show()

Output

+------+-----+------+-----------+--------+
|P_Name|P_Age| P_Sex|    P_State|P_Salary|
+------+-----+------+-----------+--------+
| David|   35|  Male|      Tokyo|    3000|
|  Mike|   30|  Male|      Paris|    2500|
| Alice|   28|Female|     Berlin|    2200|
|  John|   25|  Male|   New York|    2000|
|  Emma|   23|Female|     London|    1800|
|Daniel|   32|  Male|    Toronto|    2700|
| James|   31|  Male|    Chicago|    2600|
|Olivia|   29|Female|  Melbourne|    2300|
| Emily|   27|Female|Los Angeles|    2100|
|Sophia|   26|Female|     Sydney|    1900|
+------+-----+------+-----------+--------+

Conclusion

PySpark provides multiple methods to order DataFrames by multiple columns which include the orderBy, sort, and sortWithinPartitions functions. Depending on your use case and the size of your dataset, you can choose the most suitable method.

Updated on: 31-Aug-2023

363 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements