Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to Order PySpark DataFrame by Multiple Columns?
When working with large datasets, one common PySpark operation is to order a DataFrame by multiple columns. You can prioritize the sorting based on various criteria when you sort data based on multiple columns. PySpark provides several methods to accomplish this task, each with different features and performance characteristics.
In this article, we will learn how to order PySpark DataFrame by multiple columns using three different approaches: orderBy(), sort(), and sortWithinPartitions().
Method 1: Using orderBy()
The orderBy() method sorts the DataFrame in ascending or descending order based on one or more columns. It returns a new DataFrame with the specified sorting order. You can use the desc() function from pyspark.sql.functions for descending order ?
Syntax
dataframe.orderBy('column1', 'column2', 'column_n').show()
Parameters:
column1, column2, column_n: One or more column names or expressions to sort the DataFrame
ascending: (Optional) Boolean or list of booleans specifying the sorting order. Default is True for ascending
Example
In this example, we order the DataFrame by salary in descending order, then by name in ascending order ?
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc
# Create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Create a sample DataFrame
data = [
("John", 25, "Male", "New York", 2000),
("Emma", 23, "Female", "London", 1800),
("Mike", 30, "Male", "Paris", 2500),
("Alice", 28, "Female", "Berlin", 2200),
("David", 35, "Male", "Tokyo", 3000),
("Sophia", 26, "Female", "Sydney", 1900),
("Daniel", 32, "Male", "Toronto", 2700),
("Olivia", 29, "Female", "Melbourne", 2300),
("James", 31, "Male", "Chicago", 2600),
("Emily", 27, "Female", "Los Angeles", 2100)
]
df = spark.createDataFrame(data, ["P_Name", "P_Age", "P_Sex", "P_State", "P_Salary"])
# Order by multiple columns (Salary descending, then Name ascending)
df.orderBy(desc("P_Salary"), "P_Name").show()
+------+-----+------+-----------+--------+ |P_Name|P_Age| P_Sex| P_State|P_Salary| +------+-----+------+-----------+--------+ | David| 35| Male| Tokyo| 3000| |Daniel| 32| Male| Toronto| 2700| | James| 31| Male| Chicago| 2600| | Mike| 30| Male| Paris| 2500| |Olivia| 29|Female| Melbourne| 2300| | Alice| 28|Female| Berlin| 2200| | Emily| 27|Female|Los Angeles| 2100| | John| 25| Male| New York| 2000| |Sophia| 26|Female| Sydney| 1900| | Emma| 23|Female| London| 1800| +------+-----+------+-----------+--------+
Method 2: Using sort()
The sort() method is functionally similar to orderBy() and sorts the DataFrame based on one or more columns. You can use asc() and desc() functions to specify sorting order ?
Syntax
dataframe.sort('column1', 'column2', 'column_n', ascending=True).show()
Parameters:
column1, column2, column_n: One or more column names or expressions to sort the DataFrame
ascending: (Optional) Boolean or list of booleans specifying the sorting order. Default is True
Example
Here we sort by gender in ascending order, then by salary in descending order ?
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc
# Create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Create a sample DataFrame
data = [
("John", 25, "Male", "New York", 2000),
("Emma", 23, "Female", "London", 1800),
("Mike", 30, "Male", "Paris", 2500),
("Alice", 28, "Female", "Berlin", 2200),
("David", 35, "Male", "Tokyo", 3000),
("Sophia", 26, "Female", "Sydney", 1900),
("Daniel", 32, "Male", "Toronto", 2700),
("Olivia", 29, "Female", "Melbourne", 2300),
("James", 31, "Male", "Chicago", 2600),
("Emily", 27, "Female", "Los Angeles", 2100)
]
df = spark.createDataFrame(data, ["P_Name", "P_Age", "P_Sex", "P_State", "P_Salary"])
# Sort by multiple columns (Gender ascending, then Salary descending)
df.sort(df.P_Sex.asc(), df.P_Salary.desc()).show()
+------+-----+------+-----------+--------+ |P_Name|P_Age| P_Sex| P_State|P_Salary| +------+-----+------+-----------+--------+ |Olivia| 29|Female| Melbourne| 2300| | Alice| 28|Female| Berlin| 2200| | Emily| 27|Female|Los Angeles| 2100| |Sophia| 26|Female| Sydney| 1900| | Emma| 23|Female| London| 1800| | David| 35| Male| Tokyo| 3000| |Daniel| 32| Male| Toronto| 2700| | James| 31| Male| Chicago| 2600| | Mike| 30| Male| Paris| 2500| | John| 25| Male| New York| 2000| +------+-----+------+-----------+--------+
Method 3: Using sortWithinPartitions()
The sortWithinPartitions() method sorts data within each partition separately rather than across the entire DataFrame. This is more efficient as it reduces data shuffle across partitions ?
Syntax
dataframe.sortWithinPartitions('column1', 'column2', 'column_n').show()
Parameters:
column1, column2, column_n: One or more column names or expressions to sort within partitions
Example
This example sorts data within partitions by age in descending order ?
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc
# Create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Create a sample DataFrame
data = [
("John", 25, "Male", "New York", 2000),
("Emma", 23, "Female", "London", 1800),
("Mike", 30, "Male", "Paris", 2500),
("Alice", 28, "Female", "Berlin", 2200),
("David", 35, "Male", "Tokyo", 3000),
("Sophia", 26, "Female", "Sydney", 1900),
("Daniel", 32, "Male", "Toronto", 2700),
("Olivia", 29, "Female", "Melbourne", 2300),
("James", 31, "Male", "Chicago", 2600),
("Emily", 27, "Female", "Los Angeles", 2100)
]
df = spark.createDataFrame(data, ["P_Name", "P_Age", "P_Sex", "P_State", "P_Salary"])
# Sort within partitions by age in descending order
df.sortWithinPartitions(desc("P_Age")).show()
+------+-----+------+-----------+--------+ |P_Name|P_Age| P_Sex| P_State|P_Salary| +------+-----+------+-----------+--------+ | David| 35| Male| Tokyo| 3000| | Mike| 30| Male| Paris| 2500| | Alice| 28|Female| Berlin| 2200| | John| 25| Male| New York| 2000| | Emma| 23|Female| London| 1800| |Daniel| 32| Male| Toronto| 2700| | James| 31| Male| Chicago| 2600| |Olivia| 29|Female| Melbourne| 2300| | Emily| 27|Female|Los Angeles| 2100| |Sophia| 26|Female| Sydney| 1900| +------+-----+------+-----------+--------+
Comparison
| Method | Scope | Performance | Best For |
|---|---|---|---|
orderBy() |
Entire DataFrame | High shuffle cost | Global ordering needed |
sort() |
Entire DataFrame | High shuffle cost | Same as orderBy |
sortWithinPartitions() |
Within partitions | Low shuffle cost | Partition-level sorting |
Conclusion
PySpark provides three methods to order DataFrames by multiple columns: orderBy(), sort(), and sortWithinPartitions(). Use sortWithinPartitions() for better performance when global ordering isn't required, and orderBy() or sort() when you need complete DataFrame ordering.
