Article Categories

Selected Reading

How to Order PySpark DataFrame by Multiple Columns?

Python Server Side Programming Programming

When working with large datasets, one common PySpark operation is to order a DataFrame by multiple columns. You can prioritize the sorting based on various criteria when you sort data based on multiple columns. PySpark provides several methods to accomplish this task, each with different features and performance characteristics.

In this article, we will learn how to order PySpark DataFrame by multiple columns using three different approaches: orderBy(), sort(), and sortWithinPartitions().

Method 1: Using orderBy()

The orderBy() method sorts the DataFrame in ascending or descending order based on one or more columns. It returns a new DataFrame with the specified sorting order. You can use the desc() function from pyspark.sql.functions for descending order ?

Syntax

dataframe.orderBy('column1', 'column2', 'column_n').show()

Parameters:

column1, column2, column_n: One or more column names or expressions to sort the DataFrame
ascending: (Optional) Boolean or list of booleans specifying the sorting order. Default is True for ascending

Example

In this example, we order the DataFrame by salary in descending order, then by name in ascending order ?

from pyspark.sql import SparkSession
from pyspark.sql.functions import desc

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Create a sample DataFrame
data = [
    ("John", 25, "Male", "New York", 2000),
    ("Emma", 23, "Female", "London", 1800),
    ("Mike", 30, "Male", "Paris", 2500),
    ("Alice", 28, "Female", "Berlin", 2200),
    ("David", 35, "Male", "Tokyo", 3000),
    ("Sophia", 26, "Female", "Sydney", 1900),
    ("Daniel", 32, "Male", "Toronto", 2700),
    ("Olivia", 29, "Female", "Melbourne", 2300),
    ("James", 31, "Male", "Chicago", 2600),
    ("Emily", 27, "Female", "Los Angeles", 2100)
]

df = spark.createDataFrame(data, ["P_Name", "P_Age", "P_Sex", "P_State", "P_Salary"])

# Order by multiple columns (Salary descending, then Name ascending)
df.orderBy(desc("P_Salary"), "P_Name").show()

+------+-----+------+-----------+--------+
|P_Name|P_Age| P_Sex|    P_State|P_Salary|
+------+-----+------+-----------+--------+
| David|   35|  Male|      Tokyo|    3000|
|Daniel|   32|  Male|    Toronto|    2700|
| James|   31|  Male|    Chicago|    2600|
|  Mike|   30|  Male|      Paris|    2500|
|Olivia|   29|Female|  Melbourne|    2300|
| Alice|   28|Female|     Berlin|    2200|
| Emily|   27|Female|Los Angeles|    2100|
|  John|   25|  Male|   New York|    2000|
|Sophia|   26|Female|     Sydney|    1900|
|  Emma|   23|Female|     London|    1800|
+------+-----+------+-----------+--------+

Method 2: Using sort()

The sort() method is functionally similar to orderBy() and sorts the DataFrame based on one or more columns. You can use asc() and desc() functions to specify sorting order ?

Syntax

dataframe.sort('column1', 'column2', 'column_n', ascending=True).show()

Parameters:

column1, column2, column_n: One or more column names or expressions to sort the DataFrame
ascending: (Optional) Boolean or list of booleans specifying the sorting order. Default is True

Example

Here we sort by gender in ascending order, then by salary in descending order ?

from pyspark.sql import SparkSession
from pyspark.sql.functions import desc

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Create a sample DataFrame
data = [
    ("John", 25, "Male", "New York", 2000),
    ("Emma", 23, "Female", "London", 1800),
    ("Mike", 30, "Male", "Paris", 2500),
    ("Alice", 28, "Female", "Berlin", 2200),
    ("David", 35, "Male", "Tokyo", 3000),
    ("Sophia", 26, "Female", "Sydney", 1900),
    ("Daniel", 32, "Male", "Toronto", 2700),
    ("Olivia", 29, "Female", "Melbourne", 2300),
    ("James", 31, "Male", "Chicago", 2600),
    ("Emily", 27, "Female", "Los Angeles", 2100)
]

df = spark.createDataFrame(data, ["P_Name", "P_Age", "P_Sex", "P_State", "P_Salary"])

# Sort by multiple columns (Gender ascending, then Salary descending)
df.sort(df.P_Sex.asc(), df.P_Salary.desc()).show()

+------+-----+------+-----------+--------+
|P_Name|P_Age| P_Sex|    P_State|P_Salary|
+------+-----+------+-----------+--------+
|Olivia|   29|Female|  Melbourne|    2300|
| Alice|   28|Female|     Berlin|    2200|
| Emily|   27|Female|Los Angeles|    2100|
|Sophia|   26|Female|     Sydney|    1900|
|  Emma|   23|Female|     London|    1800|
| David|   35|  Male|      Tokyo|    3000|
|Daniel|   32|  Male|    Toronto|    2700|
| James|   31|  Male|    Chicago|    2600|
|  Mike|   30|  Male|      Paris|    2500|
|  John|   25|  Male|   New York|    2000|
+------+-----+------+-----------+--------+

Method 3: Using sortWithinPartitions()

The sortWithinPartitions() method sorts data within each partition separately rather than across the entire DataFrame. This is more efficient as it reduces data shuffle across partitions ?

Syntax

dataframe.sortWithinPartitions('column1', 'column2', 'column_n').show()

Parameters:

column1, column2, column_n: One or more column names or expressions to sort within partitions

Example

This example sorts data within partitions by age in descending order ?

from pyspark.sql import SparkSession
from pyspark.sql.functions import desc

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Create a sample DataFrame
data = [
    ("John", 25, "Male", "New York", 2000),
    ("Emma", 23, "Female", "London", 1800),
    ("Mike", 30, "Male", "Paris", 2500),
    ("Alice", 28, "Female", "Berlin", 2200),
    ("David", 35, "Male", "Tokyo", 3000),
    ("Sophia", 26, "Female", "Sydney", 1900),
    ("Daniel", 32, "Male", "Toronto", 2700),
    ("Olivia", 29, "Female", "Melbourne", 2300),
    ("James", 31, "Male", "Chicago", 2600),
    ("Emily", 27, "Female", "Los Angeles", 2100)
]

df = spark.createDataFrame(data, ["P_Name", "P_Age", "P_Sex", "P_State", "P_Salary"])

# Sort within partitions by age in descending order
df.sortWithinPartitions(desc("P_Age")).show()

+------+-----+------+-----------+--------+
|P_Name|P_Age| P_Sex|    P_State|P_Salary|
+------+-----+------+-----------+--------+
| David|   35|  Male|      Tokyo|    3000|
|  Mike|   30|  Male|      Paris|    2500|
| Alice|   28|Female|     Berlin|    2200|
|  John|   25|  Male|   New York|    2000|
|  Emma|   23|Female|     London|    1800|
|Daniel|   32|  Male|    Toronto|    2700|
| James|   31|  Male|    Chicago|    2600|
|Olivia|   29|Female|  Melbourne|    2300|
| Emily|   27|Female|Los Angeles|    2100|
|Sophia|   26|Female|     Sydney|    1900|
+------+-----+------+-----------+--------+

Comparison

Method	Scope	Performance	Best For
`orderBy()`	Entire DataFrame	High shuffle cost	Global ordering needed
`sort()`	Entire DataFrame	High shuffle cost	Same as orderBy
`sortWithinPartitions()`	Within partitions	Low shuffle cost	Partition-level sorting

Conclusion

PySpark provides three methods to order DataFrames by multiple columns: orderBy(), sort(), and sortWithinPartitions(). Use sortWithinPartitions() for better performance when global ordering isn't required, and orderBy() or sort() when you need complete DataFrame ordering.

Tarun Singh

Updated on: 2026-03-27T14:09:58+05:30

2K+ Views

Previous Next