Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to sort by value in PySpark?
PySpark is a distributed data processing engine that provides Python APIs for Apache Spark. It enables large-scale data processing and offers several built-in functions for sorting data including orderBy(), sort(), sortBy(), and asc_nulls_last().
Installation
First, install PySpark using pip ?
pip install pyspark
Key Sorting Functions
| Function | Usage | Best For |
|---|---|---|
orderBy() |
DataFrame column sorting | Single/multiple columns with custom order |
sort() |
DataFrame sorting with functions | Descending order and null handling |
sortBy() |
RDD sorting with lambda | Custom sorting logic on RDDs |
Sorting DataFrame by Single Column
Use orderBy() to sort a DataFrame by a specific column ?
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder.appName("SortExample").getOrCreate()
# Create DataFrame
student_data = [("Akash", 25), ("Bhuvan", 23), ("Peter", 18), ("Mohan", 26)]
df = spark.createDataFrame(student_data, ["Name", "Age"])
# Sort by Age in ascending order
sorted_df = df.orderBy("Age")
sorted_df.show()
spark.stop()
+------+---+ | Name|Age| +------+---+ | Peter| 18| |Bhuvan| 23| | Akash| 25| | Mohan| 26| +------+---+
Sorting DataFrame by Multiple Columns
Sort by multiple columns using a list of column names ?
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder.appName("MultiSort").getOrCreate()
# Create DataFrame
product_data = [("Umbrella", 125), ("Bottle", 20), ("Colgate", 118)]
df = spark.createDataFrame(product_data, ["Product", "Price"])
# Sort by Price first, then by Product name
sorted_df = df.orderBy(["Price", "Product"], ascending=[True, True])
sorted_df.show()
spark.stop()
+--------+-----+ | Product|Price| +--------+-----+ | Bottle| 20| | Colgate| 118| |Umbrella| 125| +--------+-----+
Sorting in Descending Order
Use desc() function for descending order sorting ?
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc
# Create SparkSession
spark = SparkSession.builder.appName("DescSort").getOrCreate()
# Create DataFrame
employee_data = [("Abhinav", 25, "Male"), ("Meera", 32, "Female"),
("Riya", 18, "Female"), ("Deepak", 33, "Male"), ("Elon", 50, "Male")]
df = spark.createDataFrame(employee_data, ["Name", "Age", "Gender"])
# Sort by Age in descending order
sorted_df = df.sort(desc("Age"))
sorted_df.show()
spark.stop()
+-------+---+------+ | Name|Age|Gender| +-------+---+------+ | Elon| 50| Male| | Deepak| 33| Male| | Meera| 32|Female| |Abhinav| 25| Male| | Riya| 18|Female| +-------+---+------+
Sorting RDD by Value
Use sortBy() with lambda functions to sort RDD data ?
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder.appName("RDDSort").getOrCreate()
# Create RDD from list of tuples
data = [("X", 25), ("Y", 32), ("Z", 18)]
rdd = spark.sparkContext.parallelize(data)
# Sort RDD by second element (value)
sorted_rdd = rdd.sortBy(lambda x: x[1])
# Collect and display results
for record in sorted_rdd.collect():
print(record)
spark.stop()
('Z', 18)
('X', 25)
('Y', 32)
Handling Null Values While Sorting
Use asc_nulls_last() to place null values at the end ?
from pyspark.sql import SparkSession
from pyspark.sql.functions import asc_nulls_last
# Create SparkSession
spark = SparkSession.builder.appName("NullSort").getOrCreate()
# Create DataFrame with null values
product_data = [("Charger", None), ("Mouse", 320), ("PEN", 18),
("Bag", 1000), ("Notebook", None)]
df = spark.createDataFrame(product_data, ["Product", "Price"])
# Sort by Price with nulls last
sorted_df = df.sort(asc_nulls_last("Price"))
sorted_df.show()
spark.stop()
+--------+-----+ | Product|Price| +--------+-----+ | PEN| 18| | Mouse| 320| | Bag| 1000| | Charger| null| |Notebook| null| +--------+-----+
Conclusion
PySpark provides multiple ways to sort data: orderBy() for DataFrames, sort() with functions like desc(), and sortBy() for RDDs. Use asc_nulls_last() to handle null values appropriately during sorting operations.
