How to take a random row from a PySpark DataFrame?

In PySpark, working with large datasets often requires extracting a random row from a DataFrame for various purposes such as sampling or testing. However, the process of selecting a random row can be challenging due to the distributed nature of Spark.

In this article, we explore efficient techniques to tackle this task, discussing different approaches and providing code examples to help you effortlessly extract a random row from a PySpark DataFrame.

Method 1: Using orderBy() and limit()

One approach to selecting a random row from a PySpark DataFrame involves using the orderBy() and limit() functions. We add a random column to the DataFrame using the rand() function, then order the DataFrame by this column, and finally select the top row using limit(1).

from pyspark.sql import SparkSession
from pyspark.sql.functions import rand

# Create a SparkSession
spark = SparkSession.builder.appName("RandomRow").getOrCreate()

# Create a sample DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35), ("David", 40), ("Eve", 45)]
df = spark.createDataFrame(data, ["Name", "Age"])

print("Original DataFrame:")
df.show()

# Add random column and order by it
df_with_random = df.withColumn("random", rand())
random_row = df_with_random.orderBy("random").limit(1)

print("Random row using orderBy() and limit():")
random_row.show()
Original DataFrame:
+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
|  David| 40|
|    Eve| 45|
+-------+---+

Random row using orderBy() and limit():
+-------+---+------------------+
|   Name|Age|            random|
+-------+---+------------------+
|Charlie| 35|0.2749356601123299|
+-------+---+------------------+

Method 2: Using sample() Function

Another approach is to use the sample() function to randomly sample rows from the PySpark DataFrame. By specifying a small fraction and setting withReplacement=False, we can sample one or more random rows ?

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("RandomRow").getOrCreate()

# Create a sample DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35), ("David", 40), ("Eve", 45)]
df = spark.createDataFrame(data, ["Name", "Age"])

# Sample a random row using fraction
random_row = df.sample(withReplacement=False, fraction=0.2, seed=42)

print("Random row using sample():")
random_row.show()
Random row using sample():
+----+---+
|Name|Age|
+----+---+
| Bob| 30|
+----+---+

Method 3: Converting to RDD and Using takeSample()

PySpark DataFrames can be converted to RDDs (Resilient Distributed Datasets) for more flexibility. We can convert the DataFrame to an RDD and then use the takeSample() function to retrieve a random row ?

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("RandomRow").getOrCreate()

# Create a sample DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35), ("David", 40), ("Eve", 45)]
df = spark.createDataFrame(data, ["Name", "Age"])

# Convert DataFrame to RDD and sample
rdd = df.rdd
random_row = rdd.takeSample(withReplacement=False, num=1, seed=42)

print("Random row using RDD takeSample():")
print(random_row)
Random row using RDD takeSample():
[Row(Name='Charlie', Age=35)]

Comparison

Method Performance Best For Returns
orderBy() + limit() Good for large datasets When you need DataFrame output DataFrame
sample() Fast, probabilistic Quick sampling with fraction DataFrame
RDD takeSample() Direct access When working with RDDs List of Row objects

Complete Example

from pyspark.sql import SparkSession
from pyspark.sql.functions import rand

# Create SparkSession
spark = SparkSession.builder.appName("RandomRowMethods").getOrCreate()

# Create sample DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35), ("David", 40), ("Eve", 45)]
df = spark.createDataFrame(data, ["Name", "Age"])

print("All three methods for getting random rows:")
print("\nMethod 1 - orderBy + limit:")
df.withColumn("random", rand()).orderBy("random").limit(1).select("Name", "Age").show()

print("Method 2 - sample:")
df.sample(withReplacement=False, fraction=0.2, seed=42).show()

print("Method 3 - RDD takeSample:")
random_rows = df.rdd.takeSample(withReplacement=False, num=1, seed=42)
for row in random_rows:
    print(f"Name: {row.Name}, Age: {row.Age}")
All three methods for getting random rows:

Method 1 - orderBy + limit:
+-------+---+
|   Name|Age|
+-------+---+
|Charlie| 35|
+-------+---+

Method 2 - sample:
+----+---+
|Name|Age|
+----+---+
| Bob| 30|
+----+---+

Method 3 - RDD takeSample:
Name: Charlie, Age: 35

Conclusion

Use orderBy() with rand() for guaranteed single row selection from large datasets. The sample() function is fastest for probabilistic sampling, while takeSample() on RDDs provides direct access to Row objects. Choose based on your specific use case and performance requirements.

Updated on: 2026-03-27T09:22:07+05:30

5K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements