How to slice a PySpark dataframe in two row-wise dataframe?

PySpark dataframes can be split into two row-wise dataframes using various built-in methods. This process, called slicing, is useful for data partitioning and parallel processing in distributed computing environments.

Syntax Overview

The key methods for slicing PySpark dataframes include:

  • limit(n) ? Returns first n rows
  • subtract(df) ? Returns rows not present in another dataframe
  • collect() ? Retrieves all elements as a list
  • head(n) ? Returns first n rows as Row objects
  • exceptAll(df) ? Returns rows excluding another dataframe's rows
  • filter(condition) ? Filters rows based on conditions

Installation

pip install pyspark

Using limit() and subtract()

The most straightforward approach uses limit() to get the first portion and subtract() to get the remaining rows.

from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName('Employee Data').getOrCreate()

# Sample data
rows = [['1', 'RAHUL', 'INDIA', '1243'],
        ['2', 'PETER', 'SRI LANKA', '5461'],
        ['3', 'JOHN', 'SOUTH KOREA', '2224'],
        ['4', 'MARK', 'NEWYORK', '9985'],
        ['5', 'SUNNY', 'BANGLADESH', '8912']]

cols = ['S.N', 'EMPLOYEE NAME', 'COUNTRY', 'EMP_ID']

# Create DataFrame
df = spark.createDataFrame(rows, cols)

# Split into two dataframes
df_first = df.limit(2)
df_second = df.subtract(df_first)

print("First DataFrame:")
df_first.show()

print("Second DataFrame:")
df_second.show()
First DataFrame:
+---+-------------+---------+------+
|S.N|EMPLOYEE NAME|  COUNTRY|EMP_ID|
+---+-------------+---------+------+
|  1|        RAHUL|    INDIA|  1243|
|  2|        PETER|SRI LANKA|  5461|
+---+-------------+---------+------+

Second DataFrame:
+---+-------------+-----------+------+
|S.N|EMPLOYEE NAME|    COUNTRY|EMP_ID|
+---+-------------+-----------+------+
|  3|         JOHN|SOUTH KOREA|  2224|
|  5|        SUNNY| BANGLADESH|  8912|
|  4|         MARK|    NEWYORK|  9985|
+---+-------------+-----------+------+

Using collect() and List Slicing

This method collects all rows into a list, slices it, then creates new dataframes from the sliced portions.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Employee Data").getOrCreate()

# Create sample dataframe
data = [("Vivek", 31), ("Aman", 20), ("Sohan", 13), ("David", 24)]
df = spark.createDataFrame(data, ["Name", "Age"])

# Collect rows and slice
rows = df.collect()
rows1 = rows[:2]  # First 2 rows
rows2 = rows[2:]  # Remaining rows

# Create new dataframes
df1 = spark.createDataFrame(rows1, df.schema)
df2 = spark.createDataFrame(rows2, df.schema)

print("First DataFrame:")
df1.show()

print("Second DataFrame:")
df2.show()
First DataFrame:
+-----+---+
| Name|Age|
+-----+---+
|Vivek| 31|
| Aman| 20|
+-----+---+

Second DataFrame:
+-----+---+
| Name|Age|
+-----+---+
|Sohan| 13|
|David| 24|
+-----+---+

Using filter() with Row Numbers

This approach adds row numbers and uses filtering to split the dataframe based on position.

from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import row_number

spark = SparkSession.builder.getOrCreate()

# Create DataFrame
data = [("Rabina", 35), ("Stephen", 31), ("Raman", 33), ("Salman", 44), ("Meera", 37)]
df = spark.createDataFrame(data, ["Name", "Age"])

# Define split point
n_rows_first_df = 2

# Add row numbers
df_with_row_number = df.withColumn("row_number", row_number().over(Window.orderBy("Name")))

# Filter into two dataframes
first_df = df_with_row_number.filter(df_with_row_number.row_number <= n_rows_first_df).drop("row_number")
second_df = df_with_row_number.filter(df_with_row_number.row_number > n_rows_first_df).drop("row_number")

print("First DataFrame:")
first_df.show()

print("Second DataFrame:")
second_df.show()
First DataFrame:
+------+---+
|  Name|Age|
+------+---+
| Meera| 37|
|Rabina| 35|
+------+---+

Second DataFrame:
+-------+---+
|   Name|Age|
+-------+---+
|  Raman| 33|
| Salman| 44|
|Stephen| 31|
+-------+---+

Using head() and exceptAll()

This method uses head() to extract the first portion and exceptAll() to get the remaining rows.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Create DataFrame
data = [("Karisma", 25), ("Bobby", 30), ("Champak", 35), ("Mark", 40)]
df = spark.createDataFrame(data, ["Name", "Age"])

# Define split point
n_rows_first_df = 2

# Create first dataframe using head()
first_rows = df.head(n_rows_first_df)
first_df = spark.createDataFrame(first_rows, df.schema)

# Create second dataframe using exceptAll()
second_df = df.exceptAll(first_df)

print("First DataFrame:")
first_df.show()

print("Second DataFrame:")
second_df.show()
First DataFrame:
+-------+---+
|   Name|Age|
+-------+---+
|Karisma| 25|
|  Bobby| 30|
+-------+---+

Second DataFrame:
+-------+---+
|   Name|Age|
+-------+---+
|Champak| 35|
|   Mark| 40|
+-------+---+

Method Comparison

Method Memory Usage Performance Best For
limit() + subtract() Low High Large datasets
collect() + slicing High Low Small datasets
filter() + row_number() Medium Medium Ordered splitting
head() + exceptAll() Medium Medium Simple splitting

Conclusion

PySpark offers multiple methods to slice dataframes row-wise, each with different performance characteristics. Use limit() and subtract() for large datasets, while collect() with list slicing works well for smaller data that fits in memory.

Updated on: 2026-03-27T08:08:52+05:30

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements