Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to slice a PySpark dataframe in two row-wise dataframe?
PySpark dataframes can be split into two row-wise dataframes using various built-in methods. This process, called slicing, is useful for data partitioning and parallel processing in distributed computing environments.
Syntax Overview
The key methods for slicing PySpark dataframes include:
-
limit(n)? Returns first n rows -
subtract(df)? Returns rows not present in another dataframe -
collect()? Retrieves all elements as a list -
head(n)? Returns first n rows as Row objects -
exceptAll(df)? Returns rows excluding another dataframe's rows -
filter(condition)? Filters rows based on conditions
Installation
pip install pyspark
Using limit() and subtract()
The most straightforward approach uses limit() to get the first portion and subtract() to get the remaining rows.
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder.appName('Employee Data').getOrCreate()
# Sample data
rows = [['1', 'RAHUL', 'INDIA', '1243'],
['2', 'PETER', 'SRI LANKA', '5461'],
['3', 'JOHN', 'SOUTH KOREA', '2224'],
['4', 'MARK', 'NEWYORK', '9985'],
['5', 'SUNNY', 'BANGLADESH', '8912']]
cols = ['S.N', 'EMPLOYEE NAME', 'COUNTRY', 'EMP_ID']
# Create DataFrame
df = spark.createDataFrame(rows, cols)
# Split into two dataframes
df_first = df.limit(2)
df_second = df.subtract(df_first)
print("First DataFrame:")
df_first.show()
print("Second DataFrame:")
df_second.show()
First DataFrame: +---+-------------+---------+------+ |S.N|EMPLOYEE NAME| COUNTRY|EMP_ID| +---+-------------+---------+------+ | 1| RAHUL| INDIA| 1243| | 2| PETER|SRI LANKA| 5461| +---+-------------+---------+------+ Second DataFrame: +---+-------------+-----------+------+ |S.N|EMPLOYEE NAME| COUNTRY|EMP_ID| +---+-------------+-----------+------+ | 3| JOHN|SOUTH KOREA| 2224| | 5| SUNNY| BANGLADESH| 8912| | 4| MARK| NEWYORK| 9985| +---+-------------+-----------+------+
Using collect() and List Slicing
This method collects all rows into a list, slices it, then creates new dataframes from the sliced portions.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Employee Data").getOrCreate()
# Create sample dataframe
data = [("Vivek", 31), ("Aman", 20), ("Sohan", 13), ("David", 24)]
df = spark.createDataFrame(data, ["Name", "Age"])
# Collect rows and slice
rows = df.collect()
rows1 = rows[:2] # First 2 rows
rows2 = rows[2:] # Remaining rows
# Create new dataframes
df1 = spark.createDataFrame(rows1, df.schema)
df2 = spark.createDataFrame(rows2, df.schema)
print("First DataFrame:")
df1.show()
print("Second DataFrame:")
df2.show()
First DataFrame: +-----+---+ | Name|Age| +-----+---+ |Vivek| 31| | Aman| 20| +-----+---+ Second DataFrame: +-----+---+ | Name|Age| +-----+---+ |Sohan| 13| |David| 24| +-----+---+
Using filter() with Row Numbers
This approach adds row numbers and uses filtering to split the dataframe based on position.
from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import row_number
spark = SparkSession.builder.getOrCreate()
# Create DataFrame
data = [("Rabina", 35), ("Stephen", 31), ("Raman", 33), ("Salman", 44), ("Meera", 37)]
df = spark.createDataFrame(data, ["Name", "Age"])
# Define split point
n_rows_first_df = 2
# Add row numbers
df_with_row_number = df.withColumn("row_number", row_number().over(Window.orderBy("Name")))
# Filter into two dataframes
first_df = df_with_row_number.filter(df_with_row_number.row_number <= n_rows_first_df).drop("row_number")
second_df = df_with_row_number.filter(df_with_row_number.row_number > n_rows_first_df).drop("row_number")
print("First DataFrame:")
first_df.show()
print("Second DataFrame:")
second_df.show()
First DataFrame: +------+---+ | Name|Age| +------+---+ | Meera| 37| |Rabina| 35| +------+---+ Second DataFrame: +-------+---+ | Name|Age| +-------+---+ | Raman| 33| | Salman| 44| |Stephen| 31| +-------+---+
Using head() and exceptAll()
This method uses head() to extract the first portion and exceptAll() to get the remaining rows.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Create DataFrame
data = [("Karisma", 25), ("Bobby", 30), ("Champak", 35), ("Mark", 40)]
df = spark.createDataFrame(data, ["Name", "Age"])
# Define split point
n_rows_first_df = 2
# Create first dataframe using head()
first_rows = df.head(n_rows_first_df)
first_df = spark.createDataFrame(first_rows, df.schema)
# Create second dataframe using exceptAll()
second_df = df.exceptAll(first_df)
print("First DataFrame:")
first_df.show()
print("Second DataFrame:")
second_df.show()
First DataFrame: +-------+---+ | Name|Age| +-------+---+ |Karisma| 25| | Bobby| 30| +-------+---+ Second DataFrame: +-------+---+ | Name|Age| +-------+---+ |Champak| 35| | Mark| 40| +-------+---+
Method Comparison
| Method | Memory Usage | Performance | Best For |
|---|---|---|---|
| limit() + subtract() | Low | High | Large datasets |
| collect() + slicing | High | Low | Small datasets |
| filter() + row_number() | Medium | Medium | Ordered splitting |
| head() + exceptAll() | Medium | Medium | Simple splitting |
Conclusion
PySpark offers multiple methods to slice dataframes row-wise, each with different performance characteristics. Use limit() and subtract() for large datasets, while collect() with list slicing works well for smaller data that fits in memory.
