Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Divide a DataFrame in a ratio
Pandas DataFrames often need to be divided into smaller parts based on specific ratios for tasks like train-test splits in machine learning. Python provides several methods to split DataFrames proportionally using different approaches.
There are three main ways to divide DataFrame data based on ratio:
Using
np.random.rand()Using
pandas.DataFrame.sample()Using
numpy.split()
Using numpy.random.rand()
This method creates random values for each row and filters based on a threshold. For a 60-40 split, we use 0.6 as the threshold ?
Syntax
import numpy as np ratio = np.random.rand(len(dataframe)) part1 = dataframe[ratio < threshold] part2 = dataframe[ratio >= threshold]
Example
import numpy as np
import pandas as pd
# Create sample data
data = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Henry'],
'Age': [25, 30, 35, 28, 32, 29, 27, 31],
'Score': [85, 90, 78, 88, 92, 76, 89, 84]
})
print("Original DataFrame:")
print(data)
# Generate random values for splitting
ratio = np.random.rand(len(data))
# Split into 60% and 40%
train_data = data[ratio < 0.6]
test_data = data[ratio >= 0.6]
print(f"\nTrain data (60%): {len(train_data)} rows")
print(train_data)
print(f"\nTest data (40%): {len(test_data)} rows")
print(test_data)
Original DataFrame:
Name Age Score
0 Alice 25 85
1 Bob 30 90
2 Charlie 35 78
3 David 28 88
4 Eve 32 92
5 Frank 29 76
6 Grace 27 89
7 Henry 31 84
Train data (60%): 4 rows
Name Age Score
1 Bob 30 90
3 David 28 88
4 Eve 32 92
7 Henry 31 84
Test data (40%): 4 rows
Name Age Score
0 Alice 25 85
2 Charlie 35 78
5 Frank 29 76
6 Grace 27 89
Using pandas.DataFrame.sample()
The sample() method provides more control over the splitting process. It uses the frac parameter to specify the fraction and random_state for reproducible results ?
Syntax
dataframe.sample(frac=fraction, random_state=seed_value)
Example
import pandas as pd
# Create sample data
data = pd.DataFrame({
'Letters': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
'Number': [1, 2, 3, 4, 5, 6, 7, 8]
})
print("Original DataFrame:")
print(data)
# Split into 50-50 using sample()
train_data = data.sample(frac=0.5, random_state=42)
test_data = data.drop(train_data.index)
print(f"\nFirst 50% of data:")
print(train_data)
print(f"\nRemaining 50% of data:")
print(test_data)
Original DataFrame: Letters Number 0 A 1 1 B 2 2 C 3 3 D 4 4 E 5 5 F 6 6 G 7 7 H 8 First 50% of data: Letters Number 6 G 7 1 B 2 4 E 5 5 F 6 Remaining 50% of data: Letters Number 0 A 1 2 C 3 3 D 4 7 H 8
Using numpy.split()
The numpy.split() function divides the DataFrame sequentially without randomization. It splits at the calculated index position ?
Syntax
numpy.split(dataframe, [int(ratio * len(dataframe))])
Example
import pandas as pd
import numpy as np
# Create sample data
data = pd.DataFrame({
'Letters': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
'Number': [1, 2, 3, 4, 5, 6, 7, 8]
})
print("Original DataFrame:")
print(data)
# Split into 70-30 ratio
train_data, test_data = np.split(data, [int(0.7 * len(data))])
print(f"\nFirst 70% of data:")
print(train_data)
print(f"\nRemaining 30% of data:")
print(test_data)
Original DataFrame: Letters Number 0 A 1 1 B 2 2 C 3 3 D 4 4 E 5 5 F 6 6 G 7 7 H 8 First 70% of data: Letters Number 0 A 1 1 B 2 2 C 3 3 D 4 4 E 5 Remaining 30% of data: Letters Number 5 F 6 6 G 7 7 H 8
Comparison
| Method | Randomization | Reproducible | Best For |
|---|---|---|---|
np.random.rand() |
Yes | No (unless seed set) | Simple random splits |
sample() |
Yes | Yes (with random_state) | Controlled random sampling |
np.split() |
No | Yes | Sequential splits |
Conclusion
Use sample() for reproducible random splits with the random_state parameter. Use np.split() for sequential division, and np.random.rand() for simple random partitioning without reproducibility requirements.
