Article Categories

Selected Reading

Divide a DataFrame in a ratio

Python Server Side Programming Programming

Pandas DataFrames often need to be divided into smaller parts based on specific ratios for tasks like train-test splits in machine learning. Python provides several methods to split DataFrames proportionally using different approaches.

There are three main ways to divide DataFrame data based on ratio:

Using np.random.rand()
Using pandas.DataFrame.sample()
Using numpy.split()

Using numpy.random.rand()

This method creates random values for each row and filters based on a threshold. For a 60-40 split, we use 0.6 as the threshold ?

Syntax

import numpy as np
ratio = np.random.rand(len(dataframe))
part1 = dataframe[ratio < threshold]
part2 = dataframe[ratio >= threshold]

Example

import numpy as np
import pandas as pd

# Create sample data
data = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Henry'],
    'Age': [25, 30, 35, 28, 32, 29, 27, 31],
    'Score': [85, 90, 78, 88, 92, 76, 89, 84]
})

print("Original DataFrame:")
print(data)

# Generate random values for splitting
ratio = np.random.rand(len(data))

# Split into 60% and 40%
train_data = data[ratio < 0.6]
test_data = data[ratio >= 0.6]

print(f"\nTrain data (60%): {len(train_data)} rows")
print(train_data)
print(f"\nTest data (40%): {len(test_data)} rows") 
print(test_data)

Original DataFrame:
      Name  Age  Score
0    Alice   25     85
1      Bob   30     90
2  Charlie   35     78
3    David   28     88
4      Eve   32     92
5    Frank   29     76
6    Grace   27     89
7    Henry   31     84

Train data (60%): 4 rows
      Name  Age  Score
1      Bob   30     90
3    David   28     88
4      Eve   32     92
7    Henry   31     84

Test data (40%): 4 rows
      Name  Age  Score
0    Alice   25     85
2  Charlie   35     78
5    Frank   29     76
6    Grace   27     89

Using pandas.DataFrame.sample()

The sample() method provides more control over the splitting process. It uses the frac parameter to specify the fraction and random_state for reproducible results ?

Syntax

dataframe.sample(frac=fraction, random_state=seed_value)

Example

import pandas as pd

# Create sample data
data = pd.DataFrame({
    'Letters': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
    'Number': [1, 2, 3, 4, 5, 6, 7, 8]
})

print("Original DataFrame:")
print(data)

# Split into 50-50 using sample()
train_data = data.sample(frac=0.5, random_state=42)
test_data = data.drop(train_data.index)

print(f"\nFirst 50% of data:")
print(train_data)
print(f"\nRemaining 50% of data:")
print(test_data)

Original DataFrame:
  Letters  Number
0       A       1
1       B       2
2       C       3
3       D       4
4       E       5
5       F       6
6       G       7
7       H       8

First 50% of data:
  Letters  Number
6       G       7
1       B       2
4       E       5
5       F       6

Remaining 50% of data:
  Letters  Number
0       A       1
2       C       3
3       D       4
7       H       8

Using numpy.split()

The numpy.split() function divides the DataFrame sequentially without randomization. It splits at the calculated index position ?

Syntax

numpy.split(dataframe, [int(ratio * len(dataframe))])

Example

import pandas as pd
import numpy as np

# Create sample data
data = pd.DataFrame({
    'Letters': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
    'Number': [1, 2, 3, 4, 5, 6, 7, 8]
})

print("Original DataFrame:")
print(data)

# Split into 70-30 ratio
train_data, test_data = np.split(data, [int(0.7 * len(data))])

print(f"\nFirst 70% of data:")
print(train_data)
print(f"\nRemaining 30% of data:")
print(test_data)

Original DataFrame:
  Letters  Number
0       A       1
1       B       2
2       C       3
3       D       4
4       E       5
5       F       6
6       G       7
7       H       8

First 70% of data:
  Letters  Number
0       A       1
1       B       2
2       C       3
3       D       4
4       E       5

Remaining 30% of data:
  Letters  Number
5       F       6
6       G       7
7       H       8

Comparison

Method	Randomization	Reproducible	Best For
`np.random.rand()`	Yes	No (unless seed set)	Simple random splits
`sample()`	Yes	Yes (with random_state)	Controlled random sampling
`np.split()`	No	Yes	Sequential splits

Conclusion

Use sample() for reproducible random splits with the random_state parameter. Use np.split() for sequential division, and np.random.rand() for simple random partitioning without reproducibility requirements.

Niharika Aitam

Updated on: 2026-03-27T15:55:20+05:30

1K+ Views

Previous Next