Divide a DataFrame in a ratio


Pandas library is used to manipulate the data and analyze the data. The data will be created using the pandas library in two ways Dataframe and Series. A DataFrame is the two dimensional data structure containing the rows and columns.

There different ways to divide the DataFrame data based on the ratio. Let’s see them one by one.

  • Using np.random.rand()

  • Using pandas.DataFrame.sample()

  • Using numpy.split()

Using numpy.random.rand()

In the following example, we will divide the dataframe data into parts by defining the ratio using the randm.rand() function. If we want to divide the data in the percentage of 60% and 40% then we will define the ratio as 0.6 and 0.4.

import numpy as np
ratio = np.random.rand(dataframe)
dataframe[ratio comparision_operator value]

Example

In the following example, we will divide the dataframe data into parts by defining the ratio using the randm.rand() function. If we want to divide the data in the percentage of 60% and 40% then we will define the ratio as 0.6 and 0.4.

import numpy as np
import pandas as pd
data=pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
data = data[:20]
ratio = np.random.rand(len(data))
train_data = data[ratio < 0.6]
train_data.head()
test_data = data[ratio >= 0.6]
test_data.head()

Output

    PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
4             5         0       3  ...   8.0500   NaN         S
8             9         1       3  ...  11.1333   NaN         S
10           11         1       3  ...  16.7000    G6         S
14           15         0       3  ...   7.8542   NaN         S
17           18         1       2  ...  13.0000   NaN         S

[5 rows x 12 columns]

Using pandas.DataFrame.sample()

The other way to divide the Dataframe in the ratio is by using the sample() function with the DataFrame. It takes the two parameters frac used to define the fraction and random_state which takes the seed value for the random number generator.

Syntax

The below is the syntax.

dataframe.sample(frac,random_state)

Example

In the following example, we are dividing the data into two parts with the percentage of 50 and 50, using the sample() function available in the pandas library.

import pandas as pd
dic = {"Letters":['A','B','C','D','E','F','G','H'],
      "Number":[1,2,3,4,5,6,7,8]}
data = pd.DataFrame(dic)
print("The Original data:")
print(data)
print("The 50% of the original data")
train_data = data.sample(frac = 0.5, random_state = 40)
print(train_data)
print("Another 50% of the data")
test_data = data.drop(train_data.index)
print(test_data)

Output

The Original data:
  Letters  Number
0       A       1
1       B       2
2       C       3
3       D       4
4       E       5
5       F       6
6       G       7
7       H       8
The 50% of the original data
  Letters  Number
7       H       8
1       B       2
2       C       3
4       E       5
Another 50% of the data
  Letters  Number
0       A       1
3       D       4
5       F       6
6       G       7

Using numpy.split() function

The other way to divide the dataframe based on ratio is numpy.split(). In Numpy library we have the split() function, which takes the dataframe along with the ratio*length of dataframe.

Syntax

The following is the syntax.

numpy.split(dataframe,[int(ratio*len(dataframe))

Example

Following example divides the dataframe in a ratio of 70% and 30% using the numpy.split() function.

import pandas as pd
import numpy as np
dic = {"Letters":['A','B','C','D','E','F','G','H'],
      "Number":[1,2,3,4,5,6,7,8]}
data = pd.DataFrame(dic)
print("The Original data:")
print(data)
print("The 70% of the original data")
train_data, test_data= np.split(data,[int(0.7*len(data))])
print(train_data)
print("Another 30% of the data")
print(test_data)

Output

The Original data:
  Letters  Number
0       A       1
1       B       2
2       C       3
3       D       4
4       E       5
5       F       6
6       G       7
7       H       8
The 70% of the original data
  Letters  Number
0       A       1
1       B       2
2       C       3
3       D       4
4       E       5
Another 30% of the data
  Letters  Number
5       F       6
6       G       7
7       H       8

Updated on: 02-Nov-2023

157 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements