How to split a Dataset into Train sets and Test sets in Python?



In this tutorial, we are going to learn about how to split a Dataset into a Train set and a Test set using Python Programming

Introduction

While creating Machine Learning and Deep Learning Models we may come across scenarios where we may want to do both training and well as evaluation on the same dataset. In such cases, we may want to divide our dataset into different groups or sets and use each set for one task or specific process (e.g. training). In such situations, we may make use of training/test sets.

Need for Train and Test sets

It is one of the very essential and easiest preprocessing techniques. A common issue found in Machine Learning models is overfitting or underfitting. Overfitting occurs when the model performs very well on the training data but fails to generalize on unseen samples. This may happen if the model learns noise from the data.

Another problem is underfitting where the model does not perform well on the training data and hence does not generalize well. This can happen if there is less data for training.

To overcome these kinds of issues one of the easiest techniques is splitting the dataset into train and test sets. The train set is used to train the model or learn the model parameters. The test set is generally used to evaluate the model performance on an unseen set of data.

Few Terminologies

Train set

The part of the dataset used for training the model. This can be usually taken around 70% of the whole dataset but the user can try with other percentages like 60% or 80% or as per use case. This part of the dataset is used for learning and fitting the parameters of the model.

Test set

The part of the dataset used for evaluating the model. This can be usually taken around 30% of the whole dataset but the user can try with other percentages like 40% or 20% or as per the use case.

Generally, we divide the dataset into 70:30 or 80:20, etc. as our requirements among train and test sets.

Splitting the dataset into train and Test sets in Python

There are basically three ways one can achieve splitting of the dataset:

  • Using sklearn's train_test_split

  • Using numpy indexing

  • Using pandas

Let's have brief look at each of the above methods

1. Using sklearn's train_test split

Example

import numpy as np from sklearn.model_selection import train_test_split x = np.arange(0, 50).reshape(10, 5) y = np.array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0]) x_train, x_test, y_train, y_test = train_test_split( x, y, test_size=0.3, random_state=4) print("Shape of x_train is ",x_train.shape) print("Shape of x_test is ",x_test.shape) print("Shape of y_train is ",y_train.shape) print("Shape of y_test is ",x_test.shape)

Output

Shape of x_train is (7, 5)
Shape of x_test is (3, 5)
Shape of y_train is (7,)
Shape of y_test is (3, 5)

2. Using numpy indexing

Example

import numpy as np x = np.random.rand(100, 5) y = np.random.rand(100,1) x_train, x_test = x[:80,:], x[80:,:] y_train, y_test = y[:80,:], y[80:,:] print("Shape of x_train is ",x_train.shape) print("Shape of x_test is ",x_test.shape) print("Shape of y_train is ",y_train.shape) print("Shape of y_test is ",x_test.shape)

Output

Shape of x_train is (80, 5)
Shape of x_test is (20, 5)
Shape of y_train is (80, 1)
Shape of y_test is (20, 5)

3. Using pandas sample

Example

import pandas as pd import numpy as np data = np.random.randint(10,25,size=(5,3)) df = pd.DataFrame(data, columns=['col1','col2','col3']) train_df = df.sample(frac=0.8, random_state=100) test_df = df[~df.index.isin(train_df.index)] print("Dataset shape : {}".format(df.shape)) print("Train dataset shape : {}".format(train_df.shape)) print("Test dataset shape : {}".format(test_df.shape))

Output

Dataset shape : (5, 3) Train dataset shape : (4, 3) Test dataset shape : (1, 3)

Conclusion

Train test split is a very important preprocessing step in python and machine learning task. It helps to prevents problems of overfitting and underfitting.


Advertisements