 
 Data Structure Data Structure
 Networking Networking
 RDBMS RDBMS
 Operating System Operating System
 Java Java
 MS Excel MS Excel
 iOS iOS
 HTML HTML
 CSS CSS
 Android Android
 Python Python
 C Programming C Programming
 C++ C++
 C# C#
 MongoDB MongoDB
 MySQL MySQL
 Javascript Javascript
 PHP PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to split a Dataset into Train sets and Test sets in Python?
In this tutorial, we are going to learn about how to split a Dataset into a Train set and a Test set using Python Programming
Introduction
While creating Machine Learning and Deep Learning Models we may come across scenarios where we may want to do both training and well as evaluation on the same dataset. In such cases, we may want to divide our dataset into different groups or sets and use each set for one task or specific process (e.g. training). In such situations, we may make use of training/test sets.
Need for Train and Test sets
It is one of the very essential and easiest preprocessing techniques. A common issue found in Machine Learning models is overfitting or underfitting. Overfitting occurs when the model performs very well on the training data but fails to generalize on unseen samples. This may happen if the model learns noise from the data.
Another problem is underfitting where the model does not perform well on the training data and hence does not generalize well. This can happen if there is less data for training.
To overcome these kinds of issues one of the easiest techniques is splitting the dataset into train and test sets. The train set is used to train the model or learn the model parameters. The test set is generally used to evaluate the model performance on an unseen set of data.
Few Terminologies
Train set
The part of the dataset used for training the model. This can be usually taken around 70% of the whole dataset but the user can try with other percentages like 60% or 80% or as per use case. This part of the dataset is used for learning and fitting the parameters of the model.
Test set
The part of the dataset used for evaluating the model. This can be usually taken around 30% of the whole dataset but the user can try with other percentages like 40% or 20% or as per the use case.
Generally, we divide the dataset into 70:30 or 80:20, etc. as our requirements among train and test sets.
Splitting the dataset into train and Test sets in Python
There are basically three ways one can achieve splitting of the dataset:
- Using sklearn's train_test_split 
- Using numpy indexing 
- Using pandas 
Let's have brief look at each of the above methods
1. Using sklearn's train_test split
Example
import numpy as np from sklearn.model_selection import train_test_split x = np.arange(0, 50).reshape(10, 5) y = np.array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0]) x_train, x_test, y_train, y_test = train_test_split( x, y, test_size=0.3, random_state=4) print("Shape of x_train is ",x_train.shape) print("Shape of x_test is ",x_test.shape) print("Shape of y_train is ",y_train.shape) print("Shape of y_test is ",x_test.shape)
Output
Shape of x_train is (7, 5) Shape of x_test is (3, 5) Shape of y_train is (7,) Shape of y_test is (3, 5)
2. Using numpy indexing
Example
import numpy as np x = np.random.rand(100, 5) y = np.random.rand(100,1) x_train, x_test = x[:80,:], x[80:,:] y_train, y_test = y[:80,:], y[80:,:] print("Shape of x_train is ",x_train.shape) print("Shape of x_test is ",x_test.shape) print("Shape of y_train is ",y_train.shape) print("Shape of y_test is ",x_test.shape)
Output
Shape of x_train is (80, 5) Shape of x_test is (20, 5) Shape of y_train is (80, 1) Shape of y_test is (20, 5)
3. Using pandas sample
Example
import pandas as pd import numpy as np data = np.random.randint(10,25,size=(5,3)) df = pd.DataFrame(data, columns=['col1','col2','col3']) train_df = df.sample(frac=0.8, random_state=100) test_df = df[~df.index.isin(train_df.index)] print("Dataset shape : {}".format(df.shape)) print("Train dataset shape : {}".format(train_df.shape)) print("Test dataset shape : {}".format(test_df.shape))
Output
Dataset shape : (5, 3) Train dataset shape : (4, 3) Test dataset shape : (1, 3)
Conclusion
Train test split is a very important preprocessing step in python and machine learning task. It helps to prevents problems of overfitting and underfitting.
