Explain how scikit-learn library can be used to split the dataset for training and testing purposes in Python?


Scikit-learn, commonly known as sklearn is a library in Python that is used for the purpose of implementing machine learning algorithms. It is powerful and robust, since it provides a wide variety of tools to perform statistical modelling.

This includes classification, regression, clustering, dimensionality reduction, and much more with the help of a powerful, and stable interface in Python. Built on Numpy, SciPy and Matplotlib libraries.

Before passing the input data to the Machine Learning algorithm, it has to be split into training and test dataset.

Once the data is fit to the chosen model, the input dataset is trained on this model. When the training takes place, the model learns from the data.

It also learns to generalize on new data. The test dataset won’t be used during the training of the model.

Once all the hyperparameters are tuned, and optimum weights are set, the test dataset is provided to the machine learning algorithm.

This is the dataset that is used to check how well the algorithm generalizes to new data. Let us see how data can be split using scikit-learn library.

Example

from sklearn.datasets import load_iris
my_data = load_iris()
X = my_data.data
y = my_data.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.2, random_state = 2
)
print("The dimensions of the features of training data ")
print(X_train.shape)
print("The dimensions of the features of test data ")
print(X_test.shape)
print("The dimensions of the target values of training data ")
print(y_train.shape)
print("The dimensions of the target values of test data ")
print(y_test.shape)

Output

The dimensions of the features of training data
(120, 4)
The dimensions of the features of test data
(30, 4)
The dimensions of the target values of training data
(120,)
The dimensions of the target values of test data
(30,)

Explanation

  • The required packages are imported.
  • The dataset required for this is also loaded into the environment.
  • The features and the target values are separated from the dataset.
  • The training and test data is split in the ratio 80 percent and 20 percent respectively.
  • This means 20 percent of the data will be used to check how well the model generalizes on new data.
  • These splits, along with the dimensions of the data are printed on the console.

Updated on: 11-Dec-2020

167 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements