Scikit-learn, commonly known as sklearn is a library in Python that is used for the purpose of implementing machine learning algorithms. It is powerful and robust, since it provides a wide variety of tools to perform statistical modelling.
This includes classification, regression, clustering, dimensionality reduction, and much more with the help of a powerful, and stable interface in Python. Built on Numpy, SciPy and Matplotlib libraries.
Before passing the input data to the Machine Learning algorithm, it has to be split into training and test dataset.
Once the data is fit to the chosen model, the input dataset is trained on this model. When the training takes place, the model learns from the data.
It also learns to generalize on new data. The test dataset won’t be used during the training of the model.
Once all the hyperparameters are tuned, and optimum weights are set, the test dataset is provided to the machine learning algorithm.
This is the dataset that is used to check how well the algorithm generalizes to new data. Let us see how data can be split using scikit-learn library.
from sklearn.datasets import load_iris my_data = load_iris() X = my_data.data y = my_data.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2, random_state = 2 ) print("The dimensions of the features of training data ") print(X_train.shape) print("The dimensions of the features of test data ") print(X_test.shape) print("The dimensions of the target values of training data ") print(y_train.shape) print("The dimensions of the target values of test data ") print(y_test.shape)
The dimensions of the features of training data (120, 4) The dimensions of the features of test data (30, 4) The dimensions of the target values of training data (120,) The dimensions of the target values of test data (30,)