
- Python Basic Tutorial
- Python - Home
- Python - Overview
- Python - Environment Setup
- Python - Basic Syntax
- Python - Comments
- Python - Variables
- Python - Data Types
- Python - Operators
- Python - Decision Making
- Python - Loops
- Python - Numbers
- Python - Strings
- Python - Lists
- Python - Tuples
- Python - Dictionary
- Python - Date & Time
- Python - Functions
- Python - Modules
- Python - Files I/O
- Python - Exceptions
Explain how scikit-learn library can be used to split the dataset for training and testing purposes in Python?
Scikit-learn, commonly known as sklearn is a library in Python that is used for the purpose of implementing machine learning algorithms. It is powerful and robust, since it provides a wide variety of tools to perform statistical modelling.
This includes classification, regression, clustering, dimensionality reduction, and much more with the help of a powerful, and stable interface in Python. Built on Numpy, SciPy and Matplotlib libraries.
Before passing the input data to the Machine Learning algorithm, it has to be split into training and test dataset.
Once the data is fit to the chosen model, the input dataset is trained on this model. When the training takes place, the model learns from the data.
It also learns to generalize on new data. The test dataset won’t be used during the training of the model.
Once all the hyperparameters are tuned, and optimum weights are set, the test dataset is provided to the machine learning algorithm.
This is the dataset that is used to check how well the algorithm generalizes to new data. Let us see how data can be split using scikit-learn library.
Example
from sklearn.datasets import load_iris my_data = load_iris() X = my_data.data y = my_data.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2, random_state = 2 ) print("The dimensions of the features of training data ") print(X_train.shape) print("The dimensions of the features of test data ") print(X_test.shape) print("The dimensions of the target values of training data ") print(y_train.shape) print("The dimensions of the target values of test data ") print(y_test.shape)
Output
The dimensions of the features of training data (120, 4) The dimensions of the features of test data (30, 4) The dimensions of the target values of training data (120,) The dimensions of the target values of test data (30,)
Explanation
- The required packages are imported.
- The dataset required for this is also loaded into the environment.
- The features and the target values are separated from the dataset.
- The training and test data is split in the ratio 80 percent and 20 percent respectively.
- This means 20 percent of the data will be used to check how well the model generalizes on new data.
- These splits, along with the dimensions of the data are printed on the console.
- Related Articles
- How can scikit-learn library be used to load data in Python?
- How can scikit learn library be used to preprocess data in Python?
- How can scikit learn library be used to upload and view an image in Python?
- Explain how L1 Normalization can be implemented using scikit-learn library in Python?
- Explain how L2 Normalization can be implemented using scikit-learn library in Python?
- How can Tensorflow be used to split the flower dataset into training and validation?
- How can Tensorflow be used to split the Illiad dataset into training and test data in Python?
- How can scikit-learn library be used to get the resolution of an image in Python?
- How can data be scaled using scikit-learn library in Python?
- How can Tensorflow be used to prepare the IMDB dataset for training in Python?
- Explain the basics of scikit-learn library in Python?
- How can the Illiad dataset be prepared for training using Python?
- How to transform Scikit-learn IRIS dataset to 2-feature dataset in Python?
- How to generate and plot classification dataset using Python Scikit-learn?
- How can Tensorflow be used to pre-process the flower training dataset?
