- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to transform Scikit-learn IRIS dataset to 2-feature dataset in Python?
Iris, a multivariate flower dataset, is one of the most useful Pyhton scikit-learn datasets. It has 3 classes of 50 instances each and contains the measurements of the sepal and petal parts of three Iris species namely Iris setosa, Iris virginica, and Iris versicolor. Along with that Iris dataset contains 50 instances from each of these three species and consists of four features namely sepal_length (cm), sepal_width (cm), petal_length (cm), petal_width (cm).
We can use Principal Component Analysis (PCA) to transform IRIS dataset into a new feature space with 2 features.
Steps
We can follow the below given steps to transform IRIS dataset to a 2-feature dataset using PCA in Python −
Step 1 − First, import the necessary packages from scikit-learn. We need to import datasets and decomposition packages.
Step 2 − Load the IRIS dataset.
Step 3 − Print the details about dataset.
Step 4 − Initialize principal component analysis (PCA) and apply fit() function to fit the data.
Step 5 − Transform the dataset to new dimensions i.e., 2-feature dataset.
Example
In the below example, we will use the above steps to transform the scikit-learn IRIS plant dataset to 2-features with PCA.
# Importing the necessary packages from sklearn import datasets from sklearn import decomposition # Load iris plant dataset iris = datasets.load_iris() # Print details about the dataset print('Features names : '+str(iris.feature_names)) print('\n') print('Features size : '+str(iris.data.shape)) print('\n') print('Target names : '+str(iris.target_names)) print('\n') X_iris, Y_iris = iris.data, iris.target # Initialize PCA and fit the data pca_2 = decomposition.PCA(n_components=2) pca_2.fit(X_iris) # Transforming iris data to new dimensions(with 2 features) X_iris_pca2 = pca_2.transform(X_iris) # Printing new dataset print('New Dataset size after transformations: ', X_iris_pca2.shape)
Output
It will produce the following output −
Features names : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] Features size : (150, 4) Target names : ['setosa' 'versicolor' 'virginica'] New Dataset size after transformations: (150, 2)
How to Transform Iris Dataset to 3-feature Dataset?
We can use a statistical method called Principal Component Analysis (PCA) to transform Iris dataset into new feature space with 3 features. PCA basically linearly project the data into new feature space by analyzing the features of original dataset.
The main concept behind PCA is to select the “principal” characteristics of the data and build features based on them. It will give us new dataset that will be low in size but have the same information as that of the original dataset.
Example
In the below example, we will transform the scikit-learn Iris plant dataset with PCA (initialized with 3 components).
# Importing the necessary packages from sklearn import datasets from sklearn import decomposition # Load iris plant dataset iris = datasets.load_iris() # Print details about the dataset print('Features names : '+str(iris.feature_names)) print('\n') print('Features size : '+str(iris.data.shape)) print('\n') print('Target names : '+str(iris.target_names)) print('\n') print('Target size : '+str(iris.target.shape)) X_iris, Y_iris = iris.data, iris.target # Initialize PCA and fit the data pca_3 = decomposition.PCA(n_components=3) pca_3.fit(X_iris) # Transforming iris data to new dimensions(with 2 features) X_iris_pca3 = pca_3.transform(X_iris) # Printing new dataset print('New Dataset size after transformations : ', X_iris_pca3.shape) print('\n') # Getting the direction of maximum variance in data print("Components : ", pca_3.components_) print('\n') # Getting the amount of variance explained by each component print("Explained Variance:",pca_3.explained_variance_) print('\n') # Getting the percentage of variance explained by each component print("Explained Variance Ratio:",pca_3.explained_variance_ratio_) print('\n') # Getting the singular values for each component print("Singular Values :",pca_3.singular_values_) print('\n') # Getting estimated noise covariance print("Noise Variance :",pca_3.noise_variance_)
Output
It will produce the following output −
Features names : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] Features size : (150, 4) Target names : ['setosa' 'versicolor' 'virginica'] Target size : (150,) New Dataset size after transformations : (150, 3) Components : [[ 0.36138659 -0.08452251 0.85667061 0.3582892 ] [ 0.65658877 0.73016143 -0.17337266 -0.07548102] [-0.58202985 0.59791083 0.07623608 0.54583143]] Explained Variance: [4.22824171 0.24267075 0.0782095 ] Explained Variance Ratio: [0.92461872 0.05306648 0.01710261] Singular Values : [25.09996044 6.01314738 3.41368064] Noise Variance : 0.02383509297344944