How to transform Scikit-learn IRIS dataset to 2-feature dataset in Python?

Python Scikit-learn Server Side Programming Programming

Iris, a multivariate flower dataset, is one of the most useful Pyhton scikit-learn datasets. It has 3 classes of 50 instances each and contains the measurements of the sepal and petal parts of three Iris species namely Iris setosa, Iris virginica, and Iris versicolor. Along with that Iris dataset contains 50 instances from each of these three species and consists of four features namely sepal_length (cm), sepal_width (cm), petal_length (cm), petal_width (cm).

We can use Principal Component Analysis (PCA) to transform IRIS dataset into a new feature space with 2 features.

Steps

We can follow the below given steps to transform IRIS dataset to a 2-feature dataset using PCA in Python ?

Step 1 ? First, import the necessary packages from scikit-learn. We need to import datasets and decomposition packages.

Step 2 ? Load the IRIS dataset.

Step 3 ? Print the details about dataset.

Step 4 ? Initialize principal component analysis (PCA) and apply fit() function to fit the data.

Step 5 ? Transform the dataset to new dimensions i.e., 2-feature dataset.

Example

In the below example, we will use the above steps to transform the scikit-learn IRIS plant dataset to 2-features with PCA.


# Importing the necessary packages
from sklearn import datasets
from sklearn import decomposition

# Load iris plant dataset
iris = datasets.load_iris()

# Print details about the dataset
print('Features names : '+str(iris.feature_names))
print('\n')
print('Features size : '+str(iris.data.shape))
print('\n')
print('Target names : '+str(iris.target_names))
print('\n')
X_iris, Y_iris = iris.data, iris.target

# Initialize PCA and fit the data
pca_2 = decomposition.PCA(n_components=2)
pca_2.fit(X_iris)

# Transforming iris data to new dimensions(with 2 features)
X_iris_pca2 = pca_2.transform(X_iris)

# Printing new dataset
print('New Dataset size after transformations: ', X_iris_pca2.shape)

Output

It will produce the following output ?

Features names : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Features size : (150, 4)

Target names : ['setosa' 'versicolor' 'virginica']

New Dataset size after transformations: (150, 2)

How to Transform Iris Dataset to 3-feature Dataset?

We can use a statistical method called Principal Component Analysis (PCA) to transform Iris dataset into new feature space with 3 features. PCA basically linearly project the data into new feature space by analyzing the features of original dataset.

The main concept behind PCA is to select the "principal" characteristics of the data and build features based on them. It will give us new dataset that will be low in size but have the same information as that of the original dataset.

Example

In the below example, we will transform the scikit-learn Iris plant dataset with PCA (initialized with 3 components).


# Importing the necessary packages
from sklearn import datasets
from sklearn import decomposition

# Load iris plant dataset
iris = datasets.load_iris()

# Print details about the dataset
print('Features names : '+str(iris.feature_names))
print('\n')
print('Features size : '+str(iris.data.shape))
print('\n')
print('Target names : '+str(iris.target_names))
print('\n')
print('Target size : '+str(iris.target.shape))
X_iris, Y_iris = iris.data, iris.target

# Initialize PCA and fit the data
pca_3 = decomposition.PCA(n_components=3)
pca_3.fit(X_iris)

# Transforming iris data to new dimensions(with 2 features)
X_iris_pca3 = pca_3.transform(X_iris)

# Printing new dataset
print('New Dataset size after transformations : ', X_iris_pca3.shape)
print('\n')

# Getting the direction of maximum variance in data
print("Components : ", pca_3.components_)
print('\n')

# Getting the amount of variance explained by each component
print("Explained Variance:",pca_3.explained_variance_)
print('\n')

# Getting the percentage of variance explained by each component
print("Explained Variance Ratio:",pca_3.explained_variance_ratio_)
print('\n')

# Getting the singular values for each component
print("Singular Values :",pca_3.singular_values_)
print('\n')

# Getting estimated noise covariance
print("Noise Variance :",pca_3.noise_variance_)

Output

It will produce the following output ?

Features names : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Features size : (150, 4)

Target names : ['setosa' 'versicolor' 'virginica']

Target size : (150,)
New Dataset size after transformations : (150, 3)

Components : [[ 0.36138659 -0.08452251 0.85667061 0.3582892 ]
[ 0.65658877 0.73016143 -0.17337266 -0.07548102]
[-0.58202985 0.59791083 0.07623608 0.54583143]]

Explained Variance: [4.22824171 0.24267075 0.0782095 ]

Explained Variance Ratio: [0.92461872 0.05306648 0.01710261]

Singular Values : [25.09996044 6.01314738 3.41368064]

Noise Variance : 0.02383509297344944

Gaurav Leekha

Updated on: 2022-10-04T08:38:18+05:30

742 Views

Kickstart Your Career

Get certified by completing the course

Get Started