Article Categories

Selected Reading

How to transform Scikit-learn IRIS dataset to 2-feature dataset in Python?

Python Scikit-learn Server Side Programming Programming

The Iris dataset is one of the most popular datasets in machine learning, containing measurements of sepal and petal dimensions for three Iris flower species. It has 150 samples with 4 features each. We can use Principal Component Analysis (PCA) to reduce the dimensionality while preserving most of the variance in the data.

What is PCA?

PCA is a dimensionality reduction technique that transforms data into a new coordinate system where the greatest variance lies on the first coordinate (principal component), the second greatest variance on the second coordinate, and so on.

Transforming to 2 Features

Here's how to reduce the Iris dataset from 4 features to 2 features using PCA ?

# Importing the necessary packages
from sklearn import datasets
from sklearn import decomposition

# Load iris plant dataset
iris = datasets.load_iris()

# Print details about the dataset
print('Features names:', iris.feature_names)
print('Features size:', iris.data.shape)
print('Target names:', iris.target_names)

X_iris, Y_iris = iris.data, iris.target

# Initialize PCA and fit the data
pca_2 = decomposition.PCA(n_components=2)
pca_2.fit(X_iris)

# Transform iris data to new dimensions (with 2 features)
X_iris_pca2 = pca_2.transform(X_iris)

# Print new dataset size
print('New Dataset size after transformations:', X_iris_pca2.shape)
print('Explained variance ratio:', pca_2.explained_variance_ratio_)

Features names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Features size: (150, 4)
Target names: ['setosa' 'versicolor' 'virginica']
New Dataset size after transformations: (150, 2)
Explained variance ratio: [0.92461872 0.05306648]

Transforming to 3 Features

We can also transform the dataset to 3 features and examine additional PCA properties ?

# Importing the necessary packages
from sklearn import datasets
from sklearn import decomposition

# Load iris plant dataset
iris = datasets.load_iris()
X_iris, Y_iris = iris.data, iris.target

# Initialize PCA with 3 components
pca_3 = decomposition.PCA(n_components=3)
pca_3.fit(X_iris)

# Transform iris data to new dimensions (with 3 features)
X_iris_pca3 = pca_3.transform(X_iris)

print('Original dataset shape:', X_iris.shape)
print('Transformed dataset shape:', X_iris_pca3.shape)
print()

# PCA properties
print('Explained variance ratio:', pca_3.explained_variance_ratio_)
print('Cumulative explained variance:', pca_3.explained_variance_ratio_.cumsum())
print('Total variance explained:', pca_3.explained_variance_ratio_.sum())

Original dataset shape: (150, 4)
Transformed dataset shape: (150, 3)

Explained variance ratio: [0.92461872 0.05306648 0.01710261]
Cumulative explained variance: [0.92461872 0.97768521 0.99478782]
Total variance explained: 0.9947878161267246

Key Benefits of PCA

Dimensionality Reduction: Reduces computational complexity
Noise Reduction: Removes less important variations
Visualization: Makes high-dimensional data easier to plot
Feature Selection: Identifies the most important patterns

Understanding the Results

The first two principal components capture about 97.8% of the total variance in the Iris dataset. This means we can represent the data in 2D while retaining almost all the information from the original 4D space.

Conclusion

PCA is an effective technique for reducing the Iris dataset dimensionality from 4 to 2 or 3 features while preserving most of the data variance. The first two components alone capture over 97% of the information, making them ideal for visualization and analysis.

Gaurav Leekha

Updated on: 2026-03-26T22:14:51+05:30

855 Views

Previous Next