Reduce Data Dimensionality using PCA - Python

Any dataset used in Machine Learning algorithms may have numerous dimensions. However, not all of them contribute to efficient output and simply cause the ML Model to perform poorly because of increased size and complexity. Thus, it becomes important to eliminate such features from the dataset using Principal Component Analysis (PCA).

PCA helps in removing dimensions from the dataset that do not optimize results, thereby creating a smaller and simpler dataset with most of the original and useful information. PCA is based on feature extraction, which maps data from higher dimensional space to lower dimensional space while maximizing variance.

Syntax

from sklearn.decomposition import PCA

pca = PCA(n_components=number)

Here, PCA is the class that performs dimension reduction and pca is the object created from it. The n_components parameter specifies the number of principal components we want as output.

Algorithm Steps

  • Step 1 Import Python's sklearn and pandas libraries along with related submodules.

  • Step 2 Load the required dataset and convert it to pandas DataFrame.

  • Step 3 Use StandardScaler to standardize the features and store the new dataset as pandas DataFrame.

  • Step 4 Apply PCA class on the scaled dataset, fit the model, and transform the data.

Example 1: Diabetes Dataset with 4 Components

Let's use the diabetes dataset from sklearn to demonstrate PCA with 4 principal components ?

# Import required libraries 
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd

# Load the diabetes dataset
diabetes = datasets.load_diabetes()
df = pd.DataFrame(diabetes['data'], columns=diabetes['feature_names'])
print("Original dataset shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

# Standardize the features
scaler = StandardScaler()
scaled_data = pd.DataFrame(scaler.fit_transform(df))
print("\nStandardized data shape:", scaled_data.shape)

# Apply PCA with 4 components
pca = PCA(n_components=4)
pca.fit(scaled_data)
data_pca = pca.transform(scaled_data)
data_pca = pd.DataFrame(data_pca, columns=['PC1', 'PC2', 'PC3', 'PC4'])

print("\nPCA transformed data:")
print(data_pca.head())
print("\nExplained variance ratio:")
print(pca.explained_variance_ratio_)
Original dataset shape: (442, 10)

First 5 rows:
        age       sex       bmi        bp        s1        s2        s3  \
0  0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   
2  0.085299  0.050680  0.044451 -0.005671 -0.045599 -0.034194 -0.032356   
3 -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   
4  0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   

        s4        s5        s6  
0 -0.002592  0.019908 -0.017646  
1 -0.039493 -0.068330 -0.092204  
2 -0.002592  0.002864 -0.025930  
3  0.034309  0.022692 -0.009362  
4 -0.002592 -0.031991 -0.046641  

Standardized data shape: (442, 10)

PCA transformed data:
        PC1       PC2       PC3       PC4
0 -0.315179  0.174059  0.089285 -0.195877
1  0.366455 -0.011952 -0.158859  0.220858
2 -0.213086  0.261699 -0.281493  0.022717
3  0.382026 -0.104704  0.378282 -0.024940
4 -0.067528  0.318908 -0.168350 -0.152484

Explained variance ratio:
[0.40242142 0.14923182 0.12059623 0.09554764]

Example 2: Wine Dataset with 3 Components

Now let's use the wine dataset with 3 principal components ?

# Import required libraries 
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd

# Load the wine dataset
wine = datasets.load_wine()
df = pd.DataFrame(wine['data'], columns=wine['feature_names'])
print("Original dataset shape:", df.shape)
print("\nFirst 3 rows:")
print(df.head(3))

# Standardize the features
scaler = StandardScaler()
scaled_data = pd.DataFrame(scaler.fit_transform(df))

# Apply PCA with 3 components
pca = PCA(n_components=3)
pca.fit(scaled_data)
data_pca = pca.transform(scaled_data)
data_pca = pd.DataFrame(data_pca, columns=['PC1', 'PC2', 'PC3'])

print("\nPCA transformed data:")
print(data_pca.head())
print(f"\nTotal explained variance: {sum(pca.explained_variance_ratio_):.3f}")
Original dataset shape: (178, 13)

First 3 rows:
   alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
0    14.23        1.71  2.43               15.6      127.0           2.80   
1    13.20        1.78  2.14               11.2      100.0           2.65   
2    13.16        2.36  2.67               18.6      101.0           2.80   

   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
0        3.06                  0.28             2.29             5.64  1.04   
1        2.76                  0.26             1.28             4.38  1.05   
2        3.24                  0.30             2.81             5.68  1.03   

   od280/od315_of_diluted_wines  proline  
0                          3.92   1065.0  
1                          3.40   1050.0  
2                          3.17   1185.0  

PCA transformed data:
        PC1       PC2       PC3
0  3.316751 -1.443463  0.165454
1  2.209465  0.333393  2.028117
2  2.516740 -1.008088 -0.869831
3  3.757066  2.756372  0.144206
4  1.008908 -0.869831  0.930992

Total explained variance: 0.797

Key Benefits of PCA

  • Dimensionality Reduction Reduces dataset complexity while preserving important information

  • Noise Reduction Filters out less important features that may contain noise

  • Visualization Enables visualization of high-dimensional data in 2D or 3D

  • Computational Efficiency Faster training and prediction with reduced features

Conclusion

PCA is a powerful technique for dimensionality reduction that transforms high-dimensional data into a lower-dimensional space while preserving maximum variance. It helps solve overfitting problems and improves model performance by eliminating redundant features, making it essential for efficient machine learning workflows.

Updated on: 2026-03-27T11:22:20+05:30

355 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements