Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Reduce Data Dimensionality using PCA - Python
Any dataset used in Machine Learning algorithms may have numerous dimensions. However, not all of them contribute to efficient output and simply cause the ML Model to perform poorly because of increased size and complexity. Thus, it becomes important to eliminate such features from the dataset using Principal Component Analysis (PCA).
PCA helps in removing dimensions from the dataset that do not optimize results, thereby creating a smaller and simpler dataset with most of the original and useful information. PCA is based on feature extraction, which maps data from higher dimensional space to lower dimensional space while maximizing variance.
Syntax
from sklearn.decomposition import PCA pca = PCA(n_components=number)
Here, PCA is the class that performs dimension reduction and pca is the object created from it. The n_components parameter specifies the number of principal components we want as output.
Algorithm Steps
Step 1 Import Python's sklearn and pandas libraries along with related submodules.
Step 2 Load the required dataset and convert it to pandas DataFrame.
Step 3 Use StandardScaler to standardize the features and store the new dataset as pandas DataFrame.
Step 4 Apply PCA class on the scaled dataset, fit the model, and transform the data.
Example 1: Diabetes Dataset with 4 Components
Let's use the diabetes dataset from sklearn to demonstrate PCA with 4 principal components ?
# Import required libraries
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
df = pd.DataFrame(diabetes['data'], columns=diabetes['feature_names'])
print("Original dataset shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
# Standardize the features
scaler = StandardScaler()
scaled_data = pd.DataFrame(scaler.fit_transform(df))
print("\nStandardized data shape:", scaled_data.shape)
# Apply PCA with 4 components
pca = PCA(n_components=4)
pca.fit(scaled_data)
data_pca = pca.transform(scaled_data)
data_pca = pd.DataFrame(data_pca, columns=['PC1', 'PC2', 'PC3', 'PC4'])
print("\nPCA transformed data:")
print(data_pca.head())
print("\nExplained variance ratio:")
print(pca.explained_variance_ratio_)
Original dataset shape: (442, 10)
First 5 rows:
age sex bmi bp s1 s2 s3 \
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412
2 0.085299 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.032356
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142
s4 s5 s6
0 -0.002592 0.019908 -0.017646
1 -0.039493 -0.068330 -0.092204
2 -0.002592 0.002864 -0.025930
3 0.034309 0.022692 -0.009362
4 -0.002592 -0.031991 -0.046641
Standardized data shape: (442, 10)
PCA transformed data:
PC1 PC2 PC3 PC4
0 -0.315179 0.174059 0.089285 -0.195877
1 0.366455 -0.011952 -0.158859 0.220858
2 -0.213086 0.261699 -0.281493 0.022717
3 0.382026 -0.104704 0.378282 -0.024940
4 -0.067528 0.318908 -0.168350 -0.152484
Explained variance ratio:
[0.40242142 0.14923182 0.12059623 0.09554764]
Example 2: Wine Dataset with 3 Components
Now let's use the wine dataset with 3 principal components ?
# Import required libraries
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd
# Load the wine dataset
wine = datasets.load_wine()
df = pd.DataFrame(wine['data'], columns=wine['feature_names'])
print("Original dataset shape:", df.shape)
print("\nFirst 3 rows:")
print(df.head(3))
# Standardize the features
scaler = StandardScaler()
scaled_data = pd.DataFrame(scaler.fit_transform(df))
# Apply PCA with 3 components
pca = PCA(n_components=3)
pca.fit(scaled_data)
data_pca = pca.transform(scaled_data)
data_pca = pd.DataFrame(data_pca, columns=['PC1', 'PC2', 'PC3'])
print("\nPCA transformed data:")
print(data_pca.head())
print(f"\nTotal explained variance: {sum(pca.explained_variance_ratio_):.3f}")
Original dataset shape: (178, 13)
First 3 rows:
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols \
0 14.23 1.71 2.43 15.6 127.0 2.80
1 13.20 1.78 2.14 11.2 100.0 2.65
2 13.16 2.36 2.67 18.6 101.0 2.80
flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue \
0 3.06 0.28 2.29 5.64 1.04
1 2.76 0.26 1.28 4.38 1.05
2 3.24 0.30 2.81 5.68 1.03
od280/od315_of_diluted_wines proline
0 3.92 1065.0
1 3.40 1050.0
2 3.17 1185.0
PCA transformed data:
PC1 PC2 PC3
0 3.316751 -1.443463 0.165454
1 2.209465 0.333393 2.028117
2 2.516740 -1.008088 -0.869831
3 3.757066 2.756372 0.144206
4 1.008908 -0.869831 0.930992
Total explained variance: 0.797
Key Benefits of PCA
Dimensionality Reduction Reduces dataset complexity while preserving important information
Noise Reduction Filters out less important features that may contain noise
Visualization Enables visualization of high-dimensional data in 2D or 3D
Computational Efficiency Faster training and prediction with reduced features
Conclusion
PCA is a powerful technique for dimensionality reduction that transforms high-dimensional data into a lower-dimensional space while preserving maximum variance. It helps solve overfitting problems and improves model performance by eliminating redundant features, making it essential for efficient machine learning workflows.
