Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to transform Sklearn DIGITS dataset to 2 and 3-feature dataset in Python?
The sklearn DIGITS dataset contains 64 features as each handwritten digit image is 8×8 pixels. We can use Principal Component Analysis (PCA) to reduce dimensionality and transform this dataset into 2 or 3-feature datasets. While this significantly reduces data size, it also loses some information and may impact ML model accuracy.
Transform DIGITS Dataset to 2 Features
We can reduce the 64-dimensional DIGITS dataset to 2 dimensions using PCA. This creates a simplified representation suitable for visualization and faster processing ?
# Import necessary packages
from sklearn import datasets
from sklearn.decomposition import PCA
# Load DIGITS dataset
digits = datasets.load_digits()
X_digits, y_digits = digits.data, digits.target
print('Original DIGITS Dataset Size:', X_digits.shape, y_digits.shape)
# Initialize PCA with 2 components
pca_2 = PCA(n_components=2)
pca_2.fit(X_digits)
# Transform to 2 dimensions
X_digits_2d = pca_2.transform(X_digits)
print('New Dataset size after PCA transformation:', X_digits_2d.shape)
# Check explained variance ratio
print('Explained variance ratio:', pca_2.explained_variance_ratio_)
print('Total variance explained:', sum(pca_2.explained_variance_ratio_))
Original DIGITS Dataset Size: (1797, 64) (1797,) New Dataset size after PCA transformation: (1797, 2) Explained variance ratio: [0.14890594 0.13618771] Total variance explained: 0.28509365061189297
Transform DIGITS Dataset with Limited Classes
You can also transform a subset of the DIGITS dataset by loading only specific digit classes. This is useful when working with fewer categories ?
# Import necessary packages
from sklearn import datasets
from sklearn.decomposition import PCA
# Load DIGITS dataset with only first 6 classes (digits 0-5)
digits_6 = datasets.load_digits(n_class=6)
X_digits_6, y_digits_6 = digits_6.data, digits_6.target
print('DIGITS Dataset Size (6 classes):', X_digits_6.shape, y_digits_6.shape)
# Apply PCA transformation
pca_2 = PCA(n_components=2)
X_digits_6_2d = pca_2.fit_transform(X_digits_6)
print('Transformed Dataset size:', X_digits_6_2d.shape)
DIGITS Dataset Size (6 classes): (1083, 64) (1083,) Transformed Dataset size: (1083, 2)
Transform DIGITS Dataset to 3 Features
A 3-dimensional transformation provides more information retention compared to 2D while still achieving significant dimensionality reduction ?
# Import necessary packages
from sklearn import datasets
from sklearn.decomposition import PCA
# Load DIGITS dataset
digits = datasets.load_digits()
X_digits, y_digits = digits.data, digits.target
print('Original DIGITS Dataset Size:', X_digits.shape, y_digits.shape)
# Initialize PCA with 3 components
pca_3 = PCA(n_components=3)
X_digits_3d = pca_3.fit_transform(X_digits)
print('New Dataset size after PCA transformation:', X_digits_3d.shape)
# Compare variance explained
print('Explained variance ratio (3D):', pca_3.explained_variance_ratio_)
print('Total variance explained:', sum(pca_3.explained_variance_ratio_))
Original DIGITS Dataset Size: (1797, 64) (1797,) New Dataset size after PCA transformation: (1797, 3) Explained variance ratio (3D): [0.14890594 0.13618771 0.11794594] Total variance explained: 0.4030395874652833
Comparison of Dimensionality Reduction
| Components | Dataset Shape | Variance Explained | Use Case |
|---|---|---|---|
| 2 | (1797, 2) | ~28.5% | 2D visualization, simple models |
| 3 | (1797, 3) | ~40.3% | 3D visualization, balanced reduction |
| 64 (original) | (1797, 64) | 100% | Full information, complex models |
Conclusion
PCA effectively reduces the DIGITS dataset from 64 to 2 or 3 features while preserving the most important variance. Use 2D for visualization and simple models, or 3D when you need slightly better information retention with manageable complexity.
