Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to implement Random Projection using Python Scikit-learn?
Random projection is a dimensionality reduction technique that simplifies high-dimensional data by projecting it onto a lower-dimensional space using random matrices. It is particularly useful when traditional methods like Principal Component Analysis (PCA) are computationally expensive or insufficient for the data.
Python Scikit-learn provides the sklearn.random_projection module that implements two types of random projection matrices ?
- Gaussian Random Matrix Uses normally distributed random values
- Sparse Random Matrix Uses mostly zero values with occasional +1 or -1
Gaussian Random Projection
The GaussianRandomProjection class reduces dimensionality by projecting data onto a randomly generated matrix with Gaussian-distributed elements. This method preserves pairwise distances approximately according to the Johnson-Lindenstrauss lemma.
Example
Let's implement Gaussian random projection and visualize the transformation matrix ?
import numpy as np
from sklearn.random_projection import GaussianRandomProjection
import matplotlib.pyplot as plt
# Create random high-dimensional data
X_random = np.random.RandomState(0).rand(100, 10000)
# Apply Gaussian random projection
gauss_transformer = GaussianRandomProjection(random_state=0)
X_transformed = gauss_transformer.fit_transform(X_random)
print(f'Original shape: {X_random.shape}')
print(f'Transformed shape: {X_transformed.shape}')
# Visualize the transformation matrix elements
plt.figure(figsize=(8, 4))
plt.hist(gauss_transformer.components_.flatten(), bins=50)
plt.title('Distribution of Gaussian Random Projection Matrix Elements')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Original shape: (100, 10000) Transformed shape: (100, 3947)
Sparse Random Projection
The SparseRandomProjection class uses sparse matrices with mostly zero values and occasional +1 or -1 entries. This approach is more memory-efficient and computationally faster than Gaussian projection.
Example
Let's implement sparse random projection and analyze the sparsity pattern ?
import numpy as np
from sklearn.random_projection import SparseRandomProjection
import matplotlib.pyplot as plt
# Create random data
rng = np.random.RandomState(42)
X_data = rng.rand(25, 3000)
# Apply sparse random projection
sparse_transformer = SparseRandomProjection(random_state=0)
X_transformed = sparse_transformer.fit_transform(X_data)
print(f'Original shape: {X_data.shape}')
print(f'Transformed shape: {X_transformed.shape}')
print(f'Transformation matrix shape: {sparse_transformer.components_.shape}')
print(f'Matrix density: {sparse_transformer.density_:.4f}')
# Analyze the sparse matrix structure
components_data = sparse_transformer.components_.data
total_elements = sparse_transformer.components_.shape[0] * sparse_transformer.components_.shape[1]
# Count positive, negative, and zero elements
positive_count = sum(components_data > 0)
negative_count = sum(components_data < 0)
zero_count = total_elements - len(components_data)
print(f'Positive values: {positive_count}')
print(f'Negative values: {negative_count}')
print(f'Zero values: {zero_count}')
Original shape: (25, 3000) Transformed shape: (25, 2759) Transformation matrix shape: (2759, 3000) Matrix density: 0.0833 Positive values: 114947 Negative values: 114947 Zero values: 7587106
Key Differences
| Aspect | Gaussian Random Projection | Sparse Random Projection |
|---|---|---|
| Matrix Elements | Gaussian distributed values | Mostly zeros with ±1 |
| Memory Usage | Higher | Lower (sparse storage) |
| Computation Speed | Slower | Faster |
| Best For | Dense data, better preservation | Large datasets, efficiency |
Choosing the Right Method
Use Gaussian Random Projection when you need better distance preservation and have sufficient computational resources. Choose Sparse Random Projection for large datasets where memory efficiency and speed are priorities.
Conclusion
Random projection offers an efficient alternative to PCA for dimensionality reduction. Gaussian projection provides better quality while sparse projection excels in computational efficiency for large-scale applications.
