Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Python - Create Test DataSets using Sklearn
The Sklearn Python library provides sample datasets which can be used to create various graph plots. The usefulness of these datasets is in creating sample graphs and charts, predicting graph behavior as values change, and experimenting with parameters like colors and axes before using actual datasets.
Using make_blobs
The make_blobs function generates isotropic Gaussian blobs for clustering. This is useful for testing clustering algorithms and creating scatter plots with distinct groups of data points.
Example
In the below example we use the sklearn library along with matplotlib to create a scatter plot with a specific style. We choose a sample of 200 data points and also select the color and type of the clusters ?
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
from matplotlib import style
style.use("fast")
X, y = make_blobs(n_samples=200, centers=4,
cluster_std=1, n_features=2)
plt.scatter(X[:, 0], X[:, 1], s=60, color='r')
plt.xlabel("X")
plt.ylabel("Y")
plt.title("Sample Blobs Dataset")
plt.show()
The output of the above code is ?
A scatter plot showing 200 red data points arranged in 4 distinct clusters
Using make_circles
The make_circles function generates a dataset with two concentric circles. This is particularly useful for testing non-linear classification algorithms and creating circular patterns.
Example
Similar to the above approach, we use the make_circles function to create circles with a sample size of 100 and blue as the color ?
from sklearn.datasets import make_circles
from matplotlib import pyplot as plt
from matplotlib import style
style.use("fast")
X, y = make_circles(n_samples=100, noise=0.04)
plt.scatter(X[:, 0], X[:, 1], s=40, color='b')
plt.xlabel("X")
plt.ylabel("Y")
plt.title("Sample Circles Dataset")
plt.show()
The output of the above code is ?
A scatter plot showing 100 blue data points arranged in two concentric circles
Common Parameters
Both functions accept several useful parameters for customizing the generated datasets:
| Parameter | make_blobs | make_circles | Description |
|---|---|---|---|
n_samples |
? | ? | Number of data points to generate |
noise |
? | ? | Standard deviation of noise added to data |
centers |
? | ? | Number of cluster centers |
factor |
? | ? | Scale factor between inner and outer circle |
Conclusion
Sklearn's dataset generation functions like make_blobs and make_circles are essential tools for creating test datasets. They help in algorithm development, visualization testing, and educational purposes before working with real-world data.
