Python - Create Test DataSets using Sklearn

The Sklearn Python library provides sample datasets which can be used to create various graph plots. The usefulness of these datasets is in creating sample graphs and charts, predicting graph behavior as values change, and experimenting with parameters like colors and axes before using actual datasets.

Using make_blobs

The make_blobs function generates isotropic Gaussian blobs for clustering. This is useful for testing clustering algorithms and creating scatter plots with distinct groups of data points.

Example

In the below example we use the sklearn library along with matplotlib to create a scatter plot with a specific style. We choose a sample of 200 data points and also select the color and type of the clusters ?

from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
from matplotlib import style

style.use("fast")
X, y = make_blobs(n_samples=200, centers=4,
                  cluster_std=1, n_features=2)

plt.scatter(X[:, 0], X[:, 1], s=60, color='r')
plt.xlabel("X")
plt.ylabel("Y")
plt.title("Sample Blobs Dataset")
plt.show()

The output of the above code is ?

A scatter plot showing 200 red data points arranged in 4 distinct clusters
X Y Sample Blobs Dataset

Using make_circles

The make_circles function generates a dataset with two concentric circles. This is particularly useful for testing non-linear classification algorithms and creating circular patterns.

Example

Similar to the above approach, we use the make_circles function to create circles with a sample size of 100 and blue as the color ?

from sklearn.datasets import make_circles
from matplotlib import pyplot as plt
from matplotlib import style

style.use("fast")
X, y = make_circles(n_samples=100, noise=0.04)

plt.scatter(X[:, 0], X[:, 1], s=40, color='b')
plt.xlabel("X")
plt.ylabel("Y")
plt.title("Sample Circles Dataset")
plt.show()

The output of the above code is ?

A scatter plot showing 100 blue data points arranged in two concentric circles
X Y Sample Circles Dataset

Common Parameters

Both functions accept several useful parameters for customizing the generated datasets:

Parameter make_blobs make_circles Description
n_samples ? ? Number of data points to generate
noise ? ? Standard deviation of noise added to data
centers ? ? Number of cluster centers
factor ? ? Scale factor between inner and outer circle

Conclusion

Sklearn's dataset generation functions like make_blobs and make_circles are essential tools for creating test datasets. They help in algorithm development, visualization testing, and educational purposes before working with real-world data.

Updated on: 2026-03-15T18:39:10+05:30

449 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements