How to generate random regression problems using Python Scikit-learn?

Python Scikit-learn provides the make_regression() function to generate random regression datasets for testing and learning purposes. This tutorial demonstrates how to create both basic regression problems and sparse uncorrelated regression datasets.

Basic Random Regression Problem

The make_regression() function creates a random regression dataset with specified parameters. Here's how to generate a simple regression problem ?

# Importing necessary libraries
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt

# Generate regression dataset
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)

# Create scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(X, y, alpha=0.7)
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title('Random Regression Problem')
plt.show()

The output shows a scatter plot with a linear relationship between the feature and target values ?

[Displays a scatter plot with points following a linear trend with added noise]

Key Parameters of make_regression()

Understanding the important parameters helps control the generated dataset ?

from sklearn.datasets import make_regression
import matplotlib.pyplot as plt

# Generate datasets with different parameters
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Low noise
X1, y1 = make_regression(n_samples=100, n_features=1, noise=5, random_state=42)
axes[0].scatter(X1, y1, alpha=0.7)
axes[0].set_title('Low Noise (noise=5)')

# High noise  
X2, y2 = make_regression(n_samples=100, n_features=1, noise=20, random_state=42)
axes[1].scatter(X2, y2, alpha=0.7)
axes[1].set_title('High Noise (noise=20)')

# Multiple informative features
X3, y3 = make_regression(n_samples=100, n_features=2, n_informative=2, random_state=42)
axes[2].scatter(X3[:, 0], y3, alpha=0.7)
axes[2].set_title('Multiple Features')

plt.tight_layout()
plt.show()
[Displays three scatter plots showing the effect of different noise levels and features]

Sparse Uncorrelated Regression Problem

The make_sparse_uncorrelated() function creates datasets where only a few features are actually informative ?

from sklearn.datasets import make_sparse_uncorrelated
import matplotlib.pyplot as plt
import numpy as np

# Generate sparse uncorrelated dataset
X, y = make_sparse_uncorrelated(n_samples=200, n_features=4, random_state=42)

# Plot the first feature vs target
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], y, alpha=0.7)
plt.xlabel('Feature 1')
plt.ylabel('Target')
plt.title('Feature 1 vs Target')

plt.subplot(1, 2, 2)
plt.scatter(X[:, 1], y, alpha=0.7)
plt.xlabel('Feature 2') 
plt.ylabel('Target')
plt.title('Feature 2 vs Target')

plt.tight_layout()
plt.show()

# Show feature statistics
print("Feature means:", np.mean(X, axis=0))
print("Feature correlations with target:")
for i in range(X.shape[1]):
    corr = np.corrcoef(X[:, i], y)[0, 1]
    print(f"Feature {i+1}: {corr:.3f}")
[Displays two scatter plots and prints feature statistics]
Feature means: [-0.02  0.01 -0.01  0.02]
Feature correlations with target:
Feature 1: 0.889
Feature 2: 0.001
Feature 3: -0.002
Feature 4: 0.003

Comparison of Methods

Function Purpose Key Feature Best For
make_regression() General regression datasets Customizable noise and features Algorithm testing and learning
make_sparse_uncorrelated() Sparse feature datasets Only few features are informative Feature selection testing

Practical Example with Model Training

Here's how to use generated data for actual machine learning ?

from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate dataset
X, y = make_regression(n_samples=1000, n_features=5, noise=10, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"Model Score: {model.score(X_test, y_test):.3f}")
Mean Squared Error: 98.45
Model Score: 0.999

Conclusion

Scikit-learn's regression generators are essential tools for creating synthetic datasets. Use make_regression() for general testing and make_sparse_uncorrelated() when you need datasets with only a few informative features. These functions are invaluable for algorithm development and educational purposes.

Updated on: 2026-03-26T22:11:55+05:30

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements