Explain how scikit-learn library can be used to split the dataset for training and testing purposes in Python?

Scikit-learn, commonly known as sklearn, is a powerful Python library used for implementing machine learning algorithms. It provides a wide variety of tools for statistical modeling including classification, regression, clustering, and dimensionality reduction, built on NumPy, SciPy, and Matplotlib libraries.

Before training a machine learning model, the dataset must be split into training and testing portions. The training set is used to teach the model patterns in the data, while the test set evaluates how well the model generalizes to unseen data.

What is train_test_split?

The train_test_split function from sklearn.model_selection randomly divides your dataset into training and testing subsets. This ensures unbiased evaluation of model performance.

Syntax

train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)

Parameters

Parameter Description Default
test_size Proportion of dataset for testing (0.0 to 1.0) 0.25
random_state Seed for reproducible results None
shuffle Whether to shuffle data before splitting True
stratify Maintain class distribution in splits None

Basic Example

Here's how to split the Iris dataset using train_test_split ?

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the Iris dataset
iris_data = load_iris()
X = iris_data.data
y = iris_data.target

# Split data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training features shape:", X_train.shape)
print("Testing features shape:", X_test.shape)
print("Training targets shape:", y_train.shape)
print("Testing targets shape:", y_test.shape)
Training features shape: (120, 4)
Testing features shape: (30, 4)
Training targets shape: (120,)
Testing targets shape: (30,)

Stratified Splitting

For classification tasks, use stratified splitting to maintain class distribution ?

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

# Load data
iris_data = load_iris()
X, y = iris_data.data, iris_data.target

# Stratified split maintains class proportions
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Check class distribution
print("Original class distribution:", np.bincount(y))
print("Training class distribution:", np.bincount(y_train))
print("Testing class distribution:", np.bincount(y_test))
Original class distribution: [50 50 50]
Training class distribution: [35 35 35]
Testing class distribution: [15 15 15]

Custom Dataset Example

You can also split custom datasets ?

import numpy as np
from sklearn.model_selection import train_test_split

# Create sample data
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
y = np.array([0, 1, 0, 1, 0, 1])

# Split with different test size
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=123
)

print("Training data:")
print("Features:", X_train)
print("Labels:", y_train)
print("\nTesting data:")
print("Features:", X_test)
print("Labels:", y_test)
Training data:
Features: [[11 12]
 [ 1  2]
 [ 3  4]
 [ 9 10]]
Labels: [1 0 1 0]

Testing data:
Features: [[7 8]
 [5 6]]
Labels: [1 0]

Best Practices

  • Set random_state − Ensures reproducible results across runs
  • Use stratify − For classification with imbalanced classes
  • Common split ratios − 80/20, 70/30, or 60/40 depending on dataset size
  • Larger datasets − Can use smaller test percentages (5-10%)

Conclusion

The train_test_split function is essential for proper model evaluation in machine learning. Use stratified splitting for classification tasks and always set a random_state for reproducible results.

Updated on: 2026-03-25T13:20:03+05:30

333 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements