Article Categories

Selected Reading

How to split a Dataset into Train sets and Test sets in Python?

Python Server Side Programming Programming

In this tutorial, we will learn how to split a dataset into train sets and test sets using Python. This is a fundamental preprocessing step in machine learning that helps build robust models.

Why Split Datasets?

When creating machine learning models, we need to evaluate their performance on unseen data. Common problems include overfitting (model performs well on training data but fails on new data) and underfitting (model performs poorly on both training and new data).

Splitting the dataset helps us:

Train set Used to train the model (typically 70-80% of data)
Test set Used to evaluate model performance on unseen data (typically 20-30% of data)

Method 1: Using sklearn's train_test_split

The most popular and recommended approach uses scikit-learn's built-in function ?

import numpy as np
from sklearn.model_selection import train_test_split

# Create sample data
X = np.arange(0, 50).reshape(10, 5)
y = np.array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0])

# Split into train and test sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (7, 5)
Shape of X_test: (3, 5)
Shape of y_train: (7,)
Shape of y_test: (3,)

Method 2: Using NumPy Indexing

Manual splitting using array slicing for simple cases ?

import numpy as np

# Create sample data
X = np.random.rand(100, 5)
y = np.random.rand(100, 1)

# Split manually (80% train, 20% test)
split_index = int(0.8 * len(X))
X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (80, 5)
Shape of X_test: (20, 5)
Shape of y_train: (80, 1)
Shape of y_test: (20, 1)

Method 3: Using Pandas Sample

For DataFrame operations, use pandas sample() method ?

import pandas as pd
import numpy as np

# Create sample DataFrame
data = np.random.randint(10, 25, size=(100, 3))
df = pd.DataFrame(data, columns=['col1', 'col2', 'col3'])

# Split using sample (80% train)
train_df = df.sample(frac=0.8, random_state=42)
test_df = df[~df.index.isin(train_df.index)]

print("Original dataset shape:", df.shape)
print("Train dataset shape:", train_df.shape)
print("Test dataset shape:", test_df.shape)

Original dataset shape: (100, 3)
Train dataset shape: (80, 3)
Test dataset shape: (20, 3)

Comparison

Method	Best For	Advantages	Disadvantages
train_test_split	Most ML projects	Built-in stratification, random state	Requires sklearn
NumPy indexing	Simple arrays	No dependencies, fast	No shuffling by default
Pandas sample	DataFrames	Maintains DataFrame structure	Requires pandas

Conclusion

Use train_test_split from sklearn for most machine learning projects as it provides the best features and flexibility. This preprocessing step is crucial for building reliable models that generalize well to unseen data.

---

Mithilesh Pradhan

Updated on: 2026-03-26T22:51:52+05:30

2K+ Views

Previous Next