Article Categories

Selected Reading

How to Split Data into Training and Testing in Python without Sklearn

Python Server Side Programming Programming

Splitting data into training and testing sets is a fundamental step in machine learning. While scikit-learn's train_test_split() is commonly used, understanding how to split data manually helps you grasp the underlying concepts and provides flexibility when external libraries aren't available.

Why Split Data?

Machine learning models learn patterns from training data. To evaluate how well they generalize to new, unseen data, we need a separate testing set. Using the same data for both training and testing leads to overfitting ? the model memorizes the training data but fails on new data.

The typical split ratios are 80-20 or 70-30, where the larger portion is used for training.

Method 1: Using Python's Built-in Functions

The simplest approach uses Python's random module to shuffle data and basic slicing to split it ?

import random

# Create sample data
data = list(range(1, 101))  # Numbers 1 to 100
print(f"Original data length: {len(data)}")

# Shuffle the data randomly
random.shuffle(data)

# Define split ratio (80% training, 20% testing)
split_ratio = 0.8
split_index = int(split_ratio * len(data))

# Split the data
train_data = data[:split_index]
test_data = data[split_index:]

print(f"Training set size: {len(train_data)}")
print(f"Testing set size: {len(test_data)}")
print(f"First 10 training samples: {train_data[:10]}")
print(f"First 10 testing samples: {test_data[:10]}")

Original data length: 100
Training set size: 80
Testing set size: 20
First 10 training samples: [65, 51, 8, 82, 15, 32, 11, 74, 89, 29]
First 10 testing samples: [45, 53, 48, 16, 9, 62, 13, 81, 92, 54]

Method 2: Using NumPy Arrays

NumPy provides efficient array operations and random shuffling for numerical data ?

import numpy as np

# Create sample data as NumPy array
data = np.array(range(1, 101))
print(f"Original data shape: {data.shape}")

# Shuffle the array in-place
np.random.shuffle(data)

# Define split ratio
split_ratio = 0.8
split_index = int(split_ratio * len(data))

# Split the data
train_data = data[:split_index]
test_data = data[split_index:]

print(f"Training set shape: {train_data.shape}")
print(f"Testing set shape: {test_data.shape}")
print(f"Training set: {train_data[:10]}")
print(f"Testing set: {test_data[:10]}")

Original data shape: (100,)
Training set shape: (80,)
Testing set shape: (20,)
Training set: [52 13 87 68 48  4 34  9 74 25]
Testing set: [49 66  7 58 37 98 24  6 55 28]

Method 3: Splitting Features and Labels

In real machine learning scenarios, you typically have features (X) and labels (y) that need to be split together ?

import random

# Sample dataset with features and labels
features = [[i, i*2, i*3] for i in range(1, 51)]  # 50 samples with 3 features each
labels = [1 if i % 2 == 0 else 0 for i in range(1, 51)]  # Binary labels

print(f"Features shape: {len(features)} x {len(features[0])}")
print(f"Labels length: {len(labels)}")

# Create combined dataset for shuffling
combined = list(zip(features, labels))
random.shuffle(combined)

# Split the combined dataset
split_ratio = 0.8
split_index = int(split_ratio * len(combined))

train_combined = combined[:split_index]
test_combined = combined[split_index:]

# Separate features and labels
X_train, y_train = zip(*train_combined)
X_test, y_test = zip(*test_combined)

# Convert back to lists
X_train, y_train = list(X_train), list(y_train)
X_test, y_test = list(X_test), list(y_test)

print(f"Training features: {len(X_train)}")
print(f"Training labels: {len(y_train)}")
print(f"Testing features: {len(X_test)}")
print(f"Testing labels: {len(y_test)}")
print(f"Sample training data: {X_train[0]} -> {y_train[0]}")

Features shape: 50 x 3
Labels length: 50
Training features: 40
Training labels: 40
Testing features: 10
Testing labels: 10
Sample training data: [23, 46, 69] -> 0

Comparison of Methods

Method	Best For	Advantages	Disadvantages
Built-in Python	Simple data types	No dependencies, easy to understand	Less efficient for large datasets
NumPy	Numerical data	Fast, memory efficient	Requires NumPy installation
Features + Labels	ML datasets	Maintains data relationships	More complex implementation

Conclusion

Manual data splitting helps you understand the fundamentals of train-test separation in machine learning. Use Python's built-in functions for simple cases, NumPy for numerical efficiency, and the combined approach when dealing with feature-label pairs. While scikit-learn's train_test_split() offers more features like stratification, these manual methods provide full control over the splitting process.

Tushar Sharma

Updated on: 2026-03-27T13:56:25+05:30

1K+ Views

Previous Next