Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to split a Dataset into Train sets and Test sets in Python?
In this tutorial, we will learn how to split a dataset into train sets and test sets using Python. This is a fundamental preprocessing step in machine learning that helps build robust models.
Why Split Datasets?
When creating machine learning models, we need to evaluate their performance on unseen data. Common problems include overfitting (model performs well on training data but fails on new data) and underfitting (model performs poorly on both training and new data).
Splitting the dataset helps us:
- Train set Used to train the model (typically 70-80% of data)
- Test set Used to evaluate model performance on unseen data (typically 20-30% of data)
Method 1: Using sklearn's train_test_split
The most popular and recommended approach uses scikit-learn's built-in function ?
import numpy as np
from sklearn.model_selection import train_test_split
# Create sample data
X = np.arange(0, 50).reshape(10, 5)
y = np.array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0])
# Split into train and test sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)
Shape of X_train: (7, 5) Shape of X_test: (3, 5) Shape of y_train: (7,) Shape of y_test: (3,)
Method 2: Using NumPy Indexing
Manual splitting using array slicing for simple cases ?
import numpy as np
# Create sample data
X = np.random.rand(100, 5)
y = np.random.rand(100, 1)
# Split manually (80% train, 20% test)
split_index = int(0.8 * len(X))
X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)
Shape of X_train: (80, 5) Shape of X_test: (20, 5) Shape of y_train: (80, 1) Shape of y_test: (20, 1)
Method 3: Using Pandas Sample
For DataFrame operations, use pandas sample() method ?
import pandas as pd
import numpy as np
# Create sample DataFrame
data = np.random.randint(10, 25, size=(100, 3))
df = pd.DataFrame(data, columns=['col1', 'col2', 'col3'])
# Split using sample (80% train)
train_df = df.sample(frac=0.8, random_state=42)
test_df = df[~df.index.isin(train_df.index)]
print("Original dataset shape:", df.shape)
print("Train dataset shape:", train_df.shape)
print("Test dataset shape:", test_df.shape)
Original dataset shape: (100, 3) Train dataset shape: (80, 3) Test dataset shape: (20, 3)
Comparison
| Method | Best For | Advantages | Disadvantages |
|---|---|---|---|
| train_test_split | Most ML projects | Built-in stratification, random state | Requires sklearn |
| NumPy indexing | Simple arrays | No dependencies, fast | No shuffling by default |
| Pandas sample | DataFrames | Maintains DataFrame structure | Requires pandas |
Conclusion
Use train_test_split from sklearn for most machine learning projects as it provides the best features and flexibility. This preprocessing step is crucial for building reliable models that generalize well to unseen data.
