How to Split Data into Training and Testing in Python without Sklearn


In the domain of machine learning or artificial intelligence models, data stands as the backbone. The way this data gets handled shapes the holistic performance of the model. This includes the indispensable task of segregating the dataset into learning and verification sets. While sklearn's train_test_split() is a frequently employed method, there could be instances when a Python aficionado might not have it at their disposal or is curious to grasp how to manually attain a similar outcome. This discourse delves into how one can segregate data into learning and verification sets without leaning on sklearn. We will bank on Python's built-in libraries for this objective.

Example 1: The Rationale Behind Segregating Data

Before plunging into the nuts and bolts, let's address the rationale. Machine learning algorithms need a wealth of data to glean from. This data, the learning set, assists the model to decipher patterns and formulate predictions. However, to gauge the model's prowess, we need data that the model hasn't been exposed to before. This untouched data is the verification set.

Utilizing the identical data for learning and verification would give rise to an overfitted model - the model performs impressively with the learning data but stumbles with untouched data. Consequently, the data typically gets divided into a 70-30 or an 80-20 proportion, where the larger chunk is harnessed for learning, and the smaller one is employed for verification.

Manually Segregating Data in Python

We'll kick-off with a simple yet efficacious way of segregating the data utilizing Python's built-in operations. The specimen used here is a list of integers, but the technique is applicable for any data type.

Assume we possess a dataset data as follows:

data = list(range(1, 101))  # data is a list of integers from 1 to 100
  • The goalpost is to segregate this data into 80% learning data and 20% verification data.

  • Initially, we'll import the necessary library.

  • The random module offers a variety of functions for generating random numbers, and we will utilize it to shuffle our data. Subsequently, we'll shuffle our data.

  • After shuffling the data, we'll segregate it into learning and verification sets

The split_index dictates the point at which the data gets bifurcated. We calculate it as the product of split_ratio and the size of the dataset.

Eventually, we employ slicing to craft the learning and verification datasets.

The learning data consists of elements from the commencement of the list up to split_index, and the verification data is composed of elements from split_index to the termination of the list.

Example

import random
random.shuffle(data)

split_ratio = 0.8  # We are using an 80-20 split here
split_index = int(split_ratio * len(data))

train_data = data[:split_index]
test_data = data[split_index:]

Output

train_data = [65, 51, 8, 82, 15, 32, 11, 74, 89, 29, 50, 
34, 93, 84, 37, 7, 1, 83, 17, 24, 5, 33, 49, 90, 35, 57, 
47, 73, 46, 95, 10, 80, 59, 94, 63, 27, 31, 52, 18, 76, 
91, 71, 20, 68, 70, 87, 26, 64, 99, 42, 61, 69, 79, 12, 
3, 66, 96, 75, 30, 22, 100, 14, 97, 56, 55, 58, 28, 23, 
98, 6, 2, 88, 43, 41, 78, 60, 72, 39]

test_data = [45, 53, 48, 16, 9, 62, 13, 81, 92, 54, 21, 
38, 25, 44, 85, 19, 40, 77, 67, 4]

As the code involves random shuffling of data, the output may vary each and every time you try it to run.

Example 2: Utilizing Numpy to Segregate Data

Another technique to segregate data sans sklearn is by exploiting the numpy library. Numpy is a potent library for numerical computation and can be employed to construct arrays and manipulate them efficiently.

Here's how you can segregate data employing numpy:

  • Firstly, import the numpy library. Subsequently, construct a numpy array.

  • Shuffle the array. Eventually, split the array.

The index represents the point at which our pool of data gets fractionated into the learning and verification subsets. It's arrived at by harnessing the product of the predetermined split ratio (0.8 in our instance for an 80-20 split) and the cumulative count of data points

The final step is to create the training and testing datasets using the calculated split index. We use list slicing for this operation.

Example

import numpy as np

data = np.array(range(1, 101))  
# data is a numpy array of integers from 1 to 100
np.random.shuffle(data)

split_ratio = 0.8  # We are using an 80-20 split here
split_index = int(split_ratio * len(data))

train_data = data[:split_index]
test_data = data[split_index:]

Output

train_data = [52, 13, 87, 68, 48, 4, 34, 9, 74, 25, 
30, 38, 90, 83, 54, 45, 61, 73, 80, 14, 70, 63, 75, 
81, 97, 60, 96, 8, 43, 20, 79, 46, 50, 76, 18, 84, 
26, 31, 71, 56, 22, 88, 64, 95, 91, 78, 69, 19, 42, 
67, 77, 2, 41, 32, 11, 94, 40, 59, 17, 57, 99, 44, 
5, 93, 62, 23, 3, 33, 47, 92]

test_data = [49, 66, 7, 58, 37, 98, 100, 24, 6, 55, 
28, 16, 85, 65, 51, 35, 12, 10, 86, 29]

Conclusion

Segregating data into learning and verification sets is a paramount step in machine learning and data science ventures. While sklearn furnishes a straightforward method to execute this task, it's crucial to comprehend how to achieve this manually. As we've demonstrated, this can be accomplished utilizing Python's built-in operations or the numpy library.

Whether you opt to use sklearn, Python's built-in operations, or numpy relies on your specific requisites and constraints. Each method carries its advantages and disadvantages. The manual methods bequeath you more control over the process, while sklearn's train_test_split() is simpler to utilize and includes additional attributes like stratification.

Updated on: 28-Aug-2023

333 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements