How can Tensorflow be used to split the Illiad dataset into training and test data in Python?

TensorFlow is a machine learning framework provided by Google. It is an open-source framework used with Python to implement algorithms, deep learning applications, and much more for research and production purposes.

The tensorflow package can be installed on Windows using the following command ?

pip install tensorflow

We will use the Illiad dataset, which contains text data from three translation works by William Cowper, Edward (Earl of Derby), and Samuel Butler. The model is trained to identify the translator when given a single line of text. The text files have been preprocessed by removing document headers, footers, line numbers, and chapter titles.

Understanding the Dataset Structure

Before splitting the data, we need to understand that TensorFlow uses Tensors as the primary data structure. Tensors are multidimensional arrays identified by three main attributes ?

  • Rank ? The dimensionality or number of dimensions in the tensor
  • Type ? The data type of tensor elements (int64, float32, etc.)
  • Shape ? The number of elements along each dimension

Splitting Dataset into Training and Validation Data

Here's how to split the Illiad dataset using TensorFlow's data pipeline operations ?

import tensorflow as tf

# Define constants for data splitting
VALIDATION_SIZE = 1000
BUFFER_SIZE = 10000
BATCH_SIZE = 64

# Assume all_encoded_data is our preprocessed dataset
# Split data: skip validation_size for training, take validation_size for validation
train_data = all_encoded_data.skip(VALIDATION_SIZE).shuffle(BUFFER_SIZE)
validation_data = all_encoded_data.take(VALIDATION_SIZE)

# Apply padding and batching to both datasets
train_data = train_data.padded_batch(BATCH_SIZE)
validation_data = validation_data.padded_batch(BATCH_SIZE)

# Examine the structure of our batched data
sample_text, sample_labels = next(iter(validation_data))
print("The text batch shape is :", sample_text.shape)
print("The label batch shape is :", sample_labels.shape)
print("A text example is :", sample_text[5])
print("A label example is:", sample_labels[5])
The text batch shape is : (64, 18)
The label batch shape is : (64,)
A text example is : tf.Tensor(
[ 20 391 2 11 144 787 2 3498 16 49 2 0 0 0
  0 0 0 0], shape=(18,), dtype=int64)
A label example is: tf.Tensor(1, shape=(), dtype=int64)

How the Data Splitting Works

The splitting process involves several key operations ?

Operation Purpose Result
skip(VALIDATION_SIZE) Skip first 1000 samples Training data
take(VALIDATION_SIZE) Take first 1000 samples Validation data
shuffle(BUFFER_SIZE) Randomize training order Better training
padded_batch(BATCH_SIZE) Group and pad sequences Uniform batch size

Key Points About Padding and Batching

  • Padding is essential because text sequences have different lengths, but batch processing requires uniform dimensions
  • Batching groups multiple examples together for efficient processing
  • The tf.data.Dataset methods provide an efficient pipeline for data preprocessing
  • Each batch contains pairs of (text_examples, labels) represented as tensors

Conclusion

TensorFlow's tf.data.Dataset API provides efficient methods to split datasets using skip() and take() operations. Combined with padded_batch(), it creates properly formatted training and validation datasets ready for model training.

Updated on: 2026-03-25T15:26:58+05:30

289 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements