How can the Illiad dataset be prepared for training using Python?

TensorFlow is a machine learning framework provided by Google. It is an open-source framework used with Python to implement algorithms, deep learning applications, and much more. The Iliad dataset contains text data from three translation works that can be prepared for training a text classification model.

The tensorflow package can be installed on Windows using the below line of code:

pip install tensorflow

We will be using the Iliad dataset, which contains text data of three translation works from William Cowper, Edward (Earl of Derby), and Samuel Butler. The model is trained to identify the translator when a single line of text is given. The text files have been preprocessed by removing document headers, footers, line numbers, and chapter titles.

Understanding Text Tokenization

A Tensor is a data structure used in TensorFlow that helps connect edges in a flow diagram called the 'Data flow graph'. Tensors are multidimensional arrays or lists. For text processing, we need to convert text into tokens that the model can understand.

Preparing the Iliad Dataset

The dataset preparation involves tokenizing the text data using TensorFlow Text. Here's how to prepare the dataset for training:

import tensorflow as tf
import tensorflow_text as tf_text

print("Prepare the dataset for training")
tokenizer = tf_text.UnicodeScriptTokenizer()

print("Defining a function named 'tokenize' to tokenize the text data")
def tokenize(text, unused_label):
    lower_case = tf_text.case_fold_utf8(text)
    return tokenizer.tokenize(lower_case)

# Assuming all_labeled_data is already loaded
tokenized_ds = all_labeled_data.map(tokenize)

print("Iterate over the dataset and print a few samples")
for text_batch in tokenized_ds.take(6):
    print("Tokens: ", text_batch.numpy())

The output of the above code is:

Prepare the dataset for training
Defining a function named 'tokenize' to tokenize the text data
Iterate over the dataset and print a few samples
Tokens: [b'but' b'i' b'have' b'now' b'both' b'tasted' b'food' b',' b'and' b'given']
Tokens: [b'all' b'these' b'shall' b'now' b'be' b'thine' b':' b'but' b'if' b'the' b'gods']
Tokens: [b'their' b'spiry' b'summits' b'waved' b'.' b'there' b',' b'unperceived']
Tokens: [b'"' b'i' b'pray' b'you' b',' b'would' b'you' b'show' b'your' b'love' b',' b'dear' b'friends' b',']
Tokens: [b'entering' b'beneath' b'the' b'clavicle' b'the' b'point']
Tokens: [b'but' b'grief' b',' b'his' b'father' b'lost' b',' b'awaits' b'him' b'now' b',']

How the Tokenization Works

The tokenization process involves several steps:

  • Case Folding: Converts all text to lowercase using tf_text.case_fold_utf8()
  • Unicode Script Tokenization: Splits text into tokens based on Unicode script boundaries
  • Word Separation: Separates words, punctuation, and special characters into individual tokens

Key Components

Component Purpose Output
UnicodeScriptTokenizer Splits text into tokens Individual words and punctuation
case_fold_utf8 Normalizes text case Lowercase text
map() Applies tokenization to dataset Tokenized dataset

Conclusion

The Iliad dataset preparation involves tokenizing text data using TensorFlow Text's UnicodeScriptTokenizer. This process converts raw text into tokens that machine learning models can process, enabling the classification of text by translator style.

Updated on: 2026-03-25T15:26:00+05:30

202 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements