Article Categories

Selected Reading

How can Tensorflow be used to create a dataset of raw strings from the Illiad dataset using Python?

Python Server Side Programming Programming

TensorFlow is an open-source machine learning framework provided by Google. It works with Python to implement algorithms, deep learning applications, and more for both research and production purposes.

The 'tensorflow' package can be installed on Windows using the below command −

pip install tensorflow

A Tensor is a multidimensional array or data structure used in TensorFlow to connect edges in a flow diagram called the 'Data Flow Graph'.

We will use the Illiad dataset, which contains text data from three translation works by William Cowper, Edward (Earl of Derby), and Samuel Butler. The model identifies the translator when given a single line of text. The text files are preprocessed by removing document headers, footers, line numbers, and chapter titles.

Dataset Preparation

First, let's create a complete example showing how to prepare and work with the Illiad dataset as raw strings −

import tensorflow as tf

# Sample Illiad text data (representing the actual dataset structure)
sample_texts = [
    "Sing, goddess, the anger of Peleus' son Achilleus",
    "Of Peleus' son, Achilles, sing, O Muse, the vengeance",  
    "Achilles' wrath, to Greece the direful spring"
]

labels = ["cowper", "derby", "butler"]  # Translator labels

# Create dataset from raw strings
BATCH_SIZE = 2
VALIDATION_SIZE = 1

# Create labeled dataset
dataset = tf.data.Dataset.from_tensor_slices((sample_texts, labels))
print("Raw string dataset created from Illiad texts")

# Display sample data
for text, label in dataset.take(2):
    print(f"Text: {text.numpy().decode('utf-8')}")
    print(f"Label: {label.numpy().decode('utf-8')}")
    print("-" * 50)

Raw string dataset created from Illiad texts
Text: Sing, goddess, the anger of Peleus' son Achilleus
Label: cowper
--------------------------------------------------
Text: Of Peleus' son, Achilles, sing, O Muse, the vengeance
Label: derby
--------------------------------------------------

Creating Test Dataset with Raw Strings

Now let's create a test dataset for evaluation −

import tensorflow as tf

print("Creating a test dataset that consists of raw strings")

# Simulate the dataset operations (assuming all_labeled_data exists)
# In practice, this would be your preprocessed Illiad dataset
sample_data = tf.data.Dataset.from_tensor_slices([
    "Rage, goddess, sing the rage of Achilles",
    "Wrath, sing goddess, of Achilles Peleus' son"
])

VALIDATION_SIZE = 1
BATCH_SIZE = 2

# Create test dataset
test_ds = sample_data.take(VALIDATION_SIZE).batch(BATCH_SIZE)

print("Test dataset created with batch size:", BATCH_SIZE)
print("Validation size:", VALIDATION_SIZE)

# Display dataset structure
for batch in test_ds.take(1):
    print("Batch shape:", batch.shape)
    print("Sample text:", batch.numpy()[0].decode('utf-8'))

Creating a test dataset that consists of raw strings
Test dataset created with batch size: 2
Validation size: 1
Batch shape: (1,)
Sample text: Rage, goddess, sing the rage of Achilles

Dataset Configuration and Evaluation

Here's how the dataset is typically configured and evaluated in a complete TensorFlow workflow −

# This represents the typical workflow (requires trained model)
def configure_dataset(dataset):
    return dataset.cache().prefetch(tf.data.AUTOTUNE)

# Create and configure test dataset
test_ds = all_labeled_data.take(VALIDATION_SIZE).batch(BATCH_SIZE)
test_ds = configure_dataset(test_ds)

# Evaluate the model (requires pre-trained export_model)
loss, accuracy = export_model.evaluate(test_ds)
print("The loss is:", loss)
print("The accuracy is: {:2.2%}".format(accuracy))

Key Steps Explained

Dataset Creation: Raw text strings from Illiad translations are converted into TensorFlow dataset format
Batching: Data is grouped into batches of specified size for efficient processing
Configuration: Dataset is optimized using caching and prefetching for better performance
Evaluation: The trained model evaluates the test dataset to measure loss and accuracy

Expected Output Structure

Creating a test dataset that consists of raw strings
79/79 [==============================] - 7s 10ms/step - loss: 0.5230 - accuracy: 0.7909
The loss is: 0.5458346605300903
The accuracy is: 78.16%

Conclusion

TensorFlow efficiently handles raw string datasets from the Illiad corpus by converting text into tensor format, batching for processing, and enabling model evaluation. This approach allows effective training of translator identification models with approximately 78% accuracy.

AmitDiwan

Updated on: 2026-03-25T15:28:04+05:30

253 Views

Previous Next