Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How can Tensorflow be used to create a dataset of raw strings from the Illiad dataset using Python?
TensorFlow is an open-source machine learning framework provided by Google. It works with Python to implement algorithms, deep learning applications, and more for both research and production purposes.
The 'tensorflow' package can be installed on Windows using the below command −
pip install tensorflow
A Tensor is a multidimensional array or data structure used in TensorFlow to connect edges in a flow diagram called the 'Data Flow Graph'.
We will use the Illiad dataset, which contains text data from three translation works by William Cowper, Edward (Earl of Derby), and Samuel Butler. The model identifies the translator when given a single line of text. The text files are preprocessed by removing document headers, footers, line numbers, and chapter titles.
Dataset Preparation
First, let's create a complete example showing how to prepare and work with the Illiad dataset as raw strings −
import tensorflow as tf
# Sample Illiad text data (representing the actual dataset structure)
sample_texts = [
"Sing, goddess, the anger of Peleus' son Achilleus",
"Of Peleus' son, Achilles, sing, O Muse, the vengeance",
"Achilles' wrath, to Greece the direful spring"
]
labels = ["cowper", "derby", "butler"] # Translator labels
# Create dataset from raw strings
BATCH_SIZE = 2
VALIDATION_SIZE = 1
# Create labeled dataset
dataset = tf.data.Dataset.from_tensor_slices((sample_texts, labels))
print("Raw string dataset created from Illiad texts")
# Display sample data
for text, label in dataset.take(2):
print(f"Text: {text.numpy().decode('utf-8')}")
print(f"Label: {label.numpy().decode('utf-8')}")
print("-" * 50)
Raw string dataset created from Illiad texts Text: Sing, goddess, the anger of Peleus' son Achilleus Label: cowper -------------------------------------------------- Text: Of Peleus' son, Achilles, sing, O Muse, the vengeance Label: derby --------------------------------------------------
Creating Test Dataset with Raw Strings
Now let's create a test dataset for evaluation −
import tensorflow as tf
print("Creating a test dataset that consists of raw strings")
# Simulate the dataset operations (assuming all_labeled_data exists)
# In practice, this would be your preprocessed Illiad dataset
sample_data = tf.data.Dataset.from_tensor_slices([
"Rage, goddess, sing the rage of Achilles",
"Wrath, sing goddess, of Achilles Peleus' son"
])
VALIDATION_SIZE = 1
BATCH_SIZE = 2
# Create test dataset
test_ds = sample_data.take(VALIDATION_SIZE).batch(BATCH_SIZE)
print("Test dataset created with batch size:", BATCH_SIZE)
print("Validation size:", VALIDATION_SIZE)
# Display dataset structure
for batch in test_ds.take(1):
print("Batch shape:", batch.shape)
print("Sample text:", batch.numpy()[0].decode('utf-8'))
Creating a test dataset that consists of raw strings Test dataset created with batch size: 2 Validation size: 1 Batch shape: (1,) Sample text: Rage, goddess, sing the rage of Achilles
Dataset Configuration and Evaluation
Here's how the dataset is typically configured and evaluated in a complete TensorFlow workflow −
# This represents the typical workflow (requires trained model)
def configure_dataset(dataset):
return dataset.cache().prefetch(tf.data.AUTOTUNE)
# Create and configure test dataset
test_ds = all_labeled_data.take(VALIDATION_SIZE).batch(BATCH_SIZE)
test_ds = configure_dataset(test_ds)
# Evaluate the model (requires pre-trained export_model)
loss, accuracy = export_model.evaluate(test_ds)
print("The loss is:", loss)
print("The accuracy is: {:2.2%}".format(accuracy))
Key Steps Explained
Dataset Creation: Raw text strings from Illiad translations are converted into TensorFlow dataset format
Batching: Data is grouped into batches of specified size for efficient processing
Configuration: Dataset is optimized using caching and prefetching for better performance
Evaluation: The trained model evaluates the test dataset to measure loss and accuracy
Expected Output Structure
Creating a test dataset that consists of raw strings 79/79 [==============================] - 7s 10ms/step - loss: 0.5230 - accuracy: 0.7909 The loss is: 0.5458346605300903 The accuracy is: 78.16%
Conclusion
TensorFlow efficiently handles raw string datasets from the Illiad corpus by converting text into tensor format, batching for processing, and enabling model evaluation. This approach allows effective training of translator identification models with approximately 78% accuracy.
