Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How can Tensorflow be used to download and explore the Illiad dataset using Python?
TensorFlow is a machine learning framework provided by Google. It is an open-source framework used in conjunction with Python to implement algorithms, deep learning applications, and much more. It is used in research and for production purposes.
The 'tensorflow' package can be installed on Windows using the below line of code −
pip install tensorflow
A Tensor is a data structure used in TensorFlow. It helps connect edges in a flow diagram known as the 'Data flow graph'. Tensors are multidimensional arrays or lists that can be identified using three main attributes −
Rank − It tells about the dimensionality of the tensor
Type − It tells about the data type associated with the elements
Shape − It is the number of rows and columns together
About the Illiad Dataset
We will be using the Illiad dataset, which contains text data of three translation works from William Cowper, Edward (Earl of Derby) and Samuel Butler. The model is trained to identify the translator when a single line of text is given. The text files have been preprocessed by removing document headers, footers, line numbers and chapter titles.
Downloading the Dataset
The following code downloads the Illiad dataset files using TensorFlow's utility functions ?
import tensorflow as tf
from tensorflow.keras import utils
import pathlib
print("Loading the Illiad dataset")
DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']
print("Iterating through the name of the files")
for name in FILE_NAMES:
text_dir = utils.get_file(name, origin=DIRECTORY_URL + name)
parent_dir = pathlib.Path(text_dir).parent
print("The list of files in the directory")
print(list(parent_dir.iterdir()))
The output of the above code is ?
Loading the Illiad dataset
Iterating through the name of the files
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt
819200/815980 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt
811008/809730 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt
811008/807992 [==============================] - 0s 0us/step
The list of files in the directory
[PosixPath('/root/.keras/datasets/derby.txt'), PosixPath('/root/.keras/datasets/cowper.txt'), PosixPath('/root/.keras/datasets/butler.txt')]
Exploring the Downloaded Files
Once downloaded, you can explore the content of these text files to understand the data structure ?
# Read and display sample content from one file
sample_file = pathlib.Path(text_dir).parent / 'cowper.txt'
with open(sample_file, 'r', encoding='utf-8') as f:
sample_text = f.read(200) # Read first 200 characters
print("Sample text from Cowper's translation:")
print(sample_text)
# Check file sizes
for name in FILE_NAMES:
file_path = pathlib.Path(text_dir).parent / name
size = file_path.stat().st_size
print(f"{name}: {size} bytes")
Key Points
The
tf.keras.utils.get_file()function downloads files and caches them locallyFiles are stored in the
~/.keras/datasets/directory by defaultThe dataset contains three translation versions of Homer's Illiad
Each text file has been preprocessed for machine learning tasks
Conclusion
TensorFlow provides convenient utilities to download and explore text datasets like the Illiad collection. The utils.get_file() function handles downloading and caching, making it easy to access preprocessed text data for natural language processing tasks.
