How can Tensorflow be used to split the Illiad dataset into training and test data in Python?


Tensorflow is a machine learning framework that is provided by Google. It is an open-source framework used in conjunction with Python to implement algorithms, deep learning applications and much more. It is used in research and for production purposes.

The ‘tensorflow’ package can be installed on Windows using the below line of code −

pip install tensorflow

Tensor is a data structure used in TensorFlow. It helps connect edges in a flow diagram. This flow diagram is known as the ‘Data flow graph’. Tensors are nothing but a multidimensional array or a list.

They can be identified using three main attributes −

  • Rank − It tells about the dimensionality of the tensor. It can be understood as the order of the tensor or the number of dimensions in the tensor that has been defined.

  • Type − It tells about the data type associated with the elements of the Tensor. It can be a one dimensional, two dimensional or n-dimensional tensor.

  • Shape − It is the number of rows and columns together.

We will be using the Illiad’s dataset, which contains text data of three translation works from William Cowper, Edward (Earl of Derby) and Samuel Butler. The model is trained to identify the translator when a single line of text is given. The text files used have been preprocessing. This includes removing the document header and footer, line numbers and chapter titles.

We are using Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.

Example

Following is the code snippet −

train_data = all_encoded_data.skip(VALIDATION_SIZE).shuffle(BUFFER_SIZE)
validation_data = all_encoded_data.take(VALIDATION_SIZE)

train_data = train_data.padded_batch(BATCH_SIZE)
validation_data = validation_data.padded_batch(BATCH_SIZE)

sample_text, sample_labels = next(iter(validation_data))
print("The text batch shape is : ", sample_text.shape)
print("The label batch shape is : ", sample_labels.shape)
print("A text example is : ", sample_text[5])
print("A label example is: ", sample_labels[5])

Code credit − https://www.tensorflow.org/tutorials/load_data/text

Output

The text batch shape is : (64, 18)
The label batch shape is : (64,)
A text example is : tf.Tensor(
[ 20 391 2 11 144 787 2 3498 16 49 2 0 0 0
   0 0 0 0], shape=(18,), dtype=int64)
A label example is: tf.Tensor(1, shape=(), dtype=int64)

Explanation

  • The Keras TextVectorization layer is used to group/batch and provide padding to the vectorized data.

  • Padding is needed since examples inside a batch need to be of the same size and shape, but examples in the dataset may not be the same size.

  • Every line of text may have a different number of words.

  • The ‘tf.data.Dataset’ method helps in splitting and pad-batching datasets.

  • The ‘validation_data’ and ‘train_data’ are collections of batch data.

  • Every batch is a pair of (many examples, many labels) represented as arrays.

Updated on: 19-Jan-2021

131 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements