How can Tensorflow be used to convert the tokenized words from Illiad dataset into integers using Python?

TensorFlow is a machine learning framework provided by Google. It is an open-source framework used in conjunction with Python to implement algorithms, deep learning applications and much more. It is used in research and for production purposes.

The tensorflow package can be installed on Windows using the below line of code ?

pip install tensorflow

A Tensor is a data structure used in TensorFlow. It helps connect edges in a flow diagram known as the Data flow graph. Tensors are multidimensional arrays or lists identified by three main attributes:

  • Rank ? Dimensionality of the tensor (order or number of dimensions)

  • Type ? Data type of the tensor elements

  • Shape ? Number of rows and columns together

We will be using the Illiad dataset, which contains text data of three translation works from William Cowper, Edward (Earl of Derby) and Samuel Butler. The model is trained to identify the translator when a single line of text is given. The text files have been preprocessed by removing document headers, footers, line numbers and chapter titles.

Converting Tokens to Integers

The process involves creating a vocabulary table that maps tokenized words to integer values. Here's how to implement this conversion ?

import tensorflow as tf
import tensorflow_text as tf_text

# Create vocabulary mapping
keys = vocab
values = range(2, len(vocab) + 2)  # reserve 0 for padding, 1 for OOV

print("Map the tokens to integers")
init = tf.lookup.KeyValueTensorInitializer(
    keys, values, key_dtype=tf.string, value_dtype=tf.int64)

num_oov_buckets = 1
vocab_table = tf.lookup.StaticVocabularyTable(init, num_oov_buckets)

print("A function has been defined to standardize, tokenize and vectorize the dataset using tokenizer and lookup table")

def preprocess_text(text, label):
    standardized = tf_text.case_fold_utf8(text)
    tokenized = tokenizer.tokenize(standardized)
    vectorized = vocab_table.lookup(tokenized)
    return vectorized, label
Map the tokens to integers
A function has been defined to standardize, tokenize and vectorize the dataset using tokenizer and lookup table

How It Works

The token-to-integer conversion process follows these steps:

  • Vocabulary Creation ? The vocab set is used to create a StaticVocabularyTable

  • Integer Mapping ? Tokens are mapped to integers within the range [2, vocab_size + 2]

  • Special Values ? Number 0 indicates padding and 1 indicates out-of-vocabulary (OOV) tokens

  • Text Processing ? The preprocess_text function standardizes, tokenizes and vectorizes input text

Key Components

Component Purpose Value Range
Padding Token Sequence padding 0
OOV Token Unknown words 1
Vocabulary Tokens Known words 2 to vocab_size + 2

Conclusion

TensorFlow's StaticVocabularyTable efficiently converts tokenized words to integers for neural network processing. This approach reserves special values for padding and unknown words while mapping known vocabulary to unique integer identifiers.

---
Updated on: 2026-03-25T15:26:19+05:30

186 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements