How can Tensorflow be used to build vocabulary from tokenized words for Illiad dataset using Python?

TensorFlow is a machine learning framework provided by Google. It is an open-source framework used in conjunction with Python to implement algorithms, deep learning applications, and much more. It uses NumPy and multi-dimensional arrays called tensors to perform complex mathematical operations efficiently. The framework supports working with deep neural networks.

We will be using the Illiad dataset, which contains text data of three translation works from William Cowper, Edward (Earl of Derby) and Samuel Butler. The model is trained to identify the translator when a single line of text is given. The text files have been preprocessed by removing document headers, footers, line numbers and chapter titles.

Building Vocabulary from Tokenized Words

Building a vocabulary involves collecting all unique tokens from the tokenized dataset and sorting them by frequency. This process helps in creating a mapping between words and numerical indices for machine learning models ?

import collections
import tensorflow as tf

# Assume we have a tokenized dataset and configuration function
# For demonstration, let's create a sample tokenized dataset
def create_sample_data():
    sample_texts = [
        [b'the', b'quick', b'brown', b'fox'],
        [b'the', b'brown', b'fox', b'jumps'],
        [b'over', b'the', b'lazy', b'dog'],
        [b'the', b'dog', b'is', b'lazy']
    ]
    return tf.data.Dataset.from_generator(
        lambda: sample_texts, 
        output_signature=tf.TensorSpec(shape=(None,), dtype=tf.string)
    )

def configure_dataset(ds):
    return ds.batch(1000)

# Set vocabulary size
VOCAB_SIZE = 10000

# Create sample tokenized dataset
tokenized_ds = create_sample_data()

print("Build a vocabulary using the tokens")
tokenized_ds = configure_dataset(tokenized_ds)
vocab_dict = collections.defaultdict(lambda: 0)

for toks in tokenized_ds.as_numpy_iterator():
    for tok in toks:
        vocab_dict[tok] += 1

print("Sort the vocabulary")
vocab = sorted(vocab_dict.items(), key=lambda x: x[1], reverse=True)
vocab = [token for token, count in vocab]
vocab = vocab[:VOCAB_SIZE]
vocab_size = len(vocab)

print("The vocabulary size is:", vocab_size)
print("First six vocabulary entries are:", vocab[:6])

The output of the above code is ?

Build a vocabulary using the tokens
Sort the vocabulary
The vocabulary size is: 8
First six vocabulary entries are: [b'the', b'brown', b'fox', b'lazy', b'dog', b'quick']

How It Works

The vocabulary building process follows these steps:

  • Token Collection: All tokens from the tokenized dataset are collected using a defaultdict to count frequencies
  • Frequency Sorting: Tokens are sorted by frequency in descending order using the count values
  • Size Limitation: Only the top VOCAB_SIZE tokens are kept to limit vocabulary size
  • Token Extraction: The final vocabulary contains just the tokens, not their counts

Key Parameters

Parameter Purpose Impact
VOCAB_SIZE Limits vocabulary size Smaller values reduce memory, larger values preserve more words
reverse=True Sort by highest frequency first Most common words appear first in vocabulary
defaultdict(lambda: 0) Initialize counts to zero Automatically handles new tokens

Conclusion

Building vocabulary from tokenized text involves counting token frequencies and selecting the most common tokens. This vocabulary serves as the foundation for converting text into numerical representations that machine learning models can process effectively.

Updated on: 2026-03-25T15:25:57+05:30

439 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements