
- Python Basic Tutorial
- Python - Home
- Python - Overview
- Python - Environment Setup
- Python - Basic Syntax
- Python - Comments
- Python - Variables
- Python - Data Types
- Python - Operators
- Python - Decision Making
- Python - Loops
- Python - Numbers
- Python - Strings
- Python - Lists
- Python - Tuples
- Python - Dictionary
- Python - Date & Time
- Python - Functions
- Python - Modules
- Python - Files I/O
- Python - Exceptions
How can Tensorflow be used to build vocabulary from tokenized words for Illiad dataset using Python?
Tensorflow is a machine learning framework that is provided by Google. It is an open-source framework used in conjunction with Python to implement algorithms, deep learning applications, and much more. It is used in research and for production purposes. It has optimization techniques that help in performing complicated mathematical operations quickly. This is because it uses NumPy and multi-dimensional arrays. These multi-dimensional arrays are also known as ‘tensors’. The framework supports working with deep neural networks.
Tensor is a data structure used in TensorFlow. It helps connect edges in a flow diagram. This flow diagram is known as the ‘Data flow graph’. Tensors are nothing but a multidimensional array or a list.
We will be using the Illiad’s dataset, which contains text data of three translation works from William Cowper, Edward (Earl of Derby) and Samuel Butler. The model is trained to identify the translator when a single line of text is given. The text files used have been preprocessing. This includes removing the document header and footer, line numbers and chapter titles.
We are using Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.
Example
Following is the code snippet −
print("Build a vocabulary using the tokens") tokenized_ds = configure_dataset(tokenized_ds) vocab_dict = collections.defaultdict(lambda: 0) for toks in tokenized_ds.as_numpy_iterator(): for tok in toks: vocab_dict[tok] += 1 print("Sort the vocabulary") vocab = sorted(vocab_dict.items(), key=lambda x: x[1], reverse=True) vocab = [token for token, count in vocab] vocab = vocab[:VOCAB_SIZE] vocab_size = len(vocab) print("The vocabulary size is : ", vocab_size) print("First six vocabulary entries are :", vocab[:6])
Code credit − https://www.tensorflow.org/tutorials/load_data/text
Output
Build a vocabulary using the tokens Sort the vocabulary The vocabulary size is : 10000 First six vocabulary entries are : [b',', b'the', b'and', b"'", b'of', b'.']
Next, you will build a vocabulary by sorting tokens by frequency and keeping the top VOCAB_SIZE tokens.
Explanation
A vocabulary is built after sorting the tokens based on their frequency.
A few of the vocabulary entries are displayed on the console.
- Related Articles
- How can Tensorflow be used to convert the tokenized words from Illiad dataset into integers using Python?
- How can Tensorflow be used to load the Illiad dataset using Python?
- How can Tensorflow be used to train the Illiad dataset using Python?
- How can Tensorflow be used to create a dataset of raw strings from the Illiad dataset using Python?
- How can Tensorflow be used to download and explore the Illiad dataset using Python?
- How can TensorFlow be used to build the model for Fashion MNIST dataset in Python?
- How can Tensorflow be used with Illiad dataset to check how well the test data performs using Python?
- How can Tensorflow and Python be used to build ragged tensor from list of words?
- How can Tensorflow be used to build a normalization layer for the abalone dataset?
- How can the Illiad dataset be prepared for training using Python?
- How can Tensorflow be used to split the Illiad dataset into training and test data in Python?
- How can Tensorflow be used to build normalization layer using Python?
- How can Tensorflow be used with abalone dataset to build a sequential model?
- How can Tensorflow be used to visualize the flower dataset using Python?
- How can Tensorflow be used to configure the stackoverflow question dataset using Python?
