
- Gensim Tutorial
- Gensim - Home
- Gensim - Introduction
- Gensim - Getting Started
- Gensim - Documents & Corpus
- Gensim - Vector & Model
- Gensim - Creating a Dictionary
- Creating a bag of words (BoW) Corpus
- Gensim - Transformations
- Gensim - Creating TF-IDF Matrix
- Gensim - Topic Modeling
- Gensim - Creating LDA Topic Model
- Gensim - Using LDA Topic Model
- Gensim - Creating LDA Mallet Model
- Gensim - Documents & LDA Model
- Gensim - Creating LSI & HDP Topic Model
- Gensim - Developing Word Embedding
- Gensim - Doc2Vec Model
- Gensim Useful Resources
- Gensim - Quick Guide
- Gensim - Useful Resources
- Gensim - Discussion
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Gensim - Creating a Dictionary
In last chapter where we discussed about vector and model, you got an idea about the dictionary. Here, we are going to discuss Dictionary object in a bit more detail.
What is Dictionary?
Before getting deep dive into the concept of dictionary, let’s understand some simple NLP concepts −
Token − A token means a ‘word’.
Document − A document refers to a sentence or paragraph.
Corpus − It refers to a collection of documents as a bag of words (BoW).
For all the documents, a corpus always contains each word’s token’s id along with its frequency count in the document.
Let’s move to the concept of dictionary in Gensim. For working on text documents, Gensim also requires the words, i.e. tokens to be converted to their unique ids. For achieving this, it gives us the facility of Dictionary object, which maps each word to their unique integer id. It does this by converting input text to the list of words and then pass it to the corpora.Dictionary() object.
Need of Dictionary
Now the question arises that what is actually the need of dictionary object and where it can be used? In Gensim, the dictionary object is used to create a bag of words (BoW) corpus which further used as the input to topic modelling and other models as well.
Forms of Text Inputs
There are three different forms of input text, we can provide to Gensim −
As the sentences stored in Python’s native list object (known as str in Python 3)
As one single text file (can be small or large one)
Multiple text files
Creating a Dictionary Using Gensim
As discussed, in Gensim, the dictionary contains the mapping of all words, a.k.a tokens to their unique integer id. We can create a dictionary from list of sentences, from one or more than one text files (text file containing multiple lines of text). So, first let’s start by creating dictionary using list of sentences.
From a List of Sentences
In the following example we will be creating dictionary from a list of sentences. When we have list of sentences or you can say multiple sentences, we must convert every sentence to a list of words and comprehensions is one of the very common ways to do this.
Implementation Example
First, import the required and necessary packages as follows −
import gensim from gensim import corpora from pprint import pprint
Next, make the comprehension list from list of sentences/document to use it creating the dictionary −
doc = [ "CNTK formerly known as Computational Network Toolkit", "is a free easy-to-use open-source commercial-grade toolkit", "that enable us to train deep learning algorithms to learn like the human brain." ]
Next, we need to split the sentences into words. It is called tokenisation.
text_tokens = [[text for text in doc.split()] for doc in doc]
Now, with the help of following script, we can create the dictionary −
dict_LoS = corpora.Dictionary(text_tokens)
Now let’s get some more information like number of tokens in the dictionary −
print(dict_LoS)
Output
Dictionary(27 unique tokens: ['CNTK', 'Computational', 'Network', 'Toolkit', 'as']...)
We can also see the word to unique integer mapping as follows −
print(dict_LoS.token2id)
Output
{ 'CNTK': 0, 'Computational': 1, 'Network': 2, 'Toolkit': 3, 'as': 4, 'formerly': 5, 'known': 6, 'a': 7, 'commercial-grade': 8, 'easy-to-use': 9, 'free': 10, 'is': 11, 'open-source': 12, 'toolkit': 13, 'algorithms': 14, 'brain.': 15, 'deep': 16, 'enable': 17, 'human': 18, 'learn': 19, 'learning': 20, 'like': 21, 'that': 22, 'the': 23, 'to': 24, 'train': 25, 'us': 26 }
Complete Implementation Example
import gensim from gensim import corpora from pprint import pprint doc = [ "CNTK formerly known as Computational Network Toolkit", "is a free easy-to-use open-source commercial-grade toolkit", "that enable us to train deep learning algorithms to learn like the human brain." ] text_tokens = [[text for text in doc.split()] for doc in doc] dict_LoS = corpora.Dictionary(text_tokens) print(dict_LoS.token2id)
From Single Text File
In the following example we will be creating dictionary from a single text file. In the similar fashion, we can also create dictionary from more than one text files (i.e. directory of files).
For this, we have saved the document, used in previous example, in the text file named doc.txt. Gensim will read the file line by line and process one line at a time by using simple_preprocess. In this way, it doesn’t need to load the complete file in memory all at once.
Implementation Example
First, import the required and necessary packages as follows −
import gensim from gensim import corpora from pprint import pprint from gensim.utils import simple_preprocess from smart_open import smart_open import os
Next line of codes will make gensim dictionary by using the single text file named doc.txt −
dict_STF = corpora.Dictionary( simple_preprocess(line, deacc =True) for line in open(‘doc.txt’, encoding=’utf-8’) )
Now let’s get some more information like number of tokens in the dictionary −
print(dict_STF)
Output
Dictionary(27 unique tokens: ['CNTK', 'Computational', 'Network', 'Toolkit', 'as']...)
We can also see the word to unique integer mapping as follows −
print(dict_STF.token2id)
Output
{ 'CNTK': 0, 'Computational': 1, 'Network': 2, 'Toolkit': 3, 'as': 4, 'formerly': 5, 'known': 6, 'a': 7, 'commercial-grade': 8, 'easy-to-use': 9, 'free': 10, 'is': 11, 'open-source': 12, 'toolkit': 13, 'algorithms': 14, 'brain.': 15, 'deep': 16, 'enable': 17, 'human': 18, 'learn': 19, 'learning': 20, 'like': 21, 'that': 22, 'the': 23, 'to': 24, 'train': 25, 'us': 26 }
Complete Implementation Example
import gensim from gensim import corpora from pprint import pprint from gensim.utils import simple_preprocess from smart_open import smart_open import os dict_STF = corpora.Dictionary( simple_preprocess(line, deacc =True) for line in open(‘doc.txt’, encoding=’utf-8’) ) dict_STF = corpora.Dictionary(text_tokens) print(dict_STF.token2id)
From Multiple Text Files
Now let’s create dictionary from multiple files, i.e. more than one text file saved in the same directory. For this example, we have created three different text files namely first.txt, second.txt and third.txtcontaining the three lines from text file (doc.txt), we used for previous example. All these three text files are saved under a directory named ABC.
Implementation Example
In order to implement this, we need to define a class with a method that can iterate through all the three text files (First, Second, and Third.txt) in the directory (ABC) and yield the processed list of words tokens.
Let’s define the class named Read_files having a method named __iteration__() as follows −
class Read_files(object): def __init__(self, directoryname): elf.directoryname = directoryname def __iter__(self): for fname in os.listdir(self.directoryname): for line in open(os.path.join(self.directoryname, fname), encoding='latin'): yield simple_preprocess(line)
Next, we need to provide the path of the directory as follows −
path = "ABC"
#provide the path as per your computer system where you saved the directory.
Next steps are similar as we did in previous examples. Next line of codes will make Gensim directory by using the directory having three text files −
dict_MUL = corpora.Dictionary(Read_files(path))
Output
Dictionary(27 unique tokens: ['CNTK', 'Computational', 'Network', 'Toolkit', 'as']...)
Now we can also see the word to unique integer mapping as follows −
print(dict_MUL.token2id)
Output
{ 'CNTK': 0, 'Computational': 1, 'Network': 2, 'Toolkit': 3, 'as': 4, 'formerly': 5, 'known': 6, 'a': 7, 'commercial-grade': 8, 'easy-to-use': 9, 'free': 10, 'is': 11, 'open-source': 12, 'toolkit': 13, 'algorithms': 14, 'brain.': 15, 'deep': 16, 'enable': 17, 'human': 18, 'learn': 19, 'learning': 20, 'like': 21, 'that': 22, 'the': 23, 'to': 24, 'train': 25, 'us': 26 }
Saving and Loading a Gensim Dictionary
Gensim support their own native save() method to save dictionary to the disk and load() method to load back dictionary from the disk.
For example, we can save the dictionary with the help of following script −
Gensim.corpora.dictionary.save(filename)
#provide the path where you want to save the dictionary.
Similarly, we can load the saved dictionary by using the load() method. Following script can do this −
Gensim.corpora.dictionary.load(filename)
#provide the path where you have saved the dictionary.