Gensim - Creating TF-IDF Matrix



Here, we will learn about creating Term Frequency-Inverse Document Frequency (TF-IDF) Matrix with the help of Gensim.

What is TF-IDF?

It is the Term Frequency-Inverse Document Frequency model which is also a bag-of-words model. It is different from the regular corpus because it down weights the tokens i.e. words appearing frequently across documents. During initialisation, this tf-idf model algorithm expects a training corpus having integer values (such as Bag-of-Words model).

Then after that at the time of transformation, it takes a vector representation and returns another vector representation. The output vector will have the same dimensionality but the value of the rare features (at the time of training) will be increased. It basically converts integer-valued vectors into real-valued vectors.

How It Is Computed?

TF-IDF model computes tfidf with the help of following two simple steps −

Step 1: Multiplying local and global component

In this first step, the model will multiply a local component such as TF (Term Frequency) with a global component such as IDF (Inverse Document Frequency).

Step 2: Normalise the Result

Once done with multiplication, in the next step TFIDF model will normalize the result to the unit length.

As a result of these above two steps frequently occurred words across the documents will get down-weighted.

How to get TF-IDF Weights?

Here, we will be going to implement an example to see how we can get TF-IDF weights. Basically, in order to get TF-IDF weights, first we need to train the corpus and the then apply that corpus within the tfidf model.

Train the Corpus

As said above to get the TF-IDF we first need to train our corpus. First, we need to import all the necessary packages as follows −

import gensim
import pprint
from gensim import corpora
from gensim.utils import simple_preprocess

Now provide the list containing sentences. We have three sentences in our list −

doc_list = [
   "Hello, how are you?", "How do you do?", 
   "Hey what are you doing? yes you What are you doing?"
]

Next, do tokenisation of the sentences as follows −

doc_tokenized = [simple_preprocess(doc) for doc in doc_list]

Create an object of corpora.Dictionary() as follows −

dictionary = corpora.Dictionary()

Now pass these tokenised sentences to dictionary.doc2bow() object as follows −

BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]

Next, we will get the word ids and their frequencies in our documents.

for doc in BoW_corpus:
   print([[dictionary[id], freq] for id, freq in doc])

Output

[['are', 1], ['hello', 1], ['how', 1], ['you', 1]]
[['how', 1], ['you', 1], ['do', 2]]
[['are', 2], ['you', 3], ['doing', 2], ['hey', 1], ['what', 2], ['yes', 1]]

In this way we have trained our corpus (Bag-of-Word corpus).

Next, we need to apply this trained corpus within the tfidf model models.TfidfModel().

First import the numpay package −

import numpy as np

Now applying our trained corpus(BoW_corpus) within the square brackets of models.TfidfModel()

tfidf = models.TfidfModel(BoW_corpus, smartirs='ntc')

Next, we will get the word ids and their frequencies in our tfidf modeled corpus −

for doc in tfidf[BoW_corpus]:
   print([[dictionary[id], np.around(freq,decomal=2)] for id, freq in doc])

Output

[['are', 0.33], ['hello', 0.89], ['how', 0.33]]
[['how', 0.18], ['do', 0.98]]
[['are', 0.23], ['doing', 0.62], ['hey', 0.31], ['what', 0.62], ['yes', 0.31]]

[['are', 1], ['hello', 1], ['how', 1], ['you', 1]]
[['how', 1], ['you', 1], ['do', 2]]
[['are', 2], ['you', 3], ['doing', 2], ['hey', 1], ['what', 2], ['yes', 1]]

[['are', 0.33], ['hello', 0.89], ['how', 0.33]]
[['how', 0.18], ['do', 0.98]]
[['are', 0.23], ['doing', 0.62], ['hey', 0.31], ['what', 0.62], ['yes', 0.31]]

From the above outputs, we see the difference in the frequencies of the words in our documents.

Complete Implementation Example

import gensim
import pprint
from gensim import corpora
from gensim.utils import simple_preprocess
doc_list = [
   "Hello, how are you?", "How do you do?", 
   "Hey what are you doing? yes you What are you doing?"
]
doc_tokenized = [simple_preprocess(doc) for doc in doc_list]
dictionary = corpora.Dictionary()
BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]
for doc in BoW_corpus:
   print([[dictionary[id], freq] for id, freq in doc])
import numpy as np
tfidf = models.TfidfModel(BoW_corpus, smartirs='ntc')
for doc in tfidf[BoW_corpus]:
   print([[dictionary[id], np.around(freq,decomal=2)] for id, freq in doc])

Difference in Weight of Words

As discussed above, the words that will occur more frequently in the document will get the smaller weights. Let’s understand the difference in weights of words from the above two outputs. The word ‘are’ occurs in two documents and have been weighted down. Similarly, the word ‘you’ appearing in all the documents and removed altogether.

Advertisements