Finding the Word Analogy from given words using Word2Vec embeddings

Python Server Side Programming Programming

In this article we will learn about machine learning program which can find the word Analogy from the provided word. Take an example "Apple : fruit :: car :vehicle".

In this analogy, "apple" and "car" are the two things being compared. "Fruit" and "vehicle" are the two categories that the things being compared belong to. The analogy is saying that apple is a type of fruit, just as car is a type of vehicle.

So the human brain can identify the pattern but training machine to do the same task will be very difficult as we will require very very large amount of data. In this article we will use the Word2Vec model with the model given by google googlenewsvectorsnegative300 which contains very large number of pre-trained words. Here is the link for GoogleNews-vectors-negative300 dataset. Let’s see the program implementation for the same problem:

Installation of Word2Vec

To install the Word2Vec we will execute the gensim library installation command as Word2Vec is a model of the gensim library.

pip install gensim

Now we will use the most_similar() method provided by the Word2Vec model of the gensim library.

Example

from gensim.models import KeyedVectors, Word2Vec

# Loading the pre-trained Word2Vec model
model = KeyedVectors.load_word2vec_format('googlenews-vectors-negative300.bin', binary=True, limit=50000)

# Word analogy function
def word_analogy(w_a, w_b, w_c):
   try:
      # Performing word analogy calculation
      word_analogy = model.most_similar(positive=[w_b, w_c], negative=[w_a], topn=1)
      return word_analogy[0][0]
      except KeyError:
      return "One or more words not found in the vocabulary."

   w_a = 'king'
   w_b = 'queen'
   w_c = 'woman'
   w_d = word_analogy(w_a, w_b, w_c)

print("Word analogy is- ",f"{w_a}:{w_b} :: {w_c}:{w_d}")

Output

Word analogy is- king:queen :: woman:man

Explanation

Before moving towards the program make sure to download the "googlenews-vectors-negative300.bin" model file from the kaggle dataset website. After downloading the model copy to the path where our current working python program file exists. As the model file is very large size due to its training on various images so if we load it then it will consume all the memory resources that's why we keep the limit size as 50,000 which limits the number of loaded word vectors to reduce memory usage. We can also adjust the parameter of the limit variable according to our choice. So, here in the above program we first load the pre-trained Word2Vec model using the load_word2vec_format() from the KeyedVectors class. The model is loaded from the googlenewsvectorsnegative300.bin dataset which we downloaded in our local system.

In the find_word_analogy() function we take three words as input and calculate word analogy using the most_similar() method from the Word2Vec model. It performs the analogy calculation by considering word_b and word_c as positive examples and word_a as a negative example. We set the topn parameter as 1 to get the most similar word. This is how we get the analogy word.

So, we get to know about methods to find the word analogy using the Word2Vec model. We used the google-news-vectors-negative300 model which has pre trained data. Then we used the program to get the suitable word analogy.

Kalyan Mishra

Updated on: 06-Oct-2023

123 Views

Kickstart Your Career

Get certified by completing the course

Get Started