Finding the Word Analogy from given words using Word2Vec embeddings

In this article, we will learn about a machine learning program that can find word analogies from provided words. For example: "Apple : fruit :: car : vehicle".

In this analogy, "apple" and "car" are the two things being compared. "Fruit" and "vehicle" are the categories that these items belong to. The analogy states that apple is a type of fruit, just as car is a type of vehicle.

While the human brain can easily identify such patterns, training a machine to perform the same task requires a very large amount of data. We will use the Word2Vec model with Google's pre-trained GoogleNews-vectors-negative300 model, which contains embeddings for millions of words. Here is the link for the dataset.

Installation

To use Word2Vec, we need to install the gensim library ?

pip install gensim

Word Analogy Implementation

We'll use the most_similar() method from the Word2Vec model to solve analogies. The method uses vector arithmetic: king - man + woman = queen.

Example

from gensim.models import KeyedVectors

# Loading the pre-trained Word2Vec model
model = KeyedVectors.load_word2vec_format('googlenews-vectors-negative300.bin', 
                                         binary=True, limit=50000)

# Word analogy function
def word_analogy(w_a, w_b, w_c):
    try:
        # Performing word analogy calculation: w_a is to w_b as w_c is to ?
        result = model.most_similar(positive=[w_b, w_c], negative=[w_a], topn=1)
        return result[0][0]
    except KeyError:
        return "One or more words not found in the vocabulary."

# Example: king is to queen as man is to ?
w_a = 'king'
w_b = 'queen' 
w_c = 'man'
w_d = word_analogy(w_a, w_b, w_c)

print(f"Word analogy: {w_a}:{w_b} :: {w_c}:{w_d}")

Output

Word analogy: king:queen :: man:woman

How It Works

The Word2Vec model represents words as high-dimensional vectors. The most_similar() method performs vector arithmetic:

  • positive: Vectors to add (w_b + w_c)
  • negative: Vectors to subtract (w_a)
  • topn: Number of most similar results to return

The calculation finds: vector(w_b) - vector(w_a) + vector(w_c), which gives us the analogy result.

Key Points

  • Download the GoogleNews-vectors-negative300.bin file from Kaggle
  • Set limit=50000 to reduce memory usage (adjust as needed)
  • The model file is large (~3.6GB) due to extensive training data
  • Handle KeyError exceptions for words not in vocabulary

Conclusion

Word2Vec embeddings enable machines to solve word analogies through vector arithmetic. The pre-trained GoogleNews model provides robust word relationships for accurate analogy completion.

Updated on: 2026-03-27T14:53:09+05:30

445 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements