Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Finding the Word Analogy from given words using Word2Vec embeddings
In this article, we will learn about a machine learning program that can find word analogies from provided words. For example: "Apple : fruit :: car : vehicle".
In this analogy, "apple" and "car" are the two things being compared. "Fruit" and "vehicle" are the categories that these items belong to. The analogy states that apple is a type of fruit, just as car is a type of vehicle.
While the human brain can easily identify such patterns, training a machine to perform the same task requires a very large amount of data. We will use the Word2Vec model with Google's pre-trained GoogleNews-vectors-negative300 model, which contains embeddings for millions of words. Here is the link for the dataset.
Installation
To use Word2Vec, we need to install the gensim library ?
pip install gensim
Word Analogy Implementation
We'll use the most_similar() method from the Word2Vec model to solve analogies. The method uses vector arithmetic: king - man + woman = queen.
Example
from gensim.models import KeyedVectors
# Loading the pre-trained Word2Vec model
model = KeyedVectors.load_word2vec_format('googlenews-vectors-negative300.bin',
binary=True, limit=50000)
# Word analogy function
def word_analogy(w_a, w_b, w_c):
try:
# Performing word analogy calculation: w_a is to w_b as w_c is to ?
result = model.most_similar(positive=[w_b, w_c], negative=[w_a], topn=1)
return result[0][0]
except KeyError:
return "One or more words not found in the vocabulary."
# Example: king is to queen as man is to ?
w_a = 'king'
w_b = 'queen'
w_c = 'man'
w_d = word_analogy(w_a, w_b, w_c)
print(f"Word analogy: {w_a}:{w_b} :: {w_c}:{w_d}")
Output
Word analogy: king:queen :: man:woman
How It Works
The Word2Vec model represents words as high-dimensional vectors. The most_similar() method performs vector arithmetic:
- positive: Vectors to add (w_b + w_c)
- negative: Vectors to subtract (w_a)
- topn: Number of most similar results to return
The calculation finds: vector(w_b) - vector(w_a) + vector(w_c), which gives us the analogy result.
Key Points
- Download the GoogleNews-vectors-negative300.bin file from Kaggle
- Set
limit=50000to reduce memory usage (adjust as needed) - The model file is large (~3.6GB) due to extensive training data
- Handle
KeyErrorexceptions for words not in vocabulary
Conclusion
Word2Vec embeddings enable machines to solve word analogies through vector arithmetic. The pre-trained GoogleNews model provides robust word relationships for accurate analogy completion.
