Understanding Word Embeddings in NLP

Machine Learning Artificial Intelligence Python

Word embeddings play a crucial role in Natural Language Processing (NLP) by providing numerical representations of words that capture their semantic and syntactic properties. These distributed representations enable machines to process and understand human language more effectively.

In this article, we will delve into the fundamentals, popular embedding models, practical aspects, evaluation techniques, and advanced topics related to word embeddings in NLP.

Fundamentals of Word Embeddings

Word embeddings are dense, low-dimensional vectors that represent words in a continuous vector space. They aim to capture the meaning and relationships between words based on their context in a given corpus. Instead of representing words as sparse and high-dimensional one-hot vectors, word embeddings encode semantic and syntactic information in a more compact and meaningful way.

Popular Word Embedding Models

Below are some of the popular Word embedding models −

Word2Vec

Word2Vec introduced the concept of distributed word representations and popularized the use of neural networks for generating word embeddings. It offers two architectures: Continuous Bag of Words (CBOW) and Skip-gram. CBOW predicts a target word given its context, while Skip-gram predicts the context words given a target word. Word2Vec models are trained on large amounts of unlabeled text data.

The model follows two key steps −

Training − Word2Vec learns word embeddings by examining the neighboring words in a text corpus. It uses either the Continuous Bag of Words (CBOW) approach, predicting the current word based on its context, or the Skip-gram approach, predicting the surrounding words given the current word.
Vector Representation − After training, each word is represented as a high-dimensional vector in a continuous space. The resulting embeddings capture semantic relationships between words, with similar words having closer vector representations.

To use Word2Vec, you can tokenize your text data, feed it to the model, and retrieve the word embeddings for various NLP tasks.

Example

Below is the example program −

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

# Text data preparation
text_data = "This is an example sentence. Another sentence follows."

# Tokenization
tokens = word_tokenize(text_data)

# Word2Vec model training
model = Word2Vec([tokens], min_count=1)

# Get word embeddings
word_embeddings = model.wv

# Accessing embeddings
print(word_embeddings['example'])

Output

[-0.00713902  0.00124103 -0.00717672 -0.00224462  0.0037193   0.00583312
  0.00119818  0.00210273 -0.00411039  0.00722533 -0.00630704  0.00464722
 -0.00821997  0.00203647 -0.00497705 -0.00424769 -0.00310898  0.00565521
  0.0057984  -0.00497465  0.00077333 -0.00849578  0.00780981  0.00925729
 -0.00274233  0.00080022  0.00074665  0.00547788 -0.00860608  0.00058446
  0.00686942  0.00223159  0.00112468 -0.00932216  0.00848237 -0.00626413
 -0.00299237  0.00349379 -0.00077263  0.00141129  0.00178199 -0.0068289
 -0.00972481  0.00904058  0.00619805 -0.00691293  0.00340348  0.00020606
  0.00475375 -0.00711994  0.00402695  0.00434743  0.00995737 -0.00447374
 -0.00138926 -0.00731732 -0.00969783 -0.00908026 -0.00102275 -0.00650329
  0.00484973 -0.00616403  0.00251919  0.00073944 -0.00339215 -0.00097922
  0.00997913  0.00914589 -0.00446183  0.00908303 -0.00564176  0.00593092
 -0.00309722  0.00343175  0.00301723  0.00690046 -0.00237388  0.00877504
  0.00758943 -0.00954765 -0.00800821 -0.0076379   0.00292326 -0.00279472
 -0.00692952 -0.00812826  0.00830918  0.00199049 -0.00932802 -0.00479272
  0.00313674 -0.00471321  0.00528084 -0.00423344  0.0026418  -0.00804569
  0.00620989  0.00481889  0.00078719  0.00301345]

GloVe (Global Vectors for Word Representation)

GloVe is another widely used word embedding model. It leverages both global word co-occurrence statistics and local context windows to capture word meanings. GloVe embeddings are trained on a co-occurrence matrix, which represents the statistical relationships between words in a corpus.

Here are the steps to perform using GloVe −

Create a word-context co-occurrence matrix from a large corpus, where each entry represents the frequency of a word co-occurring with another word in a fixed-size context window.
Initialize word vectors randomly.
Use the co-occurrence matrix to calculate the word vectors by minimizing the difference between dot products of word vectors and their co-occurrence matrix entries.
Iterate over the matrix and adjust the word vectors until convergence.
After training, the learned word vectors encode semantic relationships between words, enabling operations like word similarity and analogy completion.

GloVe provides a computationally efficient approach to generating word embeddings that capture both local and global word relationships.

Example

Below is the program example for GloVe −

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import TruncatedSVD

# Text data preparation
text_data = ["This is an example sentence.", "Another sentence follows."]

# Tokenization
count_vectorizer = CountVectorizer()
count_matrix = count_vectorizer.fit_transform(text_data)

# TF-IDF transformation
tfidf_transformer = TfidfTransformer()
tfidf_matrix = tfidf_transformer.fit_transform(count_matrix)

# Singular Value Decomposition (SVD)
svd = TruncatedSVD(n_components=2)  # Reduce components to 2
glove_embeddings = svd.fit_transform(tfidf_matrix)

# Accessing embeddings
print(glove_embeddings)

Output

[[ 0.75849858  0.65167469]
 [ 0.75849858 -0.65167469]]

FastText

FastText extends the idea of word embeddings to subword-level representations. It represents words as bags of character n-grams and generates embeddings based on these subword units. FastText embeddings can handle out-of-vocabulary words effectively, as they can be composed of subword embeddings.

Here are the steps to perform with FastText −

Tokenization − Break sentences into individual words or subwords.
Training data − Prepare a text corpus for training the FastText model.
Model Training − Train the FastText model on the tokenized text corpus.
Subword embeddings − Generate embeddings for both words and subwords.
Out-of-vocabulary (OOV) words − Handle OOV words by averaging subword embeddings.
Word similarity − Measure word similarity using cosine similarity between embeddings.
Downstream tasks − Utilize FastText embeddings for various NLP tasks like text classification or sentiment analysis.

FastText's subword modeling helps in capturing information for rare or unseen words and improves performance in morphologically rich languages.

Example

import fasttext
import nltk
from nltk.tokenize import word_tokenize

# Text data preparation
text_data = "This is an example sentence. Another sentence follows."

# Tokenization
tokens = word_tokenize(text_data)

# Saving tokens to a file
with open('text_data.txt', 'w') as file:
   file.write(' '.join(tokens))

# FastText model training
model = fasttext.train_unsupervised('text_data.txt')

# Get word embeddings
word_embeddings = model.get_word_vector('example')

# Accessing embeddings
print(word_embeddings)

Output

[ 0.03285718 -0.01526352 -0.02881184 -0.00897612  0.0460813  -0.02043175  0.03802227 -0.00231849 -0.04373281 -0.02345613  0.04132561  0.02593898 -0.03548125 -0.02176061 -0.00718064  0.02202878  0.01905638  0.01388955 -0.02727601  0.01051432 -0.02827209 -0.01180033  0.02789263 -0.02217032 -0.00819697 -0.01387899 -0.04028311 -0.01399185  0.00222543 -0.00437792 -0.01352429  0.00902361  0.0341314  -0.04119079  0.03299914 -0.01110766 -0.02954799  0.00932125  0.02062443  0.00341501 -0.03225482 -0.03569973 -0.03264207  0.00164015  0.02864997 -0.01425406  0.00099312 -0.00711453  0.00534453 -0.02709763 -0.03474019  0.01898332 -0.01320734  0.02728367  0.00637779 -0.02667361  0.0090644   0.00815791  0.00375441 -0.01883233 -0.01119692 -0.00259154  0.00825689 -0.00366063 -0.03051898 -0.0018206  0.03409107 -0.01777094 -0.00757413 -0.00613379 -0.03341368  0.02008897 -0.00342503  0.00976928  0.00776702 -0.02941767 -0.02306498  0.03264163  0.01472706  0.01123447 -0.03174553  0.02913557  0.01298951 -0.00645978  0.03404429 -0.00828668 -0.00181118  0.00852771 -0.00237192 -0.00824729 -0.02397284  0.00087284 -0.00495328 -0.01262816  0.01932779  0.00314868  0.02070006 -0.0060814   0.01978939 -0.03188471]

Preparing Text Data for Word Embeddings

Before generating word embeddings, it is essential to pre-process the text data. Some pre-processing steps include −

Tokenization − Splitting text into individual words or subword units.
Lowercasing − Converting all words to lowercase to treat words with different cases as the same.
Removing Punctuation − Eliminating punctuation marks, as they do not carry significant semantic information.
Stop Word Removal − Removing common words (e.g., "and," "the," "is") that occur frequently but do not contribute much to the meaning.

Handling Out-of-Vocabulary Words

Out-of-vocabulary (OOV) words are words that do not appear in the vocabulary of the pre-trained word embedding models. To handle OOV words −

OOV Token − Assign a specific token to represent OOV words during training or inference.
Subword Embeddings − Utilize models like FastText to generate subword embeddings, which can capture the meaning of unseen words based on their character n-grams.

Evaluating Word Embeddings

To evaluate the quality and usefulness of word embeddings, various techniques can be employed −

Intrinsic Evaluation − Assessing embeddings based on their similarity to human judgments of word similarity or relatedness. Common datasets used for evaluation include WordSim-353 and Word2Vec's "word-analogy" dataset.
Extrinsic Evaluation − Evaluating the embeddings on downstream NLP tasks such as sentiment analysis, text classification, and named entity recognition. Improved performance on these tasks indicates the effectiveness of the embeddings.

Advanced Topics in Word Embeddings

Contextualized Word Representations − Models like ELMo (Embeddings from Language Models), BERT (Bidirectional Encoder Representations from Transformers), and GPT (Generative Pre-trained Transformer) generate contextualized word representations. These models consider the surrounding context of a word to capture nuances and produce more accurate representations.
Transfer Learning − Pre-trained word embeddings can be used as a starting point for various NLP tasks. Fine-tuning or transfer learning allows models to leverage the knowledge acquired during pre-training to improve performance on specific downstream tasks.

Conclusion

In conclusion, by understanding word embeddings, their models, pre-processing techniques, evaluation methods, and advanced applications, you will be equipped to leverage these powerful representations in various NLP tasks.

Word embeddings have revolutionized the field of NLP by enabling machines to understand and process human language more effectively, opening doors to improved language understanding, machine translation, question answering, and many other applications.

Priya Mishra

Updated on: 12-Jul-2023

133 Views

Kickstart Your Career

Get certified by completing the course

Get Started