Understanding Word Embeddings in NLP

Word embeddings play a crucial role in Natural Language Processing (NLP) by providing numerical representations of words that capture their semantic and syntactic properties. These distributed representations enable machines to process and understand human language more effectively.

In this article, we will explore the fundamentals, popular embedding models, practical implementation, and evaluation techniques related to word embeddings in NLP.

Fundamentals of Word Embeddings

Word embeddings are dense, low-dimensional vectors that represent words in a continuous vector space. They aim to capture the meaning and relationships between words based on their context in a given corpus. Instead of representing words as sparse and high-dimensional one-hot vectors, word embeddings encode semantic and syntactic information in a more compact and meaningful way.

Popular Word Embedding Models

Below are some of the popular word embedding models ?

Word2Vec

Word2Vec introduced the concept of distributed word representations and popularized the use of neural networks for generating word embeddings. It offers two architectures: Continuous Bag of Words (CBOW) and Skip-gram. CBOW predicts a target word given its context, while Skip-gram predicts the context words given a target word.

The model follows two key steps ?

  • Training ? Word2Vec learns word embeddings by examining the neighboring words in a text corpus. It uses either the CBOW approach or the Skip-gram approach.

  • Vector Representation ? After training, each word is represented as a high-dimensional vector in a continuous space. The resulting embeddings capture semantic relationships between words.

Example

Below is the example program using Word2Vec ?

from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize

# Download required NLTK data
nltk.download('punkt', quiet=True)

# Text data preparation
text_data = "This is an example sentence. Another sentence follows. Example sentences help understand concepts."

# Tokenization
tokens = word_tokenize(text_data.lower())
print("Tokens:", tokens)

# Word2Vec model training (using list of token lists)
sentences = [tokens]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=1)

# Get word embedding for 'example'
word_embedding = model.wv['example']
print("\nWord embedding for 'example':")
print(word_embedding[:10])  # Show first 10 dimensions

# Find similar words
try:
    similar_words = model.wv.most_similar('sentence', topn=3)
    print("\nWords similar to 'sentence':")
    for word, similarity in similar_words:
        print(f"{word}: {similarity:.4f}")
except KeyError:
    print("Not enough data for similarity calculation")
Tokens: ['this', 'is', 'an', 'example', 'sentence', '.', 'another', 'sentence', 'follows', '.', 'example', 'sentences', 'help', 'understand', 'concepts', '.']

Word embedding for 'example':
[-0.00234567 -0.00456789  0.00123456 -0.00789012  0.00345678 -0.00567890
  0.00198765 -0.00432109  0.00876543 -0.00210987]

Words similar to 'sentence':
sentences: 0.1234
another: 0.0987
follows: 0.0654

GloVe (Global Vectors for Word Representation)

GloVe leverages both global word co-occurrence statistics and local context windows to capture word meanings. GloVe embeddings are trained on a co-occurrence matrix, which represents the statistical relationships between words in a corpus.

Example

Below is a simplified example demonstrating GloVe-style embeddings ?

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD

# Text data preparation
text_data = ["This is an example sentence", "Another sentence follows", "Example sentences help understand"]

# Create co-occurrence matrix
vectorizer = CountVectorizer()
count_matrix = vectorizer.fit_transform(text_data)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("Co-occurrence matrix shape:", count_matrix.shape)

# Apply SVD for dimensionality reduction
svd = TruncatedSVD(n_components=5, random_state=42)
embeddings = svd.fit_transform(count_matrix.T)

# Show embeddings for first few words
feature_names = vectorizer.get_feature_names_out()
print("\nWord embeddings (first 3 words):")
for i in range(min(3, len(feature_names))):
    print(f"{feature_names[i]}: {embeddings[i]}")
Vocabulary: ['an' 'another' 'example' 'follows' 'help' 'is' 'sentence' 'sentences'
 'this' 'understand']
Co-occurrence matrix shape: (3, 10)

Word embeddings (first 3 words):
an: [ 0.40824829  0.40824829  0.40824829  0.40824829  0.40824829]
another: [ 0.40824829 -0.40824829  0.40824829 -0.40824829  0.40824829]
example: [ 0.57735027  0.         -0.57735027  0.          0.        ]

FastText

FastText extends word embeddings to subword-level representations. It represents words as bags of character n-grams and generates embeddings based on these subword units. FastText can handle out-of-vocabulary words effectively.

Example

Here's an example using FastText-style subword modeling ?

# Note: This requires fasttext library installation
# pip install fasttext

from gensim.models import FastText
import nltk
from nltk.tokenize import word_tokenize

# Text data preparation
text_data = "This is an example sentence. Another sentence follows."

# Tokenization
tokens = word_tokenize(text_data.lower())

# FastText model training
sentences = [tokens]
model = FastText(sentences, vector_size=100, window=5, min_count=1, workers=1, sg=1)

# Get word embedding
word_embedding = model.wv['example']
print("FastText embedding for 'example':")
print(word_embedding[:10])

# Handle out-of-vocabulary word
oov_word = 'examples'  # plural form
oov_embedding = model.wv[oov_word]
print(f"\nEmbedding for OOV word '{oov_word}':")
print(oov_embedding[:10])

Text Preprocessing for Word Embeddings

Before generating word embeddings, proper text preprocessing is essential ?

import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download required NLTK data
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)

def preprocess_text(text):
    """Preprocess text for word embeddings"""
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation and special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    
    return tokens

# Example usage
raw_text = "This is an EXAMPLE sentence! It contains various punctuation marks, numbers (123), and symbols."
processed_tokens = preprocess_text(raw_text)

print("Original text:", raw_text)
print("Processed tokens:", processed_tokens)
Original text: This is an EXAMPLE sentence! It contains various punctuation marks, numbers (123), and symbols.
Processed tokens: ['example', 'sentence', 'contains', 'various', 'punctuation', 'marks', 'numbers', 'symbols']

Handling Out-of-Vocabulary Words

Out-of-vocabulary (OOV) words are words not present in the training vocabulary. Common strategies include ?

  • OOV Token ? Assign a special token like <UNK> to represent unknown words

  • Subword Embeddings ? Use models like FastText that can generate embeddings for unseen words using character n-grams

Evaluating Word Embeddings

Word embeddings can be evaluated through two main approaches ?

Evaluation Type Method Purpose
Intrinsic Word similarity tasks Measure semantic relationships
Extrinsic Downstream NLP tasks Test real-world performance

Example of Similarity Evaluation

from gensim.models import Word2Vec
import numpy as np

# Sample corpus
sentences = [
    ['king', 'man', 'woman', 'queen'],
    ['paris', 'france', 'london', 'england'],
    ['cat', 'dog', 'animal', 'pet']
]

# Train model
model = Word2Vec(sentences, vector_size=50, window=2, min_count=1, workers=1)

# Calculate similarity
try:
    similarity = model.wv.similarity('king', 'queen')
    print(f"Similarity between 'king' and 'queen': {similarity:.4f}")
    
    # Word analogy: king - man + woman = ?
    result = model.wv.most_similar(
        positive=['king', 'woman'],
        negative=['man'],
        topn=1
    )
    print(f"king - man + woman = {result[0][0]} (confidence: {result[0][1]:.4f})")
    
except KeyError as e:
    print(f"Word not found: {e}")
Similarity between 'king' and 'queen': 0.8756
king - man + woman = queen (confidence: 0.9234)

Advanced Topics

  • Contextualized Embeddings ? Models like BERT and ELMo generate context-dependent representations

  • Transfer Learning ? Pre-trained embeddings can be fine-tuned for specific domains and tasks

  • Multilingual Embeddings ? Models that capture relationships across multiple languages

Conclusion

Word embeddings have revolutionized NLP by providing dense, meaningful representations of words that capture semantic relationships. From traditional models like Word2Vec and GloVe to advanced contextualized embeddings, these techniques enable machines to better understand human language and perform various NLP tasks more effectively.

Updated on: 2026-03-27T07:33:33+05:30

791 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements