Article Categories

Selected Reading

Python | Measure similarity between two sentences using cosine similarity

Python Server Side Programming Programming

Natural Language Processing for finding the semantic similarity between sentences, words, or text is very common in modern use cases. Cosine similarity is a popular method that measures the cosine of the angle between two non-zero vectors using dot product formula notation.

Through this article let us briefly explore cosine similarity and see its implementation using Python.

What is Cosine Similarity?

Cosine similarity is defined as the cosine of the angle between two vectors in space. A sentence or text can be represented as a vector based on word frequencies. The similarity between two sentences depends upon the cosine angle between their vectors the smaller the angle, the higher the similarity.

Steps to Calculate Cosine Similarity

Tokenize both sentences into individual words
Calculate word frequency for each sentence to create vectors
Compute the dot product of the two vectors
Calculate the magnitude of each vector
Apply the cosine similarity formula

Implementation Using Python

Here's a complete implementation that converts sentences to vectors and calculates their cosine similarity ?

import math
import re
from collections import Counter

# Regular expression to extract words
word_pattern = re.compile(r"\w+")

def generate_vectors(sentence):
    """Convert sentence to word frequency vector"""
    words = word_pattern.findall(sentence.lower())
    return Counter(words)

def cosine_similarity(vector_1, vector_2):
    """Calculate cosine similarity between two vectors"""
    # Get intersection of words in both vectors
    common_words = set(vector_1.keys()) & set(vector_2.keys())
    
    # Calculate dot product (numerator)
    dot_product = sum([vector_1[word] * vector_2[word] for word in common_words])
    
    # Calculate magnitude of each vector
    magnitude_1 = sum([vector_1[word] ** 2 for word in vector_1.keys()])
    magnitude_2 = sum([vector_2[word] ** 2 for word in vector_2.keys()])
    
    # Calculate denominator
    denominator = math.sqrt(magnitude_1) * math.sqrt(magnitude_2)
    
    if not denominator:
        return 0.0
    else:
        return float(dot_product) / denominator

# Example sentences
sentence_1 = "The dog jumped into the well."
sentence_2 = "The well dries up in summer season."

# Generate vectors
vec_1 = generate_vectors(sentence_1)
vec_2 = generate_vectors(sentence_2)

print("Sentence 1 vector:", dict(vec_1))
print("Sentence 2 vector:", dict(vec_2))

# Calculate similarity
similarity = cosine_similarity(vec_1, vec_2)
print("Cosine Similarity:", similarity)

The output shows the word frequency vectors and the calculated similarity ?

Sentence 1 vector: {'the': 2, 'dog': 1, 'jumped': 1, 'into': 1, 'well': 1}
Sentence 2 vector: {'the': 1, 'well': 1, 'dries': 1, 'up': 1, 'in': 1, 'summer': 1, 'season': 1}
Cosine Similarity: 0.3651483716701107

Using Scikit-learn

For production use, you can leverage scikit-learn's built-in implementation ?

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample sentences
sentences = [
    "The dog jumped into the well.",
    "The well dries up in summer season."
]

# Create count vectors
vectorizer = CountVectorizer()
count_matrix = vectorizer.fit_transform(sentences)

# Calculate cosine similarity
similarity_matrix = cosine_similarity(count_matrix)

print("Similarity Matrix:")
print(similarity_matrix)
print("Similarity between sentences:", similarity_matrix[0][1])

Similarity Matrix:
[[1.         0.36514837]
 [0.36514837 1.        ]]
Similarity between sentences: 0.36514837176987416

Interpreting Results

Similarity Score	Interpretation	Example
1.0	Identical	Same sentence
0.7 - 0.9	High similarity	Similar topics
0.3 - 0.7	Moderate similarity	Some common words
0.0 - 0.3	Low similarity	Few/no common words

Conclusion

Cosine similarity is an effective measure for text similarity based on word frequency vectors. It ranges from 0 (completely different) to 1 (identical), making it ideal for NLP applications like document clustering and recommendation systems.

Mithilesh Pradhan

Updated on: 2026-03-27T14:34:25+05:30

2K+ Views

Previous Next