Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Python | Measure similarity between two sentences using cosine similarity
Natural Language Processing for finding the semantic similarity between sentences, words, or text is very common in modern use cases. Cosine similarity is a popular method that measures the cosine of the angle between two non-zero vectors using dot product formula notation.
Through this article let us briefly explore cosine similarity and see its implementation using Python.
What is Cosine Similarity?
Cosine similarity is defined as the cosine of the angle between two vectors in space. A sentence or text can be represented as a vector based on word frequencies. The similarity between two sentences depends upon the cosine angle between their vectors the smaller the angle, the higher the similarity.
Steps to Calculate Cosine Similarity
Tokenize both sentences into individual words
Calculate word frequency for each sentence to create vectors
Compute the dot product of the two vectors
Calculate the magnitude of each vector
Apply the cosine similarity formula
Implementation Using Python
Here's a complete implementation that converts sentences to vectors and calculates their cosine similarity ?
import math
import re
from collections import Counter
# Regular expression to extract words
word_pattern = re.compile(r"\w+")
def generate_vectors(sentence):
"""Convert sentence to word frequency vector"""
words = word_pattern.findall(sentence.lower())
return Counter(words)
def cosine_similarity(vector_1, vector_2):
"""Calculate cosine similarity between two vectors"""
# Get intersection of words in both vectors
common_words = set(vector_1.keys()) & set(vector_2.keys())
# Calculate dot product (numerator)
dot_product = sum([vector_1[word] * vector_2[word] for word in common_words])
# Calculate magnitude of each vector
magnitude_1 = sum([vector_1[word] ** 2 for word in vector_1.keys()])
magnitude_2 = sum([vector_2[word] ** 2 for word in vector_2.keys()])
# Calculate denominator
denominator = math.sqrt(magnitude_1) * math.sqrt(magnitude_2)
if not denominator:
return 0.0
else:
return float(dot_product) / denominator
# Example sentences
sentence_1 = "The dog jumped into the well."
sentence_2 = "The well dries up in summer season."
# Generate vectors
vec_1 = generate_vectors(sentence_1)
vec_2 = generate_vectors(sentence_2)
print("Sentence 1 vector:", dict(vec_1))
print("Sentence 2 vector:", dict(vec_2))
# Calculate similarity
similarity = cosine_similarity(vec_1, vec_2)
print("Cosine Similarity:", similarity)
The output shows the word frequency vectors and the calculated similarity ?
Sentence 1 vector: {'the': 2, 'dog': 1, 'jumped': 1, 'into': 1, 'well': 1}
Sentence 2 vector: {'the': 1, 'well': 1, 'dries': 1, 'up': 1, 'in': 1, 'summer': 1, 'season': 1}
Cosine Similarity: 0.3651483716701107
Using Scikit-learn
For production use, you can leverage scikit-learn's built-in implementation ?
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Sample sentences
sentences = [
"The dog jumped into the well.",
"The well dries up in summer season."
]
# Create count vectors
vectorizer = CountVectorizer()
count_matrix = vectorizer.fit_transform(sentences)
# Calculate cosine similarity
similarity_matrix = cosine_similarity(count_matrix)
print("Similarity Matrix:")
print(similarity_matrix)
print("Similarity between sentences:", similarity_matrix[0][1])
Similarity Matrix: [[1. 0.36514837] [0.36514837 1. ]] Similarity between sentences: 0.36514837176987416
Interpreting Results
| Similarity Score | Interpretation | Example |
|---|---|---|
| 1.0 | Identical | Same sentence |
| 0.7 - 0.9 | High similarity | Similar topics |
| 0.3 - 0.7 | Moderate similarity | Some common words |
| 0.0 - 0.3 | Low similarity | Few/no common words |
Conclusion
Cosine similarity is an effective measure for text similarity based on word frequency vectors. It ranges from 0 (completely different) to 1 (identical), making it ideal for NLP applications like document clustering and recommendation systems.
