Python | Measure similarity between two sentences using cosine similarity


Introduction

Natural Language Processing for finding the semantic similarity between sentences, words, or text is very common in modern use cases. There are numerous ways to calculate the similarity between texts. One such popular method is cosine similarity. It is used to find the similarity between two vectors that are non-zero in value and measures the cosine of the angle between the two vectors using dot product formula notation.

Through this article let us briefly explore cosine similarity and see its implementation using Python.

Cosine similarity – Finding similarity between two texts

Cosine Similarity is defined as the cosine of the angle between two vectors in space. A sentence or text can be represented as a vector. The Cosine similarity between the two sentences will depend upon the cosine angle between the two vectors. The larger the value of the angle, the smaller the cosine and similarity, and vice versa.

Steps to Find Cosine Similarity

  • The text of the two sentences is first tokenized into words using Regex or Nltk library.

  • The frequency of each word in each sentence is calculated separately for each sentence

  • The value of frequency of the similar words in both sentences is taken into account for calculating similarity.

  • As per the formula for cosine similarity, the numerator which is the dot product of the two vectors, and the denominator which is the product of Euclidian distance are calculated.

Python Implementation

In this implementation we have taken two reference sentences that are converted into vector forms. Then the cosine similarity is calculated between these two vectors using the cosine_similarity utility function.

## cosine similarity
import math
import re
from collections import Counter as Count

word = re.compile(r"\w+")

sentence_1 = "The dog jumped into the well."
sentence_2 = "The well dries up in summer season."

def cosine_similarity(vector_1, vector_2):
    inter = set(vector_1.keys()) & set(vector_1.keys())
    numer = sum([vector_1[i] * vector_2[i] for i in inter])

    s_1 = sum([vector_1[i] ** 2 for i in list(vector_1.keys())])
    s_2 = sum([vector_1[i] ** 2 for i in list(vector_1.keys())])
    deno = math.sqrt(s_1) * math.sqrt(s_2)

    if not deno:
        return 0.0
    else:
        return float(numer) / deno

def generate_vectors(sent):
    w = word.findall(sent)
    return Count(w)

vec_1 = generate_vectors(sentence_1)
vec_2 = generate_vectors(sentence_2)

sim = cosine_similarity(vec_1, vec_2)

print("Similarity(cosine):", sim)

Output

Similarity(cosine): 0.33333333333333337

Conclusion

Cosine Similarity is a very popular and handy measure of similarity between two text sentences widely used in Natural Language and Machine learning use cases. Since it depends upon the cosine of the angle between the two sentences in vector representation, the arrangement and orientation of the vectors(or sentences) in space plays a significant role in the similarity measure that is directly linked to the angle between them.

Updated on: 26-Sep-2023

733 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements