What are the applications of Similarity Measures?

Similarity measures provide the framework on which some data mining decisions are based. Tasks including classification and clustering generally consider the existence of some similarity measure, while fields with poor techniques to evaluate similarity often find that searching information is a cumbersome function.

There are several applications of similarity measures are as follows −

Information Retrieval − The goal of information retrieval (IR) systems is to meet user’s needs. In another terms, a need is generally manifested in the form of a short textual query introduced in the text box of some search engine online. IR systems generally do not directly answer a query, instead, they present a ranked list of records that are judged relevant to that query by some similarity measure.

Because similarity measures have the effect of clustering and classifying information concerning a query, users will commonly find new interpretations of their information need that may or may not be useful to them when reformulating their query.

In the case when the query is a record from the initial set, similarity measures can be used to cluster and classify records within a collection. In short, similarity measures can insert a rudimentary architecture to a previously unstructured sets.


Similarity measures utilized in IR systems can distort one’s perception of the whole data set. For example, if a user types a query into a search engine and does not find a satisfactory answer in the top ten returned web pages, then it will usually try to reformulate this query once or twice.

Classic Similarity Measures

A similarity measure is defined as a mapping from a pair of tuples of size k to a scalar number. By convention, all similarity measures must map to the range [-1, 1] or [0, 1], where a similarity score of 1 denotes maximum similarity. Similarity measure should exhibit the features that their value will increase as the several properties in the two items being compared increases.


The dice coefficient is a generalization of the harmonic mean of the precision and recall measures. A system with a high harmonic mean should theoretically be nearer to an ideal retrieval system in that it can manage high precision values at high levels of recall. The harmonic mean for precision and recall is given by


while the Dice coefficient is denoted by

$$sim(d,d_{j})=D(A,B)=\frac{|A\cap B|}{\alpha|A|+(1-\alpha)|B|}\cong \frac{\propto \sum_{k=1}^{n}w_{kq}w_{kj}}{\propto \sum_{k=1}^{n}\mathrm{w}_{kq}^{2}+(1-\propto)\sum_{k=1}^{n}\mathrm{w}_{kj}^{2}}$$

with α ε [0, 1]. It can display that the Dice coefficient is a weighted harmonic mean, let α = ½.


The Overlap coefficient tries to decide the degree to which two sets overlap. The Overlap coefficient is compared as

$$sim(d,d_{j})=D(A,B)=\frac{|A\cap B|}{min(|A|,|B|)}\cong \frac{\propto \sum_{k=1}^{n}w_{kq}w_{kj}}{\propto \sum_{k=1}^{n}\mathrm{w}_{kq}^{2}+\sum_{k=1}^{n}\mathrm{w}_{kj}^{2}}$$

The Overlap coefficient is calculated using the max operator in place of the min.

Updated on: 22-Nov-2021


Kickstart Your Career

Get certified by completing the course

Get Started