What is Binary Variables?

Data MiningDatabaseData Structure

A binary variable has only two states such as 0 or 1, where 0 defines that the variable is absent, and 1 defines that it is present. Given the variable smoker defining a patient, for example, 1 denotes that the patient smokes, while 0 denotes that the patient does not. It can be considering binary variables as if they are interval-scaled can lead to misleading clustering outcomes. Hence, methods defines to binary data are essential for calculating dissimilarities.

There is one method involves calculating a dissimilarity matrix from the given binary data. If some binary variables are thought of as having the similar weight, it can have the 2-by-2 contingency table, where q is the number of variables that similar to 1 for both objects i and j, r is the number of variables that same 1 for object i but that are 0 for object j, s is the number of variables that same 0 for object i but is similar as 1 for object j, and t is the number of variables that is similar to 0 for both objects i and j. The total number of variables is p, where p = q+r +s+t.

A binary variable is symmetric if both of its states are uniformly valuable and carry the equal weight; i.e., there is no preference on which results must be coded as 0 or 1. Dissimilarity that depends on symmetric binary variables is known as symmetric binary dissimilarity.

A binary variable is asymmetric if the results of the states are not important, including the positive and negative results of a disease test. By convention, we shall code the essential outcome, which is generally the rarest one, by 1 (e.g., HIV positive) and the other by 0 (e.g., HIV negative).

Given two asymmetric binary variables, the concurrence of two 1s (a positive match) is treated more important than that of two 0s (a negative match). Hence, such binary variables are treated as “monary” (as if having one state).

The dissimilarity based on such variables is known as asymmetric binary dissimilarity, where the several negative matches, t, is treated unimportant and therefore is ignored in the computation, as shown in equation

$$\mathrm{d(i, j)=\:\frac{r+s}{q+r+s}}$$

It can calculate the distance between two binary variables depends on the concept of similarity rather than dissimilarity. For instance, the asymmetric binary similarity between the objects i and j, or sim (i, j), can be calculated as,

$$\mathrm{sim(i, j)=\:\frac{q}{q+r+s}=1-d(i,j)}$$.

The coefficient sim(i, j) is known as the Jaccard coefficient.

Updated on 16-Feb-2022 12:18:00