Hamming distance calculates the distance between two binary vectors. Mostly we find the binary strings when we use one-hot encoding on categorical columns of data. In one-hot encoding the integer variable is removed and a new binary variable will be added for each unique integer value. For example, if a column had the categories say ‘Length’, ‘Width’, and ‘Breadth’. We might one-hot encode each example as a bitstring with one bit for each column as follows −
Length = [1, 0, 0]
Width = [0, 1, 0]
Breadth = [0, 0, 1]
The Hamming distance between any of the two categories mentioned above, can be calculated as the sum or average number of bit differences between the two binary strings. We can see that the Hamming difference between Length and Width categories is about 2/3 or 0.666 because 2 out of 3 positions are different.
Hamming distance will also decide the similarity between categorical variables. For example, suppose we have two strings −
“Google” and “Goagle”
Both the strings are of same length hence we can calculate the Hamming distance between them. We will start with matching characters one by one. The first and second characters in both the strings are the same. The third character is different but the rest of all the characters are also the same hence the Hamming distance between the above strings is 1.
The Hamming distance only works with the same length strings. The larger the Hamming distance between strings, more dissimilar will be the strings and vice versa.
Let’s see how we can calculate the Hamming distance of two strings using SciPy library −
# Importing the SciPy library from scipy.spatial import distance # Defining the strings A = 'Google' B = 'Goagle' A, B # Computing the Hamming distance hamming_distance = distance.hamming(list(A), list(B))*len(A) print('Hamming Distance b/w', A, 'and', B, 'is: ', hamming_distance)
Hamming Distance b/w Google and Goagle is: 1.0