Distances in Machine Learning (Pt. 2 – String Distances)

Distances are used throughout machine learning to help us (and our algorithms) figure out many things such as how far apart two instances are or how far our prediction is from the ground truth. The first part of this series looked at traditional distances that line up with our every day experience with the word, but the mathematical possibilities are so much broader than this.

Consider individual words, or strings to computer scientists. We have an idea that the string pair is closer to the string part than it is to the string clod, but how can we formalize this?

If we wanted to use our traditional distances we would have to find a way to embed these strings in a representational space and then measure distances between the points representing the pair/part and pair/clod strings. This seems awkward and there is a better way: we can use string distances.

Hamming Distance

A simple way to measure this distance for strings of the same length is to count the number of characters that differ. This is the number of substitutions that are necessary to turn one string into the other.

The Hamming distance between pair and part is 2 while the Hamming distance between pair and clod is 4.

The Hamming distance is useful for determining how many errors may have occurred in transmitting our string, which is useful for error-detecting and error-correcting codes. Consider a situation where we know that at most two characters can be substituted. If we use a code where all the strings are separated by a Hamming distance of 3 or greater, then we will always be able to detect the errors that occur. Alternatively, if we know that at most one character can be substituted then these codes can detect and correct the errors that occur, since only one code string will be a Hamming distance of 1 away from the string we receive.

Levenshtein Distance

The Hamming distance is quite straight-forward, but it is also restricted to the case where our two strings have the same distance. The real-world isn’t always so neat and simple, however. Happily we can extend the Hamming distance to the Levenshtein distance with some additional string operations: now we consider insertions, deletions, and substitutions.

The Levenshtein distance lets us consider string correction problems more generally and is related to problems of finding the longest common subsequence between two strings and DNA sequence alignment in bioinformatics.

Dynamic Time Warping

What if the strings we wanted to measure distances between weren’t strings at all, but were instead more generic sequences? In this case we can use a technique called dynamic time warping, which finds the optimal way of warping two temporal sequences into alignment. This allows us to calculate, for example, how similar two gaits are, even if the people are walking at different speeds.

Illustrating dynamic time wraping by comparing the gait of two skeletons. — Dynamic time warping determines the similarity between two time-based sequences, like these walking patterns. The more similar two objects are, the closer they are. Lars Lau Raket, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons

If you’re wondering why we have a similarity function in this list, that’s because similarity is just the inverse of distance! The more similar two objects are the closer they will be; the further apart two objects are the more different they will be.