Approximate String Matching

Here is a collection of algorithms that can be used to compare strings.

String metrics

Levenshtein edit distance

Using the basic operations of insertion, deletion, and substitution, the Levenshtein distance is minimum number of operations required to transform one string into another.

For example, if we want to transform kitten into sitting, we perform the following operations:

  1. kitten -> sitten (substitution)
  2. sitten -> sittin (substitution)
  3. sittin -> sitting (insertion)

Finding the shortest distance between two strings can be done by using a matrix.

To do that, first create a (m + 1) by (n + 1) matrix, where m is the number of letters in the first word, and n is the number of letters in the second word. In this example, we have a 7x8 matrix. Set each value in the first row to 0, and set each value in the first column to the row number (0-indexed).

    s i t t i n g
  0 0 0 0 0 0 0 0
k 1              
i 2              
t 3              
t 4              
e 5              
n 6              

To populate the matrix, essentially what we'll be doing is computing the distances between substrings of our two strings. For example, let's consider cell (1, 1). The value of this cell is the minimum distance between k and s. That distance is 1, because we need to substitute s for k. Similarly, the value of cell (1, 2) involves comparing the strings k and i and is 1, because we need to substitute a letter. The value of cell (2, 1) involves comparing the strings ki and s. The distance there is 2, because we need to substitute a letter and delete a letter.

To calculate the value of cell (2, 2), we're comparing the substrings ki and si. The distance here is only 1, because only 1 operation (substitution) is needed.

In general, to calculate the value of a cell (i, j), you take the minimum of three equations:

  1. value of cell (i - 1, j) + 1
  2. value of cell (i, j - 1) + 1
  3. value of cell (i - 1, j - 1) + (0 or 1)

Equation 1 is the cost of deleting a letter plus the previously accrued cost, and equation 2 is the cost of inserting a letter plus the previously accrued cost. Equation 3 is a little different in that the cost could be either 0 or 1. If the letters are the same, there is no cost.

In this way we can populate the whole matrix:

    s i t t i n g
  0 0 0 0 0 0 0 0
k 1 1 1 1 1 1 1 1
i 2 2 1 2 2 1 2 2
t 3 3 2 1 2 2 2 3
t 4 4 3 2 1 2 3 3
e 5 5 4 3 2 2 3 3
n 6 6 5 4 3 3 2 3

In the end, the shortest distance is in the bottom right corner of the matrix: 3.

You can take shortcuts in a computer program to avoid creating a matrix. It just makes it easier if you're doing it by hand.

References:
  • Levenshtein, VI. “Binary Codes with Correction of Deletions, Insertions and Substitution of Symbols.” In Dokl. Akad. Nank. SSSR, 163:845–848, 1965.
  • Sellers, Peter H. “The Theory and Computation of Evolutionary Distances: Pattern Recognition.” J. Algorithms 1, no. 4 (1980): 359–373.

Sellers edit distance

The algorithm to compute the Sellers edit distance is the same as the algorithm to compute the Levenshtein edit distance, except the cost (or weight) of each operation can be any value. In the Levenshtein edit distance, the cost of each operation is 1.

So, using the above algorithm, we can populate the matrix by taking the minimum of the following equations for each cell:

  1. value of cell (i - 1, j) + deletion cost
  2. value of cell (i, j - 1) + insertion cost
  3. value of cell (i - 1, j - 1) + (0 or substitution cost)

Hamming distance

The Hamming distance is similar to the Levenshtein distance, except only substitutions are taken into account. The strings being compared must be the same length.

References:
  • Hamming, Richard W. “Error Detecting and Error Correcting Codes.” Bell System Technical Journal 29, no. 2 (1950): 147–160.

Pair distance

The pair distance is computed by determining the common set of subsequent character pairs in two strings.

For example, if you want to compare "kitten" and "sitting", first you get the set of character pairs for each string:
  • kitten: {KI, IT, TT, TE, EN}
  • sitting: {SI, IT, TT, TI, IN, NG}

Then you find the set of pairs in both sets: {IT, TT}.

The pair distance is twice the number of pairs in the common set divided by the sum of the number of character pairs in each original set.

In this case: (2 * 2)/(5 + 6) = 0.363...

The idea is that each pair of subsequent characters contains some information about the original string being compared.

References:

Jaro-Winkler score

The Jaro-Winkler score (or distance) takes into account the number of matching characters and the transposition of characters in two strings.

The definition of "matching" is a little different than the other algorithms. A character in string 1 matches a character in string 2 if the character is the same and the respective positions of the characters aren't too far apart. Specifically, the characters can't be more than floor(max(length(s1), length(s2)) / 2) - 1 spaces apart from each other.

Once you have a list of matching characters, remove the non-matching ones from each string and compare the two new strings. If the two new strings aren't the same, then you have transpositions. The number of characters that are out of order (divided by 2) are the number of transpositions.

To get the Jaro score, apply this formula: 1/3 * (m / length(s1) + m / length(s2) + (m - t) / m), where:
  • m is the number of matching characters
  • s1 is the first string
  • s2 is the second string
  • t is the number of transpositions

For example, let's consider "Martha" and "Marhta". All of the letters match, but there is a transposition: T -> H. So:
  • m = 6
  • length(s1) = 6
  • length(s2) = 6
  • t = 1

The Jaro score is: 1/3 * (6 / 6 + 6 / 6 + 5 / 6) = 0.944

To find the Jaro-Winkler score, we first need to find the number of matching characters at the beginning. In this example, we have 3. Then apply the formula: jaro_score + (l * p * (1 - jaro_score)), where:
  • l is the length of the common prefix at the start of the string up to a maximum of 4
  • p is the constant scaling factor (usually 0.1 and not more than 0.25)

In our example, the Jaro-Winkler score is: 0.944 + (3 * 0.1 * (1 - 0.944)) = 0.9608

References:
  • Winkler, William E. “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage.” (1990).

Phonetic algorithms

Phonetic algorithms are a way to convert words into codes based on how the words sound. Below is a list of a few algorithms to do this.

Soundex

Soundex is a phonetic algorithm used by the US census to file last names together. The Soundex code is always a letter (the first letter of the name) followed by 3 numbers.

To get the Soundex code, first take the first letter of the last name to begin the code. Then, disregarding the letters A, E, I, O, U, H, W, and Y, get the matching code for the first 3 remaining letters by using this table:

Number Represents the Letters
1 B, F, P, V
2 C, G, J, K, Q, S, X, Z
3 D, T
4 L
5 M, N
6 R

If there aren't 3 letters left to code, just add zeros until you have 3 numbers.

For example, the last name Anderson would have a code of: A-536. The letters "s" and "n" are ignored.

There are a few special rules.
  1. Double letters are considered as a single letter
  2. Adjacent letters with the same code are treated as a single letter
  3. The following name prefixes cause a name to have two Soundex codes (one with the prefix and one without): Van, Con, De, Di, La, and Le

References:

Metaphone

Metaphone is an algorithm created by Lawrence Philips. It works not only with names, but with other words as well. There are 3 forms of the metaphone algorithm: Metaphone, Double Metaphone, and Metaphone 3. Each subsequent form improves on the previous form.

Double Metaphone is available freely, but Metaphone 3 is a commercial product.

References:

New York State Identification and Intelligence System (NYSIIS)

The NYSIIS code algorithm was developed in 1970. It's similar to the Soundex algorith, except the resulting code is pronounceable.

Algorithm:
  1. Translate first characters of name: MAC → MCC, KN → N, K → C, PH, PF → FF, SCH → SSS
  2. Translate last characters of name: EE → Y, IE → Y, DT, RT, RD, NT, ND → D
  3. First character of key = first character of name.
  4. Translate remaining characters by following rules, incrementing by one character each time:
    1. EV → AF else A, E, I, O, U → A
    2. Q → G, Z → S, M → N
    3. KN → N else K → C
    4. SCH → SSS, PH → FF
    5. H → If previous or next is non-vowel, previous.
    6. W → If previous is vowel, A.
    7. Add current to key if current is not same as the last key character.
  5. If last character is S, remove it.
  6. If last characters are AY, replace with Y.
  7. If last character is A, remove it.
  8. Append translated key to value from step 3 (removed first character)
  9. If longer than 6 characters, truncate to first 6 characters. (only needed for true NYSIIS, some versions use the full key)

References:
  • Taft, Robert L. Name Search Techniques. 1. Bureau of Systems Development, New York State Identification and Intelligence System, 1970.

-- JeremyStephens - 28 Feb 2013
Topic revision: r4 - 14 Mar 2013, JeremyStephens
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback