Approximate String Matching
Here is a collection of algorithms that can be used to compare strings.
String metrics
Levenshtein edit distance
Using the basic operations of insertion, deletion, and substitution, the Levenshtein distance is minimum number of operations required to transform one string into another.
For example, if we want to transform
kitten into
sitting, we perform the following operations:
- kitten -> sitten (substitution)
- sitten -> sittin (substitution)
- sittin -> sitting (insertion)
Finding the shortest distance between two strings can be done by using a matrix.
To do that, first create a
(m + 1)
by
(n + 1)
matrix, where
m
is the number of letters in the first word, and
n
is the number of letters in the second word. In this example, we have a
7x8
matrix. Set each value in the first row to
0, and set each value in the first column to the row number (0-indexed).
|
|
s |
i |
t |
t |
i |
n |
g |
|
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
k |
1 |
|
|
|
|
|
|
|
i |
2 |
|
|
|
|
|
|
|
t |
3 |
|
|
|
|
|
|
|
t |
4 |
|
|
|
|
|
|
|
e |
5 |
|
|
|
|
|
|
|
n |
6 |
|
|
|
|
|
|
|
To populate the matrix, essentially what we'll be doing is computing the distances between substrings of our two strings. For example, let's consider cell
(1, 1)
. The value of this cell is the minimum distance between
k and
s. That distance is 1, because we need to substitute
s for
k. Similarly, the value of cell
(1, 2)
involves comparing the strings
k and
i and is 1, because we need to substitute a letter. The value of cell
(2, 1)
involves comparing the strings
ki and
s. The distance there is 2, because we need to substitute a letter and delete a letter.
To calculate the value of cell
(2, 2)
, we're comparing the substrings
ki and
si. The distance here is only 1, because only 1 operation (substitution) is needed.
In general, to calculate the value of a cell
(i, j)
, you take the minimum of three equations:
- value of cell
(i - 1, j)
+ 1
- value of cell
(i, j - 1)
+ 1
- value of cell
(i - 1, j - 1)
+ (0 or 1)
Equation 1 is the cost of deleting a letter plus the previously accrued cost, and equation 2 is the cost of inserting a letter plus the previously accrued cost. Equation 3 is a little different in that the cost could be either 0 or 1. If the letters are the same, there is no cost.
In this way we can populate the whole matrix:
|
|
s |
i |
t |
t |
i |
n |
g |
|
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
k |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
i |
2 |
2 |
1 |
2 |
2 |
1 |
2 |
2 |
t |
3 |
3 |
2 |
1 |
2 |
2 |
2 |
3 |
t |
4 |
4 |
3 |
2 |
1 |
2 |
3 |
3 |
e |
5 |
5 |
4 |
3 |
2 |
2 |
3 |
3 |
n |
6 |
6 |
5 |
4 |
3 |
3 |
2 |
3 |
In the end, the shortest distance is in the bottom right corner of the matrix:
3.
You can take shortcuts in a computer program to avoid creating a matrix. It just makes it easier if you're doing it by hand.
References:
- Levenshtein, VI. “Binary Codes with Correction of Deletions, Insertions and Substitution of Symbols.” In Dokl. Akad. Nank. SSSR, 163:845–848, 1965.
- Sellers, Peter H. “The Theory and Computation of Evolutionary Distances: Pattern Recognition.” J. Algorithms 1, no. 4 (1980): 359–373.
Sellers edit distance
The algorithm to compute the Sellers edit distance is the same as the algorithm to compute the Levenshtein edit distance, except the cost (or weight) of each operation can be any value. In the Levenshtein edit distance, the cost of each operation is 1.
So, using the above algorithm, we can populate the matrix by taking the minimum of the following equations for each cell:
- value of cell
(i - 1, j)
+ deletion cost
- value of cell
(i, j - 1)
+ insertion cost
- value of cell
(i - 1, j - 1)
+ (0 or substitution cost)
Hamming distance
The Hamming distance is similar to the Levenshtein distance, except only substitutions are taken into account. The strings being compared must be the same length.
References:
- Hamming, Richard W. “Error Detecting and Error Correcting Codes.” Bell System Technical Journal 29, no. 2 (1950): 147–160.
Pair distance
The pair distance is computed by determining the common set of subsequent character pairs in two strings.
For example, if you want to compare "kitten" and "sitting", first you get the set of character pairs for each string:
- kitten:
{KI, IT, TT, TE, EN}
- sitting:
{SI, IT, TT, TI, IN, NG}
Then you find the set of pairs in both sets:
{IT, TT}
.
The pair distance is twice the number of pairs in the common set divided by the sum of the number of character pairs in each original set.
In this case:
(2 * 2)/(5 + 6) = 0.363...
The idea is that each pair of subsequent characters contains some information about the original string being compared.
References:
Jaro-Winkler score
The Jaro-Winkler score (or distance) takes into account the number of matching characters and the transposition of characters in two strings.
The definition of "matching" is a little different than the other algorithms. A character in string 1 matches a character in string 2 if the character is the same and the respective positions of the characters aren't too far apart. Specifically, the characters can't be more than
floor(max(length(s1), length(s2)) / 2) - 1
spaces apart from each other.
Once you have a list of matching characters, remove the non-matching ones from each string and compare the two new strings. If the two new strings aren't the same, then you have
transpositions. The number of characters that are out of order (divided by 2) are the number of transpositions.
To get the Jaro score, apply this formula:
1/3 * (m / length(s1) + m / length(s2) + (m - t) / m)
, where:
- m is the number of matching characters
- s1 is the first string
- s2 is the second string
- t is the number of transpositions
For example, let's consider "Martha" and "Marhta". All of the letters match, but there is a transposition: T -> H. So:
- m = 6
- length(s1) = 6
- length(s2) = 6
- t = 1
The Jaro score is:
1/3 * (6 / 6 + 6 / 6 + 5 / 6) = 0.944
To find the Jaro-Winkler score, we first need to find the number of matching characters at the beginning. In this example, we have 3. Then apply the formula:
jaro_score + (l * p * (1 - jaro_score))
, where:
- l is the length of the common prefix at the start of the string up to a maximum of 4
- p is the constant scaling factor (usually 0.1 and not more than 0.25)
In our example, the Jaro-Winkler score is:
0.944 + (3 * 0.1 * (1 - 0.944)) = 0.9608
References:
- Winkler, William E. “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage.” (1990).
Phonetic algorithms
Phonetic algorithms are a way to convert words into codes based on how the words sound. Below is a list of a few algorithms to do this.
Soundex
Soundex is a phonetic algorithm used by the US census to file last names together. The Soundex code is always a letter (the first letter of the name) followed by 3 numbers.
To get the Soundex code, first take the first letter of the last name to begin the code. Then, disregarding the letters A, E, I, O, U, H, W, and Y, get the matching code for the first 3 remaining letters by using this table:
Number |
Represents the Letters |
1 |
B, F, P, V |
2 |
C, G, J, K, Q, S, X, Z |
3 |
D, T |
4 |
L |
5 |
M, N |
6 |
R |
If there aren't 3 letters left to code, just add zeros until you have 3 numbers.
For example, the last name Anderson would have a code of: A-536. The letters "s" and "n" are ignored.
There are a few special rules.
- Double letters are considered as a single letter
- Adjacent letters with the same code are treated as a single letter
- The following name prefixes cause a name to have two Soundex codes (one with the prefix and one without): Van, Con, De, Di, La, and Le
References:
Metaphone
Metaphone is an algorithm created by Lawrence Philips. It works not only with names, but with other words as well. There are 3 forms of the metaphone algorithm: Metaphone, Double Metaphone, and Metaphone 3. Each subsequent form improves on the previous form.
Double Metaphone is available freely, but Metaphone 3 is a commercial product.
References:
New York State Identification and Intelligence System (NYSIIS)
The NYSIIS code algorithm was developed in 1970. It's similar to the Soundex algorith, except the resulting code is pronounceable.
Algorithm:
- Translate first characters of name: MAC → MCC, KN → N, K → C, PH, PF → FF, SCH → SSS
- Translate last characters of name: EE → Y, IE → Y, DT, RT, RD, NT, ND → D
- First character of key = first character of name.
- Translate remaining characters by following rules, incrementing by one character each time:
- EV → AF else A, E, I, O, U → A
- Q → G, Z → S, M → N
- KN → N else K → C
- SCH → SSS, PH → FF
- H → If previous or next is non-vowel, previous.
- W → If previous is vowel, A.
- Add current to key if current is not same as the last key character.
- If last character is S, remove it.
- If last characters are AY, replace with Y.
- If last character is A, remove it.
- Append translated key to value from step 3 (removed first character)
- If longer than 6 characters, truncate to first 6 characters. (only needed for true NYSIIS, some versions use the full key)
References:
- Taft, Robert L. Name Search Techniques. 1. Bureau of Systems Development, New York State Identification and Intelligence System, 1970.
--
JeremyStephens - 28 Feb 2013