You are here: Vanderbilt Biostatistics Wiki>Main Web>ApproximateStringMatching (revision 3)~~EditAttach~~

Approximate String Matching

Here is a collection of algorithms that can be used to compare strings.

Levenshtein edit distance

Using the basic operations of insertion, deletion, and substitution, the Levenshtein distance is minimum number of operations required to transform one string into another.

For example, if we want to transform kitten into sitting, we perform the following operations:

kitten -> sitten (substitution)
sitten -> sittin (substitution)
sittin -> sitting (insertion)

Finding the shortest distance between two strings can be done by using a matrix.

To do that, first create a (m + 1) by (n + 1) matrix, where m is the number of letters in the first word, and n is the number of letters in the second word. In this example, we have a 7x8 matrix. Set each value in the first row to 0, and set each value in the first column to the row number (0-indexed).

		s	i	t	t	i	n	g
	0	0	0	0	0	0	0	0
k	1
i	2
t	3
t	4
e	5
n	6

To populate the matrix, essentially what we'll be doing is computing the distances between substrings of our two strings. For example, let's consider cell (1, 1). The value of this cell is the minimum distance between k and s. That distance is 1, because we need to substitute s for k. Similarly, the value of cell (1, 2) involves comparing the strings k and i and is 1, because we need to substitute a letter. The value of cell (2, 1) involves comparing the strings ki and s. The distance there is 2, because we need to substitute a letter and delete a letter.

To calculate the value of cell (2, 2), we're comparing the substrings ki and si. The distance here is only 1, because only 1 operation (substitution) is needed.

In general, to calculate the value of a cell (i, j), you take the minimum of three equations:

value of cell (i - 1, j) + 1
value of cell (i, j - 1) + 1
value of cell (i - 1, j - 1) + (0 or 1)

Equation 1 is the cost of deleting a letter plus the previously accrued cost, and equation 2 is the cost of inserting a letter plus the previously accrued cost. Equation 3 is a little different in that the cost could be either 0 or 1. If the letters are the same, there is no cost.

In this way we can populate the whole matrix:

		s	i	t	t	i	n	g
	0	0	0	0	0	0	0	0
k	1	1	1	1	1	1	1	1
i	2	2	1	2	2	1	2	2
t	3	3	2	1	2	2	2	3
t	4	4	3	2	1	2	3	3
e	5	5	4	3	2	2	3	3
n	6	6	5	4	3	3	2	3

In the end, the shortest distance is in the bottom right corner of the matrix: 3.

You can take shortcuts in a computer program to avoid creating a matrix. It just makes it easier if you're doing it by hand.

References:

Levenshtein, VI. “Binary Codes with Correction of Deletions, Insertions and Substitution of Symbols.” In Dokl. Akad. Nank. SSSR, 163:845–848, 1965.
Sellers, Peter H. “The Theory and Computation of Evolutionary Distances: Pattern Recognition.” J. Algorithms 1, no. 4 (1980): 359–373.

Sellers edit distance

The algorithm to compute the Sellers edit distance is the same as the algorithm to compute the Levenshtein edit distance, except the cost (or weight) of each operation can be any value. In the Levenshtein edit distance, the cost of each operation is 1.

So, using the above algorithm, we can populate the matrix by taking the minimum of the following equations for each cell:

value of cell (i - 1, j) + deletion cost
value of cell (i, j - 1) + insertion cost
value of cell (i - 1, j - 1) + (0 or substitution cost)

Hamming distance

The Hamming distance is similar to the Levenshtein distance, except only substitutions are taken into account. The strings being compared must be the same length.

References:

Hamming, Richard W. “Error Detecting and Error Correcting Codes.” Bell System Technical Journal 29, no. 2 (1950): 147–160.

Pair distance

The pair distance is computed by determining the common set of subsequent character pairs in two strings.

For example, if you want to compare "kitten" and "sitting", first you get the set of character pairs for each string:

kitten: {KI, IT, TT, TE, EN}
sitting: {SI, IT, TT, TI, IN, NG}

Then you find the set of pairs in both sets: {IT, TT}.

The pair distance is twice the number of pairs in the common set divided by the sum of the number of character pairs in each original set.

In this case: (2 * 2)/(5 + 6) = 0.363...

The idea is that each pair of subsequent characters contains some information about the original string being compared.

References:

White, Simon. “How to Strike a Match.” Accessed March 7, 2013. http://www.catalysoft.com/articles/StrikeAMatch.html.

Jaro-Winkler score

The Jaro-Winkler score (or distance) takes into account the number of matching characters and the transposition of characters in two strings.

The definition of "matching" is a little different than the other algorithms. A character in string 1 matches a character in string 2 if the character is the same and the respective positions of the characters aren't too far apart. Specifically, the characters can't be more than floor(max(length(s1), length(s2)) / 2) - 1 spaces apart from each other.

Once you have a list of matching characters, remove the non-matching ones from each string and compare the two new strings. If the two new strings aren't the same, then you have transpositions. The number of characters that are out of order (divided by 2) are the number of transpositions.

To get the Jaro score, apply this formula: 1/3 * (m / length(s1) + m / length(s2) + (m - t) / m), where:

m is the number of matching characters
s1 is the first string
s2 is the second string
t is the number of transpositions

For example, let's consider "Martha" and "Marhta". All of the letters match, but there is a transposition: T -> H. So:

m = 6
length(s1) = 6
length(s2) = 6
t = 1

The Jaro score is: 1/3 * (6 / 6 + 6 / 6 + 5 / 6) = 0.944

To find the Jaro-Winkler score, we first need to find the number of matching characters at the beginning. In this example, we have 3. Then apply the formula: jaro_score + (l * p * (1 - jaro_score)), where:

l is the length of the common prefix at the start of the string up to a maximum of 4
p is the constant scaling factor (usually 0.1 and not more than 0.25)

In our example, the Jaro-Winkler score is: 0.944 + (3 * 0.1 * (1 - 0.944)) = 0.9608

-- JeremyStephens - 28 Feb 2013

Topic revision: r3 - 07 Mar 2013, JeremyStephens

Main

Department Home Page

Biostatistics Graduate Program

Vanderbilt University Medical Center

Biostatistics Webs
- Archive
- Main
- Sandbox
- System

Copyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback

		s	i	t	t	i	n	g
	0	0	0	0	0	0	0	0
k	1	1	1	1	1	1	1	1
i	2	2	1	2	2	1	2	2
t	3	3	2	1	2	2	2	3
t	4	4	3	2	1	2	3	3
e	5	5	4	3	2	2	3	3
n	6	6	5	4	3	3	2	3

		s	i	t	t	i	n	g
	0	0	0	0	0	0	0	0
k	1	1	1	1	1	1	1	1
i	2	2	1	2	2	1	2	2
t	3	3	2	1	2	2	2	3
t	4	4	3	2	1	2	3	3
e	5	5	4	3	2	2	3	3
n	6	6	5	4	3	3	2	3

		s	i	t	t	i	n	g
	0	0	0	0	0	0	0	0
k	1	1	1	1	1	1	1	1
i	2	2	1	2	2	1	2	2
t	3	3	2	1	2	2	2	3
t	4	4	3	2	1	2	3	3
e	5	5	4	3	2	2	3	3
n	6	6	5	4	3	3	2	3