Approximate String Matching

Here is a collection of algorithms that can be used to compare strings.

Levenshtein edit distance

Using the basic operations of insertion, deletion, and substitution, the Levenshtein distance is minimum number of operations required to transform one string into another.

For example, if we want to transform kitten into sitting, we perform the following operations:

  1. kitten -> sitten (substitution)
  2. sitten -> sittin (substitution)
  3. sittin -> sitting (insertion)

Finding the shortest distance between two strings can be done by using a matrix.

To do that, first create a (m + 1) by (n + 1) matrix, where m is the number of letters in the first word, and n is the number of letters in the second word. In this example, we have a 7x8 matrix. Set each value in the first row to 0, and set each value in the first column to the row number (0-indexed).

    s i t t i n g
  0 0 0 0 0 0 0 0
k 1              
i 2              
t 3              
t 4              
e 5              
n 6              

To populate the matrix, essentially what we'll be doing is computing the distances between substrings of our two strings. For example, let's consider cell (1, 1). The value of this cell is the minimum distance between k and s. That distance is 1, because we need to substitute s for k. Similarly, the value of cell (1, 2) involves comparing the strings k and i and is 1, because we need to substitute a letter. The value of cell (2, 1) involves comparing the strings ki and s. The distance there is 2, because we need to substitute a letter and delete a letter.

To calculate the value of cell (2, 2), we're comparing the substrings ki and si. The distance here is only 1, because only 1 operation (substitution) is needed.

In general, to calculate the value of a cell (i, j), you take the minimum of three equations:

  1. value of cell (i - 1, j) + 1
  2. value of cell (i, j - 1) + 1
  3. value of cell (i - 1, j - 1) + (0 or 1)

Equation 1 is the cost of deleting a letter plus the previously accrued cost, and equation 2 is the cost of inserting a letter plus the previously accrued cost. Equation 3 is a little different in that the cost could be either 0 or 1. If the letters are the same, there is no cost.

In this way we can populate the whole matrix:

    s i t t i n g
  0 0 0 0 0 0 0 0
k 1 1 1 1 1 1 1 1
i 2 2 1 2 2 1 2 2
t 3 3 2 1 2 2 2 3
t 4 4 3 2 1 2 3 3
e 5 5 4 3 2 2 3 3
n 6 6 5 4 3 3 2 3

In the end, the shortest distance is in the bottom right corner of the matrix: 3.

You can take shortcuts in a computer program to avoid creating a matrix. It just makes it easier if you're doing it by hand.

References:
  • Levenshtein, VI. “Binary Codes with Correction of Deletions, Insertions and Substitution of Symbols.” In Dokl. Akad. Nank. SSSR, 163:845–848, 1965.
  • Sellers, Peter H. “The Theory and Computation of Evolutionary Distances: Pattern Recognition.” J. Algorithms 1, no. 4 (1980): 359–373.

Sellers edit distance

The algorithm to compute the Sellers edit distance is the same as the algorithm to compute the Levenshtein edit distance, except the cost (or weight) of each operation can be any value. In the Levenshtein edit distance, the cost of each operation is 1.

So, using the above algorithm, we can populate the matrix by taking the minimum of the following equations for each cell:

  1. value of cell (i - 1, j) + deletion cost
  2. value of cell (i, j - 1) + insertion cost
  3. value of cell (i - 1, j - 1) + (0 or substitution cost)

-- JeremyStephens - 28 Feb 2013
Edit | Attach | Print version | History: r4 < r3 < r2 < r1 | Backlinks | View wiki text | Edit WikiText | More topic actions...
Topic revision: r2 - 28 Feb 2013, JeremyStephens
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback