You are here: Vanderbilt Biostatistics Wiki>Main Web>ApproximateStringMatching (14 Mar 2013, JeremyStephens)EditAttach

- kitten ->
**s**itten (substitution) - sitten -> sitt
**i**n (substitution) - sittin -> sittin
**g**(insertion)

`(m + 1)`

by `(n + 1)`

matrix, where `m`

is the number of letters in the first word, and `n`

is the number of letters in the second word. In this example, we have a `7x8`

matrix. Set each value in the first row to s | i | t | t | i | n | g | ||
---|---|---|---|---|---|---|---|---|

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |

k | 1 | |||||||

i | 2 | |||||||

t | 3 | |||||||

t | 4 | |||||||

e | 5 | |||||||

n | 6 |

`(1, 1)`

. The value of this cell is the minimum distance between `(1, 2)`

involves comparing the strings `(2, 1)`

involves comparing the strings `(2, 2)`

, we're comparing the substrings `(i, j)`

, you take the minimum of three equations:
- value of cell
`(i - 1, j)`

+ 1 - value of cell
`(i, j - 1)`

+ 1 - value of cell
`(i - 1, j - 1)`

+ (0 or 1)

s | i | t | t | i | n | g | ||
---|---|---|---|---|---|---|---|---|

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |

k | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

i | 2 | 2 | 1 | 2 | 2 | 1 | 2 | 2 |

t | 3 | 3 | 2 | 1 | 2 | 2 | 2 | 3 |

t | 4 | 4 | 3 | 2 | 1 | 2 | 3 | 3 |

e | 5 | 5 | 4 | 3 | 2 | 2 | 3 | 3 |

n | 6 | 6 | 5 | 4 | 3 | 3 | 2 | 3 |

- Levenshtein, VI. “Binary Codes with Correction of Deletions, Insertions and Substitution of Symbols.” In Dokl. Akad. Nank. SSSR, 163:845–848, 1965.
- Sellers, Peter H. “The Theory and Computation of Evolutionary Distances: Pattern Recognition.” J. Algorithms 1, no. 4 (1980): 359–373.

- value of cell
`(i - 1, j)`

+ deletion cost - value of cell
`(i, j - 1)`

+ insertion cost - value of cell
`(i - 1, j - 1)`

+ (0 or substitution cost)

- Hamming, Richard W. “Error Detecting and Error Correcting Codes.” Bell System Technical Journal 29, no. 2 (1950): 147–160.

- kitten:
`{KI, IT, TT, TE, EN}`

- sitting:
`{SI, IT, TT, TI, IN, NG}`

`{IT, TT}`

.
The pair distance is twice the number of pairs in the common set divided by the sum of the number of character pairs in each original set.
In this case: `(2 * 2)/(5 + 6) = 0.363...`

The idea is that each pair of subsequent characters contains some information about the original string being compared.
References: - White, Simon. “How to Strike a Match.” Accessed March 7, 2013. http://www.catalysoft.com/articles/StrikeAMatch.html.

`floor(max(length(s1), length(s2)) / 2) - 1`

spaces apart from each other.
Once you have a list of matching characters, remove the non-matching ones from each string and compare the two new strings. If the two new strings aren't the same, then you have `1/3 * (m / length(s1) + m / length(s2) + (m - t) / m)`

, where: - m is the number of matching characters
- s1 is the first string
- s2 is the second string
- t is the number of transpositions

- m = 6
- length(s1) = 6
- length(s2) = 6
- t = 1

`1/3 * (6 / 6 + 6 / 6 + 5 / 6) = 0.944`

To find the Jaro-Winkler score, we first need to find the number of matching characters at the beginning. In this example, we have 3. Then apply the formula: `jaro_score + (l * p * (1 - jaro_score))`

, where: - l is the length of the common prefix at the start of the string up to a maximum of 4
- p is the constant scaling factor (usually 0.1 and not more than 0.25)

`0.944 + (3 * 0.1 * (1 - 0.944)) = 0.9608`

References: - Winkler, William E. “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage.” (1990).

Number | Represents the Letters |
---|---|

1 | B, F, P, V |

2 | C, G, J, K, Q, S, X, Z |

3 | D, T |

4 | L |

5 | M, N |

6 | R |

- Double letters are considered as a single letter
- Adjacent letters with the same code are treated as a single letter
- The following name prefixes cause a name to have two Soundex codes (one with the prefix and one without): Van, Con, De, Di, La, and Le

- "The Soundex Indexing System", National Archives, last modified May 30, 2007, http://www.archives.gov/research/census/soundex.html

- "The Double Metaphone Search Algorithm", Lawrence Phillips, last modified June 1, 2000, http://www.drdobbs.com/the-double-metaphone-search-algorithm/184401251?pgno=2

- Translate first characters of name: MAC → MCC, KN → N, K → C, PH, PF → FF, SCH → SSS
- Translate last characters of name: EE → Y, IE → Y, DT, RT, RD, NT, ND → D
- First character of key = first character of name.
- Translate remaining characters by following rules, incrementing by one character each time:
- EV → AF else A, E, I, O, U → A
- Q → G, Z → S, M → N
- KN → N else K → C
- SCH → SSS, PH → FF
- H → If previous or next is non-vowel, previous.
- W → If previous is vowel, A.
- Add current to key if current is not same as the last key character.

- If last character is S, remove it.
- If last characters are AY, replace with Y.
- If last character is A, remove it.
- Append translated key to value from step 3 (removed first character)
- If longer than 6 characters, truncate to first 6 characters. (only needed for true NYSIIS, some versions use the full key)

- Taft, Robert L. Name Search Techniques. 1. Bureau of Systems Development, New York State Identification and Intelligence System, 1970.

Edit | Attach | Print version | History: r4 < r3 < r2 < r1 | Backlinks | View wiki text | Edit wiki text | More topic actions

Topic revision: r4 - 14 Mar 2013, JeremyStephens

Copyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.

Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback

Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback