Soundex

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.^[1] The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms^{[by whom?]} and is often used (incorrectly) as a synonym for "phonetic algorithm". Improvements to Soundex are the basis for many modern phonetic algorithms.^{[citation needed]}

History

Soundex was developed by Robert C. Russell and Margaret K. Odell and patented in 1918.^[2] and 1922^[3] A variation called American Soundex was used in the 1930s for a retrospective analysis of the US censuses from 1890 through 1920. The Soundex code came to prominence in the 1960s when it was the subject of several articles in the Communications and Journal of the Association for Computing Machinery (CACM and JACM), and especially when described in Donald Knuth's magnum opus, The Art of Computer Programming, vol. 3: Sorting And Searching, Addison-Wesley Professional (1973), p. 391-392.

The National Archives and Records Administration (NARA) maintains the current rule set for the official implementation of Soundex used by the U.S. Government.^[1] These encoding rules are available from NARA, upon request, in the form of General Information Leaflet 55, "Using the Census Soundex".

Rules

The Soundex code for a name consists of a letter followed by three numerical digits: the letter is the first letter of the name, and the digits encode the remaining consonants. Similar sounding consonants share the same digit so, for example, the labial consonants B, F, P, and V are each encoded as the number 1. Vowels can affect the coding, but are not coded themselves except as the first letter. However if "h" or "w" separate two consonants that have the same soundex code, the consonant to the right of the vowel is not coded.

The correct value can be found as follows:

If "h", "w" separate two consonants with the same soundex code, change consonants to right of the vowel into "h" until they have the same soundex code
Replace consonants with digits as follows (but do not change the first letter):
- b, f, p, v => 1
- c, g, j, k, q, s, x, z => 2
- d, t => 3
- l => 4
- m, n => 5
- r => 6
Collapse adjacent identical digits into a single digit of that value.
Remove all non-digits after the first letter.
Return the starting letter and the first three remaining digits. If needed, append zeroes to make it a letter and three digits.

Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" yields "A261".

Soundex variants

A similar algorithm called "Reverse Soundex" prefixes the last letter of the name instead of the first.

The NYSIIS algorithm was introduced by the New York State Identification and Intelligence System in 1970 as an improvement to the Soundex algorithm. NYSIIS handles some multi-character n-grams and maintains relative vowel positioning, whereas Soundex does not.

Daitch-Mokotoff Soundex (D-M Soundex) was developed in 1985 by genealogist Gary Mokotoff and later improved by genealogist Randy Daitch because of problems they encountered while trying to apply the Russell Soundex to Jews with Germanic or Slavic surnames (such as Moskowitz vs. Moskovitz or Levine vs. Lewin). D-M Soundex is sometimes referred to as "Jewish Soundex" or "Eastern European Soundex",^[4] although the authors discourage the use of these nicknames. The D-M Soundex algorithm can return as many as 32 individual phonetic encodings for a single name. Results of D-M Soundex are returned in an all-numeric format between 100000 and 999999. This algorithm is much more complex than Russell Soundex.

As a response to deficiencies in the Soundex algorithm, Lawrence Philips developed the Metaphone algorithm in 1990 for the same purpose. Philips developed an improvement to Metaphone in 2000, which he called Double-Metaphone. Double-Metaphone includes a much larger encoding rule set than its predecessor, handles a subset of non-Latin characters, and returns a primary and a secondary encoding to account for different pronunciations of a single word in English.

References

^ ^a ^b "The Soundex Indexing System". National Archives and Records Administration. 2007-05-30. Retrieved 2007-06-07.
^ US patent 1261167, R. C. Russell, "(title unknown)", issued 1918-04-02
^ US patent 1435663, R. C. Russell, "(title unknown)", issued 1922-11-14
^ Mokotoff, Gary (2007-09-08). "Soundexing and Genealogy". Retrieved 2008-01-27.

External links

The Soundex Indexing System (U.S. National Archives and Records Administration)
Text::Soundex Perl module from CPAN
PHP soundex function
SimMetrics an open source (sourceforge) library of similarity metrics including a number of soundex variants
Soundex in JavaScript (wrong: prefixes like "van der" are not excluded, original has two soundex codes for names with prefixes)
Soundex in JavaScript (view page source for code)
Soundex in Ruby
Soundex in Python
Soundex in PostgreSQL
Soundex Tcl package from the tcllib library
Indic Soundex developed by Swatantra Malyalam Group, can search text across different Indic scripts.
Indic Soundex's Source Code Code for above example.

[NARA_TSIS-1] "The Soundex Indexing System". National Archives and Records Administration. 2007-05-30. Retrieved 2007-06-07.

[2] US patent 1261167, R. C. Russell, "(title unknown)", issued 1918-04-02

[3] US patent 1435663, R. C. Russell, "(title unknown)", issued 1922-11-14

[4] Mokotoff, Gary (2007-09-08). "Soundexing and Genealogy". Retrieved 2008-01-27.

[1]

[2]

[3]

[4]

History

Rules

Soundex variants

See also

References

External links