Happy Birthday, Soundex!
Honoring the 100th anniversary of the first phonetic name matching patent
One hundred years ago on April 12, 1918, the patent for Soundex, the first phonetic name matching system, was issued. While Rosette name matching is a hybrid of many methods, phonetics play a key role in laying the groundwork for the powerful and accurate name matching we have today.
The great-grandfather of phonetic name indexing
Soundex is a phonetic algorithm that encodes words – specifically names – based on their pronunciation rather than their spelling. Names are indexed by their Soundex encoding so that they can be searched and matched to one another despite variations in spelling.
Robert C. Russell and Margaret King Odell created Soundex in the early 1900s to simplify the U.S. Bureau of Archives census taking process. Their system follows a few basic principles:
- Vowels have less of an effect on the overall sound of a word. Soundex disregards vowels unless they occur at the beginning of the name.
- The letters H, W, and Y minimally affect the sounds of most words and are also disregarded by Soundex unless they occur in the beginning of the word.
- Consonants that sound similar and appear together in a word sound similar to a single consonant. Soundex encodes these double consonants as a single consonant sound.
- All words are reduced to a four digit code that consists of a letter followed by three numerical digits: the letter is the first letter of the name, and the digits encode the remaining consonants.
Names or words that sound the same but have different spellings will have the same encoding. For example, the names Caitlin and Catelynn both have the Soundex code C345, and the names Smith and Smythe have the encoding S530.
The New Deal and American Soundex
In the 1930s, the Roosevelt administration enacted the New Deal, a series of programs, regulations, and reforms aimed at stimulating the economy in response to the Great Depression. One of the most ambitious New Deal programs was the creation of the Works Progress Administration (WPA), a means of providing federal jobs to the unemployed.
One of the many projects undertaken by the WPA was the creation of the American Soundex to codify the decennial census, assisting the Census Bureau in finding records of individuals who needed age verification. With the creation of Social Security, an increase in passport applications, changes in national defense needs, and more increased requests for official proof of age from 97,000 to 216,000 between 1936 and 1940. Before this system, states did not share a uniform system to register births however.
Individuals can still use census records codified in the American Soundex format to find records of ancestors today.
Modern phonetic indexing
Metaphone and Double Metaphone are two of the most popular methods of phonetic encoding (also known as the Common Key Method) in use today. Metaphone expands on Soundex with a wider set of English pronunciation rules and allowing for varying lengths of keys, whereas Soundex uses a fixed-length key.
Double Metaphone returns both a “primary” and “secondary” code for each name, allowing for greater ambiguity. In addition, instead of being tied to English pronunciation of characters, it attempts to encompass pronunciations of other origins such as Slavic, Germanic, Celtic, Greek, French, Italian, Spanish, and Chinese.
For example, Double Metaphone encodes “Smith” with a primary code of SM0 and a secondary code of XMT, while it tags “Schmidt” with a primary code of XMT and a secondary code of SMT. That the names share a primary and secondary code of XMT indicates a degree of similarity between the names which Soundex perhaps overstates and which Metaphone misses.
Limits of phonetic methods
For cases where name similarity is being scored against pairs of names in different scripts, the name must first be converted to Latin characters, which potentially introduces more errors to the comparison. Particularly in languages such as Japanese where one character can have more than one correct pronunciation, converting first to the Latin script can introduce fatal mistakes. The common Japanese female name 洋子 can be correctly pronounced Yoko or Hiroko.
Transliteration of names (a mapping of characters or sounds in one script to another) produces many possible variations since sounds in one language have to be approximated. Variations introduced by transliteration increases the complexity of the already difficult task of matching names.
If الرشید عبد is being evaluated against Abdal-Rachid, but the transliteration of الرشید عبد produces Ar-Rashid, will the names come back as a match—as they should?
|Name||Soundex Key||Metaphone Key|
Many ways to match
Consistently matching multilingual names correctly is a complex challenge. In addition to phonetic encoding, many other methods of name matching come into play:
- List Method: This method attempts to list all possible spelling variations of each name component and then looks for matching names from these lists of name variations.
- Edit Distance Method: This approach looks at how many character changes it takes to get from one name to another.
- Statistical Modeling: This technique takes hundreds, if not thousands, of matching name pairs and trains a model to recognize what two “similar names” look like so that the model can take two names and assign a similarity score.
Much like phonetic matching, each of these methods has constraints and failings to balance out its strengths. The most effective name matching systems are a hybrid of two or several of the various matching techniques, ensuring the weaknesses of one method are compensated for by the strengths of another.
To learn more about phonetic encoding and how the other fuzzy name matching methods work, check out our Overview of Fuzzy Name Matching Techniques.
Header image: 1930 census Turpin, Ben Turpin via Wikimedia Commons