Name Matching Algorithms

The basics you need to know about fuzzy name matching

When identification numbers are not available, names are often used as a unique identifier. Yet, misspellings, aliases, nicknames, transliteration and translation errors bring unique challenges in matching names. Each fuzzy name matching algorithm excels at solving one or several of these challenges in their own unique ways to provide better matching.

Learn the basics of fuzzy name matching techniques and find out the one that suits you best.

Request Demo

name matching

A Few Things to Know Before Starting

What is fuzzy name matching?

Fuzzy matching assigns a probability to a match between 0.0 and 1.0 based on linguistic and statistical methods instead of just choosing either 1 (true) or 0 (false). As a result, names Robert and Bob can be a match with high probability even though they’re not identical.

What is exact name matching?

Exact name matching determines whether two names are the same. For instance, Bill and William is not a match in this case because two names are not exactly the same even though Bill is a nickname for William.

What are the current challenges in matching names?

Name matching is hard. Spelling variations, initials, nicknames, titles, and names in different languages and scripts are just some of the most common challenges of name matching we see today.

What does “precision” and “recall” mean?

Precision: The number of correct results over the total number of results retrieved. High precision indicates the measure of quality.*

Recall: The number of correct items you found over the total number of correct items. High recall indicates the measure of quantity.*

The Most Common Challenges of Name Matching

Phonetic Similarity

Jesus ↔ Heyzeus ↔ Haezoos

Missing Spaces & Hyphens

MaryEllen ↔ Mary Ellen ↔ Mary-Ellen

Missing Components

Phillip Charles Carr ↔ Phillip Carr

Split Database Fields

Dick. Van Dyke ↔ Dick Van. Dyke

Spelling Differences

Abdul Rasheed ↔ Abd al-Rashid

Titles & Honorifics

Dr. ↔ Mr. ↔ Ph.D.

Out-of-Order Components

Diaz, Carlos Alfonzo ↔ Carlos Alfonzo Diaz

Multiple Languages

Mao Zedong ↔ Мао Цзэдун ↔ 毛泽东


William ↔ Will ↔ Bill ↔ Billy

Truncated Components

McDonalds ↔ McDonald ↔ McD


J.J. Smith ↔ James Earl Smith

Similar Names

Eagle Pharmaceuticals, Inc. ↔ Eagle Drugs, Co.

Name Matching Algorithms at a Glance

Common Key Method

Assignes names a key or code based on their English pronunciation such that similar sounding names share the same key. A well-known common key method is Soundex.

List Method

Generates a list of all possible spelling variations of each name component and, then, matches names from that list.

Edit Distance Method

Calculates the smallest number of changes — in different ways — it takes to get from one name to other.

Statistical Similarity Method

Develops a statistical algorithm by training thousands of paired names to calculate the similarity score between two names.

Word Embedding Method

Turns each word into a numerical vector based on its semantic meaning and calculates the similarity of two words in a multidimensional space. Commonly used for organization names.

Hybrid Method

Combines some or all of the name matching methods above.

Dive into the World of Name Matching Algorithms

Learn the pros and cons of each algorithm and understand what’s happening behind the scenes.

Statistical Similarity Method

Pros: Matches across languages and scripts; offers greater precision

Cons: Slower performance; high barrier to entry as it requires training data and adjusting features etc.

A statistical approach takes hundreds, if not thousands, of matching name pairs and trains a model to recognize what two “similar names” look like so that the model can take two names and assign a similarity score.

A statistical model that has been trained on thousands of pairs of matching names offers high accuracy and the ability to directly match names written in different languages without first transliterating names to Latin script.

This method has a higher barrier to entry, as collecting the matching names requires significant resources, but the accuracy may be well worth the effort. A downside is the slowness of execution. A system only using the statistical method to sift through millions of names to look for matches may be too slow to be feasible in high-transaction environments.

Common Key Method

Pros: Fast execution, high recall

Cons: Mostly limited to Latin-based languages; transliterating non-Latin names reduces precision

These methods reduce names to a key or code based on their English pronunciation, such that similar sounding names share the same key. A well-known common key method is Soundex, patented in 1918. For example, Cyndi, Canada, Candy, Canty, Chant, Condie share the code C530.

Many methods take a similar approach to Soundex, including Metaphone and Double Metaphone. These methods use phonetic algorithms which turn similar sounding names into the same key, thus identifying similar names. Metaphone expands on Soundex with a wider set of English pronunciation rules and allowing for varying lengths of keys, whereas Soundex uses a fixed-length key.

Double Metaphone further refines the matching by returning both a “primary” and “secondary” code for each name, allowing for greater ambiguity. In addition, instead of being tied to English pronunciation of characters, it attempts to encompass pronunciations of other origins such as Slavic, Germanic, Celtic, Greek, French, Italian, Spanish, and Chinese.

For example, Double Metaphone encodes “Smith” with a primary code of SM0 and a secondary code of XMT, while it tags “Schmidt” with a primary code of XMT and a secondary code of SMT. That the names share a primary and secondary code of XMT indicates a degree of similarity between the names which Soundex perhaps overstates and which Metaphone misses.

Name Soundex Key
Smith S530
Schmidt S530
Name Metaphone Key
Smith SM0
Schmidt SXMTT

While the common key method is fast to execute and has good recall, the precision suffers. Manual inspection of a few names reveals the precision issues. These names share the Soundex key H245: Haugland, Hagelin, Haslam, Heislen, Heslin, Hicklin, Highland, Hoagland.

Metaphone does a better job than Soundex, encoding the above names with different codes except for the very similar pairs Haugland/Hoagland and Heislen/Heslin.

Name Metaphone Key
Haugland HKLNT
Hagelin HJLN
Haslam HSLM
Heislen HSLN
Heslin HSLN
Hicklin HKLN
Highland HFLNT
Hoagland HKLNT

For cases where name similarity is being scored against pairs of names in different scripts—for example Korean hangul vs. English—the name must first be converted to Latin characters, which potentially introduces more errors to the comparison.

Particularly in languages such as Japanese where one character can have more than one correct pronunciations, converting first to the Latin script can introduce fatal mistakes. The common Japanese female name 洋子 can be correctly pronounced Yoko or Hiroko.

Transliteration of names (a mapping of characters or sounds in one script to another) produces many possible variations since sounds in one language have to be approximated. Variations introduced by transliteration increases the complexity of the already difficult task of matching names.

If الرشید عبد is being evaluated against Abdal-Rachid, but the transliteration of الرشید عبد produces Ar-Rashid, will the names come back as a match—as they should?

Name Soundex Key Metaphone Key
Abdal-Rachid A134 ABTLRXT
Ar-Rashid A623 ARRXT

One common key method, the Beider-Morse Phonetic Matching algorithm, does accept Russian in Cyrillic script and Hebrew in Hebrew script, but is otherwise Latin-bound.

List Method

Pros: Easy to maintain

Cons: Computationally intensive (read: expensive hardware needed to run against long lists of names quickly); Cannot handle names the system doesn’t know about; Cannot handle names with missing/added spaces between components; Cannot handle names split between different fields

This method attempts to list all possible spelling variations of each name component and then looks for matching names from these lists of name variations. For example: One system produced 3,024 possible transliterations of this Arabic name “الرشید عبد“ since each separate name component alone has several variations. Here are the first five and last five variations.

1. Abdal-rashid

2. Abdal-rashide

3. Abdal-rasheed

4. Abdal-rashiyd

5. Abdal-rachid

3020. ‘Abd-errshiyd

3021. ‘Abd-errchid

3022. ‘Abd-errchide

3023. ‘Abd-errcheed

3024. ‘abd-errchiyd

Trying to generate every possible name variation has a couple of obvious drawbacks. Name variations which are not in the list will not be found as matches, and perhaps an even greater issue is that of speed and size. Since multi-part names–particularly non-English names–generate an exponentially growing list of variations, searching through these lists takes time. Given a name with just three components and 20 possible variations per name, the number of possibilities is 203 (=8,000), a very large search space for just one name, now multiply it by the number of names on a watch list! There are further challenges with the list method – how do you score matches when one of your 8,000 query variants matches more than one name in the database? It is also difficult to handle other types of variation, like nicknames, initials, and titles, without expanding the search space even more.

A benefit of the list method is that it is simple to maintain. When a user complains about a missed match, it’s easily added to the name database. However, easy maintenance may not be enough to offset the decreased speed. For applications with that require high-throughput over millions of names, such as watchlist screening, anti-money laundering (AML), and know your customer (KYC), this approach is likely to be too slow or require a lot of expensive hardware.

Edit Distance Method

Pros: Easy to implement

Cons: Limited to Latin-based languages; all swaps are weighted evenly, missing linguistic nuances

This approach looks at how many character changes it takes to get from one name to another. “Cindy” and “Cyndi” have an edit distance of 1 since the “i” and “y” are merely transposed, whereas “Catherine” and “Katharine” have an edit distance of 2 as the “C” turns into a “K” and the first “e” becomes an “a.”

Methods which look at the character-by-character distance between two names include the Levenshtein distance, the Jaro–Winkler distance, and the Jaccard similarity coefficient. These approaches look at some combination of two factors (1) the number of similar characters and (2) the number of edit operations it takes to turn one name into the other—the operations being, insert, delete, and transpose.

Although these comparisons are quick, they do not capture linguistic nuance. All edits are given the same weight. Thus changing “c” to “p” is weighted equally as “c” to “k” although in English the latter substitution might more clearly indicate a similar name, as in “Catherine” vs. “Katherine.” Further, a one-to-many character mapping is not possible, as in the case of the Arabic character “sheen” ش‎ which is frequently mapped to “sh” in English.

And, just as with the common key method, a non-Latin script name must first be transliterated to Latin script before the comparison can be executed, as explained in the discussion of “The Weakness of the Common Key Method in Matching Across Scripts”.

Word Embedding Method

Pros: makes semantic matches that a spelling-centric method would miss

Cons: only relevant to organization name matching

Organization names differ from human names in that variations may include synonyms that look and sound entirely different than the target name. In these cases, two names referring to one company are semantically similar but phonetically different. For example, a human can quickly infer that corporation, company, and group are all similar words often found in an organization’s name, but standard name matching techniques like the edit distance method would be unlikely to make the connection. In these cases, word embeddings can make the match.

Word embeddings are numerical vector representations of a word’s semantic meaning. If two words or documents have a similar embedding, they are semantically similar. For example, the embeddings of “woman” and “girl” are close to one another in the vector space, meaning they are semantically similar. Contrastingly, the embeddings of “whale” and “philosophy” are far from one another because they are not semantically related. Applied to organizations, the word embedding method recognizes that Eagle Drugs and Eagle Pharmaceuticals are most likely the same company.

Best Practice: The Hybrid Method

The hybrid name matching method combines two or more of these name matching algorithms to backfill weakness in one algorithm with the strength of another algorithm.

machine learning name matching

Rosette uses the hybrid method combining algorithms that suit your needs best.

machine learning name matching

Taking advantage of the common key method, Rosette quickly winnows the candidate pool down to a smaller, likely set of matches in the first pass.

In the second pass, using a high-precision statistical method, Rosette filters the highest scoring matches to the top so that fine-grained distinctions between different matches have been made.

machine learning name matching
machine learning name matching

Additionally, setting a minimum match threshold further controls the quality and quantity of results returned and allows Rosette provide the best results for you.

Throughout the process, Rosette doesn’t simply combine different algorithms but also handles name phenomena such as missing name components by comparing every available combination and scoring the degree of match for each to give the end-user an appropriate degree of confidence in the match.

Benefits of Rosette Name Matching

  • Matches names of people, locations, and organizations
  • Ranks results by the relevancy based on the confidence score
  • Matches names regardless of how the names are written in fifteen languages
  • Leverages cross-script and cross-lingual matching
  • Takes advantage of semantic similarity algorithms
  • Provides greater accuracy and recall
  • Faster and more reliable than legacy solutions
  • Available to deploy on-premise and on the Cloud API

Rosette understands the linguistic complexities of names across fifteen languages. Contact us today to learn more about the sophistication of Rosette’s name matching algorithm and what difference it could make in your business.

Request a demo

Further Reading

[*] Source: Wikipedia