Making the Most of Intelligence
The Importance of Name Matching in Identity Resolution
Great strides have been made in counterterrorism with government agencies widely sharing and linking their intelligence databases. Yet, the ability to connect the dots from a growing ocean of information from data mining, signal intelligence, and social media monitoring still comes down to matching names.
Matching names is especially difficult when a name may appear in different sources written with different spellings, or even in different languages and scripts. It is possible to add this linguistic expertise to many common search engines and any intelligence or watch list filtering system.
Finding the Dots to Connect
Different government agencies maintain multiple databases to keep track of known or suspected terrorists and their associates. The FBI-maintained Terrorist Screening Database (TSDB) is just one of these. On a daily basis, it receives an average of 1,600 names nominated for addition, 600 for removal, and 4,800 for correction.1 The TSDB includes some 400,000 people that authorities have a “reasonable suspicion” are tied to terrorism, including over 1 million entries (for all aliases).2 The TSDB, in turn, is used to compile numerous watch lists and screening systems.3
The intent of all this name-gathering is to provide the authorities with a means to fuse disparate pieces of information and identify persons of interest that deserve greater scrutiny. Unfortunately this isn’t always successful. In 2012, there was an urgent alert to detain Tamerlan Tsarnaev “…whether or not the officer believes there is an exact match,” but because there was not an exact match in the spelling of his name between his passport and the alert, he was not picked up, and one year later, Tsarnaev instigated the Boston Marathon bombings.
We see that the ability to carry through on gathered intelligence that accumulates into alerts and action starts with the accurate matching of multiple name variations across both languages and scripts.
“…even for a fluent Arabic speaker, coming up with all possible spelling permutations is no small job.”
The case of the Boston Marathon bombing suspect, Tamerlan Tsarnaev is one example of how small differences in the spelling of names can have devastating results. Despite a 2011 warning from the Russian government, Tsarnaev was not picked up by authorities when he traveled from Boston to Russia in January 2012 nor when he returned in July 2012, because the urgent alert to detain him spelled his name as “Tsarnayev.” It is suspected his trip to Russia was when he was “radicalized” and trained. The following April 2013, the Boston Marathon bombings occurred.
In fact, transliteration variants of “Tsarnaev”4 have cropped up in English and Russian news reports and even passenger logs, from the original Cyrillic, Царнаев to spellings as diverse as Carnaev or Czarnaev.
In 2009, Umar Farouk Abdulmutallab attempted to detonate an explosive device on-board a flight from Amsterdam to Detroit on Christmas Day. On November 8, 2009, Abdulmutallab’s father met with U.S. Embassy officials in Nigeria to report concerns about his son becoming radicalized. However, since the father spelled Abdulmutallab’s name differently than the visa record (it differed by one letter), the two pieces of information were not connected.
Some news reports5 criticized U.S. embassy officials for not conducting a more thorough visa search by trying other spellings of Abdulmutallab’s name, but even for a fluent Arabic speaker, coming up with all possible spelling permutations is no small job.
Tragedy and Near Misses
The misspellings in Abdulmutallab’s case and that of Tamerlan Tsarnaev point to the variability that is introduced when transliterating names from their native script to the Latin alphabet.
|Name in Native Arabic||English Translations|
Names in languages that don’t use the Latin alphabet are rarely transliterated to a consistent English spelling. These languages may use sounds inexpressible in English, or some sounds may be expressed in English using more than one letter combination (e.g., Czar vs. Tsar).
The real question is, why should the embassy staff have to think up and search on numerous name variations? Instead, search systems should have fuzzy name matching built in.
In fact in the mid-2010s, the U.S. Department of Homeland Security’s Customs and Border Protection strengthened its Targeting Analysis Systems Program Office (TASPO) by bringing in Basis Technology’s cutting-edge name matching software. Rosette provides fuzzy name matching across 15 languages and their native scripts, including English, Russian, Chinese, Korean, Arabic, Persian, Pashto, and Urdu.
Naïve vs. Linguistic Methodologies
Traditional methods of name matching do not use linguistic knowledge. They apply what might be called the naïve or “brute force” approach, which matches names by addressing them simply as a sequence of letters. If two names have the same letters in the same order (or a simple approximation) then they’re considered a match. Naïve methods attempt to handle name variability (as in the Abdul Rashid example) by maintaining an exhaustive list of possible name options.
“…name variations do not occur at random. They follow established patterns of human practice specific to each language and cultural context, knowledge used in linguistic-based models.
The downside to this process is that it requires (1) the continuous collection and storage of name variations, and (2) knowing the name in advance. In addition, it needs enormous computing power to continuously check every name against a massive and ever-growing list. The unacceptably high risk of this approach is that a not-seen-before spelling of a name may not be matched to an existing person on a watch list.
A knowledge-based approach is significantly better. Rather than try to specify every possible name variation, a computer is “taught” to identify variations based on automated linguistic methods – the same linguistic patterns people use to make name variants. This approach applies phonology (how words sound) to orthography (how words are written).
Take “Charley” versus “Charlie.” A linguistic engine knows that in English, “ley” and “lie” sound the same, and that both forms are reasonable variations of the same name. Other types of knowledge also apply, such as formal vs informal names, where the “e” sounding suffix indicates a nickname for the formal name “Charles.” Linguistic algorithms can recognize that the many different potential spellings of Mohammed “sound” the same and in fact, derive from the name محمد, without the need to maintain an exhaustive list of each variation.
Arabic name transliterations vary widely because different dialects pronounce certain characters differently. This can result in dramatic differences in the translation, for example Khadaffi vs. Gadaffi. Because a linguistic name matching system can map a translated name back to the original Arabic script, it is still able to accurately match these names to each other.
Similarly, a Mexican national’s paternal surname may be mistakenly entered in a middle name field as the typist is unaware that naming customs place it before the maternal surname in the surname field. Cases like this can easily generate false negatives and positives without the cultural awareness of a linguistic system.
Fortunately, these types of name variations do not occur at random. They follow established patterns of human practice specific to each language and cultural context, knowledge used in linguistic-based models.
Advantages of the Linguistic Model
The benefits of a knowledge-based approach are clear. Because a computer can apply the knowledge humans use to derive name variations, every system can be a “global name expert.” No actual humans are needed to create spelling variations. Nor does the computer have to make potentially billions of naïve comparisons. Results are more accurate and the systems that produce these results are much more scalable.
The linguistic approach therefore has several key advantages:
- Fewer false positives: Computers employ the same knowledge humans would
- Fewer false negatives: Humans do not have to know all possible variations
- Faster checking: Not all variations are compared explicitly
- Greater scalability: Fewer (or smaller) machines can check more names
The linguistic, knowledge-based approach is also stronger in specific cases.
|Matching Issue||Naïve Approach||Knowledge-Based Approach|
|New, previously unseen name||Weak support. Depends on knowing names in advance to generate variants||Supported by applying linguistic and cultural knowledge of names|
|Cross-lingual name matching: the same name written in different scripts
(鈴木一郎 vs. Ichiro Suzuki)
|Not supported||Supported. Cross-lingual name matching in many languages and scripts6|
(Mary Ellen vs. MaryEllen)
|Not supported||Supported by applying linguistic and cultural knowledge of names|
|Arabic name variations
(Abdul Rasheed vs. Abd Al-Rashid)
|Supported by making potentially billions of naïve comparisons. A three-component Arabic name could generate thousands of variants.||Supported by applying linguistic and cultural knowledge of names|
Linguistic Name Matching in the Cloud
Rosette’s name matching is the premier knowledge-based solution on the market, actively used within government, financial compliance, identity verification applications used by the sharing economy, and eDiscovery solutions.
By working on the name in its original script…Rosette properly matches names…without introducing the inevitable errors that transliteration introduces.
Rosette performs an intelligent comparison based on linguistic, orthographic, and phonologic algorithms. It handles spelling variations and errors, non-standard transliteration, and the cultural vagaries of how names are written in each language. In addition, Rosette understands the structures of names in each language, so instead of generating countless variations to look up, it does an intelligent comparison of names within a language or across languages and scripts.
By working on the name in its original script—as opposed to translating the name into English—Rosette takes advantage of all the available contextual information to properly match names to a target list without introducing the inevitable errors that transliteration introduces.
Rosette in Action
Rosette is deployed in numerous installations across the U.S. government, including the search engine for the National Harmony database, a U.S. government-run database that provides bibliographic references to foreign technical and military documents and their translations. Rosette is also integrated into the U.S. Customs and Border Patrol’s TASPO system to strengthen U.S. borders.
Rosette’s name matching functions are accessible as an API (a cloud API or on-premise deployment) or SDK and can be easily integrated into applications, search engines, or predictive analytics through plugins to Elasticsearch, Apache™ Solr, and RapidMiner.
With its increasing list of users, Basis Technology is constantly improving Rosette to increase its capabilities and performance, in order to meet current and future demands of our customers. Rosette name matching currently supports 15 languages and their native scripts with more under development: Arabic, Chinese (simplified and traditional), English, French, German, Italian, Japanese, Korean, Pashto, Persian, Portuguese, Russian, Spanish, and Urdu.
For a free evaluation of how Rosette can solve your name matching challenge, contact Basis Technology at firstname.lastname@example.org, or read more about Rosette’s name matching.
1 As measured for the 12 month period ending March 2009. Pincus, Walter, “1,600 are suggested daily for FBI’s list” Washington Post, Nov. 1, 2009
4 House Homeland Security Committee Report, “The Road to Boston: Counterterrorism Challenges and Lessons from the Marathon Bombings,” March 2014
5 Robb, Robert, “Christmas Day bomber disturbing revelations,” The Arizona Republic, January 14, 2010
6 See the full list of languages that Rosette name matching supports.