17 Aug 2020
Blog

Rosette 1.17.0 Release: Hebrew Name Translation, French Semantic Similarity, Robust Address Matching


Recent Rosette® Cloud and Enterprise releases (1.17.0, 1.16.1) bring expanded language coverage to name translation and semantic similarity, and ease of use to the address matching capability within Rosette Name Indexer. We have also made improvements to Arabic-Arabic and Arabic-English name matching, as well as better morphological analysis in various languages.

Hebrew name translation

Name translation now supports translation of names from Hebrew to English (Latin) — but not yet from Latin-based names to Hebrew. Rosette includes a panoply of “overrides,” so that well-known foreign names written in Hebrew (such as “ג’ורג’ וושינגטון”) are properly translated (to “George Washington”) and not merely phonetically transliterated (to “G\’vrg\ vvshngtvn” or similar).

The name translation defaults to a “folk transliteration” scheme, but also supports the Hebrew transliteration standard ISO 259-2:1994 and the default Hebrew transliterator implemented by ICU, which is based on the UNGEGN (United Nations Group of Experts on Geographical Names) standard. The folk transliteration scheme, created by Basis Technology, most resembles how people actually write Hebrew names with Latin characters. This folk scheme is intended to be more useful than the other more academic standards, which use diacritics and are less readable. See more about the Basis Technology transliteration scheme in the blog post “Building a More Useful Hebrew Transliteration Scheme.”

Transliteration scheme
Transliteration of רוזלינד פרנקלין
Folk (Basis Tech) Ruzlind Prenklin
ISO 259-2:1994 Rẇzliynd Prnqliyn
ICU Rẇzĕliynĕd Pĕrĕnĕqĕliyn

French semantic similarity

Semantic similarity now supports French. It can transform French words into a numerical representation (vectors), which can be used to compare the meaning of French words to each other, or to words in eight supported languages.

For example, here are similar terms generated for the English word “spy” in German and French:

English

"term": "spy", "similarity": 1.0
"term": "spies", "similarity": 0.66227961
"term": "spying", "similarity": 0.65423775
"term": "spymaster", "similarity": 0.60325158
"term": "cia", "similarity": 0.57148194

French

"term": "espion", "similarity": 0.54824299
"term": "espionne", "similarity": 0.49286559
"term": "espionnes", "similarity": 0.41175416
"term": "secrets", "similarity": 0.39606363
"term": "escroc", "similarity": 0.36654109

German

"term": "Deckname", "similarity": 0.51391315
"term": "GRU", "similarity": 0.50809389
"term": "Spion", "similarity": 0.50051737
"term": "KGB", "similarity": 0.49981388
"term": "Informant", "similarity": 0.48774603

It is also possible to return a series of similar terms from any supported language based on a French word or phrase/sentence, or compare the content of French documents.

Unfielded fuzzy address matching

Address matching within Rosette Name Indexer (SDK) now supports unfielded addresses (whole addresses in one field) and misfielded address components (components in the wrong field).

Example of unfielded address matching:

{
"address1": "The Book Club 100-106 Leonard St Shoreditch London EC2A 4RH, United Kingdom",
"address2": {
"number": "100-106",
"road": "Leonard St",
"city": "Shoreditch",
"postcode": "EC2A 4RH"
}
}"

Example of misfielded address matching:

{
"address1":
{ "houseNumber": "160",
"road": "Pennsylvania Ave N.W.",
"city": "Washington",
"state": "D.C.",
"postcode": "20500"
},
"address2": {
"houseNumber": "160",
"road": "Pennsylvania Ave N.W.",
"city": "D.C.",
"state": "Washington",
"postcode": "20500"
}
}

We have also increased the number of overrides — lists that explicitly map nicknames, cognates, and variants — to improve matching accuracy. The example below shows how adding “UK” and “England” to the overrides list changed the match score for two otherwise identical addresses.

{
"address1": {
"house": "Ffrwdgrech Industrial Estate",
"road": "Ffrwdgrech Rd",
"city": "Brecon",
"country": "UK",
"postcode": "LD3 8LA"
},
"address2": {
"house": "Ffrwdgrech Industrial Estate",
"road": "Ffrwdgrech Rd",
"city": "Brecon",
"country": "England",
"postcode": "LD3 8LA"
}
}

Previous score without override: 0.86

New score with “UK=England” override: 0.95

Arabic name matching

Arabic-Arabic and Arabic-English name matching have improved through the addition of the following features:

  • A name gender mismatch penalty for Arabic names
  • Support for initials and initialisms in Arabic names
  • Stop words for PERSON and ORGANIZATION names added
  • Improved name token alignment
  • A new Arabic-English statistical model
  • Weighting of Arabic name tokens based on the rarity of the name, so that uncommon Arabic name components that match contribute more to the match score than common names.

Detection of tables and lists in Rosette Base Linguistics

Rosette Base Linguistics now detects zones of structured text (such as tables and lists) within a body of text, and notes the offsets for this structured region in the ADM (annotated data model JSON). This demarcation enables users to apply different processing to tables and lists, as opposed to sentences in downstream analyses.

Morphological analysis

In release 1.17, we have increased accuracy for: Greek POS tags and lemmatization; Russian verb lemmatization; German noun lemmatization; and Hebrew POS tags and tokenization.

Check out the release notes for all the details and bug fixes in this release. We look forward to your feedback!