04 Sep 2019
Blog

Rosette 1.14 Release: Entity linking to Thomson Reuters PermID, Multi-model language identification


The August release of Rosette 1.14 brings new features to entity extraction and linking, as well as language identification.

Roadmap for linking entities to multiple knowledge bases

In addition to linking to entities in Wikidata and DBpedia, entity extraction Rosette will ultimately link to multiple knowledge bases, including Thomson Reuters open PermID. PermID covers a wide variety of entity types in the business domain, including organizations, instruments, funds, issuers, and people.
The integrating of PermID into Rosette entity linking will be in three phases:

  • Phase 1 is a beta release available now in Rosette Cloud. Entity linking will still primarily be through Wikidata (QIDs) and DBpedia, but PermID data will be linked to, and thus supplement, Wikidata entities for entities that appear in both knowledge bases. PermID entities not in Wikidata are not linkable for this first phase. Try the new option by adding {“options”: {“includePermID”: true}} to your call. Let us know how it works.
  • Phase 2 will enlarge the pool of entities Rosette can link to and is slated for December. Rosette will first try to link to the PermID knowledge base. If an entity is not found, Rosette will try linking to Wikidata.
  • Phase 3 will allow for a fully aggregated knowledge base. Rosette will simultaneously link to Wikidata and PermID, and take full advantage of the QID ↔ PermID mapping and contexts from both knowledge bases.

Multi-model language identification

With this release we added support for 23 more languages to short string language identification, for a total of 50 languages. With the exception of four Arabic script languages in transliterated form (transliterated Arabic, Pashto, Persian, and Urdu), short string and regular language identification support the same languages.

The accuracy of Rosette’s short string algorithm relies on more than the conventional approach using each language’s statistical profile created by counting n-grams in a language.

This is an example of the conventional use of n-grams (a trigram above) to create a language’s statistical profile based on the frequency of each trigram (seen as bytes by the computer) in a language.

Short string language detection is boosted in Rosette using “script awareness” and “token-level awareness” algorithms.

Script awareness exploits the fact that certain languages have their own alphabet. If the text is in a language like Greek, then we can confidently say “It’s Greek” based on the limited codepoint range for that script.

Word-level (or token-level) awareness exploits the fact that certain combinations of words (tokens) are associated with a given language. It may recognize prefixes and suffixes that are unique to a language.

For example, the Cyrillic model uses single tokens, characters, and token prefixes and suffixes to help identify Russian. The model will know that “хочет” is a Russian word, that “ч” is a Russian letter, and that “ет” appears at the ends of Russian words.
With this expanded support, language identification on multilingual documents —which switch language every line or so — becomes even more accurate.

Consider this multilingual excerpt from Leo Tolstoy’s War and Peace:

И хочет, чтоб я не боялась.
Все-таки я не понял, de quoi vous avez peur.
Non, André, je dis que vous avez tellement, tellement changé…

Read the full release notes or give Rosette a try by signing up for a free 30-day trial of Rosette Cloud at http://developer.rosette.com/signup.