Just the important entities, please

08 Nov 2017
Blog

Salience scores and linking confidence scores for extracted entities come to Rosette API

Data scraped from the web is often very noisy and cumbersome to work with. Sorting through it to find the most valuable information is a vital step in converting raw data into actionable insights.

The release of Rosette API 1.8 aims to help users with this data cleansing challenge by delivering entity salience and linking confidence scores for all results, in order to filter out only the most relevant and accurate pieces of data.

VIPs only, please

Entity salience scores indicate how relevant each extracted entity is to the main subject of a body of text. Scores are returned as 1 for salient and 0 for non-salient.

For example, following the trade of backup quarterback, Jimmy Garoppolo, to the San Francisco 49ers, the Patriots signed a 3-year contract with Brian Hoyer to back up starter, Tom Brady. Submitting an NFL article about the acquisition to the /entities endpoint identifies “PERSON” “Brian Hoyer” and “ORGANIZATION” “Patriots” as salient, whereas other teams that Hoyer has played with over the years like the Bear, Browns, and Texans, are not salient:

"entities": [

       { "type": "PERSON",

           "mention": "Hoyer",

           "normalized": "Brian Hoyer",

           "count": 9,

           "entityId": "Q912370",

           "confidence": 0.99900001,

           "salience": 1},
       { "type": "ORGANIZATION",

           "mention": "Patriots",

           "normalized": "Patriots",

           "count": 6,

           "entityId": "T2",

           "confidence": 0.99900001,

           "salience": 1},
       { "type": "ORGANIZATION",

           "mention": "Cardinals",

           "normalized": "Cardinals",

           "count": 1,

           "entityId": "T30",

           "confidence": 0.67881167,

           "salience": 0},
       { "type": "ORGANIZATION",

           "mention": "Browns",

           "normalized": "Browns",

           "count": 1,

           "entityId": "T31",

           "confidence": 0.99900001,

           "salience": 0},
       { "type": "ORGANIZATION",

           "mention": "Texans",

           "normalized": "Texans",

           "count": 1,

           "entityId": "T32",

           "confidence": 0.79899311,

           "salience": 0},
       { "type": "ORGANIZATION",

           "mention": "Bears",

           "normalized": "Bears",

           "count": 1,

           "entityId": "T33",

           "confidence": 0.99900001,

           "salience": 0},

Entity salience is particularly useful when analyzing data acquired through simple text scraping which can often include website headers, advertisements, and other irrelevant data. Chances are, if you’re analyzing an article from nytimes.com, you don’t particularly care about the “ORGANIZATION” entity, “New York Times,” but you are quite interested in “Trump,” “Jerome Powell,” and “Janet Yellen.”

Fewer, better connections

The /entities endpoint extracts the people, organizations, and locations in your text, and, if publically known, links them to the appropriate Wikidata QID. Entity linking allows you to distinguish between similarly named people such as George Bush Sr. and George W. Bush.

/entities now also returns confidence scores for each link, telling you how likely the link is correct. Confidence scores are on a range from 0.0 to 1.0 with 0 representing low confidence and 1 representing high confidence that the match is correct.

For example, in the above article, “Hoyer” is returned with “entityId”: “Q912370” and “confidence”: 0.99900001. This tells you that Rosette is 99.9% sure that the entity “Hoyer is the same person as the wikidata entry: Brian Hoyer (Q912370), American football quarterback.

Because the article uses his full name and discusses the NFL and Patriots (organizations Hoyer is associated with), it is not surprising that he was linked to the correct QID. However if you had a shorter text like a tweet, there’s a higher risk for error. The name “Hoyer” could be incorrectly linked to Australian rules footballer, Craig Hoyer (Q5181062), albeit with a low confidence score.

Depending upon your use case, you could either automatically eliminate all links under a certain threshold, or flag them for human review, removing false positives and improving data quality.

NOTE: In default mode, salience and confidence scores will not be returned. Turn them on by adding an option to the request: “calculateSalience=true” or “calculateConfidence=true” respectively.

Clean data is better data

Together, entity salience and linking scores allow users to know more about the entities in their data: how central they are to the meaning of the text, and (with enough context) to tell apart similarly named people (“Hoyer” as in “Craig” or “Brian”?), locations, or organizations by linking the name to the real world.

Put salience and linking scores to use today by signing up for a free API account (no credit card required).