07 Jun 2018
Blog

Provide Live Feedback to Your Entity Linking Knowledge Bases


Rosette Entity Linking adds real-time, human-in-the-loop feature to entity linking databases

While entity extraction provides the foundation of data mining and information extraction systems, extracted entities only have limited value out of context. Understanding not just what entity strings are included in your data but also the real-world entity they link back to is vital for true knowledge management.

Rosette Enterprise has added support for making live updates to entity extraction and linking knowledge bases. Users can now add new entities and new rules to our default Wikipedia library or to users’ internal knowledge bases—all while the system is still running. Having a human-in-the-loop maximizes the quality of users’ results, combining the speed and recall of a machine with the knowledge and judgment of a human researcher.

To understand the value of real-time knowledge base updates, let’s start with the basics.

String matching

One method of extracting entities relies on identifying string matches between search queries and databases. For example, take the following sentence:

“Liam Payne, Niall Horan, Louis Tomlinson, Zayn Malik and Harry Styles of the London musical group One Direction arrive at the 42nd annual American Music Awards at Nokia Theatre L.A.”

Entity extraction identifies the following strings and uses statistical modeling to link them to their corresponding entity in the database:

  • Liam Payne – person: English singer and songwriter Q2756349
  • Niall Horan – person: Irish singer Q775231
  • Louis Tomlinson – person: English pop singer Q10745343
  • Zayn Malik – person: British singer Q3626950
  • Harry Styles – person: One Direction band member, singer, songwriter and actor Q3626966
  • London – location: capital of England and the United Kingdom Q84
  • One Direction – organization: English-Irish boy band Q146027
  • American Music Awards – organization: annual American music awards show Q207601
  • Nokia Theatre L.A. – location: music and theatre venue in downtown Los Angeles, California Q7048038

Limits of string matching

Each of the entities mentioned above are an exact string match to an entity in Wikipedia. However, while “Zayn Malik” is a fairly distinct string, “London” is ambiguous. Consider the following sentence:

“After working in the Klondike, London returned home and began publishing stories.”

A basic string linker would incorrectly connect “London” to the capital city of England rather than the American author, journalist, and social activist, Jack London (Q45765).

However, if the sentence contained London’s full name, string matching will link to the correct identifier.

Naturally, the city of London is mentioned in far more documents and articles than the late author. In most situations, basic string matching would be correct in linking all mentions of “London” to the geographical place…but not always.

The ability to utilize context clues and human-in-the-loop feedback to differentiate between ambiguous entities takes entity extraction and linking from good to great.

A balance of disambiguation and exact matching

If ambiguous linking methods are so much more accurate, why use gazetteers at all? As with any complex algorithms, ambiguous linking is much slower than searching for exact strings because more data is being processed and compared. It also requires more user interaction rather than being an entirely automated system.

When you know your data is vague, the improved precision may be worth the time cost. However when you know your data is non-ambiguous (like automobile models), exact string matching is preferable for its speed.

The ideal entity extraction and linking solution is customizable to accommodate varying data types, levels of human supervision desired, and accuracy requirements.

Customize your entity linking in real-time

Imagine you are analyzing election and voter trends in Texas. The system identifies interesting documents, and you discover that several mention a city, Paris. Similar to the London example above, an untrained system will link most mentions of the string “Paris” to the capital of France. Knowing that you are working with data from Texas however, you adjust the weighting model to default to Paris, Texas.

In another document, you discover an economist who has written a number of reports about correlations between income levels and voter turnout in Texas. Because your organization had not come across this person previously, they weren’t in your internal knowledge base. You quickly add this new entity to your knowledge base so that all future mentions of them will link back to their profile, giving you access to an extended web of useful data.

This kind of human-in-the-loop feedback allows the user to provide context and direction that the algorithm may not have access to, while also responding in real-time to new information and pivoting accordingly. The result is a more thorough analysis that combines the knowledge and intuition of a human with the speed and recall of automation.

Try it out

A system that supports both string and ambiguous extraction and linking is a complex animal. Many of these systems require downtime to make updates and changes to how statistical models are weighted, wasting valuable analyst time.

Our newest update to Rosette Enterprise’s entity extraction and linking means you have the ability to customize your information extraction system to suit your needs and data, without wasted time retraining and rebooting systems after every change. Request a demo today to try it out.


Image via Wikimedia Commons, by vagueonthehow from Tadcaster, York, England