Entity Linking and Too Many (Tim) Cooks: Inside NLP and Search, Part II
Interested search technology—or AI generally? Over the next four weeks, we’re going to take an in-depth (and interesting!) look at the technology that makes modern search tick. This week, we’re talking all about entity linking.
This involves two steps. The first is entity linking, which correctly ties each extracted entity to a knowledge base record of a real-world thing. The second is coreference resolution, which links together all the mentions of an entity and the pronouns referring to it within a document.
For the first phase to be successful, the meaning conveyed by “Cook” needs to be understood by the system in much greater detail. Right now, the system has identified Cook as a person. But there are a lot of cooks that are people.
In fact, there are even several “Tim Cooks.” There’s Tim Cook, CEO of Apple. Then there’s Tim Cook, Canadian historian. There are actually quite a few scholars with that name. So how does the system perform this level of disambiguation?
In the entity linking step, the context around the entity is compared to the attributes of the entity in the knowledge base, linking the information if there’s a match.
In our example, the system would compare key information like birthdates, location, related entities, and even some qualitative descriptions of each to figure out if they are talking about the same “Tim Cook.” If the document mentions “Apple,” “iPhone,” or “Steve Jobs,” it’s a good indication that the entity in the document is in fact referring to Tim Cook, the CEO of Apple. In that case, they will be linked.
The coreference resolution step will chain a mention of “Apple CEO Tim Cook” in one part of an article to later reference(s) to the same entity, which could show up as “Cook,” “he,” or “him.”
The foundation of the clustering process is semantic similarity, made possible by a technology known as text embeddings. Semantic similarity translates the definitional relationship between words into a mathematical space, where the distance in meaning is captured by the difference in values. This method allows a machine system to understand language by transforming it into numbers.
Meaning becomes math.
For example, in the two sentences below, a human would quickly realize that “they” refers to “the city council” in the first example, and “the protesters” in the second example.
Ex. 1) The city council refused the protesters a permit because they feared violence. Ex. 2) The city council refused the protesters a permit because they might incite violence.
The coreference resolution algorithm, powered by deep learning, would figure out the antecedent using semantic similarity to predict the most likely matchup: a) “the city council”+“feared violence” or b) “the protesters”+“feared violence” in the first example. In the second example, is it more likely that a) “the city council”+“might incite violence” or b) “the protesters”+“might incite violence”?
Using this process, proper nouns (“Prof. Strickland”) are clustered with their common noun (“the professor”) and pronoun (“she”) references.
Coreference resolution ensures you are gathering every bit of relevant information about an entity, whether it’s facts or sentiment.
So far, we have discussed how a search engine populates a knowledge base with news media information about “Tim Cook” through three processes: ingestion, extraction, and linking. Now, when a user types in “Tim Cook,” the search engine pulls from this knowledge base of entity profiles to answer queries.
Typing “Tim Cook” will likely bring you hundreds of results of the most prominent Tim Cook, the CEO of Apple. Unless specified, the system bets you’re searching for the more famous figure.
If you wanted another “Tim Cook,” such as the lesser known historian Tim Cook, you’d add “historian.” This way the system knows to utilize all the articles about the historian it has compiled and linked, instead of those about the CEO.
A Note on Knowledge Bases
Many organizations have a knowledge base, but it’s neither as clean nor complete as they might wish. It’s very hard to build a resource as complete as Wikipedia or D-U-N-S without an army of human volunteers to manually write, sort, tag, translate, and disambiguate every article.
AI offers a promising solution to this problem.
While firms like Basis Technology help larger companies perform key tasks like linking, it also uses AI to help bigger players build knowledge bases.
The Big Picture
Thus far, we’ve described how search engines use AI tools like entity extraction and linking to create a growing encyclopedia of information and deliver the most relevant news information about a given person, place, or organization.
Each step must be executed with the highest possible accuracy to deliver dependable results to the user. Given the global nature of business, being able to execute in English is just the starting point. Major tech players need to be fluent in all the languages of commerce and consumers.
Homegrown solutions are prohibitively expensive, and even the tech giants have to turn to specialists.
While these tools are built by technology providers, they’re also built on data. That data is the topic of our next section.
That’s it for today. Click here for part 1. Part 3 will be posted soon!
About Basis Technology
Verifying identity, understanding customers, anticipating world events, uncovering crime. For over twenty years, Basis Technology has provided analytics enabling businesses and governments to tackle some of their toughest problems. Rosette text analytics employs a hybrid of classical machine learning and deep neural nets to extract meaningful information from unstructured data. Autopsy, our digital forensics platform, and Cyber Triage, our tool for first responders, serve the needs of law enforcement, national security, and legal technologists with over 5,000 downloads every week. KonaSearch enables natural language queries of every field, object, and file in Salesforce and external sources from a single index. For more information, email firstname.lastname@example.org or visit www.basistech.com.
Gengo, a Lionbridge company, provides training data for machine-learning applications. With a crowdsourcing platform powered by 25,000-plus certified linguistic specialists in 37 languages, Gengo delivers first-rate multilingual services for a range of data creation and annotation use cases. To learn more, visit https://gengo.ai or follow us on Twitter at @GengoIt.
Understand the NLP Driving Search
Want an inside look at the AI technology under the hood of today’s search engines?
From entity linking to semantic similarity, Inside NLP and Search explores the key innovations that make these platforms possible, giving you rare insight into how the technology that’s changing the world actually works.