The Fight Against Intelligence Failures with NLP: Rosette Identity Resolver
In his book “Intelligence and Surprise Attack: Failure and Success from Pearl Harbor to 9/11 and Beyond,” 21-year naval intelligence veteran Dr. Erik Dahl highlights the importance of the intelligence data collector: “Decision makers…don’t want to make a difficult, dangerous decision based on…an assessment, on somebody just connecting the dots. They want to know specifically what the nature of the problem is and what specifically I need to do in order to solve it.”1 Dahl’s research finds that “almost all of those warnings [in cases of intelligence failure] were of this broad, general, non-specific nature, what we call strategic warning….We need to collect the dots much more specifically. We need much more fine-grained intelligence if we’re going to use that intelligence to stop bad things from happening.”2
When domestic threats are successfully thwarted, “…the tools and techniques that are used to stop these would-be terrorists are usually very similar, the same tools and techniques that law enforcement has been using for many years. It’s tips from the public, it’s undercover officers, it’s the use of informants.”3
The continually vexing issue in intelligence is how to find those people and organizations that need to be tracked, but who are not yet on your radar.
How do you find the “unknown unknowns”?
For people to discover the yet-unknown persons and organizations they should add to their watchlist, they have to spend a lot of time combing through a lot of data and noticing patterns. Maybe in 200 reports, they will notice a new person is referenced dozens of times, elevating that individual to a “known unknown.”
Discovering those unknown is now possible through a combination of named entity recognition (NER) and new identity resolution technology that tracks profiles of “ghosts” — people or entities not yet in your knowledge base — as it automatically accumulates new information about your existing knowledge base entries (aka identities). These new identities can be promoted in the knowledge base to be an authoritative identity by data collectors when the ghosts come into focus with more data.
In essence, this NLP technology can transform the static knowledge bases of analysts and data collectors into an intelligent system that gets smarter over time.
How does this work? Let’s take a look at the technology underneath.
Introducing Rosette Identity Resolver: A Self-Learning Knowledge Base
Turning the unknown unknowns into unknown entities and then into known identities.
Rosette® Identity Resolver is a self-learning entity knowledge base system that uses named entity recognition to identify mentions of people, places, and organizations (aka entities) in documents. It does this by looking at the document context of named entities and intelligently matching them to identities in the knowledge base (aka, entity linking). At the same time, analysts have a user-friendly interface with which to make corrections and updates to the knowledge base using their expertise.
What makes Identity Resolver different from other entity linking systems is that it intelligently identifies entities that do not correspond to any knowledge base entry, and adds a ghost identity to the knowledge base.
The NLP Workflow of Identity Resolver
Step 1: Linguistic Analysis
The user starts by uploading a number of documents into the system to enrich the text and prepare it for named entity recognition. This linguistic analysis includes:
- Identifying the language of the text
- Dividing the text into sentences and tokens (aka words)
- Labeling the part of speech of each token
- Analyzing the morphology of each token.
Step 2: Named Entity Recognition and Coreference Resolution
Through named entity recognition (NER), the system executes on:
- Finding all word sequences in a document that are mentions of an entity, such as a person, place, organization, or other significant keyword (including dates, currency, and job titles); this is named entity recognition
- Chaining together the words that seem to refer to the same entity in a document as an “entity reference” (e.g., Barack Obama = Obama = his = he = him); this is in-document coreference resolution
- Grouping all entity references from across separate documents that refer to the same entity as an identity; this is cross-document coreference resolution or entity linking.
Technical details: The entity extraction in Identity Resolver is Rosette Entity Extractor (REX), which uses three methods for NER: a machine learning, statistical model that extracts words that are highly likely to represent entities; regular expressions to match entities fitting a pattern (e.g., dates); and entity lists to exact-match entity types (e.g., nationality) that can be listed. REX also provides entity links to a knowledge base and confidence scores for each entity reference, which Rosette Identity Resolver uses in later entity linking steps.
Step 3: Entity Linking
Matching these words that refer to entities to knowledge base entries is the crucial next step using:
- Sophisticated fuzzy name search to draw up a list of knowledge base candidates that an entity reference might match
- Entity linking that narrows the list of candidates by looking at the “context” of the entity in the document compared to identities (i.e., knowledge base entries).
Technical details: The initial list of knowledge base candidates is drawn up by matching names using the highly sophisticated name matching of Rosette Name Indexer, which considers a wide variety of ways that names vary (such as nicknames, initials, typos, and mis-segmentation) and even the same name written in different languages.
In entity linking, the linker is the component that takes an entity reference as input and produces an identity label as output. The Identity Resolver’s linker works as a random forest classification model. A random forest is a versatile and fast machine learning algorithm that generates many decision trees, each of which selects a candidate. The random forest selects the identity based on votes from the decision trees.
The Rosette random forest model depends on multiple features:
- The confidence assigned by the entity extractor, so that when it is confident in its link, the Identity Resolver is more likely to select its choice.
- The similarity between the entity reference’s name and the identity’s aliases.
- Several semantic features related to the context of the mentions in the document. These features depend on text embeddings, which enable mathematically determining the similarity of meaning between two texts. The word embedding features are based on other names and non-name words in the document.
Step 4: Automatically Updating the Knowledge Base and Ghost Detection
Once the linker selects an identity for the entity reference, the entity reference is added to the knowledge base. The new entity reference increases the information in the knowledge base, thus allowing the new references to be considered as linking candidates in the next round of linking. In this way, the linker gets better and better over time and accelerates data collection.
If none of the existing entity references have a high enough random forest score to be considered a match, Identity Resolver recognizes that this is probably an unknown identity, categorized as a ghost. The system automatically creates a new ID string for the ghost and adds it to the index as a new identity, so that future references can be linked to it. This ability to link new entity references to these ghosts means the system can continue to accrue information about them. If there is a ghost identity that a user has confidence in, the user can promote it from ghost status to an authoritative identity. The promotion is significant. as the linker is more likely to link future entity references to an authoritative identity than a ghost.
Technical details: The act of the Identity Resolver adding new references through the linking process is an implementation of a clustering algorithm. That is, an identity is a cluster of entity references. When it is time to link a new entity reference, the system identifies which known reference is most similar to it, and adds the new entity reference to that cluster. This is how a single-link clustering algorithm works.
Although AI can make good automatic updates to the knowledge base, the human user still needs to prune, correct, and curate the knowledge base. Identity Resolver gives users a wide range of manual controls and flexibility to keep the knowledge base and AI on track. Users can:
- Directly edit the knowledge base through Identity Resolver REST endpoints
- Add aliases to, or remove aliases from, an identity, which affects the candidate generation phase
- Change which identity an entity reference is linked to
- Split an entity reference apart, so that some of its mentions belong to a new identity.
By putting power in the hands of the users, Identity Resolver enables the non-technical subject matter expert to correct machine errors, integrate information from external sources, and update the knowledge base as the world changes.
Identity Resolver brings machine speed to human-like data collection to find the very specific information required to successfully thwart threats to society.
1. German, Mike, “Rethinking Intelligence: Interview with Erik Dahl,” June 2, 2014. Dr. Erik Dahl is the author of “Intelligence and Surprise Attack: Failure and Success from Pearl Harbor to 9/11 and Beyond.” Dr. Dahl served 21 years as an intelligence officer in the U.S. Navy. He is an assistant professor at the Naval Postgraduate School, Department of National Security Affairs and the Center for Homeland Defense and Security. http://brennancenter.org/our-work/research-reports/rethinking-intelligence-interview-erik-dahl
2. Ibid German
3. Ibid German