Adapt Rosette’s Entity Extraction to Your Content for Increased AccuracyNovember 10, 2014
Entity extraction is becoming a mission-critical tool for finding mentions of people, places, organizations, and products in massive quantities of text. In patent searches, law enforcement, voice-of-the-customer analysis, ad targeting, content recommendation, eDiscovery, and anti-fraud, entity extraction enables swift analysis of gigabytes of data.
Among named entity recognition systems, those such as Rosette’s entity extraction function which rely on machine learning to find entities have the advantage. They can find previously unknown entities. Furthermore, because statistical entity extractors are context sensitive, it can disambiguate between places like Paris and people named Paris.
Why Entity Extraction Needs To Be Flexible
When it comes to entity extraction, not all content is created equal. While most entity extractors are quite accurate out-of-the-box when working on well-formed text such as news articles, the high degree of content variation in blogs, restaurant reviews, financial documents, electronic medical records, legal contracts, and patent filings, can limit the algorithms’ accuracy.
Rosette Entity Extraction has an advantage in these cases. Rosette’s statistical model has been tuned to a wide range of content beyond simply published news. And, for users with particularly quirky data—whether in format, style, or vocabulary—and for those who need every last bit of accuracy, Rosette includes robust field training capabilities with multiple mechanisms for adapting to your data’s idiosyncrasies, thus maximizing the accuracy of named entity extraction on your data.
Using Field Training to Improve Accuracy
Level 1: Just Add Data
The easiest level of adaptation, called “Unsupervised Field Training,” can be almost completely user driven. Rosette provides access to a state-of-the art clustering tool chain. You add any quantity of your own data—no need for annotation! just any old documents you have lying around that are representative of the data you need to extract—and Rosette will build you a new model adapted to the idiosyncrasies of your data, dramatically increasing the entity extraction accuracy.
This unsupervised process allows Rosette to more accurately locate entities in the genre, style and vocabulary used by your data, based on the idea of word clusters, i.e., “similar words tend to appear in similar contexts.” Thus it might learn that the function word “outturn” is used in financial documents the same way “outcome” is used in news articles, or that the words “Waltham”, “Atiak”, “Loveland”, “Svetogorsk”, “Yeisk” and “Descoberto” are all likely names of LOCATIONs, even though none were mentioned in the original “stock” annotated corpus. Consequently Rosette will better understand the context surrounding unfamiliar words, and as a result, extract them into existing, well-defined clusters.
Level 2: A Little Annotation Goes a Long Way
For even greater accuracy, you can annotate a small quantity of your data and actively teach Rosette the unique contexts for entities that are common to your documents. Only a few hundred annotated documents can create dramatic improvements in accuracy.
Rosette customers who have conducted field training report a drop in both false positives (increased precision) and false negatives (increased recall) from Rosette and a noticeable improvement in their overall analytics system.
Given that most of our customers welcome guidance in selecting data, building a new model, and evaluating the results, Basis Technology offers professional services to assist with field training. Whether you are just adding raw data to the Rosette model help.
Contact us if you have more questions about the highly adaptable Rosette.