Make Your Choice: It’s More Than a Score for Evaluating NLP
25 Feb 2020
Blog

Make Your Choice: It’s More Than a Score for Evaluating NLP


Part 3 of Evaluating Natural Language Processing for Named Entity Recognition in Six Steps

Just as standardized test scores alone cannot prove that an applicant will be successful in a college or university, there are other factors to take into consideration before you choose your NLP. Ultimately, the NLP must successfully fulfill your needs, based on your defined requirements in step 1. These needs may include language coverage and the entity types it can extract, as well as the ability to be customized or taught to perform better on your data.

The truth is, you are very, very lucky if an out-of-the-box NER model solved your problem. Good NER platforms are adaptable and have a mature suite of features to solve common problems (such as text in all capital letters, short strings, and/or character normalization), but in the end some aspects of your data will always be unique to your business. Therefore, it is vital to get the out-of-the-box score and learn about the maturity of the technology and its capacity for customization.

Adaptation

Former longtime Massachusetts House Speaker Tip O’Neill was known for saying, “All politics is local.” Lesser known is “All NLP is local,” in that any NLP technology will perform best on data that most closely resembles the data it was trained on. If the NLP in question was trained on news articles, it may perform acceptably on product reviews, but not so well on tweets, electronic medical records, or patent applications.

For many consumers of NLP, the final choice comes down to how easy it is to adapt a model to their needs and how many different paths there are to achieving the accuracy they need. Does the system have tools made for the user to relatively easily: fix errors, retrain the statistical model, adapt a model to a different domain or genre, add new entity types, or always omit or include certain entities?

Some methods are easier:

  • Adding entity types by adding regular expressions for pattern matched entities
  • Adding entity types by creating an entity list for the new type
  • Retraining a model with unannotated data (aka, unsupervised training)
  • Creating a pattern for matching words that are always entities or never entities
  • Creating an entity list for words that are always entities or never entities.

Some methods are harder:

  • Correcting a pattern of errors by writing a custom processor
  • Increasing accuracy in a domain by retraining a model with annotated data
  • Adding entity types by retraining a model with annotated data.

Does the NER system provide a wide range of options for customization or just one or two? Do those options require moderate or significant investment of time and labor? Or, is the adaptation that you need even possible?

Maturity

Maturity comes down to whether the vendor really knows a particular NLP problem space. Do they use a range of techniques for extracting each entity type. Some may rely heavily on an external knowledge base to extract and link terms. What about entities that don’t appear in the knowledgebase? Can it overcome entities that are misspelled?

It’s not just F-score: The customization, adaptation factor

Let’s conclude with a couple of ideas. F-score is a convenient metric for comparing NLP systems, but don’t just look at F-score, precision, and recall. Remember to hold your emotions in check when you come across an error that’s obvious — for a human — often called a “howler.” The system could still be performing very well despite not tagging “Jerusalem” as a location. Finding a howler is the perfect moment to ask, “Can this system be relatively easily adapted to fix howlers or perform better on my data?”

Suppose the system you are evaluating is from a vendor with a good track record as a serious NLP technology provider, which has a variety of options and tools for you to make it work better on your data. Even if the system scores lower than another system, if that first system has a mature solution and tools that make it easy for you to customize it to fit your data, ultimately the nimbler system might be the real winner.

Read other posts in this blog series: