Evaluating NLP: Annotating Evaluation Data and Scoring Results
18 Feb 2020
Blog

Evaluating NLP: Annotating Evaluation Data and Scoring Results


Part 2 of Evaluating Natural Language Processing for Named Entity Recognition in Six Steps

In our previous blog post, we discussed the importance of defining your requirements for your NLP evaluation. In short, can you describe what a “perfectly performing NLP” would output? And with perfect output, have you shown that you would get the business outcomes you are seeking? This second of three parts looks at the heart of the work, which is annotating a gold standard test dataset, and then scoring the results you get from different NLP libraries processing this test dataset.

Annotate the gold standard test dataset

Annotating data begins with drawing up guidelines. These are the rules by which you will judge correct and incorrect answers. It seems obvious what a person, location, or organization is, but there are always ambiguous cases, and your answer will depend on your use case, which depends on the requirements you defined in step 1.

If you’ve never done this before, you might think, “Well, isn’t a ‘person’ just any person mentioned?”

Here are some ambiguous cases to consider:

  1. Should fictitious characters (“Harry Potter”) be tagged as “person”?
  2. When a location appears within an organization’s name, do you tag the location and the organization extracted or just the organization (“San Francisco Association of Realtors”)?
  3. Do you tag the name of a person if it is used as a modifier (“Martin Luther King Jr. Day”)?
  4. How do you handle compound names?
  5. Do you tag “Twitter” in “You could try reaching out to the Twitterverse”?
  6. Do you tag “Google” in “I googled it, but I couldn’t find any relevant results”?
  7. When do you include “the” in an entity?
  8. How do you differentiate between an entity that’s a company name and a product by the same name? {[ORG]The New York Times} was criticized for an article about the {[LOC]Netherlands} in the June 4 edition of {[PRO]The New York Times}.

How to do the annotation

The web-based annotation tool BRAT is a popular, if manual, open-source option. More sophisticated annotation tools that use active learning1. can speed up the tagging and minimize the number of documents that need to be tagged to achieve a representative corpus. As you tag, check to see if you have enough entity mentions for each type, and then tag more if you don’t.

Once the guidelines are established, annotation can begin. It is important to review the initial tagging to verify that the guidelines are working as expected. The bare minimum is to have a native speaker read through your guidelines and annotate the test corpus. This hand-annotated test corpus is called your “gold standard.”

Extra credit: Set up an inter-annotator agreement. Ask two annotators to tag your test corpus, and then check the tags to make sure they agree. In cases where they don’t agree, have the annotators check for an error on their side. If there’s no error, have a discussion. In some cases, a disagreement might reveal a hole in your guidelines.

Get output from vendors

Give vendors an unannotated copy of your test corpus, and ask them to run it through their system. Any serious NLP vendor should be happy to do this for you. You might also ask to see the vendor’s annotation guidelines, and compare them with your annotation guidelines. If there are significant differences, ask if their system can be adapted to your guidelines and needs.

Evaluate the results

Let’s introduce the metrics used for scoring NER, and then the steps for performing the evaluation.

Metrics for evaluating NER: F-score, precision, and recall

Most NLP and search engines are evaluated based on their precision and recall. Precision answers “of the answers you found, what percentage were correct?” Recall answers “of all the possible correct answers, what percentage did you find?” F-score refers to the harmonic mean of precision and recall, which isn’t quite an average of the two scores, as it penalizes the case where precision or recall scores are far apart. This approach makes sense intuitively because if the system finds 10 answers that are correct (high precision), but misses 1,000 correct answers (low recall), you wouldn’t want the F-score to be misleadingly high.

In some cases using the F-score as your yardstick doesn’t make sense, such as with voice applications (e.g., Amazon’s Alexa), where the desire is high precision with low recall because the system can only present the user with a handful of options in a reasonable time frame. In other cases, high recall and low precision is the goal. Take the case of redacting text to remove personally identifiable information. Redacting too much (low precision) is much better than missing even one thing that should have been redacted, which is fulfilled by having high recall (overtagging what should be redacted).

See Appendix A in the PDF version of this blog post series for the details behind calculating precision, recall, and F-score. Vendors should be willing to calculate these scores based on their output, but knowing what goes into these scores is good practice.

Determining “right” and “wrong”

Determining what is correct is easy. The tricky part is, it is possible to be wrong in many ways. We recommend the guidelines followed by Message Understanding Conference 7 (MUC-7).2 Entities are scored on a token-by-token basis (i.e., word-by-word for English). For ideographic languages such as Chinese and Japanese, a character-by-character scoring may be more appropriate, as there are no spaces between words and frequently a single character can represent a word or token.

Scoring looks at two things: whether entities extracted were labeled correctly as PER, LOC, etc.; and whether the boundaries of the entity were correct. Thus, an extracted PER, “John,” would only be partially correct if the system missed “Wayne,” as in “John Wayne,” the full entity.

See detailed examples of how the outputs are “graded” and scores are calculated in Appendix B in the PDF version of this blog post series.

When evaluating, F-score, precision, and recall are good places to start, but they are not the entire story. Other factors — including the ease or difficulty of adapting the NLP to do better on your task or the maturity of the solution — must be considered.

The third and final blog post in this series will walk you through those considerations. Next week, learn how to choose the right NLP package.


End notes

  1. Basis Technology has an active learning annotator tool that speeds the annotation process by helping to select a diverse set of documents to tag. It works in tandem with a model building process that enables frequent checking to see when “enough” documents have been tagged.
  2. The Message Understanding Conferences (MUC) were initiated and financed by DARPA (Defense Advanced Research Projects Agency) to encourage the development of new and better methods of information extraction between research teams competing against one another, which resulted in MUC developing standards for evaluation (e.g., the adoption of metrics like precision and recall). MUC-7 was the seventh conference held in 1997. https://en.wikipedia.org/wiki/Message_Understanding_Conference

Read other posts in this blog series: