24 Nov 2020

The Top NLP Mistake Made by Data Scientists

When the conclusions seem “off,” is it bad data science or faulty NLP?

By Dan Maxwell

Download Whitepaper

Executive Summary

In the age of big data, high-quality results from natural language processing (NLP) data extraction are the foundation of analytical models built by data scientists. Betting that NLP models of unknown quality will be “good enough” for your analyses is a gamble because starting with weak or solid NLP output can be the difference between building on quicksand or bedrock. Even the most meticulous and brilliant data science cannot produce good analyses based on erroneous NLP results.

Whether your data science models will choose media channels for an advertising campaign, influence billions in spending, or affect life-and-death decisions, cheaply built NLP is too expensive. Using superior NLP will pay rich dividends in better data science. This whitepaper discusses the importance of selecting NLP that will boost the accuracy and usefulness of your data science conclusions.

The Data Challenge of Data Scientists

Data scientists build analytical models to help others make decisions ranging from resource allocation to when to reopen society after it was shut down by a pandemic. Based on patterns, signals, and relevant data points extracted by NLP, data scientists build descriptive models that tell what is known and with what level of certainty. They are the foundations for predictive models, which provide forecasts for things like the future market demand of a product or the likelihood of a ceasefire in a conflict. From there, prescriptive models can be constructed to weigh different courses of action by combining the predictive models with a decision-maker’s objectives.

The data challenge is finding the relevant data with which to build the first descriptive model. There is simply too much unstructured data out there, and most of it is irrelevant. Data scientists turn to NLP tools to achieve three objectives:

  1. Speed the process of understanding a quantity of data by asking, is it useful and what is in there that I can use?
  2. Provide insights at scale. Instead of accessing tens to hundreds of documents in a week, NLP plows through thousands of documents a day, and finds more useful data than humans could do manually.
  3. Structure data nicely so that machine learning algorithms for a data scientist’s math models can ingest and use it.

The number one mistake of many data scientists is spending more time thinking about their tools than the problem they are trying to solve. Data scientists are either paid for time and labor or for providing a result, so they need to be cost-efficient and ask: Where is my time best spent in providing value for the client? I’d argue that as a data scientist, your time is best spent thinking about the problem you are trying to solve, what models fit that class of problem, and how it all connects to helping your client make better decisions.

When “Good Enough” Can Be Fatal

Some data scientists may wish to enrich their experience and learn how to train NLP models using open source tools. Others feel strapped for funding and would rather not buy software. Some make the compromise of “good enough” without realizing the fundamental impact a weak data extraction solution will have on their results. NLP is an entire discipline in its own right, and to build high-quality extraction models requires a significant investment of time, money, and expertise. (Learn more about what NLP professionals do to build a model.) There are NLP first principles — underlying rules of the tools and techniques for all modeling — that need to be followed in order to produce an accurate model and, eventually, a useful tool.

Foreign languages are a particular challenge. NLP first principles say never translate, always analyze text in the source language. Some data scientists might jury-rig an NLP pipeline by Google Translating text into English, and then using prebuilt open source NLP models for English. These models are readily available and high quality. However, doing so introduces potentially fatal errors from machine translation into the data.

A Case Study in Forecasting

In the IARPA-sponsored Geopolitical Forecasting Challenge 2, our team invested in professional NLP tools — including semantic search, keywords and phrase extraction, and entity extraction — for the heavy lifting of accurate unstructured text processing. We used semantic searching and entity extraction to find the most relevant documents, and then examined keywords and phrases to find less-biased sources. For example, sources expressing both sides of a point were more reliable than sources presenting only one side. Thus, through structured search, we could prioritize documents with terms and phrases of logic that show a balanced thought process, such as “The reason for A is xyz. On the other hand, B could occur by thinking about it (this way).” This type of keyword or keyphrase search is nearly impossible if the original meaning is garbled in translation.

A Case Study of an NLP/Data Science Failure

I was called into a government agency and asked to review their descriptive, predictive, and prescriptive data science models, which looked to be based on some jury-rigged NLP. Although I was not allowed to see the data, I was given the diagnostics on the regression analysis (identifying which variables have an impact on a factor of interest). Normally an error equaling about 20% of the coefficient is considered acceptable, but I found that the mean squared error was 10 times greater than the coefficient (i.e., 50 times greater than the usual acceptable margin of error). Those results were fed into a nonlinear optimization model, but as the range of variation was so small, it looked like a straight line, and it was impossible to tell what kind of curve it was.

The end result was the agency had no way of knowing if the plan in which it was investing significant funds would be effective.


When your range of results is the red line, it’s not possible to know the shape of the curve.

The Path to Better Decisions

Rather than risk building upon unsound data that will skew your analytical models dramatically, it is well worth spending money on professionally built NLP based on solid principles. In particular, NLP analysis should be on the original text, and not a machine translated stand-in. Good, professional NLP will also come with user-friendly tools that will let you adapt prebuilt models to perform even better on your particular data.

When the data models are outputting poor results, data scientists need to ask themselves if the results are due to an error in the data science or erroneous data from weak NLP. Let it not be “garbage in” NLP results producing “garbage out” analyses.

About the Author

Operations Researcher Daniel Maxwell was the leader of team KaDSci, which leaped from 27th place to 2nd place in three months during IARPA’s nine-month-long Geopolitical Forecasting Challenge 2. This improvement came, in large, by adding sophisticated natural language processing tools to triage data. The team used semantic search and entity extraction to surface useful, accurate data from massive, noisy datasets — sometimes containing misleading information — which enabled its human forecasters to predict answers to 305 geopolitical questions covering a staggering range of topics.