Why Data & Data Annotation Make or Break AI: Inside NLP and Search, Part III

A robot with the words "Inside NLP and Search: Learn How Small Players Help Build the Biggest Search Engines"

Interested search technology—or AI generally? Over the next four weeks, we’re going to take an in-depth (and interesting!) look at the technology that makes modern search tick. Today we’re digging into data and how it’s prepared.

Data: The Building Blocks of AI

Machine-learning algorithms don’t just spring from nothing. Before they can extract or link a single entity, they need to be shown what an entity is. They need to know what, where, and how to label them. In short, they need to be trained.

To do this, developers rely on large, human-annotated datasets, formed from thousands of examples of correct input and output for a given task. By running each data point through the algorithm multiple times, it’s possible to create a model that has extrapolated the complex system of rules and relationships behind the entire dataset.

The scope of a dataset therefore defines the limits of a solution’s capabilities, while the level of detail it contains helps to determine the accuracy with which the algorithm will do its task. As a result, there is an unbreakable bond between high-quality data and high-performance algorithms, as well as massive demand for data that will give a solution the extra edge.

Of course, there is plenty of open-source, off-the-shelf data available on the internet which many companies pull from to widen their datasets. However, this isn’t much help to those looking to build a top-of-the-range solution.

In NLP, the need to keep pace with the rapid evolution of language can quickly make public data obsolete.

Consider the phrase “North West.” Until a few years ago, its meaning was probably the northwest of a particular place. Now, North West is just as likely to refer to Kanye West’s daughter as it is to any geographical region.

These subtle shifts in meaning happen all the time, in every language and culture on the planet. The slang of the present will be old news in a few months. New words are invented, old ones are re-engineered, and cultural phenomena rise and fall.

Meanwhile, the knowledge gap between data from 15 years ago and the data of today widens into a chasm.

The only way to continue riding the wave of change is to turn to human annotators, who are immersed in the languages and cultures the algorithm needs to learn. As the only reliable source of ground truth for language-based models, human knowledge is the secret weapon behind the best training data—and by extension, the best machine-learning solutions.

In this section, let’s dive deep into the underbelly of the NLP supply chain. We’ll explore how specialized data providers create and process the raw materials needed for machine learning, propping up all the systems and solutions stated above.

But first, let’s get a bit technical. To truly grasp this section of the supply chain, it’s essential to understand how data annotation works.

A Note on Data Collection

The raw data that annotators begin with needs to fit a certain profile and will also determine how much data needs to be annotated. The ideal training corpus has the following key features:

  • It should be representative, covering the domain vocabulary, format, and genre of the text you intend to put into the Named Entity Recognition (NER) system.
  • It should be balanced, containing instances of each entity type that the system is supposed to extract. For example, a system cannot learn to extract corporate entities if there are not enough mentions of corporate entities in the training data.
  • It should be clean. Processing a bunch of HTML pages during training probably won’t produce good results. If the page is in a language other than English, it’s especially important to normalize characters. For example, “é” can be a single character or the letter “e” plus the accent. Normalizing every instance of this ensures that the model won’t differentiate between characters that are essentially the same. Cleaning is especially critical in languages like Japanese, which has both a “full-width” and “half-width” version of katakana and ASCII characters.
  • It should be enough. You need a certain amount of data for it to be representative and contain sufficient mentions of each entity type. This ensures accuracy, and is crucial to the creation of a gold standard that will test your system’s performance.

What is Data Annotation?

Annotation is the act of adding vital information to raw data.

At the beginning of the process, this data is without structure or order—and therefore incomprehensible to machines. To a supervised learning algorithm, data without tags is simply noise.

Through annotation, however, this noise can be turned into a focused training manual that has an impact all the way up the supply chain.

To illustrate this, let’s return to our search engine example. In order to build an entity extractor, Basis Technology requires a dataset of text samples annotated for entity extraction.
However, even within entity extraction, there are a multitude of different ways to tag that will help to train the solution for slightly different tasks.

These different methods create different input-output combinations within the data. Since models extrapolate the rules governing a dataset from the way these combinations are calibrated, adding slightly different metadata to the raw text can result in models optimized for a completely different type of task.

Here’s an example of some raw text data that could be used to train the entity extractor:
Example of BRAT annotation
There are a range of ways that we could progress here. The most relevant annotation method for our example is (NER).

Through this method, words or phrases are tagged according to meaning. Names should be tagged as Names, while companies are annotated as Companies, and so on. These labels are sourced from a classification system that can stretch to multiple tiers, depending on the level of detail requested by the client.

Done simply, this could look similar to the below example:
Example of BRAT format with annotation
There are many other ways to label a text, but for the sake of brevity we’ll refrain from making an exhaustive list. Outside of entity extraction, other machine-learning tasks such as sentiment analysis or computer vision also have their own range of unique annotation methods.

Although the above example might seem fairly straightforward, building a clean, focused dataset for AI training isn’t easy. There are a huge number of tasks that have to be considered in order to build effective training data. All of these have the potential to eat through large swathes of valuable time if performed by someone who isn’t a specialist.

A Note on Outsourcing

Not everyone can parse a sentence into dependency trees. In fact, finding capable annotators can be a tremendous headache. Yet in some ways, this is one of the easier parts of the problem.
Once a group of annotators has been found, there’s a whole host of behind-the-scenes management tasks that need to be undertaken. From testing, onboarding, and ensuring tax compliance to distributing, managing, and assessing the quality of projects, there’s an enormous amount of hidden labor involved in annotating.

It’s a hassle for anyone to build this kind of system out. As a result, tech companies often choose to outsource to companies that specialize in data annotation. By bringing experienced outside players into the process, they free up time and resources to get on with what they do best—building search engines.

That’s it for today. Click here for part 1 & part 2. Part 4 will be posted soon!

About Basis Technology

Verifying identity, understanding customers, anticipating world events, uncovering crime.
For over twenty years, Basis Technology has provided analytics enabling businesses and governments to tackle some of their toughest problems. Rosette text analytics employs a hybrid of classical machine learning and deep neural nets to extract meaningful information from unstructured data.Autopsy, our digital forensics platform, and Cyber Triage, our tool for first responders, serve the needs of law enforcement, national security, and legal technologists with over 5,000 downloads every week. KonaSearch enables natural language queries of every field, object, and file in Salesforce and external sources from a single index. For more information, email info@basistech.com or visit www.basistech.com.

About Gengo

Gengo, a Lionbridge company, provides training data for machine-learning applications. With a crowdsourcing platform powered by 25,000-plus certified linguistic specialists in 37 languages, Gengo delivers first-rate multilingual services for a range of data creation and annotation use cases. To learn more, visit https://gengo.ai or follow us on Twitter at @GengoIt.

integrating ai

Understand the NLP Driving Search

Want an inside look at the AI technology under the hood of today’s search engines?

From entity linking to semantic similarity, Inside NLP and Search explores the key innovations that make these platforms possible, giving you rare insight into how the technology that’s changing the world actually works.

Download Now