How Data Annotation Works: Inside NLP and Search, Part IV
Interested search technology—or AI generally? Over the next four weeks, we’re going to take an in-depth (and interesting!) look at the technology that makes modern search tick. This week, we’re breaking down step by step how data annotation works.
How Entity Annotation Works
It should come as no surprise that an entity extractor requires a large training dataset comprised of entity annotation examples.
Since we intend to use this extractor to improve search relevance for some of the world’s leading technology companies, it’s extremely important that the tool is able to understand the intricate semantic and linguistic links between terms in a given text.
It could take a small in-house team months to annotate enough raw data with the necessary level of detail to perform this task.
It is here that big companies often turn to a data provider for help. For a project like this, a good data provider will move through the following stages to ensure maximum ROI for their customers:
Stage 1: Establishing Guidelines
For every project, it’s crucial to have clear guidelines.
At the start of a project, both parties should expect to build out a comprehensive document that will define what constitutes an appropriately tagged entity within the text.
The development of these guidelines centers on two things: the classification system and best practices for annotation.
When we establish the classification system, our main focus is on understanding the exact meaning of all labels in the system, on both a macro and micro level. We need a domain-specific definition of each category and sub-category.
For example, many clients would consider a hotel to be a company, but travel industry customers often request that hotels are tagged as locations. There will also be specific tagging requests based on the text itself. When tagging dates, it’s important to know if the category should be restricted to phrases like “August 17th,” or whether “tomorrow” or “the 17th” should be included.
All of these instances need working out in advance to produce high-quality data. The expectation should be that the guideline will continue to evolve as annotators encounter fresh edge cases.
The second half of the guidelines deals with the process of tagging itself. Annotators should know how to deal with a range of potential issues, such as punctuation, nicknames, symbols, spelling errors, and junk data.
Once the guidelines are ready, it’s finally time to start annotating.
Stage 2: Annotation and Quality Controls
The work of annotators sounds rather straightforward. At a high level, they have to read each raw text sample, highlight certain words or phrases, and add an appropriate label using a text annotation tool. However, contributors have to work extremely carefully to avoid common errors that could compromise the dataset.
The job of annotating is more than just a language exercise. It’s also a cultural exercise.
To give just one narrow example, names might only be tagged when they refer to a person entity. If this is the case, “Walt Disney” and “Snow White” should be tagged, but “Mickey Mouse” should not, since he isn’t human.
Further to this, words have an awkward habit of changing categories, often within the same paragraph. This can often be seen in the names of institutions, such as “Bank of Japan.” On its own, “Japan” should be tagged as a location, but as part of this phrase, it should be tagged as an organization. This issue is even more apparent in the Japanese translation, where the word for Japan—日本—is actually contained within the compound name of the company, “日本銀行.”
As a result, annotators have to be alive to changes in language use, subject, and nuance within a piece of text to ensure that every data point has a positive impact.
There’s another issue when multiple contributors work on a dataset: consistency.
For quality assurance, it’s good practice to ask more than one person to tag the same data. From this, it’s possible to calculate how much cross-annotator agreement there is in the completed dataset. Depending on the subjectivity or objectivity of the task, a low level of agreement can mean one of two things.
For objective tasks with extensive guidelines, a lack of agreement suggests that the annotators need more training. All that effort to build out guidelines pays off here, as it clarifies many of the thornier issues that come up as work progresses.
For subjective tasks, such as certain types of entity linking, disagreement suggests that the source input is of a low quality. Since the annotators can’t agree, a more uniform, cleaned dataset needs to be sourced.
Fortunately for us, NER is one of the more objective entity extraction tasks.
Other data annotation tasks, such as text categorization, tend to be far more subjective. In these cases, extensive cross-annotation checks are required to compensate for the varying ways in which people categorize content.
Stage 3: Packaging and Exporting
Once all the relevant entities have been identified and labeled, it’s time to format the data. There are two particular methods of presenting the data that are worth mentioning.
IOB2 format, where IOB stands for inside-outside-beginning, is a method of tagging in which the annotations are attached directly to each word. It looks like the image on the left.
In this scenario, B indicates the beginning of the label, while I denotes that this word also falls inside the entity. Tagging with O means that the word falls outside of any entity category. The key advantage of this method is that it includes the entire text, regardless of whether the word is part of an entity.
Sometimes entity extraction results are provided in JSON format, so that the information is more easily processed by other systems. In JSON, the information is provided in a similar way to IOB2:
Standoff format differs from IOB2 format in that it only shows the entities that have been tagged. The text and annotations are often split into two separate documents. This makes the files look rather different. See the image on the right.
Here, T indicates the entity number, while the three-letter code explains which type of entity it is: in our case, names, locations, and companies.
The two numbers indicate character numbers that the entity can be found between in the document. Finally, on the far right is the entity as it appears in the data. One benefit of this format is that it gives you a simple list of entities, making your data easy to work with.
After we’ve formatted the data, we finally have a dataset that is ready for training a machine-learning model.
Data is the Foundation
Data annotation can seem like dirty work. To an outsider, it’s not nearly as glamorous as building the great AI engines in the chain above. However, it’s a subject of great importance for data scientists, project managers, and business leaders alike.
High-quality data isn’t a luxury that the big tech companies can do without. It’s a fundamental part of the machine-learning ecosystem. Without data, it’s impossible to build a functioning algorithm—or a great search engine.
But the essential nature of data is only half of the story. It’s also a source of huge excitement among machine-learning specialists.
Thanks to the explosion in specialist data annotation services, software and solution builders have access to an increasingly diverse range of datasets. Whether it’s for highly specific semantic tags or in previously untapped languages, this new data is creating new opportunities for progress in the field of machine learning.
All things considered, the reason behind the demand for data is simple: A well-prepared dataset can have a transformative impact along the entire NLP supply chain.
Even the best software engineers at Google, Yahoo, or Bing don’t operate in a vacuum. To help these talented people build the next revolutionary update to their search engine, these tech companies turn to a network of specialists to build their AI stack.
Each company involved in bringing technology to the desks of those engineers has decades of experience that would take years for anyone to build in-house. By investing in outside players, the tech giants gain access to the building blocks of their next great product: innovative solutions built on richly-annotated data and foundational text analytics.
Every component of the supply chain single-mindedly pulls toward one goal. Together, they enable the tech giants to extract meaningful intelligence from their data—and build the services that will define our future.
About Basis Technology
Verifying identity, understanding customers, anticipating world events, uncovering crime. For over twenty years, Basis Technology has provided analytics enabling businesses and governments to tackle some of their toughest problems. Rosette text analytics employs a hybrid of classical machine learning and deep neural nets to extract meaningful information from unstructured data. Autopsy, our digital forensics platform, and Cyber Triage, our tool for first responders, serve the needs of law enforcement, national security, and legal technologists with over 5,000 downloads every week. KonaSearch enables natural language queries of every field, object, and file in Salesforce and external sources from a single index. For more information, email email@example.com or visit www.basistech.com.
Gengo, a Lionbridge company, provides training data for machine-learning applications. With a crowdsourcing platform powered by 25,000-plus certified linguistic specialists in 37 languages, Gengo delivers first-rate multilingual services for a range of data creation and annotation use cases. To learn more, visit https://gengo.ai or follow us on Twitter at @GengoIt.
Understand the NLP Driving Search
Want an inside look at the AI technology under the hood of today’s search engines?
From entity linking to semantic similarity, Inside NLP and Search explores the key innovations that make these platforms possible, giving you rare insight into how the technology that’s changing the world actually works.