Natural Language Processing Search Engines

A robot with the words "Inside NLP and Search: Learn How Small Players Help Build the Biggest Search Engines"

Interested search technology—or AI generally? Over the next four weeks, we’re going to take an in-depth (and interesting!) look at the technology that makes modern search tick. Let’s dive in with a look at the role AI plays in modern search engines.

Introduction to Natural Language Processing and Search Engines

Modern search results are remarkable. Users have instant access to relevant web pages, images, and knowledge cards for any keyphrase they choose. Even for the most obscure or vague search terms, appropriate results are only a click away.

Since these engines are owned by some of the planet’s biggest tech companies, it’s easy to believe that their models are built entirely in-house. It sounds plausible that there’s a superhuman team of developers and data scientists all in a room somewhere, single-handedly revolutionizing the way that the world searches the internet. In fact, the idea is almost romantic.

It’s also completely untrue.

Of course, these companies are hotbeds for AI innovation. But they don’t operate in a vacuum. The natural language processing (NLP) technology that powers these search engines relies on an entire supply chain of software and data providers. Their role is to put all the tools and raw materials in place before anyone ever sits down to work on that crucial update.

Without them, it simply wouldn’t be possible to build the search engines we use today.

So what does this supply chain look like? To really grasp the extent of it, we’re going to need to examine the complex technology that makes these massive search engines tick.

No room for hand-waving here—just an informative look at how this incredible search technology really works.

The AI Behind Your Favorite Search Engine

Imagine you’ve just entered the query “Tim Cook” on your favorite search engine.

What happens next?

The platform returns a multicomponent results page. Its major components are:

  • Relevant news articles on Tim Cook
  • Rank-ordered web pages on Tim Cook
  • A knowledge graph card on Tim Cook

But these things don’t magically appear. Each component of the search result is painstakingly assembled using a laundry list of technologies.

A complete walk-through of each would fill several books—and who has time for that? Instead, let’s look at just one component—news—to answer one question: How do big search engines deliver relevant news for queries like “Tim Cook”?

Behind the Curtain

Content processing is what enables search engines to deliver those oh-so-perfect news results. This processing creates a knowledge base that the system uses every time a user enters a query. For our investigation, there are four phases: ingestion, extraction, linking, and searching.


The world’s biggest search engines have to consume an enormous amount of news articles, blog posts, and social media to stock their system with timely material.

To give a sense of the scale, let’s look at an example. Running 60-plus Elasticsearch clusters and 2000-plus nodes, the ecommerce platform eBay ingests around 18 billion documents every day to serve around 3.5 billion daily search requests.

Now imagine how much Google has to process.

As these feeds pour in, the content they contain has to be cleaned of navigation bars and HTML tags. Then it has to be identified and understood. This leads us to phase two.


In this context, extraction is the process of identifying the key entities in the text, such as people, organizations, places, events, products, chemicals, etc. An extractor will identify entities according to their type.

This is an important task because the meaning of a given term is almost always ambiguous. In this case, “cook” could refer to a person, organization, place (like “Cook County”), or activity. We want the extractor to identify all the mentions that refer to a person.
The extractors used in major search engines employ three approaches to finding different entity types:

  • Exact matching against entity lists (or gazetteers)
  • Pattern matching using regular expressions
  • Machine-learned statistical models (the AI component)

Exact matching is simply listing all the entities of a particular type, such as car models or names of bacteria, and then matching words within a given piece of text against those lists. This method works well when the list of entities is finite and fairly unique in use and meaning. Thus, “Streptococcus acidominimus” is a fine candidate for exact matching, whereas “Cook” is not.

Unsurprisingly, pattern matching is best for finding entity types that fit a pattern, such as credit card numbers, URLs, or email addresses.

The strength of these methods is that they are relatively easy to build and don’t require a lot of examples. However, they lack the ability to recognize words that don’t perfectly match their records.

In other words, they don’t do “fuzzy matching.” They’re just mechanical-matching engines.

So, for more smarts, AI is needed.

AI allows us to extract entities from a document using a statistical model. This model can handle the variety that exists in real-world text that other methods cannot, whether that comes from alternate spellings, typos, or other sources.

Modern AI technology is based on machine learning. This is the process of training a machine through thousands of examples of correct input-output combinations for a given task. These combinations come from data that has been prepared by humans through data annotation—a process that we’ll discuss in far more detail later in this paper.

In our example, the data would be newsfeed text with the entities tagged by entity type:

Upon graduation, <PER>Mae C. Jemison</PER> entered <LOC>Cornell University Medical College</LOC> and, during her years there, found time to expand her horizons by studying in <LOC>Cuba</LOC> and <LOC>Kenya</LOC> and working at a <NAT>Cambodian</NAT> <LOC>refugee camp</LOC> in <LOC>Thailand</LOC>

The system is then fed this annotated data—also known as labeled data—and produces a statistical model that has learned which features of the text are good predictors of each entity type. That is, given this context, there is a probability n that the next word(s) are a particular entity type (for example, a person).

This is how the AI component learns how to identify the kind of entity the word “cook” in the newsfeed text refers to.
Depending on the quality of the training data, the accuracy of these models can actually surpass human level.

A Note on Adjudication

What if there is a conflict between what the three methods label as an entity? For example, “Christian” in the sentence “Christian Dior released its fall collection at Fashion Week in NYC yesterday” might be labelled a religion by exact matching and a person or organization by a machine-learning model. Which label wins?

The right choice depends on the deployment. However, a flexible system will allow the user to configure the adjudication to indicate the method that is most trustworthy. This ensures conflicts are resolved optimally. This is an example of what is known as a “human-in-the-loop” system.

That’s it for today. Part 2 will be posted soon!

About Basis Technology

Verifying identity, understanding customers, anticipating world events, uncovering crime. For over twenty years, Basis Technology has provided analytics enabling businesses and governments to tackle some of their toughest problems. Rosette text analytics employs a hybrid of classical machine learning and deep neural nets to extract meaningful information from unstructured data. Autopsy, our digital forensics platform, and Cyber Triage, our tool for first responders, serve the needs of law enforcement, national security, and legal technologists with over 5,000 downloads every week. KonaSearch enables natural language queries of every field, object, and file in Salesforce and external sources from a single index. For more information, email or visit

About Gengo

Gengo, a Lionbridge company, provides training data for machine-learning applications. With a crowdsourcing platform powered by 25,000-plus certified linguistic specialists in 37 languages, Gengo delivers first-rate multilingual services for a range of data creation and annotation use cases. To learn more, visit or follow us on Twitter at @GengoIt.

integrating ai

Understand the NLP Driving Search

Want an inside look at the AI technology under the hood of today’s search engines?

From entity linking to semantic similarity, Inside NLP and Search explores the key innovations that make these platforms possible, giving you rare insight into how the technology that’s changing the world actually works.

Download Now