Search and Beyond with Text Analytics

Benson Margulies

Chief Technology Officer, Basis Technology

This talk looks at how some of our language processing components help applications give users a high quality search experience.

This is a partial transcript of the recorded talk.

What is SEARCH?

Search Box

We all know what search is, at least we think we do. Search is this box. You type some words into it and back comes some snippets of documents. You hope that those documents are what you are looking for based on the words you typed. Maybe you are searching the whole internet or maybe you are just searching a corporate document repository. Either way you are often struggling to find the information you are searching for amidst the torrent of other stuff that mentions the same words.

Traditionally, search was an application way over there.  It was or it was some page of some website at your company that allowed you to search the corporate document repository.  But, little by little that picture has become unrealistic. Instead of search being a separate application searching a separate set of information, more and more search is the way that users expect to find what they are looking for in every application that they use.

Every App is a Search App

This graphic illustrates what Google has done.  Over the years, Google has rolled out a whole series of applications. If there is one thing all of these applications have in common, they all have search in them somewhere. Users of these applications expect to use search to find what they need.

There are a number of reasons for this use of search, other than Google just pushing it in people’s faces. One is that we are dealing with ever increasing amounts of data.

Google Search Timeline - Source TechCrunch

Data Gets Bigger

The more data there is, the more likelihood there is that user are going to have to use full-text search to find what they are looking for as opposed to navigating a taxonomy or navigational structure. Little by little, other theories about how to find text in the world, including categories, taxonomies, and trees, have been withered away to type with a Google box.

Forensic Data

forensic data

One particular area where ever increasing data volumes has changed the shape of applications is digital forensics. Once upon a time, the digital forensics problem, say for a cell phone, was a tiny amount of data. You could probably just page through it all and learn what you has to know.

Nowadays, take a phone and you might have 32 gigabits of heaven knows what? and forensics on computers, on disks pulled from laptops or larger systems, have such massive amounts of data that to find what you are looking for, whether it is legal discovery or law enforcement research, has to have some sort of search paradigm or you will never find it.

No One Can Spel

whitehouse tweet

Another reason why search has become ubiquitous is the sad decline in spelling abilities. If nobody ever spells anything right, then it becomes much harder to find things by expecting them to match up with pre-existing categories. If you tried to categorize documents by simply saying let’s keep track of all the documents that talk about “Libya” then you wouldn’t find this tweet.

The spelling problem is not just humorous. The difficulty of dealing with varying spellings has very serious practical applications in the world.  The Tsarnaev brother were in part able to do what they did because no one can agree how to spell their name and various systems in the U.S. Government failed to match them up with information about them.

Possible name spellings of Tamerlan Tsarnaev

Devices are Small – UI Problem

Typing on Small Devices is Hard

Another driver of search is the ubiquity of small devices. In various places a person has to select something from a list. The longer the list becomes, in comparison to the size of the screen, the less practical it becomes to just scroll down the list. Simple mechanisms for this, like type the first three letters, do not work very well when you have 200 people with the same first three letters.

It’s also not so easy to enter text into these small devices and that feeds to spelling problem that was discussed earlier.  The text people produce on these small devices is even more full of spell-o-s than it would be ordinarily.

Search is Not So Easy

Elasticsearch, Lucene, Apache Solr

The Bad News:

High quality search is not a trivial matter.

The Good News:

A gigantic part of the problem of high quality search is already solved everyone. Open source platforms provide the basic framework for inverted full-text search engines. There is no reason for anyone to reinvent these things. This saves everyone from having to build this stuff from scratch.


The problem is that it is still not so easy to get the results that people recognize. Here is an example of where someone typed in the word “Suite” and got have a whole bunch of search results about “Suits.”

The other problem is semantics. When someone has typed into a search box “Tiger Woods,” there is a real question about what they are looking for. It is possible that they are looking for a large striped animal, but more likely they are looking for the famous golf player. A straight full-text search engine cannot tell the difference between the two.


This can have real practical implications in the world. Now mind you, Tiger Woods might not be the first thing you would go shopping for on, but sure enough there is some legitimate Tiger Woods merchandise on there that is hiding amongst things about “tigers” or “woods” or other stuff that is not really clear what it is doing there.

Screen shot taken on June 1st, 2014 – results may vary

Etsy Screenshot from June 1, 2014

Eventually this is about semantics.  People want to put in a real query of human language and get an answer. That is a very hard problem to solve and I am not going to tell you that you can solve it. However, I’m going to show you how you can apply text analytics to search and get a much better approximation of the solution than what you just saw with Tiger Woods.

What People Are Searching For

Things: People, Locations, OrganizationsUltimately people are looking for things in the world. They want to know about a particular person, or a particular place, or a particular organization. What a search engine does out of the box is look for words and they are just not the same thing.

Our products at Basis Technology help customers build search applications that give users more relevant results.  Where relevancy ultimately means: Are the documents or results that you pull up really related to the things in the world that the user is trying to search for?  That starts with:

If you are just thinking of it as a search on words, are we at least giving people words they recognize as being relevant to the problem at hand?

Can we help people search names of real world items?

Can we find places where documents are mentioning particular things in the world?

That is enabled by our ability to say “this document has a string which seems like it might be talking about something in the world, but is it?”

Without NLP

To understand how Basis Technology can help with this, it helps to understand what is going on in an inverted text search engine like Lucene.

Here is an example derived from German:

A diagram showing Massenfertigung being stemmed using open source technology and Lucene

We will look at what happens and what you can get by just simply deploying the open source components available for German. We start with a document containing what Mark Twain would call “an awful German compound.” What happens is that we take that word, put it in lower case, run it thought a stemming algorithm, and we put it in the index.  In this case, what wound up in the index is exactly the same thing we started with except with all the uppercase taken out.

Now, that word is a compound. It is two separate German nouns. If the user searches for the second one of those nouns, what happens? Well we take what the user put in, we lower case it, we stem it, and since the stemming is not so very interesting we just get that second word, and we go looking for it, and we don’t find it, and we are sad.

Here is another example from before:

A diagram showing suits being stemmed using open source technology and Lucene

In this case the document contained the English word “Suit” and that goes into our index in a very simple way: as itself.  Then the user searches for “Suite,” perhaps while looking for hotels rooms.  The stemming process converts “Suite” to “Suit” because the stemming process is just a simple pattern match that in this particular case will strip final “e” letters off.  Now all of a sudden we have a hit that we do not want.  We went looking for “Suite” and we found “Suit” when we should have found nothing or a document that actually contain “Suites.”

 With NLP via Rosette Base Linguistics

In the German example:

A diagram showing Massenfertigung being decompounded and lemmatized by Rosette Base Linguistics

If we add more significant text analytic processing, things get better. In this case we take our German word, we lowercase it and instead of just trying to stem it, we break it into its bits and pieces. You will notice that it is not just splitting the word into two pieces. If you look at the first of the two pieces of that compound, it has lost the letter “n” off the end, because that is what the noun really looks like if a user was going to search for it. Now the index has both terms, in fact the index has the full compound in case someone wants to look for it (but it was left off the slide because it was getting crowded).

As a result of this process, when the user searches again for the second word of the compound they get a match and they are happier.

In the “Suite” example:

A diagram showing Suits being decompounded and lemmatized by Rosette

The same thing happens during the ingestion, because “Suit” doesn’t go through any interesting transformations. However, when we put “Suite” through the process, we take the lemma of  “Suite” which is not “Suit” because they have nothing to do with each other.  Now “Suite” fails to match “Suit,” so we do not get the unwanted result, and we are doing much better.

What Goes Wrong: basic morphology

Here are some more examples of where all this stuff comes from, which is cases where languages can show you something superficial that appears to be related, but is not.

Here we have some French examples:

Input Stem Lemma
été (summer) été été (summer)
été (was) été être (to be)
bois (I drink) bois boire (to drink)
bois (woods) bois bois (woods)

The collision of the word for “summer” and a form of the word “to be” or the word for “I drink” and the word for some “woods”.

Spanish examples:

Input Stem Lemma
passato (the past) passat passato (the past)
passate (you spend) passat passare (to spend)
solo (only) sol solo (only)
sole (sun) sol sole (sun)

What Goes Wrong: compounds

More German compound words:

Input Stem Lemma

(coastal fisheries)

küstenfischerei küste, fischerei

(telephone number)

telefonnumm Telefon, Nummer
Lederjacke (leather jacket) lederjack leder, jacke

This shows you how different the compound nouns are compared to the different pieces that make them up.

What Goes Wrong: segmentation

Where are the words: image of Kanji writing

A whole other problem comes up when you look at Chinese, Japanese, Korean, and Thai. If you look at the text, you will see a surprisingly suspicious absence of whitespace. They invented paper in Asia and they hate to waste it on something as useless as whitespace. Instead the human brain turns out to be really good at figuring out where the words are (once you know the language). However, computer are really stupid about finding the words. So before it is possible to build an index based on these words, we need to locate the words.

A text analytic process can find the words by using a combination of dictionaries and statistical models.

Japanese text that has been tokenized

Solutions Using Rosette


All of this adds up into a suite of tools that we call Rosette Base Linguistics.

Tokenization – Identifying words, particularly in non-space delimited languages such as Chinese.

Lemmatization – Determining the root form of words through dictionary definitions as opposed to stemming.

Decompounding – Separating compound words (common in German and Korean) into their appropriate sub components that can then be indexed independently.

Noun Phrase Extraction – Grouping multiple nouns as a single entity for enhanced clustering and concept extraction.

Sentence Detection – Identifying the start and end of each sentences, even when punctuation is ambiguous, enabling increased relevancy ranking.

Searching for Names

Our Rosette Entity Extractor can be used to extract people, place, and organization names from documents. The specifics of this product will be covered in another presentation. However, once you have names you have to deal with the fact that names are a particular target for variation and misspelling.

If someone’s name is Jesus Alfonso Lopez Diaz,  there are many different ways that someone might spell that person’s name while referring to the same guy. If  you really want help people find things based on names, then it is just not good enough to go looking in a search engine for some sort of phrase query. If you look at the following table, you will see a common nickname for Jesus, which is Chuy.  You are never going to match Jesus and Chuy by running a simple full-text search engine or even producing a lemma. This requires special purpose code that knows particularly about phenomena in names.

Name Index Scores on the name Jesus Alfanso Lopez Diaz, including a 78% match on Chuy A. Deaz

Name problems also come up in a cross language context.  Here are series of references to the U.S. President Franklin D. Roosevelt. The Rosette Name Indexer can give you a score that relates each one of these forms to the normalized form of a name.

RNI Name Matcher, scoring various, multilingual spellings of Franklin D. Roosevelt to his normalized name.

Better Yet, Resolve Strings to Things

Even once we can find the strings in documents that refer to things—entity mentions, named entities, and the like—that still has not told us who these people are in the real world.  The most obvious example is in the next diagram where the name “President Bush” can refer to two different, distinct individuals. In many important applications, it is this distinction that is the most important. To do this, we have to use document context and other more complex models to perform a task called named entity resolution. Entity resolution maps the string references to the actual things in the world.

President George w. Bush and George H.W. Bush being resolved from multiple document to the correct Wikipedia based identity card.

This has been only a partial transcript of the recorded talk. 

For the complete talk, please watch the video presentation.

Learn More