Optimizing Multilingual Search Using Solr

David Anthony Troiano

Principal Software Engineer

This talk was delivered at the Boston Data Con 2014 on September 13th, 2014.

David Troiano explains how to optimize Apache Solr for multilingual search. Using the example of “Serie A” (which is an Italian Football Club), David shows how a major search engine finds documents in both Italian and English. To get these same results in Solr requires a natural language processing (NLP) pipeline that solves the various problems created by each language in use.

Language detection, the first step in the pipeline, is typically is part of the indexing process and not used at query time.  This is because language detection has lower accuracy on shorter strings, often involves named entities, and end user applications often know the query language based on upstream data (such as browser settings).

tokenization neededTokenization is the next step and particularly important when dealing with languages that have no whitespace. Chinese and Japanese text has no whitespace and Korean uses it for things other than breaking up words.  As a group these languages are referred to as CJK and tokenizing the text is vital to further language processing.

Decompounding is an important step for languages such as Dutch, German, and Danish, as well as Japanese, Chinese, and Korean. A common example is the German word Samstagmorgen, which means “saturday morning” and is a compound of “samstag” and “morgen”. By putting not just the compound word, but also its component words into the index, it is possible to match queries of just one component (such as samtag) to articles containing the compound (samstagmorgen). The net result is a major improvement to recall in these languages.

Normalization is an important process in monolingual or multilingual search. It is the process of reducing word forms variations down to a canonical representation. The canonical verb speak might be expressed as speaks, speaking, speaker, or spoke. The recall of search is greatly improved by collapsing these word variations. The two most common methods are Stemming and Lemmatization.

Stemming is a rules based approach that generally means “chop off the end”. A common example of this system failing is arsenal and arsenic, which both would be stemmed to arsen by one common stemming tool.

Lemmatization is mapping words to their dictionary form via morphological analysis. Generally this improves both precision and recall compared to stemming. Going back to the stemming example, spoke would never stem to speak, but it would lemmatize correctly based on the context of the sentence.

For the QA on the multilingual NLP pipeline skip to time 19:00 in the video.

For indexing strategies skip to 21:30 in the video or fill out this form to receive a copy of our upcoming whitepaper on implementing multilingual search in Apache Solr.