Delivering More Accurate Search Results with LemmatizationMay 10, 2013
Many of our commercial and government customers are building extremely powerful and efficient search engines for their own internal or customer’s data. Whether they are using open source Solr/Lucene, Elasticsearch or building their own, these applications are often tasked with natively searching across many languages with very high accuracy.
Of course, in order to do this, especially when dealing with European languages, a search engine must be able to handle all those particular morphological complexities.
In these languages, it is common for the forms of words to change based on how they are used. This presents a challenge for search engines because they must match the correct form of the word in order to serve accurate results— this is called normalizing. Typically, most search engines and search solutions normalize by “stemming”. Stemming is a crude method of chopping off characters at the end of a word in the attempt to find the root word. You can imagine the problems this technique produces.
Think about the following example: The user searches with the word “celebrities”, as in the plural of celebrity, but the search engine ends up with a stem of “celebr”. That search could end up with false positives from other words with the same stem as “celebrations”. Not good!
This problem is compounded when you are a dealing with searches across many languages. The only way to truly get accurate results is to perform a more advanced morphological analysis to find the “lemma” or dictionary form of the word— this is called “lemmatization”.
So let’s review the previous example: “celebrities” is searched, but with lemmatization utilized by the search engine, the query is correctly interpreted as “celebrity”, not “celebration”, enabling the search engine to deliver the right results. In fact, studies have shown that lemmatization is significantly more accurate than stemming in many European languages.
Our linguists and engineers have worked really hard to bring lemmatization to our customers and their search applications. This is a standard feature in the Rosette® Base Linguistics (RBL) component, enabling high-quality search across 40 languages. If you are currently dealing with search engine stemming fall-out, I encourage you to check out Basis Technology’s white paper on the subject: “Enabling High Quality Search in European Languages”.