Enhancing Solr™ with Text Analytics

Apache Solr is at the heart of many innovative search-based applications. Rosette delivers a comprehensive solution to a wide range of challenges that developers face when they move beyond simple search.

Rosette provides advanced tokenization for Chinese, Japanese, and Korean; plus decompounding for languages such as Korean, German, and Dutch. Rosette also delivers lemmatization (as opposed to stemming) for normalization based on a word’s root meaning. Together, these linguistic processes enhance search recall and precision in 40 languages.

Facet on real-world entities

Rosette entity extraction and linking enable high quality faceted search by extracting people, locations, organizations, products, and 13 other entity types, and linking them to real-world entities. Rosette extracts entities using a combination of statistical models, gazetteers, and regular expressions, and can link those entities to Wikipedia or a custom entity database. It comes pre-trained in 18 languages and can be adapted in the field to domain-specific content for improved accuracy.

Find names, no matter how they’re written

Names are frequently the most significant term in a query. Rosette increases search recall by ensuring that a larger number of occurrences of a name are found—overcoming misspellings, nicknames, missing spaces, the same name written in different languages, or other variations. Rosette matches names written in 15 languages, including English, Arabic, Russian, Japanese, Chinese, and Korean.

Anti-fraud & financial compliance

Matching names is a key task in many anti-fraud and financial compliance processes. Ensuring that a company—often with international customers and branches—is not transacting with known terrorists or criminals is not as simple as it seems.

Rosette’s fuzzy name indexing capabilities can help to find matches across a wide variety of name variations—nicknames, typos, homonyms—and across languages.

Rosette also handles cases including unknown names, and names with variable segmentation (“MaryEllen” vs. “Mary Ellen”), surpassing systems that fuzzy match using name variants generated from a list of known names.

Open source intelligence (OSINT)

Rosette can add structure to vast quantities of text coming from social media, news, or blog feeds by extracting new or known entities, including people, places, and organizations. These entities can then be used by applications that visualize patterns and trends and create links between entities and documents.

Rosette can also standardize, translate, and link these names to an authority, mitigating the problem of inconsistent and “messy” real-world data.

logo-careerbuilder-white

“At CareerBuilder, we are building talent management software that unlocks meaning in unstructured human capital data. Our core competencies include search, data classification, matching, and big data analytics. Relying on Rosette for our linguistic analysis (in over a dozen languages) allows us to remain focused on our core competencies and ultimately provide more value to our customers.”

—Trey Grainger
Director of Engineering, Search & Analytics at CareerBuilder