Text analytics fundamentals to prepare your data for analysis. Language-specific tools for tokenization, part-of-speech tagging, lemmatization, decompounding, and Chinese and Japanese readings for your input.

Overview
Search many languages with high accuracy
Every language, including English, presents unique and difficult challenges for search applications to deliver relevant and precise results. Rosette® Base Linguistics (RBL) enables enterprise applications to effectively search or process text in many languages by providing a complete set of linguistic services. RBL enriches the original text in its native language for best-of-class natural language processing, improving speed and accuracy.
What is base linguistics?
Base linguistics refers to the core morphological building blocks that prepare your text for further analysis. In Chinese and Japanese base linguistics answers “what are the words?” for these languages written without spaces between words. In many European languages, it is associating related words based on their dictionary form, such as beau/beaux/belle/belles, which all mean “beautiful” in French, but vary by gender and number.
The leaders in multilingual search
Intelligent, successful search is about semantics. People want to type in a question and get an answer. To find more relevant results, the engine must associate words by meaning, that is, trace words back to their dictionary form based on the word’s context. For example, without looking at context, a reference to “spoke,” as part of a bicycle wheel, can be easily confused with the past tense of the verb “to speak.” While open source platforms now provide the basic framework for inverted full-text search engines, the challenges of accurate search are compounded as you add more languages to the queries and results. Rosette fills the linguistic need in Elasticsearch, Apache Solr, and applications that need to search across 30+ languages.
Product highlights
- Supports multiple languages
- Sentence tagging
- Tokenization
- Lemmatization
- Part-of-speech tagging
- Decompounding
- Chinese/Japanese readings
- Fast and scalable
- Cloud, on-premise, search plugin deployments
Language-specific features
How It Works
Part of speech tagging
As part of the lemmatization process, statistical modeling is used to determine the correct part of speech, even with ambiguous words. Each token is then tagged for enhanced comprehension and search relevancy. Because different languages have different grammars, part-of-speech tags differ.
Our base linguistics support the Universal POS Tag standard from which the developer can map to Penn Treebank or other POS tag systems.
Decompounding
Decompounding breaks compound words into sub-components and delivers each individual element to be indexed. This is especially useful for increasing search recall in languages such as German and Korean.
Example: German
“Samstagmorgen” is a compound word formed with “Samstag” (Saturday) and “morgen” (morning). Decompounding allows for an appropriate match when searching for “Samstag.”
Chinese and Japanese readings
Our base linguistics functionality understands the difference between Chinese and Japanese text when they are written in Han script, and accurately returns the pronunciation information. For Chinese text in Hanzi, Rosette returns the pronunciation information in pinyin transcriptions. For Japanese content, Rosette returns furigana transcriptions in Katakana. For example, if you call Rosette with “医療番組,” it will return this reading: “イリョウ”, “バングミ”.
Lemmatization
To associate related words, most search engines utilize a crude method of chopping off characters at the end of a word in the hopes of finding a common root form. This method, called stemming, often results in more recall, but poorer precision, associating unrelated words such as arsenic/arsenal, which share a stem (arsen). Our base linguistics tools find the true dictionary form of each word, known as a lemma, by using vocabulary, context, and advanced morphological analysis. Indexing the root form increases search precision and recall, and slims the search index by not indexing all inflected forms. Alternative lemmas are also made available to supplement indexing. Linguistic analysis is useful for every language, improving search recall and precision.
Example: EnglishLinguistic analysis is useful for every language; lemmatization for English improves recall and precision. |
|||
Challenge | Query | Stem | Lemma |
Two unrelated words may share a stem | animals animated |
anim | animal animate |
Stemming may deliver unintended results | several | sever | several |
Irregular verbs and nouns stump the stemmer | spoke | spoke | speak (v.) spoke (n.) |
Tokenization
Many search tools use n-grams to break text into overlapping consecutive characters to create a search index in languages written without spaces between words. N-grams result in a larger index size and a reduction in precision. Our tools, in contrast, accurately identify and separate each word through advanced statistical modeling. The resulting token output (also known as segmentation) minimizes index size, enhances search accuracy, and increases relevancy.
Tech Specs
Availability and platform support
Deployment availability: | |
Plugins: | |
Bindings: |
Supported languages
Arabic | Catalan | Chinese, Simplified | Chinese, Traditional | Czech | Danish | |
Dutch | English | Estonian | Finnish | French | German | |
Greek | Hebrew | Hungarian | Indonesian | Italian | Japanese | |
Korean | Latvian | Malay (standard) | Norwegian | Pashto | Persian | |
Polish | Portuguese | Romanian | Russian | Serbian | Slovak | |
Spanish | Swedish | Tagalog | Thai | Turkish | Urdu |
Sample output:
{ "tokens": [ "The", "fact", "is", "that", "the", "geese", "just", "went", "back", "to", "get", "a", "rest", "and", "I", "'m", "not", "banking", "on", "their", "return", "soon" ], "lemmas": [ "the", "fact", "be", "that", "the", "goose", "just", "go", "back", "to", "get", "a", "rest", "and", "I", "be", "not", "bank", "on", "they", "return", "soon" ] }
Try the Demo
Deployment
Rosette Cloud
Sign up today for a free 30-day trial
The SaaS version of Rosette is rapidly implemented, low maintenance and ideal for users who wish to pay based on monthly call volume. Numerous bindings through a RESTful API are supported.
Rosette Server Edition
This on-premise private cloud deployment puts all the functionality of Rosette Cloud behind your secure firewall, and enables advanced user settings, access to custom profiles (user-specific configuration setups), and deployment of custom models.
Rosette Java Edition
For on-premise systems that need the low-latency, high-speed integration of an SDK, Rosette Java is the way to go. It has been deployed in the most demanding, high-transaction environments, including web search engines, financial compliance, and border security.
Rosette Plugins
Just plug in Rosette for instant high-accuracy multilingual search and fuzzy name search for Elasticsearch or Apache Solr.
Quality documentation and support
Our support team responds to customers in less than a business day, and is committed to a satisfactory resolution. Users have access to in-depth documentation describing all the features, with code examples and a searchable knowledge base.
Visit our GitHub for bindings and documentation.
Questions?
Email: info@basistech.com
Phone: +1-617-386-2000
Customers Include
Learn More
Deep Search for Salesforce
AI-driven Search Application for Salesforce
KonaSearch is a best-in-class search application for Salesforce enabling users to search every field, file, and object across multiple orgs and other data sources.
