Base Linguistics


Text analytics fundamentals to prepare your data for analysis. Language-specific tools for tokenization, part-of-speech tagging, lemmatization, decompounding, and Chinese and Japanese readings for your input.

Overview

Search many languages with high accuracy

Every language, including English, presents unique and difficult challenges for search applications to deliver relevant and precise results. Rosette® Base Linguistics (RBL) enables enterprise applications to effectively search or process text in many languages by providing a complete set of linguistic services. RBL enriches the original text in its native language for best-of-class natural language processing, improving speed and accuracy.

What is base linguistics?

Base linguistics refers to the core morphological building blocks that prepare your text for further analysis. In Chinese, Japanese, and Korean base linguistics answers “what are the words?” for these languages written without spaces between words. In many European languages, it is associating related words based on their dictionary form, such as beau/beaux/belle/belles, which all mean “beautiful” in French, but vary by gender and number.

The leaders in multilingual search

Intelligent, successful search is about semantics. People want to type in a question and get an answer. To find more relevant results, the engine must associate words by meaning, that is, trace words back to their dictionary form based on the word’s context. For example, without looking at context, a reference to “spoke,” as part of a bicycle wheel, can be easily confused with the past tense of the verb “to speak.” While open source platforms now provide the basic framework for inverted full-text search engines, the challenges of accurate search are compounded as you add more languages to the queries and results. Rosette fills the linguistic need in Elasticsearch, Apache Solr, and applications that need to search across 30+ languages.

Product highlights

  • 32 supported languages
  • Sentence tagging
  • Tokenization
  • Lemmatization
  • Part-of-speech tagging
  • Decompounding
  • Chinese/Japanese readings
  • Fast and scalable
  • Cloud, on-premise, search plugin deployments

Language-specific features

How It Works

Part of speech tagging

Parts of Speech Tagging

As part of the lemmatization process, statistical modeling is used to determine the correct part of speech, even with ambiguous words. Each token is then tagged for enhanced comprehension and search relevancy. Because different languages have different grammars, part-of-speech tags differ.

Our base linguistics support the Universal POS Tag standard from which the developer can map to Penn Treebank or other POS tag systems.

Decompounding

Decompounding Example

Decompounding breaks compound words into sub-components and delivers each individual element to be indexed. This is especially useful for increasing search recall in languages such as German and Korean.

Example: German

“Samstagmorgen” is a compound word formed with “Samstag” (Saturday) and “morgen” (morning). Decompounding allows for an appropriate match when searching for “Samstag.”

Chinese and Japanese readings

Chinese & Japanese readings

Our base linguistics functionality understands the difference between Chinese and Japanese text when they are written in Han script, and accurately returns the pronunciation information. For Chinese text in Hanzi, Rosette returns the pronunciation information in pinyin transcriptions. For Japanese content, Rosette returns furigana transcriptions in Katakana. For example, if you call Rosette with “医療番組,” it will return this reading: “イリョウ”, “バングミ”.

Lemmatization

Lemmatization Example

To associate related words, most search engines utilize a crude method of chopping off characters at the end of a word in the hopes of finding a common root form. This method, called stemming, often results in more recall, but poorer precision, associating unrelated words such as arsenic/arsenal, which share a stem (arsen). Our base linguistics tools find the true dictionary form of each word, known as a lemma, by using vocabulary, context, and advanced morphological analysis. Indexing the root form increases search precision and recall, and slims the search index by not indexing all inflected forms. Alternative lemmas are also made available to supplement indexing. Linguistic analysis is useful for every language, improving search recall and precision.

Example: English

Linguistic analysis is useful for every language; lemmatization for English improves recall and precision.

Challenge Query Stem Lemma
Two unrelated words may share a stem animals
animated
anim animal
animate
Stemming may deliver unintended results several sever several
Irregular verbs and nouns stump the stemmer spoke spoke speak (v.)
spoke (n.)

Tokenization

Tokenization Example

Many search tools use n-grams to break text into overlapping consecutive characters to create a search index in languages written without spaces between words. N-grams result in a larger index size and a reduction in precision. Our tools, in contrast, accurately identify and separate each word through advanced statistical modeling. The resulting token output (also known as segmentation) minimizes index size, enhances search accuracy, and increases relevancy.

Tech Specs

Availability and platform support

Deployment availability:
Plugins:
Bindings:

Supported languages

Arabic English Hungarian Persian Spanish
Catalan Estonian Italian Polish Swedish
Chinese, Simplified Finnish Japanese Portuguese Thai
Chinese, Traditional French Korean Romanian Turkish
Czech German Latvian Russian Urdu
Danish Greek Norwegian Serbian
Dutch Hebrew Pashto Slovak
Sample output:
{
"tokens": [
"The",
"fact",
"is",
"that",
"the",
"geese",
"just",
"went",
"back",
"to",
"get",
"a",
"rest",
"and",
"I",
"'m",
"not",
"banking",
"on",
"their",
"return",
"soon"
],
"lemmas": [
"the",
"fact",
"be",
"that",
"the",
"goose",
"just",
"go",
"back",
"to",
"get",
"a",
"rest",
"and",
"I",
"be",
"not",
"bank",
"on",
"they",
"return",
"soon"
]
}

Try the Demo

Deployment

Rosette Cloud

Sign up today for a free 30-day trial

The SaaS version of Rosette is rapidly implemented, low maintenance and ideal for users who wish to pay based on monthly call volume. Numerous bindings through a RESTful API are supported.

Rosette Server Edition

This on-premise private cloud deployment puts all the functionality of Rosette Cloud behind your secure firewall, and enables advanced user settings, access to custom profiles (user-specific configuration setups), and deployment of custom models.

Rosette Java Edition

For on-premise systems that need the low-latency, high-speed integration of an SDK, Rosette Java is the way to go. It has been deployed in the most demanding, high-transaction environments, including web search engines, financial compliance, and border security.

Rosette Plugins

Just plug in Rosette for instant high-accuracy multilingual search and fuzzy name search for Elasticsearch or Apache Solr.

Quality documentation and support

Our support team responds to customers in less than a business day, and is committed to a satisfactory resolution. Users have access to in-depth documentation describing all the features, with code examples and a searchable knowledge base.

Visit our GitHub for bindings and documentation.

Request Custom Demo

Complete this form and our customer team will reach out to schedule a demo based on your use case.

Questions?

Email: info@basistech.com

Phone: +1-617-386-2000

Customers Include

konasearch salesforce

Deep Search for Salesforce

AI-driven Search Application for Salesforce

KonaSearch is a best-in-class search application for Salesforce enabling users to search every field, file, and object across multiple orgs and other data sources.

View on AppExchange

SalesForce Search