Text analytics fundamentals


Morphological analysis delivers the core linguistic building blocks that prepare your text for further analysis. These processes include lemmatization, noun phrase extraction, part-of-speech (POS) tagging, and features specific to particular languages like decompounding and readings for Han script words.

Features

Lemmatization
Most search engines utilize a crude method of chopping off characters at the end of a word in the hopes of finding the root form. This method, called stemming, often results in extra recall and poor precision. Instead, Rosette finds the true dictionary form of each word, known as a lemma, by using vocabulary, context, and advanced morphological analysis.The Benefits - Indexing the root form increases search relevancy and slims the search index by not indexing all inflected forms. <br />Alternative lemmas are also made available to supplement indexing.

Noun phrase extraction
Certain nouns, especially proper names, can be very tricky to identify as a single entity. The Benefits - Rosette groups the nouns and their modifiers, which is useful in document clustering and concept extraction.

Part-of-speech tagging
As part of the lemmatization process, statistical modeling is used to determine the correct part of speech, even with ambiguous words.Each token is then tagged for enhanced comprehension and search relevancy. Because different languages have different grammars, part-of-speech tags differ. POS Tag StandardsRosette supports the Universal POS Tag standard from which the developer can map to Penn Treebank or other POS tag systems.

Decompounding
Rosette breaks down compound words into sub-components and delivers each individual element to be indexed. The Benefits This is especially useful for increasing search relevancy in languages such as German and Korean.

Han readings
For Chinese tokens in Han script, Rosette returns pinyin transcriptions as the Han reading. For Japanese tokens in Han script, Rosette returns Furigana transcriptions rendered in Hiragana as the Han reading.

Lemmas vs. stemming

Most search engines use stemming, chopping off characters at the end of a word, to find its root form. However, stemming often results in more recall but poorer precision, associating unrelated words such as arsenic/aresenal which share a stem (arsen).

Rosette lemmatization associates semantically related words through the common dictionary form of the word (the lemma). Rosette looks at vocabulary, context, and advanced morphological analysis to figure out when “spoke” is a noun or a verb. The result is more recall and better precision.


Select Customers

Pinterest Bing Kobo Adobe Google

Going from FAST to Solr

Read More

Shining A Light On Consumer Feedback

Read More

Supported Languages & Features

Languages (40)

  • Albanian
  • Arabic
  • Bulgarian
  • Catalan
  • Chinese, Simp.
  • Chinese, Trad.
  • Croatian
  • Czech
  • Danish
  • Dutch
  • English
  • Estonian
  • Finnish
  • French
  • German
  • Greek
  • Hebrew
  • Hungarian
  • Indonesian
  • Italian
  • Japanese
  • Korean
  • Latvian
  • Malay
  • Norwegian
  • Pashto
  • Persian
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Serbian
  • Slovak
  • Slovenian
  • Spanish
  • Swedish
  • Thai
  • Turkish
  • Ukrainian
  • Urdu

Lemmatization

Lemmatization-Web2

Rosette finds the dictionary form, or lemma, of words in your text based on its usage and context. By indexing the dictionary form of the word, you can reduce index size and improve your application’s speed and efficiency.

Noun Phrase Extraction

noun-phrase-extraction-Web

Certain nouns, especially proper names, can be very tricky to identify as a single entity. Rosette groups the nouns and their modifiers, which is useful in document clustering and concept extraction.

Part of Speech Tagging

POS-Web

To further improve the indexing and metadata for your content, Rosette tags each word with its language-specific part of speech, depending on the context. For example, “spoke” can be classified as a noun or a verb. “The wheel spoke creaked,” (noun) compared to, “She spoke the truth” (verb).

Decompounding

decompounding

Using multilingual content in your applications is challenging when the same word has different forms, such as the German word “samstagmorgen”. Rosette breaks down the compound words into their elements, “Saturday” and “morning”, to enable your application to carefully index each word and improve search results.

Han Readings

han-readings-Web3

Rosette understands the difference between Chinese and Japanese text when they are written in Han script, and accurately returns the pronunciation information. For Chinese text in Han, Rosette returns the pronunciation information in Pinyin transcriptions. For Japanese content, Rosette returns Furigana transcriptions in Hiragana. For example, if you call Rosette with “医療番組”, it will return these Han readings: “イリョウ”, “バングミ”.

Morphological Analysis
Morphological Analysis
Released

Live Demo:

Language-specific tools for POS tagging, lemmatization, decompounding, and Han readings for your input.