Text analytics fundamentals | Morphological Analysis

Morphological Analysis

Text analytics fundamentals to prepare your data for analysis. Language-specific tools for part-of-speech tagging, lemmatization, decompounding, and Han readings for your input.

Morphological Analysis

Overview

What is morphological analysis?

Morphological analysis delivers the core linguistic building blocks that prepare your text for further analysis, and allow you to effectively search or process text in many languages. Rosette enriches your original text in its native language for best-in-class natural language processing, improving speed, and accuracy.

The leaders in multilingual search

Intelligent, successful search is about semantics. People want to put in a real query of human language and get an answer. Words like ‘spoke’ referring to part of a bicycle wheel can be easily confused with the past tense of the verb to speak. While open source platforms now provide the basic framework for inverted full-text search engines, the challenges of accurate search are compounded as you add more languages to the queries and results. Rosette provides the tools you need to search across 40 languages.

Product Highlights

  • 40 supported languages
  • Lemmatization
  • Part-of-speech tagging
  • Decompounding
  • Han readings
  • Intuitive cloud API
  • Customizable SDK
  • Fast and scalable
  • Industrial-strength support
  • Constantly stress-tested and improved

How It Works

Part of Speech Tagging

Parts of Speech Tagging

As part of the lemmatization process, statistical modeling is used to determine the correct part of speech, even with ambiguous words. Each token is then tagged for enhanced comprehension and search relevancy. Because different languages have different grammars, part-of-speech tags differ.

Rosette supports the Universal POS Tag standard from which the developer can map to Penn Treebank or other POS tag systems.

Decompounding

Decompounding

Rosette breaks down compound words into sub-components and delivers each individual element to be indexed. This is especially useful for increasing search relevancy in languages such as German and Korean.

Example: German
Samstagmorgen is a compound word formed with Samstag (Saturday) and morgen (morning). Decompounding allows for an appropriate match when searching for “Samstag”.

Han Readings

Han Readings

Rosette understands the difference between Chinese and Japanese text when they are written in Han script, and accurately returns the pronunciation information. For Chinese text in Han, Rosette returns the pronunciation information in Pinyin transcriptions. For Japanese content, Rosette returns Furigana transcriptions in Hiragana. For example, if you call Rosette with “医療番組”, it will return these Han readings: “イリョウ”, “バングミ”.

Lemmatization

Lemmatization

Most search engines utilize a crude method of chopping off characters at the end of a word in the hopes of finding the root form. This method, called stemming, often results in more recall, but poorer precision, associating unrelated words such as arsenic/arsenal which share a stem (arsen). Instead, Rosette finds the true dictionary form of each word, known as a lemma, by using vocabulary, context, and advanced morphological analysis. Indexing the root form increases search relevancy and slims the search index by not indexing all inflected forms. Alternative lemmas are also made available to supplement indexing.

Example: English

Linguistic analysis is useful for every language; lemmatization for english improves recall and precision.

Challenge Query Stem Lemma
Two unrelated words may share a stem animals
animated
anim animal
animate
Stemming may deliver unintended results. several sever several
Irregular verbs and nouns stump the stemmer. spoke spoke speak (v.)
spoke(n.)

Tech Specs

Availability and Platform Support

Deployment Availability:
Plugins:
Bindings:

Supported Languages

Albanian Danish Hebrew Norwegian Slovak
Arabic Dutch Hungarian Pashto Slovenian
Bulgarian English Indonesian Persian Spanish
Catalan Estonian Italian Polish Swedish
Chinese, Simplified Finnish Japanese Portuguese Thai
Chinese, Traditional French Korean Romanian Turkish
Croatian German Latvian Russian Ukrainian
Czech Greek Malay Serbian Urdu

Try the Demo

Cloud API

Easy to Use API

Ideal for product evaluation, academic research, and smaller, cost-conscious businesses, our fast and powerful API is instantly accessible and free to get started.

Try morphological analysis and the rest of Rosette’s endpoints, free up to 10,000 calls/month!

Get an API Key

Quality Documentation and Support

Customers love our thorough and responsive support team. We also provide in-depth documentation that lists all the features and functions of the various API endpoints along-side examples in the binding of your choice.

Visit our GitHub for bindings and documentation.

Enterprise Ready

Evaluate Rosette’s functional fit with your business and data needs on our cloud API knowing that scalable, customizable, on-premise deployments are available if you need them.

{
  "tokens": [
    "The",
    "fact",
    "is",
    "that",
    "the",
    "geese",
    "just",
    "went",
    "back",
    "to",
    "get",
    "a",
    "rest",
    "and",
    "I",
    "'m",
    "not",
    "banking",
    "on",
    "their",
    "return",
    "soon"
  ],
  "lemmas": [
    "the",
    "fact",
    "be",
    "that",
    "the",
    "goose",
    "just",
    "go",
    "back",
    "to",
    "get",
    "a",
    "rest",
    "and",
    "I",
    "be",
    "not",
    "bank",
    "on",
    "they",
    "return",
    "soon"
  ]
}

On Premise

Customize and scale your text analytics on premise

For organizations with vast data quantities, unique integration needs, and data security restrictions, we provide on-premise API deployment and SDKs to be hosted on your internal servers.

Request Product Evaluation

If your organization requires an on-premise solution, we’re happy to work with you to meet your business’ unique needs. For more in-depth evaluations please complete the form below and our Customer Engineering team will provide you with an on-premise evaluation package.

Drop Us a Line

EMAIL:
info@basistech.com

PHONE:
+1-617-386-2000

Select Customers Include

Blog

A Text Analytics Recipe For Document Summarization

Read More

Blog

Cognitive computing: The human benefit of natural language processing

Read More

No coding required

rapidminer-1
rapidminer

RapidMiner is the industry’s #1 predictive analytics platform. The client platform, RapidMiner Studio, empowers organizations to easily prep data, create models and operationalize predictive analytics within any business process.

Try RapidMiner