Japanese Base Linguistics

Normalize Japanese spellings for more efficient and accurate processing

Overview

Katakana spelling variations

Since Japanese Katakana words are a phonetic approximation of foreign words, there can be considerable variation in how people spell the same word. Our Japanese base linguistics tools normalize these variations to a single spelling, so that every search for “Venice” will find all the occurrences, even though some people may write ベニス, while others write ベネツィア, and still others write ヴェネチア or ヴェネツィア.

Variations on “Bermuda” and “expo” in Japanese

Modern vs. old Kanji

The Japanese borrowed Chinese ideographs (called Kanji) from China centuries ago. Although modern-day Kanji are somewhat simplified versions, there are still a fair number of instances where older versions of modern characters are used. Our Japanese base linguistics normalizes older Kanji to its modern version, so that all variations of a character are processed as the same character.

Product Highlights

  • Normalize old Kanji to modern Kanji
  • Normalize Katakana spelling variations
  • Dictionary to customize tokenization, lemmatization, readings
  • Sentence tagging
  • Tokenization
  • Lemmatization
  • Part-of-speech tagging
  • Noun decompounding
  • Japanese readings
  • Dictionary to customize tokenization, lemmatization, and readings


Older Kanji variations converted to modern Kanji

How It Works

Hybrid approach for high quality

Our base linguistics uses a combination of dictionaries, statistical modeling, and rules to achieve high-quality results.

  • Tokenization: Japanese is written using three scripts with no spaces between words. Tokenizing word boundaries increases search accuracy more effectively than language-agnostic n-gram techniques.
  • Lemmatization: Lemmatization (finding the dictionary form of a word) means a system can associate related words based on meaning, or index the lemma in place of the many surface forms. A given token may have more than one possible lemma. Rosette chooses the best candidate based on context.
  • Part-of-speech tagging: Rosette selects the most likely POS tag based on sentence context.
  • Noun decompounding: Search engines, in particular, need noun decompounding to increase search recall.
  • Japanese readings: Kanji pronunciation varies depending on its context in Japanese. Rosette® provides possible readings for each token — useful for text-to-speech or input method editor programs.

User customizable

Rosette Java edition (an SDK) and Rosette Server edition (on-premise cloud) deployments provide dictionaries that allow the user to modify or correct the behavior of Rosette by adding words, noun decompounds, readings, and lemmas.