Chinese Base Linguistics

Analyze all your Chinese text, whether written in simplified or traditional script

Overview

Hybrid approach for high quality

Our base linguistics uses a combination of dictionaries, statistical modeling, and rules to achieve high-quality results.

  • Tokenization: Chinese is written without spaces between words. Tokenizing text into words increases search accuracy more effectively than language-agnostic n-gram techniques.
  • Lemmatization: Lemmatization (finding the dictionary form of a word) means a system can associate related words based on meaning, or index the lemma in place of the many surface forms. A given token may have more than one lemma.
  • Part-of-speech tagging: Rosette® selects the most likely POS tag based on the sentence context.
  • Chinese readings: Rosette provides the pinyin pronunciation (reading) of Chinese tokens — useful for text-to-speech or input method editor programs.

Chinese script conversion

Because of the proliferation of Chinese speakers in many countries and regions, the language has numerous variations. Chinese in China and Taiwan developed independently of each other from the 1950s onward. China created a simplified version of Chinese ideographs (Hanzi), while Taiwan continued using the traditional script. These variations apply in other parts of the Chinese-speaking world as well. For example, Singapore uses simplified Chinese, while Hong Kong uses traditional Chinese.

For applications that work with Chinese, the text must be converted to a single form — whether traditional or simplified — in order to be searched and processed correctly.

Levels of conversion: simplified versus traditional Chinese

Rosette supports all three levels of conversion:

  • Character-level: For this basic level of conversion, RBL does a fast character-by-character conversion without considering the context of the character.
  • Orthographic: A single simplified character may map to one or more traditional characters. At this level, RBL correctly chooses the destination character depends on word context.
  • Lexemic: Especially for modern objects and concepts, China and Taiwan chose different words to represent a foreign name (“Natalie Portman”) or new word (“computer”). RBL uses dictionary data to perform this most difficult level of script converstion.
Type of script conversion Simplified Chinese Traditional Chinese
Character-level 门 (“door”)
Pronounced: mén
門 (“door”)
Pronounced: mén
Orthographic* 出发 (“set off”)
Pronounced: chūfā
出發 (“set off”)
Pronounced: chūfā
Orthographic* 头发 (“hair”)
Pronounced: tóufa
頭髮 (“hair”)
Pronounced: tóufa
Lexemic 出租汽车(“taxi”)
Pronounced: chūzūqìchē
計程車(“taxi”)
Pronounced: jìchéngchē

*Note how the simplified Chinese words for “set off” and “hair” share the same second character, but that character — although sharing the pronunciation “fa” — differs in the traditional Chinese characters, which also have different meanings.

Product Highlights

  • Chinese script conversion between traditional and simplified scripts
  • Tokenization
  • Lemmatization
  • Part-of-speech tagging
  • Noun decompounding
  • Chinese pinyin readings
  • Sentence tagging
  • Dictionary to customize tokenization, lemmatization, and readings

User customizable

Rosette Java edition (an SDK) and Rosette Server edition (an on-premise cloud) enable users to add new entries to the user dictionaries and thus modify or correct the behavior of Rosette.