Chinese Base Linguistics


Analyze all your Chinese text, whether written in simplified or traditional script

Overview

Chinese script conversion

Because of the proliferation of Chinese speakers in many countries and regions, the language has numerous variations. Chinese in China and Taiwan developed independently of each other from the 1950s onward. China created a simplified version of Chinese ideographs (Hanzi), while Taiwan continued using the traditional script. These variations apply in other parts of the Chinese-speaking world as well. For example, Singapore uses simplified Chinese, while Hong Kong uses traditional Chinese.

For applications that work with Chinese, the text must be converted to a single form — whether traditional or simplified — in order to be searched and processed correctly.

Levels of conversion: simplified versus traditional Chinese

Rosette supports all three levels of conversion:

  • Codepoint: These are cases where the character is unchanged, but because mainland China and Taiwan use different character encodings, the bytes representing each character have to be correctly interpreted and converted to the universal Unicode encoding.
  • Orthographic: A single simplified character may map to one or more traditional characters. The correct destination character depends on word context.
  • Lexemic: Especially for modern objects and concepts, China and Taiwan chose different words to represent a foreign name (“Natalie Portman”) or new word (“computer”).
Type of script conversion Simplified Chinese Traditional Chinese
Codepoint 大 (“big”)
Orthographic* 出发 (“set off”)
Pronounced: chū fā
出發 (“set off”)
Pronounced: chū fā
Orthographic* 头发 (“hair”)
Pronounced: tóu fa
頭髮 (“hair”)
Pronounced: tóu fa
Lexemic 出租汽车(“taxi”)
Pronounced: chū zū qì chē
計程車(“taxi”)
Pronounced: jì chéng chē

*Note how the simplified Chinese words for “set off” and “hair” share the same second character, but that character — although sharing the pronunciation “fa” — differs in the traditional Chinese characters, which also have different meanings.

Product Highlights

  • Chinese script conversion between traditional and simplified scripts
  • Tokenization
  • Lemmatization
  • Part-of-speech tagging
  • Noun decompounding
  • Chinese pinyin readings
  • Sentence tagging
  • Dictionary to customize tokenization, lemmatization, and readings

How It Works

Hybrid approach for high quality

Our base linguistics uses a combination of dictionaries, statistical modeling, and rules to achieve high-quality results.

  • Tokenization: Chinese is written without spaces between words. Tokenizing text into words increases search accuracy more effectively than language-agnostic n-gram techniques.
  • Lemmatization: Lemmatization (finding the dictionary form of a word) means a system can associate related words based on meaning, or index the lemma in place of the many surface forms. A given token may have more than one lemma.
  • Part-of-speech tagging: Rosette® selects the most likely POS tag based on the sentence context.
  • Chinese readings: Rosette provides the pinyin pronunciation (reading) of Chinese tokens — useful for text-to-speech or input method editor programs.

User customizable

Rosette Java edition (an SDK) and Rosette Server edition (an on-premise cloud) enable users to add new entries to the user dictionaries and thus modify or correct the behavior of Rosette.