The Key Ingredient To Being Able to Search and Process All Chinese Text

Rosette converts between simplified and traditional Chinese scripts so that “one search finds them all”

A Brit’s “I’m going to have a kip” is nearly incomprehensible to most Americans. No wonder people say that Great Britain and the U.S. are two countries divided by the same language. But, Chinese-speaking countries like China and Taiwan or Hong Kong and Singapore truly are countries divided by the same language—because Chinese is written two ways.

China and Singapore use the simplified Chinese script that was introduced during Mao’s Great Leap Forward in 1950s China. Simplified Chinese reduced the number of strokes, but also forced some characters to take the place of what was two similar-sounding characters (but often with completely unrelated meanings!). Taiwan and Hong Kong stuck with the traditional Chinese writing system. While the two scripts have the same origin, they are different enough that a person accustomed to traditional Chinese will have some difficulty reading simplified Chinese, and vice-versa.

The simplified character 发 pronounced fa replaced two characters, also pronounced fa: 發 (meaning emit) and 髮 (meaning hair)

Character-by-Character conversion could give you 出髮 (= emitting hair?) Now that’s hair-raising!

To avoid asking users to type their search in twice (in both scripts) or to do any comprehensive Chinese text processing, all the text must be converted to one script, analyzed, and then results displayed to the user’s preferred script. It’s a special machine translation problem that would be a major headache for a business whose core competency wasn’t multilingualism.

Luckily, Rosette’s Chinese Script Converter function tackles this problem for you! Equipped with dictionaries and linguistic smarts, Rosette can handle the trickiest script conversion use cases. For example:

Translating word-by-word instead of character-by-character

Especially when converting from simplified to traditional Chinese, you’re faced with a one-to-many conversion problem. A simple character-to-character mapping might give you gibberish. With knowledge of words, Rosette chooses the correct conversion from 出发 (set off) to 出發 (set off) and not 出髮 (emitting hair).

In a conversion to traditional Chinese gone wrong, one could never find documents with 出發 as they would be indexed as the non-word 出髮—never to be found.

  Traditional Chinese Simplified Chinese Pronunciation (meaning)
  出發 出发 chūfā (set off)
  頭髮 头发 tóufǎ (hair)

Translation cases where the word is entirely different

Cases like the British “kip” to the American “nap” are especially difficult to handle. As China and Taiwan developed independently, it’s not surprising each country came up with different words for new concepts, especially technological advances.

  Traditional Chinese Simplified Chinese Meaning
 電腦 (diànnǎo) 计算机 (jìsuànjī) computer

Rosette Chinese script conversion is available now. Have a Chinese text processing or search problem that could use some help? Contact us at info@basistech.com.

Huang Tingjian “Fubo Shrine scroll” (local) ink from the paper, via Wikimedia Commons