The| power| of| words|
Tokenization separates text into its atomic elements, for example, words in Latin script. While used for all languages as the basis for text analytics, it is especially important for scripts that do not use spaces between words, like Japanese and Chinese. Rosette provides tokenization for 40 languages.
Bigrams vs. statistical modeling
Many search tools use bigrams to understand languages written without spaces between words. Rosette identifies and separates each word through advanced statistical modeling. This approach minimizes index size, enhances search accuracy, and increases relevancy.
Let’s compare the two approaches for indexing 北京大学生物系 (“Beijing University Biology Department”). Bigramming produces six tokens resulting in two non-words and an incorrect word, 学生 (“student”). Rosette segments the phrase into two tokens 北京大学 (“Beijing University”) and 生物系 (“Biology Department”). A query for the word 学生 (“student”) will correctly miss in a Rosette-tokenized index but incorrectly hit in the bigrammed index.