The Importance of Japanese Readings in Search and More

15 Jul 2015

Japanese is unusual in that a word and its pronunciation are both valid keyword searches. Imagine if you could search in English on “Seezer Salad Resipi” and get recipes for “Caesar salad.” In Japanese you can, because it is written with Chinese ideographs, called kanji, and two phonetic alphabets. The two alphabets (hiragana and katakana) each map to the same set of syllabic sounds, which can represent every Japanese word. The difficulty is the kanji characters represent meaning, but vary in pronunciation depending on word context. For example, 先生 sensei, but 生じる shoujiru, and 生 nama.

Natural language processing solutions like Rosette tokenize Japanese text into words and provide each word’s reading, which has a variety of uses.

Japanese search query expansion

Some words are written equally often in kanji or hiragana; the script used depends on the writer. For example, “sushi” is commonly written in hiragana (すし) or kanji (寿司).

By Adonis Chen from Sushi restaurant Ichiba Sushi in Tsukiji Fish Market, Tsukiji, Chuo, Tokyo (Flickr) [CC BY 2.0], via Wikimedia Commons
I, MASA [GFDL , CC-BY-SA-3.0 , or CC BY-SA 2.5-2.0-1.0, via Wikimedia Commons

blog_japanese-s-2-tA Japanese writer may also write a word that is normally in kanji as hiragana or katakana for emphasis.

When used judiciously, expanding a query by also searching on its reading can increase recall, although it may come at the expense of some precision as Japanese has many homonyms. “Hashi” is the reading for both “chopsticks” and “bridge.” “Hana” is both “nose” and “flower.”

Japanese speech-to-text

Japanese readings are also used to convert spoken Japanese to regular written Japanese in speech-to-text applications. Similarly, readings are essential to input method editors (IMEs) which are little software tools that sit between a user’s typed input and an application. By typing the pronunciation, the user brings up a small window from which to choose the correct Japanese characters.

Reading Names: Call Centers

Names in Japanese are notorious for their irregular pronunciations. You have very standard names like 田中 (“Tanaka”) that are utterly unambiguous in their commonness, but then recent trend among parents to give children “creative” names push the envelope in pronunciation. 美桜 (Mio), where the first character is normally read “utsuku” or “mi” and the second is usually read “sakura”, “ou”, or “you.”  Or the very strange family names like 九 (“9”) which is normally pronounced “ku” or “kyuu” is pronounced “Ichijiku.”

bg_japanese_yelpRoad to Japan: How to Yelp Like a Native

If a politician is to gain favor, it’s good to remember that “all politics is local.” That’s also true in the business of crowdsourcing reviews of local businesses. Yelp clearly knows what it’s doing with numerous review sites in cities worldwide that have logged millions of reviews—more than 83 million reviews by the end of Q2 2015. So way before Yelp Japan launched in 2014, its goal was clear: to create quality, polished products to local Japanese users that would facilitate creating a community of users from the start…Read the Yelp Case Study

Being able to note the correct name pronunciation is important for call center applications where an operator might need to address a customer.

When it comes to names and readings, Rosette offers two options. First, the name matching capability of Rosette can provide a reading when given a Japanese name (or given a reading, provide possible kanji equivalents), and a custom reading may also be designated. The base linguistics module of Rosette provides readings of words to the user. Now it also enables users to specify a reading for a word (within a user-defined dictionary) to take care of these special readings, or to correct Rosette when it suggests the wrong reading for a particular word or name.

Sorting Lists

Readings are also used to sort lists. Names or categories in a directory are frequently sorted by pronunciation in hiragana order. Yahoo! Japan, for example, lists categories in order by reading. Also possible, but less useful is sorting by kanji stroke order (number of strokes in the first character).