Challenges of Southeast Asian Languages — Tagalog, Malay, and Indonesian — for Text Analytics

Tina Lieu

BasisTech, known in the text analytics market as “the language experts,” has added three Southeast Asian languages Tagalog, Malay, and Indonesian to the newest release of its Rosette® text analytics product line.

Tagalog, Malay, and Indonesian belong to the Austronesian language family of the numerous languages and dialects spoken on islands in Southeast Asia. The English language has borrowed words from Tagalog, including “boondocks,” from the Tagalog word bundók (“mountain”). Malay/Indonesian words adopted by English include “bamboo” from bambu and “tempeh” from tempe.

When I spoke with BasisTech’s Senior Linguistic Data Engineer Zachary Yocum about working with Southeast Asian languages, he said “These languages have interesting and challenging morphological typologies that entice our inner linguists. It’s a fun, intellectual puzzle to figure out which approach or technology will produce the most accurate text analytics, given the rules and characteristics of each language.

“We needed to be mindful of the distinction between inflectional morphemes and derivational morphemes. For example, in Indonesian, the lemma form of memberi/VERB is beri/VERB (“to give”) since the meng- prefix inflects for an indirect object. But for menyempit/VERB, the correct lemma is menyempit/VERB (“to narrow” or “to tighten”) because the prefix meng- derives the verbal form from the adjective sempit/ADJ (“ narrow” or “tight”). We also need to be mindful of the phonological interaction demonstrated by the prefix meng- realized as both mem- before the initial b in beri, and meny- in combination with the elision of the initial s from sempit in menyempit.”

Linguistic Terms

Morphology: The study of the forms of words.
Morpheme: The smallest meaningful unit of a language that cannot be further divided, such as morphemes in, come, and -ing which form “incoming.”
Inflectional morpheme: Morphemes that add grammatical information to a word, such as marking a word as the direct object of a verb.
Derivational morpheme: A morpheme that cannot stand on its own (such as -ing) also called an affix or bound morpheme and must attach to another word. It can create new words by either changing a word’s meaning or its part of speech.
Circumfix: A paired prefix and suffix that work together to change a word’s meaning or part of speech.
Reduplication: Repetition of a word.


In English, we are familiar with prefixes and suffixes that make new words by adding a few characters to the beginning and end of words, such as “pre-” (pre-test, pre-application) or “-able” (preventable, excitable). You may also know about infixes, which in Arabic are inserted into the middle of words. A circumfix is a paired prefix and suffix that work together to change a word.

For example, the Indonesian circumfix meng- -kan can be attached to a verb to form the benefactive form[1], or attached to an adjective to create a verb. Furthermore, depending on the phonology (the way a word is pronounced) of the word to which the circumfix is attached, the circumfix’s spelling may change.

The Tagalog circumfix pa- -nin (which also has other forms) creates a causative verb, as shown here:

pa- -nin + tawa (“laugh”) → patawanin (“to make someone laugh”)

In Malay, the circumfix ke- -an appends to adjectives to form nouns about the state or quality of the adjective (similar to the function of -ness in English).

  • ke- -an + barat (“west”) ) → kebaratan “westness”
  • ke- -an + timur (“east”) → ketimuran “eastness”
  • ke- -an + besar (“huge”) → kebesaran “hugeness”


These three Southeast Asian languages also use reduplication (word repetition). In many languages, such as Chinese, reduplication is used to show emphasis. Reduplication in Tagalog changes the meaning of the word. Sometimes there is partial reduplication where only a part of a word is repeated. Malay and Indonesian also use reduplication to syntactically mark the role of the word in the sentence.

Tagalog examples:

  • Reduplication of sabi (“said”) → sabi-sabi (“rumor”)
  • Partial reduplication of balita (“news) → bali-balita (“news spreading”)

Indonesian examples:

  • Reduplication can form a new word: gula (“sugar”) → gula-gula (“sweets”).
  • It can also make nouns plural: kasur (“bed”) → kasur-kasur (“beds”).
  • Partial reduplication for verbs can indicate verb tense. In this example, it means that something is being done in a leisurely way: Di toko itu kami hanya melihat-lihat. (“In that shop we were just looking around/browsing.”)

Where do you hear Tagalog, Malay, and Indonesian?

Many people are surprised at the large number of speakers of these Southeast Asian languages. Both Malay (Standard) and Indonesian are standardized varieties of Malay spoken in Malaysia and Indonesia, respectively. Indonesian is the 11th most spoken language in the world. Malay (Standard), also called Malaysian Malay, is spoken in Malaysia, Brunei (with some minor local variations), and Singapore. The number of speakers of all versions of Malay is estimated at 290 million with 260 million living in Indonesia.

While most people know that Filipino, a standardized version of Tagalog, is an official language of the Philippines (the other being English), they may be surprised to learn that the U.S. is home to the third-largest concentration of Tagalog speakers outside of that country (after China and India). According to the U.S. Census, in 2019, Tagalog was the fourth-most spoken language (other than English) in the U.S.

Rosette added support for the morphological analysis of these three Southeast Asian languages in its version 1.23 release in addition to entity extraction in Tagalog.

[1] The benefactive verb form is one in which the focus of the verb indicates that the subject of the sentence benefits from the verb’s action.