Robust text normalization for Arabic script languages: Arabic, Persian, Pashto, and Urdu
Commercial-strength analysis of text in Arabic script
Arabic base linguistics is the essential first step in analyzing documents written in Arabic script. Designed to plug into mainstream search engines and data mining applications, it performs orthographic and lexical normalization of text in Arabic script. Specific features include:
- Normalizing orthography, including the removal of vowel and nunation signs, unification of hamza forms, and the removal of kashida (tatweel)
- Normalizing irregular “broken” plural forms to the correct singular form
- Normalizing Arabic numerical expressions to their Latin counterparts
- Normalization features specific to Arabic, Persian, and Urdu are also included.
In addition, the text is tokenized (divided into words) for Persian (Farsi and Dari), Pashto, and Urdu.
The analysis of each token for Arabic, Persian and Urdu produces:
- For Arabic, the normalized form of the token, a part-of-speech tag, a stem, a lemma and a Semitic root. Stems and lemmas result from stripping most of the inflectional morphemes, while Semitic roots result from stripping derivational morphemes.
- For Persian, the normalized form of the token, a part-of-speech tag, and a stem. Some analyses also include a lemma.
- For Urdu, the normalized form of the token, a part-of-speech tag, and a stem.
Why is Arabic text normalization and analysis necessary?
Because of the complexity of Arabic script languages — which add affixes (similar to prefixes/suffixes in English) to the beginning, middle, and end of words — simply searching for an exact match will miss many relevant results. Why?
Arabic is challenging for standard automatic analysis techniques that look at a language’s written form. These affixes and other grammatical elements are used in Arabic to indicate attributes such as verb aspect, object, conjugation, person, number, gender, and others. For example, articles (“an,” “the”) and determiners (“his,” “their”) are not separate words as they are in languages like English but are actually attached to the words to which they refer. (For example, “their houses” is written as a single token, بُيُوتُهُمْ.)
There is additional ambiguity in Arabic due to the inconsistent use or absence of vowels in written Modern Standard Arabic. These ambiguities and lack of normalization can decrease the accuracy of Arabic natural language processing. Therefore Arabic text requires significant preprocessing before it can be accurately indexed, searched, or put through any other text manipulation.
- Fine-grained, language-specific text normalization for Arabic, Persian (Iranian Persian and Dari), Pashto, and Urdu.
- Token-level analysis provides (depending on the language) a normalized form of the token, a part-of-speech tag, a stem, a lemma, and a Semitic root
- Stopword mechanism to designate words to ignore