Parlez-vous 這種語言 aquí?
Your content could be written in an urban city in South Asia or a rural town in Eastern Europe. With so many languages intersecting and overlapping, only an expert can distinguish between Indonesian and Standard Malaysian, or Bulgarian and Serbian. Rosette does this automatically.
What is it used for?
Language identification is the prerequisite for accurate text analytics. It categorizes content and improves search results especially for multilingual documents. A document is anything with text:
email subject lines
Detection by the numbers
|| languages (26 for short strings)
|| language scripts (7 Latin variants)
|| language/encoding pairs
|| legacy encodings
- Identifies the dominant language of a document
- Identifies the language and language scripts (e.g. Latin, Cyrillic) within the document
- Identifies different language regions within multilingual documents
- Bests most language identifiers in detecting language of tweets and queries (as little as 1-3 words to a full sentence)
- Works with languages that have been transliterated or written with more than one alphabet, such as Arabic chat (Arabic in Latin script)