Parlez-vous 這種語言 aquí?


Your content could be written in an urban city in South Asia or a rural town in Eastern Europe. With so many languages intersecting and overlapping, only an expert can distinguish between Indonesian and Standard Malaysian, or Bulgarian and Serbian. Rosette does this automatically.

What is it used for?

 

Language identification is the prerequisite for accurate text analytics. It categorizes content and improves search results especially for multilingual documents. A document is anything with text:

social media
image captions
news headlines
email subject lines
tweets
metadata
keywords
queries
files
logs
more…

 

Detection by the numbers

55   languages (26 for short strings)
18   language scripts (7 Latin variants)
188   language/encoding pairs
44   legacy encodings

Features

  • Identifies the dominant language of a document
  • Identifies the language and language scripts (e.g. Latin, Cyrillic) within the document
  • Identifies different language regions within multilingual documents
  • Bests most language identifiers in detecting language of tweets and queries (as little as 1-3 words to a full sentence)
  • Works with languages that have been transliterated or written with more than one alphabet, such as Arabic chat (Arabic in Latin script)

Select Customers Include:

Bing EMC Turner StumbleUpon Equivio

Blog: Accurate Language Detection for Queries & Tweets

Read More

Going from FAST to Solr

Read More

Supported Languages & Features

Languages (55)

  • Albanian
  • Arabic
  • Bengali
  • Bulgarian
  • Catalan
  • Chinese, Simp.
  • Chinese, Trad.
  • Croatian
  • Czech
  • Danish
  • Dutch
  • English
  • Estonian
  • Finnish
  • French
  • German
  • Greek
  • Gujarati
  • Hebrew
  • Hindi
  • Hungarian
  • Icelandic
  • Indonesian
  • Italian
  • Japanese
  • Kannada
  • Korean
  • Kurdish
  • Latvian
  • Lithuanian
  • Macedonian
  • Malay
  • Malayalam
  • Norwegian
  • Pashto
  • Persian
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Serbian
  • Slovak
  • Slovenian
  • Somali
  • Spanish
  • Swedish
  • Tagalog
  • Tamil
  • Telugu
  • Thai
  • Turkish
  • Ukrainian
  • Urdu
  • Uzbek
  • Vietnamese

Short String Languages (26)

  • Arabic
  • Chinese, Simp.
  • Chinese, Trad.
  • Czech
  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Greek
  • Hebrew
  • Hungarian
  • Italian
  • Japanese
  • Korean
  • Norwegian
  • Pashto
  • Persian
  • Portuguese
  • Romanian
  • Russian
  • Spanish
  • Swedish
  • Thai
  • Turkish
Language Identification
Language Identification
Released

Live Demo:

188 language-encoding-pairs, involving 55 languages, 44 legacy encodings, and 7 latin script variants.