Language Identification
Language Identification
Mature

Live Demo:

188 language-encoding-pairs, involving 55 languages, 44 legacy encodings, and 7 latin script variants.

Identify Languages In Large Volumes Of Unstructured Text


Language identification from a few words to whole documents

What languages or character encodings make up your text?

Important first step

Automatic language identification is the necessary first step for applications that categorize, search, process, and store text in many languages. Route individual documents into language-specific analysis pipelines to improve the quality of search results.

Short or long — we have you covered

If your applications analyzes tweets, search keywords, and other short text, we offer market-leading accuracy for language detection given 1-3 words (<20 bytes) up to a full sentence.

Scalable

From desktop to cloud, we have you covered with our robust SDK and RESTful cloud API.

Identification features

  • Identifies the primary or dominant language of a document
  • Identifies the language scripts within the document, such as Latin and Cyrillic
  • Identifies different language regions within multilingual documents
  • Works with languages that have been transliterated or written with more than one alphabet, such as Arabic chat (Arabic in Latin script)
  • Accurate with short strings—from 1-3 words (<20 bytes) to a full sentence to enable full analysis of search queries, tweets, image captions, metadata, news headlines, email subject lines, and more.

Select Customers Include:

Bing EMC Turner StumbleUpon Equivio

Comprehensive Linguistic Coverage

188

Language/Encoding Pairs

55

Supported Languages

7

Latin Script Variants

44

Legacy Encodings

Blog: Accurate Language Detection for Queries & Tweets

Read More

Going from FAST to Solr

Read More

Supported Languages & Features

55 Supported Languages

  • Albanian
  • Arabic
  • Bengali
  • Bulgarian
  • Catalan
  • Chinese, Simp.
  • Chinese, Trad.
  • Croatian
  • Czech
  • Danish
  • Dutch
  • English
  • Estonian
  • Finnish
  • French
  • German
  • Greek
  • Gujarati
  • Hebrew
  • Hindi
  • Hungarian
  • Icelandic
  • Indonesian
  • Italian
  • Japanese
  • Kannada
  • Korean
  • Kurdish
  • Latvian
  • Lithuanian
  • Macedonian
  • Malay
  • Malayalam
  • Norwegian
  • Pashto
  • Persian
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Serbian
  • Slovak
  • Slovenian
  • Somali
  • Spanish
  • Swedish
  • Tagalog
  • Tamil
  • Telugu
  • Thai
  • Turkish
  • Ukrainian
  • Urdu
  • Uzbek
  • Vietnamese

Short String Support

  • Arabic
  • Chinese (Traditional)
  • Chinese (Simplified)
  • Czech
  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Greek
  • Hebrew
  • Hungarian
  • Italian
  • Japanese
  • Korean
  • Norwegian
  • Pashto
  • Persian
  • Portuguese
  • Romanian
  • Russian
  • Spanish
  • Swedish
  • Thai
  • Turkish