Language Identification

Instantly identify and triage many languages within large volumes of text to prepare for further analysis

language identification, language detector


What is language detection?

Global-minded companies must be able to work with data and content in dozens of languages. While most can recognize English text versus Russian, only an expert can distinguish between very similar languages like Indonesian and Standard Malaysian, or Bulgarian and Serbian. Rosette does this automatically, enabling you to apply language-specific analytics to your data for greater accuracy and deeper insights.

Why do I need it?

Language identification is the prerequisite for accurate text analytics. Due to the risk of translation errors, your results are always more accurate when performed with language-specific tools on untranslated data. In order to apply the correct text analytics models to your content, you must first know which tools to use.

Language identification categorizes content and improves search results especially for multilingual documents, or anything with text: social media, image captions, news headlines, email subject lines, tweets, metadata, keywords, queries, files, logs, and more.

Rosette leads the pack

Rosette outperforms most language identifiers on the market in detecting the language of tweets and queries, including short strings as little as 1-3 words, to a full sentence. We also include language coverage of transliterated text that can be written with more than one alphabet, such as Arabic chat or “Arabizi,”  Arabic words in Latin script. Rosette language ID identifies both the dominant language in a body of text, as well as the boundaries of different languages within a multilingual document.

Product highlights

  • 56 languages
  • 18 language scripts (e.g. Latin, Cyrillic)
  • 364 language/encoding pairs
  • Identifies the dominant language of a document
  • Identifies different language regions within multilingual documents

How It Works

Superior coverage of language, encodings, and scripts

Rosette achieves its incredible accuracy through the use of proprietary algorithms with information-rich language profiles derived from statistical analysis, in addition to language-specific methods for short text language detection.


The input data may be in any of 364 language-encoding-script combinations, involving 56 languages, 48 encodings, and 18 writing scripts. The language identifier uses an n-gram algorithm to detect language. Each of the 155 built-in profiles contains the quad-grams (i.e., four consecutive bytes) that are most frequently encountered in documents in a given language, encoding, and script. The default number of n-grams is 10,000 for double-byte encodings and 5,000 for single-byte encodings.


Rosette returns a confidence score with each language result, ranging from 0 to 1. It is a measurement that you can use as a threshold for filtering out undesired results.


When input text is submitted for detection, a similar n-gram profile is built based on that data. The input profile is then compared with all the built-in profiles (a vector distance measure between the input profile and the built-in profile is calculated). The pre-built profiles are then returned in ascending order by the (shortest) distance of the input from the pre-built profiles.

Short string language detection

For a number of languages, the endpoint provides a different proprietary algorithm for detecting the language of short strings (140 characters or less).

Tech Specs

Availability and platform support

Deployment availability:

Supported languages

Includes short string language identification, excepting for languages marked with *

Albanian German Latvian Slovak
Arabic Greek Lithuanian Slovenian
*Arabic (transliterated) Gujarati Macedonian Somali
Bengali Hebrew Malay Spanish
Bulgarian Hindi Malayalam Swedish
Catalan Hungarian Norwegian Tagalog
Chinese, Simplified Icelandic Pashto Tamil
Chinese, Traditional Indonesian *Pashto (transliterated) Telugu
Croatian Italian Persian Thai
Czech Japanese *Persian (transliterated) Turkish
Danish Kannada Polish Ukraine
Dutch Korean Portuguese Urdu
English Korean (North Dialect) Romanian *Urdu (transliterated)
Estonian Korean (South Dialect) Russian Uzbek
Finnish Kurdish Serbian Uzbek (transliterated)
French Kurdish (transliterated) Serbian (transliterated) Vietnamese

Try the Demo

Rosette Cloud

Easy to use

Built for the most demanding text analytics applications and engineered to deliver high accuracy without sacrificing speed, Rosette Cloud is instantly accessible and offers a variety of plans to suit both startups and enterprises.

The language ID endpoint identifies the dominant language within a document. To parse documents with multiple languages and identify language section boundaries, ask us about on-premise deployments.

Try language identification and the rest of Rosette’s endpoints, signup today for a 30-day free trial!

Get an API Key

Quality documentation and support

Customers love our thorough and responsive support team. We also provide in-depth documentation that lists all the features and functions of the various API endpoints along-side examples in the binding of your choice.

Visit our GitHub for bindings and documentation.

Enterprise ready

Evaluate Rosette’s functional fit with your business and data needs on our cloud API knowing that scalable, customizable, on-premise deployments are available if you need them.

  "languageDetections": [
      "language": "spa",
      "confidence": 0.38719602327387076
      "language": "eng",
      "confidence": 0.32699986625091865
      "language": "por",
      "confidence": 0.05569054210624943
      "language": "deu",
      "confidence": 0.030069489878380328
      "language": "swe",
      "confidence": 0.027734757034048835

Rosette Enterprise

Customize and scale your text analytics on premise

For organizations with vast data quantities, unique integration needs, and data security restrictions, we provide on-premise API deployment and SDKs to be hosted on your internal servers.

On premise language identification can identify both the dominant language of an entire document, as well as breakdown the language regions within multilingual content.

Request product evaluation

If your organization requires an on-premise solution, we’re happy to work with you to meet your business’ unique needs. For more in-depth evaluations please complete the form below and our Customer Engineering team will provide you with an on-premise evaluation package.

Drop us a line



Select Customers Include

konasearch salesforce

Deep Search for Salesforce

AI-driven Search Application for SalesForce

KonaSearch is a best-in-class search application for SalesForce enabling users to search every field, file, and object across multiple orgs and other data sources.

View on AppExchange

SalesForce Search