Language Identifier

Instantly identify the language of whole documents or multiple language regions within each document

Overview

What is language detection?

Language identification is the first step in any text analysis or natural language processing pipeline. If the language of a document is misidentified, all subsequent language specific models will produce inaccurate results. Errors at this stage of analysis can snowball and result in invalid results, as when an English language analyzer is applied to French text. It is vital to identify the language of each document, and establish whether any sections are in another language. Depending on the country and culture, it is very common for documents to contain multiple language regions.

Whether you are building a search engine or analyzing product reviews, Rosette® Language Identifier answers not just, Is it Korean or French, but is it North Korean or South Korean?

What do I need from language detection?

Detecting the language of a monolingual document is widely solved by using statistical profiles of languages, but there are other instances:

  • Multilingual documents — What about a critical two-line email written in French with a 12-line legal footer in English? That email might fool many language identifiers into tagging the email as English, but Rosette detects language regions within each document, correctly tagging the language of each section to unlock the information inside.
  • Short texts — From a few words to a sentence, short texts appear everywhere, from social media, image captions, news headlines, email subject lines, and tweets to metadata, keywords, queries, files, and logs. Using statistical profiles only, short text language detection is challenging. Rosette uses a special short string language detection algorithm to overcome this issue with script and word awareness.

Language identification categorizes content and improves search results especially for multilingual documents, or anything with text: social media, image captions, news headlines, email subject lines, tweets, metadata, keywords, queries, files, logs, and more.

Basis Technology leads the pack

Our language identifier software outperforms the competition in detecting the language of:

  • Short texts — Such as the language of tweets and queries (from one to three words to a full sentence)
  • Multilingual documents — Rosette recognizes the dominant language in a body of text, as well as smaller sections in different languages, and tags the various language regions
  • Transliterated texts — At times, non-Latin languages (such as Arabic) may appear in Arabic script or Latin characters, and Rosette recognizes both
  • Dialect detection — Rosette distinguishes between the dialects of North Korea and South Korea.

Product highlights

  • Detects 56 languages
  • Supports 18 language scripts (e.g., Latin and Cyrillic)
  • Identifies 364 language/encoding pairs
  • Reports the dominant language of a document
  • Detects different language regions within multilingual documents
  • Delivers high accuracy based on as few as one to three words
  • Cloud, on-premise, and search plugin deployments

How It Works

Superior coverage of language, encodings, and scripts

Our language identifier achieves its incredible accuracy through the use of proprietary algorithms with information-rich language profiles derived from statistical analysis, in addition to language-specific methods for short text language detection.

Input

The input data may be in any of 364 language-encoding-script combinations, and involve 56 languages, 48 encodings, and 18 writing scripts. The language identifier uses an n-gram algorithm to detect language. Each of the 155 built-in profiles contains the quad-grams (i.e., four consecutive bytes) that are most frequently encountered in documents in a given language, encoding, and script. The default number of n-grams is 10,000 for double-byte encodings, and 5,000 for single-byte encodings.

Technology

When input text is submitted for detection, a similar n-gram profile is built based on that data. The input profile is then compared with all the built-in profiles (a vector distance measure between the input profile and the built-in profile is calculated). The built-in language profiles are returned in ascending order from the most likely language (i.e., the built-in profile with the shortest distance from the input text’s profile).

Confidence

Our language identifier returns a confidence score ranging from 0 to 1 with each language result. It is a measurement that you can use as a threshold for flagging results that are “too close to be sure.”

Language Boundary Locator


Digital text is often composed of multiple languages within the same document, presenting a challenge to computers and humans. RLI enriches the text with start and end markers for each language placed within multilingual documents — even if all the languages are written in the same script, such as English, French, German, or Italian. Boundaries of each writing system are also detected, such as Latin, Cyrillic, Japanese Kana, or Chinese Hanzi.

Short String Language Detection

For a number of languages, the language identifier uses additional proprietary algorithms for detecting the language of short strings (140 characters or less).

Tech Specs

Availability and platform support

Deployment availability:
Plugins:
Bindings:

Supported languages

Includes short string language identification, excepting for languages marked with *

Albanian German Latvian Slovak
Arabic Greek Lithuanian Slovenian
*Arabic (transliterated) Gujarati Macedonian Somali
Bengali Hebrew Malay Spanish
Bulgarian Hindi Malayalam Swedish
Catalan Hungarian Norwegian Tagalog
Chinese, Simplified Icelandic Pashto Tamil
Chinese, Traditional Indonesian *Pashto (transliterated) Telugu
Croatian Italian Persian Thai
Czech Japanese *Persian (transliterated) Turkish
Danish Kannada Polish Ukraine
Dutch Korean Portuguese Urdu
English Korean (North Dialect) Romanian *Urdu (transliterated)
Estonian Korean (South Dialect) Russian Uzbek
Finnish Kurdish Serbian Uzbek (transliterated)
French Kurdish (transliterated) Serbian (transliterated) Vietnamese
Sample output:
{
  "languageDetections": [
    {
      "language": "spa",
      "confidence": 0.38719602327387076
    },
    {
      "language": "eng",
      "confidence": 0.32699986625091865
    },
    {
      "language": "por",
      "confidence": 0.05569054210624943
    },
    {
      "language": "deu",
      "confidence": 0.030069489878380328
    },
    {
      "language": "swe",
      "confidence": 0.027734757034048835
    }
  ]
}

Try the Demo

Deployment

Rosette Cloud

Sign up today for a free 30-day trial

The SaaS version of Rosette is rapidly implemented, low maintenance and ideal for users who wish to pay based on monthly call volume. Numerous bindings through a RESTful API are supported.

Rosette Server Edition

This on-premise private cloud deployment puts all the functionality of Rosette Cloud behind your secure firewall, and enables advanced user settings, access to custom profiles (user-specific configuration setups), and deployment of custom models.

Rosette Java Edition

For on-premise systems that need the low-latency, high-speed integration of an SDK, Rosette Java is the way to go. It has been deployed in the most demanding, high-transaction environments, including web search engines, financial compliance, and border security.

Rosette Plugins

Just plug in Rosette for instant high-accuracy multilingual search and fuzzy name search for Elasticsearch or Apache Solr.

Quality documentation and support

Our support team responds to customers in less than a business day, and is committed to a satisfactory resolution. Users have access to in-depth documentation describing all the features, with code examples and a searchable knowledge base.

Visit our GitHub for bindings and documentation.

Questions?

Email: info@basistech.com

Phone: +1-617-386-2000

Select Customers Include

konasearch salesforce

Deep Search for Salesforce

AI-driven Search Application for Salesforce

KonaSearch is a best-in-class search application for Salesforce enabling users to search every field, file, and object across multiple orgs and other data sources.

View on AppExchange

SalesForce Search