Parlez-vous 這種語言 aquí? | Language Identification

Language Identification

Instantly identify and triage many languages within large volumes of text to prepare for further analysis

language identification, language detector

Overview

What is language detection?

Global-minded companies must be able to work with data and content in dozens of languages. While most can recognize English text versus Russian, only an expert can distinguish between very similar languages like Indonesian and Standard Malaysian, or Bulgarian and Serbian. Rosette does this automatically, enabling you to apply language specific analytics to your data for greater accuracy and deeper insights.

Why do I need it?

Language identification is the prerequisite for accurate text analytics. Due to the risk of translation errors, your results are always more accurate when performed with language-specific tools on untranslated data. In order to apply the correct text analytics models to your content, you must first know which tools to use.

Language identification categorizes content and improves search results especially for multilingual documents, or anything with text: social media, image captions, news headlines, email subject lines, tweets, metadata, keywords, queries, files, logs, and more.

Rosette leads the pack

Rosette outperforms most language identifiers on the market in detecting the language of tweets and queries, including short strings as little as 1-3 words, to a full sentence. We also include language coverage of transliterated text that can be written with more than one alphabet, such as Arabic chat or “Arabizi,”  Arabic words in Latin script. Rosette language ID identifies the both the dominant language in a body of text, as well as the boundaries of different languages within a multilingual document.

Product Highlights

  • 55 languages
  • 18 language scripts (e.g. Latin, Cyrillic)
  • 188 language/encoding pairs
  • Identifies the dominant language of a document
  • Identifies different language regions within multilingual documents

How It Works

Superior coverage of language, encodings, and scripts

Rosette achieves its incredible accuracy through the use of proprietary algorithms with information-rich language profiles derived from statistical analysis, in addition to language-specific methods for short text language detection.

Input

The input data may be in any of 364 language-encoding-script combinations, involving 56 languages, 48 encodings, and 18 writing scripts. The language identifier uses an n-gram algorithm to detect language. Each of the 155 built-in profiles contains the quad-grams (i.e., four consecutive bytes) that are most frequently encountered in documents in a given language, encoding, and script. The default number of n-grams is 10,000 for double-byte encodings and 5,000 for single-byte encodings.

Confidence

Rosette returns a confidence score with each language result, ranging from 0 to 1. It is a measurement that you can use as a threshold for filtering out undesired results.

Technology

When input text is submitted for detection, a similar n-gram profile is built based on that data. The input profile is then compared with all the built-in profiles (a vector distance measure between the input profile and the built-in profile is calculated). The pre-built profiles are then returned in ascending order by the (shortest) distance of the input from the pre-built profiles.

Short string language detection

For a number of languages, the endpoint provides a different proprietary algorithm for detecting the language of short strings (140 characters or less).

Tech Specs

Availability and Platform Support

Deployment Availability:
Plugins:
Bindings:

Supported Languages

Albanian Arabic Arabic (transliterated) Bengali
Bulgarian Catalan Chinese, Simplified Chinese, Traditional
Croatian Czech Danish Dutch
English Estonian Finnish French
German Greek Gujarati Hebrew
Hindi Hungarian Icelandic Indonesian
Italian Japanese Kannada Korean
Kurdish Kurdish (transliterated) Latvian Lithuanian
Macedonian Malay Malayalam Norwegian
Pashto Pashto (transliterated) Persian Persian (transliterated)
Polish Portuguese Romanian Russian
Serbian Serbian (transliterated) Slovak Slovenian
Somali Spanish Swedish Tagalog
Tamil Telugu Thai Turkish
Ukrainian Urdu Urdu (transliterated) Uzbek
Uzbek (transliterated) Vietnamese

Short String Languages

Arabic Chinese, Simplified Chinese, Traditional Czech
Danish Dutch English Finnish
French German Greek Hebrew
Hungarian Italian Japanese Korean
Norwegian Pashto Persian Portuguese
Romanian Russian Spanish Swedish
Thai Turkish

Try the Demo

Cloud API

Easy to Use API

Ideal for product evaluation, academic research, and smaller, cost-conscious businesses, our fast and powerful API is instantly accessible and free to get started. The language ID endpoint identifies the dominant language within a document. To parse documents with multiple languages and identify language section boundaries, ask us about on-premise deployments.

Try language identification and the rest of Rosette’s endpoints, free up to 10,000 calls/month!

Get an API Key

Quality Documentation and Support

Customers love our thorough and responsive support team. We also provide in-depth documentation that lists all the features and functions of the various API endpoints along-side examples in the binding of your choice.

Visit our GitHub for bindings and documentation.

Enterprise Ready

Evaluate Rosette’s functional fit with your business and data needs on our cloud API knowing that scalable, customizable, on-premise deployments are available if you need them.

{
  "languageDetections": [
    {
      "language": "spa",
      "confidence": 0.38719602327387076
    },
    {
      "language": "eng",
      "confidence": 0.32699986625091865
    },
    {
      "language": "por",
      "confidence": 0.05569054210624943
    },
    {
      "language": "deu",
      "confidence": 0.030069489878380328
    },
    {
      "language": "swe",
      "confidence": 0.027734757034048835
    }
  ]
}

On Premise

Customize and scale your text analytics on premise

For organizations with vast data quantities, unique integration needs, and data security restrictions, we provide on-premise API deployment and SDKs to be hosted on your internal servers.

On premise language identification can identify both the dominant language of an entire document, as well as breakdown the language regions within multilingual content.

Request Product Evaluation

If your organization requires an on-premise solution, we’re happy to work with you to meet your business’ unique needs. For more in-depth evaluations please complete the form below and our Customer Engineering team will provide you with an on-premise evaluation package.

Drop Us a Line

EMAIL:
info@basistech.com

PHONE:
+1-617-386-2000

Select Customers Include

Learn More

Going from FAST to Solr

Going from FAST to Solr

Read More
Road to Japan: How to Yelp Like a Native

Road to Japan: How to Yelp Like a Native

Read More

No coding required

rapidminer-1

rapidminer

RapidMiner is the industry’s #1 predictive analytics platform. The client platform, RapidMiner Studio, empowers organizations to easily prep data, create models and operationalize predictive analytics within any business process.

Try RapidMiner