The power of words | Tokenization

Tokenization

Access the power of words regardless of language or spacing

Tokenization

Overview

What is tokenization?

Tokenization separates text into its most fundamental elements: words. While tokenization is necessary as the basis for text analytics in any language, it is especially important for scripts that do not use spaces between words, like Japanese and Chinese.

Bigrams vs. statistical modeling

Many search tools use bigrams to understand languages written without spaces between words. Rosette identifies and separates each word through advanced statistical modeling. This approach minimizes index size, enhances search accuracy, and increase relevancy.

An example

Let’s compare the two approaches for indexing 北京大学生物系 (“Beijing University Biology Department”). Bigramming produces six tokens resulting in two non-words and an incorrect word, 学生 (“student”). Rosette segments the phrase into two tokens 北京大学 (“Beijing University”) and 生物系 (“Biology Department”). A query for the word 学生 (“student”) will correctly miss in a Rosette-tokenized index but incorrectly hit in the bigrammed index.

Product Highlights

  • 40 supported languages
  • Intuitive cloud API
  • Customizable SDK
  • Fast and scalable
  • Industrial-strength support
  • Constantly stress-tested and improved

Tech Specs

Availability and Platform Support

Deployment Availability:
Plugins:
Bindings:

Supported Languages

Albanian Danish Hebrew Norwegian Slovak
Arabic Dutch Hungarian Pashto Slovenian
Bulgarian English Indonesian Persian Spanish
Catalan Estonian Italian Polish Swedish
Chinese, Simp. Finnish Japanese Portuguese Thai
Chinese, Trad. French Korean Romanian Turkish
Croatian German Latvian Russian Ukrainian
Czech Greek Malay Serbian  Urdu

Try the Demo

Cloud API

Easy to Use API

Ideal for product evaluation, academic research, and smaller, cost-conscious businesses, our fast and powerful API is instantly accessible and free to get started.

Try tokenization and the rest of Rosette’s endpoints, free up to 10,000 calls/month!

Get an API Key

Quality Documentation and Support

Customers love our thorough and responsive support team. We also provide in-depth documentation that lists all the features and functions of the various API endpoints along-side examples in the binding of your choice.

Visit our GitHub for bindings and documentation.

Enterprise Ready

Evaluate Rosette’s functional fit with your business and data needs on our cloud API knowing that scalable, customizable, on-premise deployments are available if you need them.

{
  "tokens": [
    "北京大学",
    "生物系",
    "主任",
    "办公室",
    "内部",
    "会议"
  ]
}

On Premise

Customize and scale your text analytics on premise

For organizations with vast data quantities, unique integration needs, and data security restrictions, we provide on-premise API deployment and SDKs to be hosted on your internal servers.

Request Product Evaluation

If your organization requires an on-premise solution, we’re happy to work with you to meet your business’ unique needs. For more in-depth evaluations please complete the form below and our Customer Engineering team will provide you with an on-premise evaluation package.

Drop Us a Line

EMAIL:
info@basistech.com

PHONE:
+1-617-386-2000

Select Customers

Learn More

Going from FAST to Solr

Going from FAST to Solr

Read More

Blog

Connecting to Asia with Rosette API

Read More

No coding required

rapidminer-1
rapidminer

RapidMiner is the industry’s #1 predictive analytics platform. The client platform, RapidMiner Studio, empowers organizations to easily prep data, create models and operationalize predictive analytics within any business process.

Try RapidMiner