Semantic Similarity


Empowering cross-lingual semantic search and related term-generation. Our business-ready text vectors can represent any length of text and map across languages.

Overview

What is semantic similarity?

In short, semantic similarity is our implementation of a technology called text embedding. One of the most useful, new technologies for natural language processing, text embedding transforms words into a numerical representation (vectors) that approximates the conceptual distance of word meaning.

Semantic similarity is useful for cross-language search, duplicate document detection, and related-term generation.

Cross-language semantic search

A keyword search returns what you said—not what you necessarily meant. They don’t understand the meaning of the keyword(s) used and choke on words that have multiple meanings and synonyms.

One word many meanings

(polysemy)

“fast” can mean:

  • “quick”
  • “abstain from food”
  • “firmly fixed—attached”

Many words one meaning

(Synonym)

  • clothes = apparel
  • jewel = gem
  • hot dog = sandwich (to some people)

Semantic search is the opposite, meaning is searched instead of the word. Thus a search for “clothes” will return articles that mention “apparel” and a search for “gem” will return articles that mention “jewels.” Whether or not a search for “hot dog” can or should return articles about “sandwiches” is still a matter of opinion.

What is really exciting, is searching semantically across languages. By mapping text embeddings between languages, we can now find “croissant sausage” by searching for “pigs in a blanket” or even “Сосиска в тесте.”

Duplicate document detection

In numerous situations, such as eDiscovery, being able to detect nearly duplicate documents can save man-weeks or more of expensive human labor. Rosette offers a business-ready workflow: identifying the language(s) in a set of documents to know upfront the specialists needed, to then detecting duplicate documents that may be eliminated from the queue of documents for human lawyers to review.

Text embeddings can detect plagiarism even when a sentence or two has been moved or modified, which would foil many plagiarism checkers.

If you are interested in taking a second measurement of document similarity, Rosette Base Linguistics generates word lemmas that can be used in a bag-of-words model.

Related term generation

In some applications, turning all of the available documents into vectors and comparing those vectors is not practical. For these situations, we offer similar term generation. From any of our supported languages, you can generate as many terms in the other languages as you need. These terms can be stored in the search index, used for query expansion, or shown to the end user as keyword suggestions.

Product highlights

  • 9 supported languages
  • Cloud and secure on-premise deployments
  • Fast and scalable
  • Industrial-strength support
  • Constantly stress-tested and improved

How It Works

Conceptual distance

From one perspective, a cat is very different from a pig, after all, a cat is a pet and a pig typically is not. However, in another view, they are both animals, and much more similar to each other than to the moon. So we would expect that vectors for cat and pig would be closer together than vectors of cat and moon. This principle of relative conceptual distance is at the crux of calculating semantic similarity.

Converting words to vectors

Text embeddings are the mathematical representations of words as vectors. They are created by analyzing a body of text and representing each word, phrase, or entire document as a vector in a high dimensional space (similar to a multi-dimensional graph). Once text has been mapped as vectors, it can be added, subtracted, multiplied, or otherwise transformed to mathematically express or compare the relationships between different words, phrases, and documents.

Word embedding vs. text embedding

Word embedding, the cutting edge of today’s natural language processing and deep learning technology, is the mapping to vectors of individual words. Text embedding takes the process a step further by creating vectors for phrases, paragraphs, and documents as well. Word embeddings show that “king” is similar to “queen” but not to “avalanche,” while text embeddings can show that The Book of John is similar to The Book of Luke, but not to Harry Potter.

Tech Specs

Availability and platform support

Deployment availability:
Bindings:

Supported languages

Arabic English Japanese Korean (North Dialect) Russian
Chinese, Simplified & Traditional German Korean Korean (South Dialect) Spanish

Rosette Cloud

Easy to use

Built for the most demanding text analytics applications and engineered to deliver high accuracy without sacrificing speed, Rosette Cloud is instantly accessible and offers a variety of plans to suit both startups and enterprises.

Try text embedding and the rest of Rosette’s endpoints, signup today for a 30-day free trial!

Get a Rosette Cloud Key

Quality documentation and support

Customers love our thorough and responsive support team. We also provide in-depth documentation that lists all the features and functions of the various endpoints along-side examples in the binding of your choice.

Visit our GitHub for bindings and documentation.

Enterprise ready

Evaluate Rosette’s functional fit with your business and data needs on the cloud knowing that scalable, customizable, enterprise deployments are available if you need them.

Semantic Vectors /semantics/vector

{"content": "Cambridge, Massachusetts"}
 
{
  "documentEmbedding": [
    0.0220256,
    0.03633998,
    0.05246549,
    -0.03751056,
    0.0347335,
    0.02479751,
    -0.03860506,
    0.00603574,
    -0.04244069,
    0.00521813,
    0.01740657,
    -0.08501768,
    -0.01918706,
    -0.05974227,
    0.00762913,
    -0.00020686,
    -0.04639495,
    0.00458408,
    0.01220596,
    0.06160719,
    -0.03988802,
    -0.03095652,
    -0.01182547,
    0.04861571,
    0.02967435,
    -0.04560868,
    -0.16111824,
    -0.06562275,
    0.00208866,
    0.01622739,
    -0.09196278,
    0.13520485,
    0.03665138,
    -0.01748736,
    0.05908763,
    0.07113674,
    0.04435388,
    -0.04436791,
    -0.0018729,
    -0.03612895,
    0.00324841,
    0.0218222,
    0.00414962,
    0.02750619,
    -0.00466647,
    -0.03516347,
    0.00061686,
    0.03071387,
    0.060716,
    -0.05394382,
    -0.03460756,
    -0.0916905,
    -0.04351116,
    0.03095916,
    0.07264832,
    0.00440244,
    -0.06487004,
    -0.0124327,
    -0.02594845,
    0.06403252,
    0.05990276,
    0.08421157,
    0.00113943,
    -0.05188083,
    0.01336752,
    0.05737128,
    0.0868928,
    -0.02797472,
    0.02951868,
    -0.06528687,
    -0.02593506,
    -0.1377904,
    0.05021935,
    -0.00331138,
    0.00345429,
    -0.0806604,
    -0.02997256,
    0.04178474,
    -0.16860084,
    -0.00202994,
    0.04082655,
    0.04052638,
    -0.02616019,
    -0.07079905,
    0.04114204,
    -0.05405192,
    -0.02079529,
    0.03362259,
    0.12866253,
    0.04686183,
    0.03205459,
    0.01844979,
    0.10577367,
    -0.04331236,
    0.03550498,
    0.03498939,
    -0.05236725,
    0.05650697,
    -0.03229797,
    -0.05911481,
    0.08041807,
    -0.01093418,
    -0.04541076,
    0.00499057,
    0.03379054,
    0.01985912,
    0.05434353,
    -0.06876269,
    -0.02142489,
    -0.04368682,
    -0.02340091,
    0.04271708,
    -0.03868493,
    0.03260612,
    -0.00310602,
    -0.08135383,
    0.03890613,
    0.05206529,
    0.01902638,
    -0.03261049,
    -0.01225097,
    -0.04929554,
    0.06811376,
    -0.10045446,
    -0.03772711,
    0.06436889,
    0.0335337,
    0.03110947,
    -0.01010367,
    -0.03986244,
    0.01340914,
    -0.06304926,
    0.05365673,
    -0.07044137,
    0.06421522,
    0.0632241,
    -0.04348637,
    0.13118945,
    -0.02082631,
    0.07590587,
    -0.04813327,
    -0.02577493,
    0.05642929,
    0.00033935,
    -0.01024516,
    0.06391647,
    0.03264675,
    -0.02187326,
    0.04832495,
    0.02241259,
    0.05681982,
    -0.04124964,
    0.08708096,
    0.06066873,
    -0.03356391,
    -0.03327714,
    -0.03449181,
    -0.02047219,
    0.06597982,
    0.08629483,
    0.03777988,
    0.01191289,
    0.10955901,
    -0.05159367,
    0.00001431,
    -0.00435081,
    -0.07139333,
    -0.10915583,
    -0.06582265,
    -0.02754464,
    0.04510804,
    0.09508634,
    -0.02923319,
    0.03627863,
    0.02647047,
    0.06838391,
    0.07216309,
    -0.00809051,
    0.07248835,
    0.0123264,
    -0.09173338,
    -0.02095788,
    0.02871792,
    -0.03392723,
    0.05959549,
    -0.10397915,
    -0.03820326,
    -0.05222115,
    -0.02296818,
    -0.06410559,
    0.02745123,
    0.02334865,
    -0.02446206,
    -0.12417631,
    -0.01871051,
    0.02439541,
    -0.02481432,
    -0.03880155,
    0.04188481,
    0.02300973,
    0.10600527,
    0.02696968,
    0.02788247,
    0.05024018,
    0.05907565,
    0.02856795,
    -0.00740766,
    0.02289764,
    -0.0643627,
    -0.00749485,
    -0.03111451,
    0.06580845,
    0.02102997,
    -0.10717536,
    0.16490568,
    0.03047366,
    -0.02454999,
    0.07184675,
    -0.02504459,
    -0.11541119,
    0.03915355,
    -0.03187835,
    -0.05494586,
    -0.15862629,
    -0.02779816,
    0.00724561,
    0.00901807,
    -0.01519001,
    0.04528573,
    -0.05221211,
    0.01260346,
    -0.01652065,
    0.01324382,
    -0.01688977,
    0.01070876,
    -0.03916383,
    -0.03296183,
    -0.06774635,
    -0.05388693,
    -0.01320887,
    0.07467077,
    0.06863626,
    -0.06439278,
    0.06113409,
    -0.00122581,
    -0.0411741,
    0.11657882,
    -0.01979883,
    -0.01714609,
    -0.00621283,
    0.05906631,
    0.00404663,
    0.02791196,
    -0.11955266,
    -0.0623432,
    -0.12302965,
    0.04749805,
    -0.05722075,
    0.08342554,
    -0.0616898,
    0.0171079,
    0.1030134,
    0.00575187,
    -0.01223959,
    -0.01106031,
    0.02733183,
    -0.05465746,
    -0.00639093,
    0.10582153,
    0.05119603,
    -0.16957831,
    0.0605646,
    0.05737981,
    0.12555394,
    -0.00963913,
    -0.15966235,
    0.06239227,
    -0.01519997,
    -0.00653814,
    -0.01759958,
    -0.00281965,
    -0.07387377,
    0.01542045,
    -0.01574635,
    0.09960862,
    0.06726488,
    0.01381977,
    0.03104461,
    0.05140565,
    -0.08996302,
    0.06713541,
    -0.10765704,
    -0.00975681,
    0.15130819,
    0.0128835,
    -0.00251494,
    -0.02743187,
    0.00955417,
    -0.10639542,
    0.04656886
  ]
}

Similar Terms /semantics/similar

{"content": "spy", "options": {"resultLanguages": ["spa", "deu", "jpn"]}}

{
  "similarTerms": {
    "spa": [
      {
        "term": "espía",
        "similarity": 0.61295485
      },
      {
        "term": "cia",
        "similarity": 0.46201307
      },
      {
        "term": "desertor",
        "similarity": 0.42849663
      },
      {
        "term": "cómplice",
        "similarity": 0.36646274
      },
      {
        "term": "subrepticiamente",
        "similarity": 0.36629659
      },
      {
        "term": "asesino",
        "similarity": 0.36264464
      },
      {
        "term": "misterioso",
        "similarity": 0.35466132
      },
      {
        "term": "fugitivo",
        "similarity": 0.35033143
      },
      {
        "term": "informante",
        "similarity": 0.34707013
      },
      {
        "term": "mercenario",
        "similarity": 0.34658083
      }
    ],
    "jpn": [
      {
        "term": "スパイ",
        "similarity": 0.5544399
      },
      {
        "term": "諜報",
        "similarity": 0.46903181
      },
      {
        "term": "MI6",
        "similarity": 0.46344957
      },
      {
        "term": "殺し屋",
        "similarity": 0.41098994
      },
      {
        "term": "正体",
        "similarity": 0.40109193
      },
      {
        "term": "プレデター",
        "similarity": 0.39433435
      },
      {
        "term": "レンズマン",
        "similarity": 0.3918637
      },
      {
        "term": "S.H.I.E.L.D.",
        "similarity": 0.38338536
      },
      {
        "term": "サーシャ",
        "similarity": 0.37628397
      },
      {
        "term": "黒幕",
        "similarity": 0.37256041
      }
    ],
    "deu": [
      {
        "term": "Deckname",
        "similarity": 0.51391315
      },
      {
        "term": "GRU",
        "similarity": 0.50809389
      },
      {
        "term": "Spion",
        "similarity": 0.50051737
      },
      {
        "term": "KGB",
        "similarity": 0.49981388
      },
      {
        "term": "Informant",
        "similarity": 0.48774603
      },
      {
        "term": "Geheimagent",
        "similarity": 0.48700801
      },
      {
        "term": "Geheimdienst",
        "similarity": 0.48512384
      },
      {
        "term": "Spionin",
        "similarity": 0.47224587
      },
      {
        "term": "MI6",
        "similarity": 0.46969846
      },
      {
        "term": "Decknamen",
        "similarity": 0.44730526
      }
    ]
  }

Rosette Enterprise

Customize and scale your text analytics on premise

For organizations with vast data quantities, unique integration needs, and data security restrictions, we provide on-premise deployments to be hosted on your internal servers.

Request product evaluation

If your organization requires an enterprise solution, we’re happy to work with you to meet your business’ unique needs. For free evaluation of Rosette Enterprise please complete the form below and our Customer Engineering team will provide you with an evaluation package.

Drop us a line

EMAIL:
info@basistech.com

PHONE:
+1-617-386-2000

Select Rosette Customers

No Coding Required

rapidminer-1

rapidminer

RapidMiner is the industry’s #1 predictive analytics platform. The client platform, RapidMiner Studio, empowers organizations to easily prep data, create models and operationalize predictive analytics within any business process.

Try RapidMiner