Semantic Similarity


Measure text similarity and identify relevant keywords in and across nine languages

Overview

What is semantic similarity?

Semantic similarity is our implementation of text embedding. This recent natural language processing innovation transforms words into numerical representations (vectors) that approximate the conceptual distance of word meaning.

Semantic similarity is useful for cross-language search, duplicate document detection, and related-term generation.

Cross-language semantic search

A keyword search returns what you said — not what you necessarily meant. These searches don’t understand the meaning of the keyword(s) used, and choke on words that have multiple meanings and synonyms.

One word many meanings

(polysemy)

“Fast” can mean:

  • quick
  • abstain from food
  • firmly fixed—attached

Many words one meaning

(synonym)

  • clothes = apparel
  • jewel = gem
  • hot dog = sandwich (to some people)

Semantic search is the opposite, as it looks for the meaning rather than the literal words. Thus a search for “clothes” will return articles that mention “apparel,” and a search for “gem” will return articles that mention “jewels.” Whether a search for “hot dog” can or should return articles about “sandwiches” is still a matter of opinion.

What is really exciting, is searching semantically across languages. By mapping text embeddings between languages, we can now find “croissant sausage” by searching for
pigs in a blanket” or even “Сосиска в тесте.”

Duplicate document detection

In numerous situations, such as eDiscovery, being able to detect nearly duplicate documents can save weeks or more of expensive human labor. Rosette® offers a business-ready workflow, from identifying the language(s) in a set of documents to know upfront the specialists needed, to detecting duplicate documents that may be eliminated from the queue of documents for human lawyers to review.

Text embeddings can detect plagiarism even when a sentence has been moved or modified, something that would foil many plagiarism checkers. In addition, they
further strengthen document similarity by adding Rosette Base Linguistics, which generates word lemmas for use in a bag-of-words model.

Related term generation

Some applications need to compare the similarity of two words or generate similar terms. From any of our supported languages, you can:

  • Generate a list of similar terms to store in the search index to expand queries or offer keyword suggestions
  • Input the word “spy” and get “espía” in Spanish or “スパイ” in Japanese.

Similar terms are also helpful in matching company and product names.
Ambiguity clouds how people refer to company and product names. They may accidentally swap similar meaning words. Semantic similarity detects those near matches because, for example:

  • The company name “Eagle Drugs” is more similar to “Eagle Pharmaceuticals” than it is to “Eagle Landscaping”
  • The product name “Hershey Chocolate Bar” is more similar to “Hershey Candy Bar” than it is to “Hershey Protein Bar.”

Product highlights

  • Nine supported languages
  • Cloud and secure on-premise deployments
  • Fast and scalable
  • Industrial-strength support
  • Constantly stress-tested and improved

How It Works

Conceptual distance

From one perspective, a cat is very different from a pig. After all, a cat is a pet and a pig typically is not. However, in another view, they are both animals, and much more similar to each other than to the moon. Thus vectors for cat and pig are closer together than vectors of cat and moon. This principle of relative conceptual distance is at the crux of calculating semantic similarity.

Converting words to vectors

Text embeddings are the mathematical representations of words as vectors. They are created by analyzing a body of text and representing each word, phrase, or entire document as a vector in a high-dimensional space (similar to a multidimensional graph). Once text has been mapped as vectors, it can be added, subtracted, multiplied, or otherwise transformed to mathematically express or compare the relationships between different words, phrases, and documents.

Word embedding versus text embedding

Word embedding, the cutting edge of today’s natural language processing and deep learning technology, is the mapping of individual words to vectors. Text embedding takes the process a step further by creating vectors for phrases, paragraphs, and documents as well. Word embedding shows that “king” is similar to “queen,” but not to “avalanche,” while text embedding can show that the Book of John is more similar to the Book of Luke than Harry Potter and the Goblet of Fire.

Tech Specs

Availability and platform support

Deployment availability:
Bindings:

Supported languages

Arabic English French Japanese Korean (North Korean Dialect)
Russian Chinese, Simplified & Traditional German Korean Korean (South Korean Dialect)
Spanish
Semantic Vectors /semantics/vector
{"content": "Cambridge, Massachusetts"}
 
{
  "documentEmbedding": [
    0.0220256,
    0.03633998,
    0.05246549,
    -0.03751056,
    0.0347335,
    0.02479751,
    -0.03860506,
    0.00603574,
    -0.04244069,
    0.00521813,
    0.01740657,
    -0.08501768,
    -0.01918706,
    -0.05974227,
    0.00762913,
    -0.00020686,
    -0.04639495,
    0.00458408,
    0.01220596,
    0.06160719,
    -0.03988802,
    -0.03095652,
    -0.01182547,
    0.04861571,
    0.02967435,
    -0.04560868,
    -0.16111824,
    -0.06562275,
    0.00208866,
    0.01622739,
    -0.09196278,
    0.13520485,
    0.03665138,
    -0.01748736,
    0.05908763,
    0.07113674,
    0.04435388,
    -0.04436791,
    -0.0018729,
    -0.03612895,
    0.00324841,
    0.0218222,
    0.00414962,
    0.02750619,
    -0.00466647,
    -0.03516347,
    0.00061686,
    0.03071387,
    0.060716,
    -0.05394382,
    -0.03460756,
    -0.0916905,
    -0.04351116,
    0.03095916,
    0.07264832,
    0.00440244,
    -0.06487004,
    -0.0124327,
    -0.02594845,
    0.06403252,
    0.05990276,
    0.08421157,
    0.00113943,
    -0.05188083,
    0.01336752,
    0.05737128,
    0.0868928,
    -0.02797472,
    0.02951868,
    -0.06528687,
    -0.02593506,
    -0.1377904,
    0.05021935,
    -0.00331138,
    0.00345429,
    -0.0806604,
    -0.02997256,
    0.04178474,
    -0.16860084,
    -0.00202994,
    0.04082655,
    0.04052638,
    -0.02616019,
    -0.07079905,
    0.04114204,
    -0.05405192,
    -0.02079529,
    0.03362259,
    0.12866253,
    0.04686183,
    0.03205459,
    0.01844979,
    0.10577367,
    -0.04331236,
    0.03550498,
    0.03498939,
    -0.05236725,
    0.05650697,
    -0.03229797,
    -0.05911481,
    0.08041807,
    -0.01093418,
    -0.04541076,
    0.00499057,
    0.03379054,
    0.01985912,
    0.05434353,
    -0.06876269,
    -0.02142489,
    -0.04368682,
    -0.02340091,
    0.04271708,
    -0.03868493,
    0.03260612,
    -0.00310602,
    -0.08135383,
    0.03890613,
    0.05206529,
    0.01902638,
    -0.03261049,
    -0.01225097,
    -0.04929554,
    0.06811376,
    -0.10045446,
    -0.03772711,
    0.06436889,
    0.0335337,
    0.03110947,
    -0.01010367,
    -0.03986244,
    0.01340914,
    -0.06304926,
    0.05365673,
    -0.07044137,
    0.06421522,
    0.0632241,
    -0.04348637,
    0.13118945,
    -0.02082631,
    0.07590587,
    -0.04813327,
    -0.02577493,
    0.05642929,
    0.00033935,
    -0.01024516,
    0.06391647,
    0.03264675,
    -0.02187326,
    0.04832495,
    0.02241259,
    0.05681982,
    -0.04124964,
    0.08708096,
    0.06066873,
    -0.03356391,
    -0.03327714,
    -0.03449181,
    -0.02047219,
    0.06597982,
    0.08629483,
    0.03777988,
    0.01191289,
    0.10955901,
    -0.05159367,
    0.00001431,
    -0.00435081,
    -0.07139333,
    -0.10915583,
    -0.06582265,
    -0.02754464,
    0.04510804,
    0.09508634,
    -0.02923319,
    0.03627863,
    0.02647047,
    0.06838391,
    0.07216309,
    -0.00809051,
    0.07248835,
    0.0123264,
    -0.09173338,
    -0.02095788,
    0.02871792,
    -0.03392723,
    0.05959549,
    -0.10397915,
    -0.03820326,
    -0.05222115,
    -0.02296818,
    -0.06410559,
    0.02745123,
    0.02334865,
    -0.02446206,
    -0.12417631,
    -0.01871051,
    0.02439541,
    -0.02481432,
    -0.03880155,
    0.04188481,
    0.02300973,
    0.10600527,
    0.02696968,
    0.02788247,
    0.05024018,
    0.05907565,
    0.02856795,
    -0.00740766,
    0.02289764,
    -0.0643627,
    -0.00749485,
    -0.03111451,
    0.06580845,
    0.02102997,
    -0.10717536,
    0.16490568,
    0.03047366,
    -0.02454999,
    0.07184675,
    -0.02504459,
    -0.11541119,
    0.03915355,
    -0.03187835,
    -0.05494586,
    -0.15862629,
    -0.02779816,
    0.00724561,
    0.00901807,
    -0.01519001,
    0.04528573,
    -0.05221211,
    0.01260346,
    -0.01652065,
    0.01324382,
    -0.01688977,
    0.01070876,
    -0.03916383,
    -0.03296183,
    -0.06774635,
    -0.05388693,
    -0.01320887,
    0.07467077,
    0.06863626,
    -0.06439278,
    0.06113409,
    -0.00122581,
    -0.0411741,
    0.11657882,
    -0.01979883,
    -0.01714609,
    -0.00621283,
    0.05906631,
    0.00404663,
    0.02791196,
    -0.11955266,
    -0.0623432,
    -0.12302965,
    0.04749805,
    -0.05722075,
    0.08342554,
    -0.0616898,
    0.0171079,
    0.1030134,
    0.00575187,
    -0.01223959,
    -0.01106031,
    0.02733183,
    -0.05465746,
    -0.00639093,
    0.10582153,
    0.05119603,
    -0.16957831,
    0.0605646,
    0.05737981,
    0.12555394,
    -0.00963913,
    -0.15966235,
    0.06239227,
    -0.01519997,
    -0.00653814,
    -0.01759958,
    -0.00281965,
    -0.07387377,
    0.01542045,
    -0.01574635,
    0.09960862,
    0.06726488,
    0.01381977,
    0.03104461,
    0.05140565,
    -0.08996302,
    0.06713541,
    -0.10765704,
    -0.00975681,
    0.15130819,
    0.0128835,
    -0.00251494,
    -0.02743187,
    0.00955417,
    -0.10639542,
    0.04656886
  ]
}
Similar Terms /semantics/similar
{"content": "spy", "options": {"resultLanguages": ["spa", "deu", "jpn"]}}

{
  "similarTerms": {
    "spa": [
      {
        "term": "espía",
        "similarity": 0.61295485
      },
      {
        "term": "cia",
        "similarity": 0.46201307
      },
      {
        "term": "desertor",
        "similarity": 0.42849663
      },
      {
        "term": "cómplice",
        "similarity": 0.36646274
      },
      {
        "term": "subrepticiamente",
        "similarity": 0.36629659
      },
      {
        "term": "asesino",
        "similarity": 0.36264464
      },
      {
        "term": "misterioso",
        "similarity": 0.35466132
      },
      {
        "term": "fugitivo",
        "similarity": 0.35033143
      },
      {
        "term": "informante",
        "similarity": 0.34707013
      },
      {
        "term": "mercenario",
        "similarity": 0.34658083
      }
    ],
    "jpn": [
      {
        "term": "スパイ",
        "similarity": 0.5544399
      },
      {
        "term": "諜報",
        "similarity": 0.46903181
      },
      {
        "term": "MI6",
        "similarity": 0.46344957
      },
      {
        "term": "殺し屋",
        "similarity": 0.41098994
      },
      {
        "term": "正体",
        "similarity": 0.40109193
      },
      {
        "term": "プレデター",
        "similarity": 0.39433435
      },
      {
        "term": "レンズマン",
        "similarity": 0.3918637
      },
      {
        "term": "S.H.I.E.L.D.",
        "similarity": 0.38338536
      },
      {
        "term": "サーシャ",
        "similarity": 0.37628397
      },
      {
        "term": "黒幕",
        "similarity": 0.37256041
      }
    ],
    "deu": [
      {
        "term": "Deckname",
        "similarity": 0.51391315
      },
      {
        "term": "GRU",
        "similarity": 0.50809389
      },
      {
        "term": "Spion",
        "similarity": 0.50051737
      },
      {
        "term": "KGB",
        "similarity": 0.49981388
      },
      {
        "term": "Informant",
        "similarity": 0.48774603
      },
      {
        "term": "Geheimagent",
        "similarity": 0.48700801
      },
      {
        "term": "Geheimdienst",
        "similarity": 0.48512384
      },
      {
        "term": "Spionin",
        "similarity": 0.47224587
      },
      {
        "term": "MI6",
        "similarity": 0.46969846
      },
      {
        "term": "Decknamen",
        "similarity": 0.44730526
      }
    ]
  }

Deployment

Rosette Cloud

Sign up today for a free 30-day trial

The SaaS version of Rosette is rapidly implemented, low maintenance and ideal for users who wish to pay based on monthly call volume. Numerous bindings through a RESTful API are supported.

Rosette Server Edition

This on-premise private cloud deployment puts all the functionality of Rosette Cloud behind your secure firewall, and enables advanced user settings, access to custom profiles (user-specific configuration setups), and deployment of custom models.

Rosette Java Edition

For on-premise systems that need the low-latency, high-speed integration of an SDK, Rosette Java is the way to go. It has been deployed in the most demanding, high-transaction environments, including web search engines, financial compliance, and border security.

Rosette Plugins

Just plug in Rosette for instant high-accuracy multilingual search and fuzzy name search for Elasticsearch or Apache Solr.

Quality documentation and support

Our support team responds to customers in less than a business day, and is committed to a satisfactory resolution. Users have access to in-depth documentation describing all the features, with code examples and a searchable knowledge base.

Visit our GitHub for bindings and documentation.

Request Custom Demo

Complete this form and our customer team will reach out to schedule a demo based on your use case.

Questions?

Email: info@basistech.com

Phone: +1-617-386-2000

Select Rosette Customers

konasearch salesforce

Deep Search for Salesforce

AI-driven Search Application for Salesforce

KonaSearch is a best-in-class search application for Salesforce enabling users to search every field, file, and object across multiple orgs and other data sources.

View on AppExchange

SalesForce Search