Measure text similarity and identify relevant keywords in and across nine languages

Overview
What is semantic similarity?
Semantic similarity is our implementation of text embedding. This recent natural language processing innovation transforms words into numerical representations (vectors) that approximate the conceptual distance of word meaning.
Semantic similarity is useful for cross-language search, duplicate document detection, and related-term generation.
Cross-language semantic search
A keyword search returns what you said — not what you necessarily meant. These searches don’t understand the meaning of the keyword(s) used, and choke on words that have multiple meanings and synonyms.
One word many meanings
(polysemy)
“Fast” can mean:
- quick
- abstain from food
- firmly fixed—attached
Many words one meaning
(synonym)
- clothes = apparel
- jewel = gem
- hot dog = sandwich (to some people)
Semantic search is the opposite, as it looks for the meaning rather than the literal words. Thus a search for “clothes” will return articles that mention “apparel,” and a search for “gem” will return articles that mention “jewels.” Whether a search for “hot dog” can or should return articles about “sandwiches” is still a matter of opinion.
What is really exciting, is searching semantically across languages. By mapping text embeddings between languages, we can now find “croissant sausage” by searching for
“pigs in a blanket” or even “Сосиска в тесте.”
Duplicate document detection
In numerous situations, such as eDiscovery, being able to detect nearly duplicate documents can save weeks or more of expensive human labor. Rosette® offers a business-ready workflow, from identifying the language(s) in a set of documents to know upfront the specialists needed, to detecting duplicate documents that may be eliminated from the queue of documents for human lawyers to review.
Text embeddings can detect plagiarism even when a sentence has been moved or modified, something that would foil many plagiarism checkers. In addition, they
further strengthen document similarity by adding Rosette Base Linguistics, which generates word lemmas for use in a bag-of-words model.
Related term generation
Some applications need to compare the similarity of two words or generate similar terms. From any of our supported languages, you can:
- Generate a list of similar terms to store in the search index to expand queries or offer keyword suggestions
- Input the word “spy” and get “espía” in Spanish or “スパイ” in Japanese.
Similar terms are also helpful in matching company and product names.
Ambiguity clouds how people refer to company and product names. They may accidentally swap similar meaning words. Semantic similarity detects those near matches because, for example:
- The company name “Eagle Drugs” is more similar to “Eagle Pharmaceuticals” than it is to “Eagle Landscaping”
- The product name “Hershey Chocolate Bar” is more similar to “Hershey Candy Bar” than it is to “Hershey Protein Bar.”
Product highlights
- Supports multiple languages
- Cloud and secure on-premise deployments
- Fast and scalable
- Industrial-strength support
- Constantly stress-tested and improved
How It Works
Conceptual distance
From one perspective, a cat is very different from a pig. After all, a cat is a pet and a pig typically is not. However, in another view, they are both animals, and much more similar to each other than to the moon. Thus vectors for cat and pig are closer together than vectors of cat and moon. This principle of relative conceptual distance is at the crux of calculating semantic similarity.
Converting words to vectors
Text embeddings are the mathematical representations of words as vectors. They are created by analyzing a body of text and representing each word, phrase, or entire document as a vector in a high-dimensional space (similar to a multidimensional graph). Once text has been mapped as vectors, it can be added, subtracted, multiplied, or otherwise transformed to mathematically express or compare the relationships between different words, phrases, and documents.
Word embedding versus text embedding
Word embedding, the cutting edge of today’s natural language processing and deep learning technology, is the mapping of individual words to vectors. Text embedding takes the process a step further by creating vectors for phrases, paragraphs, and documents as well. Word embedding shows that “king” is similar to “queen,” but not to “avalanche,” while text embedding can show that the Book of John is more similar to the Book of Luke than Harry Potter and the Goblet of Fire.

Tech Specs
Availability and platform support
Deployment availability: | |
Bindings: |
Supported languages
Arabic | English | French | Japanese | Korean (North Korean Dialect) | |
Russian | Chinese, Simplified & Traditional | German | Korean | Korean (South Korean Dialect) | |
Spanish | Tagalog |
Semantic Vectors /semantics/vector
{"content": "Cambridge, Massachusetts"} { "documentEmbedding": [ 0.0220256, 0.03633998, 0.05246549, -0.03751056, 0.0347335, 0.02479751, -0.03860506, 0.00603574, -0.04244069, 0.00521813, 0.01740657, -0.08501768, -0.01918706, -0.05974227, 0.00762913, -0.00020686, -0.04639495, 0.00458408, 0.01220596, 0.06160719, -0.03988802, -0.03095652, -0.01182547, 0.04861571, 0.02967435, -0.04560868, -0.16111824, -0.06562275, 0.00208866, 0.01622739, -0.09196278, 0.13520485, 0.03665138, -0.01748736, 0.05908763, 0.07113674, 0.04435388, -0.04436791, -0.0018729, -0.03612895, 0.00324841, 0.0218222, 0.00414962, 0.02750619, -0.00466647, -0.03516347, 0.00061686, 0.03071387, 0.060716, -0.05394382, -0.03460756, -0.0916905, -0.04351116, 0.03095916, 0.07264832, 0.00440244, -0.06487004, -0.0124327, -0.02594845, 0.06403252, 0.05990276, 0.08421157, 0.00113943, -0.05188083, 0.01336752, 0.05737128, 0.0868928, -0.02797472, 0.02951868, -0.06528687, -0.02593506, -0.1377904, 0.05021935, -0.00331138, 0.00345429, -0.0806604, -0.02997256, 0.04178474, -0.16860084, -0.00202994, 0.04082655, 0.04052638, -0.02616019, -0.07079905, 0.04114204, -0.05405192, -0.02079529, 0.03362259, 0.12866253, 0.04686183, 0.03205459, 0.01844979, 0.10577367, -0.04331236, 0.03550498, 0.03498939, -0.05236725, 0.05650697, -0.03229797, -0.05911481, 0.08041807, -0.01093418, -0.04541076, 0.00499057, 0.03379054, 0.01985912, 0.05434353, -0.06876269, -0.02142489, -0.04368682, -0.02340091, 0.04271708, -0.03868493, 0.03260612, -0.00310602, -0.08135383, 0.03890613, 0.05206529, 0.01902638, -0.03261049, -0.01225097, -0.04929554, 0.06811376, -0.10045446, -0.03772711, 0.06436889, 0.0335337, 0.03110947, -0.01010367, -0.03986244, 0.01340914, -0.06304926, 0.05365673, -0.07044137, 0.06421522, 0.0632241, -0.04348637, 0.13118945, -0.02082631, 0.07590587, -0.04813327, -0.02577493, 0.05642929, 0.00033935, -0.01024516, 0.06391647, 0.03264675, -0.02187326, 0.04832495, 0.02241259, 0.05681982, -0.04124964, 0.08708096, 0.06066873, -0.03356391, -0.03327714, -0.03449181, -0.02047219, 0.06597982, 0.08629483, 0.03777988, 0.01191289, 0.10955901, -0.05159367, 0.00001431, -0.00435081, -0.07139333, -0.10915583, -0.06582265, -0.02754464, 0.04510804, 0.09508634, -0.02923319, 0.03627863, 0.02647047, 0.06838391, 0.07216309, -0.00809051, 0.07248835, 0.0123264, -0.09173338, -0.02095788, 0.02871792, -0.03392723, 0.05959549, -0.10397915, -0.03820326, -0.05222115, -0.02296818, -0.06410559, 0.02745123, 0.02334865, -0.02446206, -0.12417631, -0.01871051, 0.02439541, -0.02481432, -0.03880155, 0.04188481, 0.02300973, 0.10600527, 0.02696968, 0.02788247, 0.05024018, 0.05907565, 0.02856795, -0.00740766, 0.02289764, -0.0643627, -0.00749485, -0.03111451, 0.06580845, 0.02102997, -0.10717536, 0.16490568, 0.03047366, -0.02454999, 0.07184675, -0.02504459, -0.11541119, 0.03915355, -0.03187835, -0.05494586, -0.15862629, -0.02779816, 0.00724561, 0.00901807, -0.01519001, 0.04528573, -0.05221211, 0.01260346, -0.01652065, 0.01324382, -0.01688977, 0.01070876, -0.03916383, -0.03296183, -0.06774635, -0.05388693, -0.01320887, 0.07467077, 0.06863626, -0.06439278, 0.06113409, -0.00122581, -0.0411741, 0.11657882, -0.01979883, -0.01714609, -0.00621283, 0.05906631, 0.00404663, 0.02791196, -0.11955266, -0.0623432, -0.12302965, 0.04749805, -0.05722075, 0.08342554, -0.0616898, 0.0171079, 0.1030134, 0.00575187, -0.01223959, -0.01106031, 0.02733183, -0.05465746, -0.00639093, 0.10582153, 0.05119603, -0.16957831, 0.0605646, 0.05737981, 0.12555394, -0.00963913, -0.15966235, 0.06239227, -0.01519997, -0.00653814, -0.01759958, -0.00281965, -0.07387377, 0.01542045, -0.01574635, 0.09960862, 0.06726488, 0.01381977, 0.03104461, 0.05140565, -0.08996302, 0.06713541, -0.10765704, -0.00975681, 0.15130819, 0.0128835, -0.00251494, -0.02743187, 0.00955417, -0.10639542, 0.04656886 ] }
Similar Terms /semantics/similar
{"content": "spy", "options": {"resultLanguages": ["spa", "deu", "jpn"]}} { "similarTerms": { "spa": [ { "term": "espía", "similarity": 0.61295485 }, { "term": "cia", "similarity": 0.46201307 }, { "term": "desertor", "similarity": 0.42849663 }, { "term": "cómplice", "similarity": 0.36646274 }, { "term": "subrepticiamente", "similarity": 0.36629659 }, { "term": "asesino", "similarity": 0.36264464 }, { "term": "misterioso", "similarity": 0.35466132 }, { "term": "fugitivo", "similarity": 0.35033143 }, { "term": "informante", "similarity": 0.34707013 }, { "term": "mercenario", "similarity": 0.34658083 } ], "jpn": [ { "term": "スパイ", "similarity": 0.5544399 }, { "term": "諜報", "similarity": 0.46903181 }, { "term": "MI6", "similarity": 0.46344957 }, { "term": "殺し屋", "similarity": 0.41098994 }, { "term": "正体", "similarity": 0.40109193 }, { "term": "プレデター", "similarity": 0.39433435 }, { "term": "レンズマン", "similarity": 0.3918637 }, { "term": "S.H.I.E.L.D.", "similarity": 0.38338536 }, { "term": "サーシャ", "similarity": 0.37628397 }, { "term": "黒幕", "similarity": 0.37256041 } ], "deu": [ { "term": "Deckname", "similarity": 0.51391315 }, { "term": "GRU", "similarity": 0.50809389 }, { "term": "Spion", "similarity": 0.50051737 }, { "term": "KGB", "similarity": 0.49981388 }, { "term": "Informant", "similarity": 0.48774603 }, { "term": "Geheimagent", "similarity": 0.48700801 }, { "term": "Geheimdienst", "similarity": 0.48512384 }, { "term": "Spionin", "similarity": 0.47224587 }, { "term": "MI6", "similarity": 0.46969846 }, { "term": "Decknamen", "similarity": 0.44730526 } ] }
Deployment
Rosette Cloud
Sign up today for a free 30-day trial
The SaaS version of Rosette is rapidly implemented, low maintenance and ideal for users who wish to pay based on monthly call volume. Numerous bindings through a RESTful API are supported.
Rosette Server Edition
This on-premise private cloud deployment puts all the functionality of Rosette Cloud behind your secure firewall, and enables advanced user settings, access to custom profiles (user-specific configuration setups), and deployment of custom models.
Rosette Java Edition
For on-premise systems that need the low-latency, high-speed integration of an SDK, Rosette Java is the way to go. It has been deployed in the most demanding, high-transaction environments, including web search engines, financial compliance, and border security.
Rosette Plugins
Just plug in Rosette for instant high-accuracy multilingual search and fuzzy name search for Elasticsearch or Apache Solr.
Quality documentation and support
Our support team responds to customers in less than a business day, and is committed to a satisfactory resolution. Users have access to in-depth documentation describing all the features, with code examples and a searchable knowledge base.
Visit our GitHub for bindings and documentation.
Questions?
Email: info@basistech.com
Phone: +1-617-386-2000
Select Rosette Customers
Deep Search for Salesforce
AI-driven Search Application for Salesforce
KonaSearch is a best-in-class search application for Salesforce enabling users to search every field, file, and object across multiple orgs and other data sources.
