Deep Learning Powers Cross-Lingual Semantic Similarity Calculation

Text Embeddings Now Available in the Rosette API
The Rosette API team is excited to announce the addition of a new function to Rosette’s suite of capabilities: text embedding. This endpoint returns a single vector of floating point numbers for your input, a.k.a. an embedding of your text in a semantic vector space.
Text embeddings can be used for a variety of text analysis tasks, including judging the semantic similarity of one or more texts across languages. Knowing the embeddings of two documents, phrases or words can allow you to evaluate how similar they are in meaning or content.
What is semantic similarity?
While word and text embedding is still an emerging capability in the realm of natural language processing, one of the most popular uses for text embeddings so far is similarity calculation. Our engine utilizes machine learning to recognize similarities in context, content, and associations. For example, king correlates to man while queen correlates to woman.
For the end user, text embeddings could power a number of different applications. Businesses engaging in e-discovery might use text embedding for deduplication of documents. In consumer services, review websites like Yelp or TripAdvisor could use text embeddings to aggregate related phrases such as “the bathrooms were spotless” or “the restroom was very clean.”
Find Related Words and Documents in Five Languages
The Rosette API text embedding endpoint supports five languages: English, German, Spanish, Japanese, and Chinese. It also supports cross-lingual comparisons, which allow you to calculate the similarity of words or documents written in different languages. As a test, consider evaluating the similarity of comparable or related words in different languages, such as “amor” and “love,” or “die Braut” and “le mariage.”
Try it Out
Once you’ve signed up (no commitment, no credit card required) for your Rosette API account, you can try out a basic application for text embeddings using some sample Python code we created. Remember, the Rosette API is free for a 30-day free trial!! If you need more calls, check out our paid plans.
First, head to the Rosette API GitHub community and clone the text-embeddings-sample repo to your machine.
You should see two files (plus a README.md):
- cosine_similarity.py
- test_embeddings.py
Make sure you’ve installed the latest version of our Python client binding — 1.3.2 — via
$ pip install rosette-api --upgrade
Then edit cosine_similarity.py in your favorite text editor to replace “[your key here]” with your Rosette API key.
Save, and head back to the text-embeddings-sample directory in your command line to run test_embeddings.py. It should look something like this:
$ python test_embeddings.py
Sample Results
We can see some interesting measures of semantic similarity between “Paris”, “France”, “London”, and “England”. Notice that “Paris” is closer in meaning to “France” than “London” is to “France.” Similarly, “England” is more semantically similar to “London” than to “France”.
Once you’ve gotten the hang of it, replace the sample input words in test_embeddings.py with your own words or longer input text to calculate the similarity between them.
You can find more details about text embedding and language coverage in the documentation. If you discover any cool results or use cases, let us know! Email support@rosette.com and we’ll feature your results on our blog or in our GitHub community.