Rosette Cloud 1.9: more languages, higher accuracy, and deep neural nets
Rosette Cloud 1.9 is out, delivering a new language for name matching, translation, and deduplication: Thai. We’ve also added a new deep neural network model for sentiment analysis, entity extraction offsets, salience scores for topic extraction, and more.
Learn more below, or jump to the release notes.
The /name-similarity, /name-translation, and /name-deduplication endpoints all now support Thai names written in Thai script. Use these endpoints to calculate the similarity of two Thai names, or a Thai name and an English name, translate a name from Thai to English, or deduplicate a list of multilingual names, including Thai.
An alternative, experimental model has been added to the /sentiment endpoint. The new model uses a deep neural network (DNN), and may produce different results than the default support vector machine (SVM) model.
By default, /sentiment will continue to use the SVM model. If you do not choose to switch to the alternative model, your results will not change. For instructions on how to change to the DNN model, check the documentation. Should you choose to test out the DNN model, we’d love to hear your observations and feedback.
Entity Extraction and Linking
Entity mention offsets are now returned by default. Offsets can be used to pinpoint the exact surface form mentions of each extracted entity in the document text.
In addition, confidence scores for entities extracted using the statistical processor as well as entity links are now returned by default. We also implemented confidence thresholds for Spanish, Chinese, and Japanese entity links, greatly improving the precision of the results. Finally, we updated our Korean entity extraction model, leading to a 10% improvement in overall accuracy across entity types.
The /language endpoint can now detect when a document contains multiple different languages. With Rosette’s new multilingual mode enabled, you will receive both a list of candidate languages for the document as a whole (ideal for monolingual documents) and a list of the different language regions within the document and their corresponding language code. For instructions on enabling multilingual mode, see the documentation.
In addition, confidence scores for language identification results have been rescaled to more accurately represent the likelihood that a candidate language is correct. This will not cause any change to the results of the /language endpoint or the relative ranking of a given set of language candidates, only to the associated confidence scores. For more information, get in touch!
Salience scores will now be returned for both concepts and keyphrases. Salience refers to the relevancy of a given concept or keyphrase to the text as a whole. Scores are returned on a range from 0.0 to 1.0.
In addition, we’ve improved concept results on short texts and you should see much more relevant results.