Notes from the Lab: Fueling New Research into Machine Learning with Wikidata

Basis Technology R&D team pioneers new technique and open sources WikiSem500, a dataset for multilingual word embedding evaluation

The most time consuming and expensive aspect of machine learning research is data preparation—aggregation and cleaning—and every data scientist has been frustrated by it. However the importance of good, testing data makes it hard to cut corners. Without it, it’s nearly impossible to tell how well an algorithm is doing its job. What if there was an artificial intelligence tool that was smart enough to do its own data preparation, taking the hassle off of researchers’ plates?

That’s exactly what we’ve done. The Basis Technology R&D team is pleased to announce a new technique for generating evaluation data for word embedding projects, which they used to produce the WikiSem500 dataset, which contains 500 cluster groups in five languages. We are making this dataset freely available on GitHub to the open data science community. Unlike most currently available datasets designed for this function, which are created manually, ours was generated automatically using a language-agnostic technique and Wikidata. This first collection includes English, Spanish, German, Japanese, and Chinese cluster groups, but datasets may be generated in any Wikidata-supported language.

“Unlike most currently available datasets designed for this function, which are created manually, ours was generated automatically using a language-agnostic technique and Wikidata.”

Datasets like the WikiSem500 are vitally important because they allow us to better test new word embedding models. Each test case includes a cluster of semantically-similar words and several “outliers.” For example, one case might contain Mordor, Shire, Thule, Arnor, and Rohan. Which is the outlier? (for non-“Lord of the Rings” enthusiasts, the answer is Thule). A good word embedding model should be able to accurately identify the outliers and score their dissimilarity. WikiSem500 is the only fully automated multilingual outlier detection dataset currently available and it’s much larger than previous manually-created options (which are also English-only). i, It makes developing better word embedding models possible with both greater speed and accuracy!

Figure 1 from “Automated Generation of Multilingual Clusters for the Evaluation of Distributed Representations:” Partial example of a Wikidata cluster. Solid arrows represent “Instance Of” relationships, and dashed arrows represent “Subclass Of” relationships.

Want to try it out? Download the WikiSem500 dataset from GitHub, try it out with the /text-embedding endpoint of our Rosette API (free, no-commitment signup). You’re a data science guru? Read the full technical paper, “Automated Generation of Multilingual Clusters for the Evaluation of Distributed Representations,” which is currently under review by the International Conference on Learning Representations (ICLR) 2017. Watch for updates here on our blog, and on our Twitter feed.