Minds Converge: A Machine Learning Meeting in Toulon

Basis Technology R&D presents at the International Conference on Learning Representations in France

The International Conference on Learning Representations (ICLR) is an annual gathering of leading machine learning experts working in both industry and academia. This year’s conference was held from April 24-26 in Toulon, France.

Toulon, France

ICLR focuses on a broad range of subjects, with a particular emphasis on deep learning and representation learning, which is a close cousin of reinforcement learning. All these areas are deeply tied to our text analytics and natural language processing work at Basis Technology.. Pau Rodríguez López, a PhD student at the Computer Vision Center (CVC) in the Autonomous University of Barcelona (UAB) created a wonderful interactive visualization of the accepted papers’ abstracts, which gives a concise look at what deep learning researchers are particularly interested in right now.

Basis Technology releases WikiSem500 for ICLR

Congratulations to the entire R&D team here at Basis Technology! Philip Blair, a member of the team, travelled to Toulon last month to present the team’s work on the WikiSem500 paper at the ICLR.

In November, the Basis Technology R&D team submitted a paper to ICLR for consideration, presenting a new technique for generating evaluation data for word embedding projects. The project included the “WikiSem500” dataset, which contains 500 cluster groups in five languages. Of the 491 papers submitted, 15 (3%) were asked to give an oral presentation, 183 (37.3%) were invited to present their research as a poster  in the conference track, 48 (9.8%) were invited to present their research as a poster in the workshop track (including us) and 245 (49.9%) were rejected.

For those unfamiliar with word or text embeddings, they are a natural language processing function that assigns vectors to words, phrases and documents which can be used to determine semantic similarity. Word embedding is used under the hood of numerous text analytics tools.

Any software engineer will tell you that the most time consuming and expensive aspect of machine learning research is data preparation.However, good testing data is vital to determining how well an algorithm is performing. Basis Technology’s WikiSem500 project seeks to automate the evaluation process of word embeddings, saving researchers in the field large amounts of time and resources. Dive deeper into the WikiSem500 project on our blog.

Here’s Philip’s report on the conference and other NLP presenters.

Philip’s Conference Report: GANs, Optimization, and Q&A Systems

Generative Adversarial Networks (GANs) received  a significant amount of attention at ICLR.. For the uninitiated, GANs are systems for generating media which pit two deep learning networks against each other: one which does the generation and another which tries to distinguish the generated items from items belonging to some ground truth dataset.  While a large amount of the buzz involved the applications of GAN techniques to new problems in creative ways, there was also a bit of interest in the theoretical grounding for how these networks are trained.

In line with this, a number of papers presented discussed topics in optimization. In addition to some fascinating work on the theory behind the ability of deep networks to learn and generalize, many papers discussed things such as novel reparameterization tricks and ways of improving optimization times.

Another increasingly popular topic is question answering systems. Closely related to machine comprehension, these systems power the functionality of products such as Apple’s Siri and Amazon Alexa. The hope of researchers is that deep learning can enable these systems to retrieve information in more flexible and meaningful ways without the need for hand-written rules.

Figure 1: Diagram of Doc2VecC (Source: Chen (2017)

A paper by Chen investigates reducing these costs by takings ideas from both methods. Basically, in addition to the local context around a word being used to train word embeddings, their model (called Doc2VecC) also uses a representation of the global context (found by sampling words from across the document and averaging their word vectors). This simple tweak makes training a model no more expensive than a standard Word2Vec model (which is relatively inexpensive), and at runtime, a simple bag-of-words approach with these vectors actually outperforms Paragraph Vectors.

Figure 2: Sentences which have been attended to by Lin et al.’s self-attention mechanism (Source: Lin et al. (2017)

Lin et al. describe an interesting spin on the idea of sentence embeddings. For one, instead of using a vector to represent a sentence, their model describes sentences using a two-dimensional matrix. Additionally, they describe a “self-attention” mechanism, which allows their model to focus on specific parts of sentences when creating embeddings for them.

Word Embeddings

Figure 3: Diagram of word and character-level gating mechanism (Source: Yang et al. (2017)

One problem with word embeddings is that they can struggle with things like sub-word morphologies (e.g. “cat” vs “cats”) and out-of-vocabulary words, while character-level embeddings can deal with these just fine. Conversely, word vectors are significantly better at capturing the semantics of words than their character-level counterparts. Approaches have been proposed in the past which combine word embeddings with character-level ones to better cope with these challenges, but the issue with these methods is that in practice it is desirable to focus more on a word’s word-level embedding or its character-level embedding depending on which word is being considered. Yang et al. discuss a framework which addresses this issue. The basic idea is that their model contains a “gating” mechanism that can dynamically choose between the two representations, giving the author’s state-of-the-art performance on reading comprehension tasks.

Multilingual Text Embeddings

Smith et al. demonstrate work on improving the quality of multilingual text embeddings. The motivation for these types of embeddings is that it enables one to reason about the semantics of words in different languages with the same meaning; for example, the closest word in Japanese to the English word “cat” within Rosette API’s embedding space is “猫”, which is indeed the Japanese word for “cat”. The authors describe a robust framework for creating these vector spaces and demonstrate state-of-the-art results on translating word pairs (such as the one mentioned above) and sentence pairs.

All in all, it is an exciting time to be engaged in advances in machine learning and natural language processing. It was fascinating to see what researchers have come up with, and we can’t wait to see what we find at next year’s conference! —Philip Blair