By Paul Flamburis

In the world of natural language processing, the strength of a model is limited by the quantity and quality of the data that goes into developing that model. Collecting enough of this data, whether it’s for training or evaluation, can be an overwhelming undertaking for human beings. But what if we could generate a portion of this data automatically? During Hackathon 2022, two teams worked on developing tools to do exactly that.

Team 6 focused on the training data side of things, aiming to improve Rosette’s ability to generalize, or adapt to different domains, with less manual annotation. Strong generalization increases the likelihood of consistent performance beyond the training and evaluation data. To accomplish this goal, the team envisioned a two-model solution.

Photograph of Team 6 members Philip Blair, Jena Chen, Aishwarya “Ash” Baliga, and Amit Seker

From left to right: Philip Blair, Jena Chen, Aishwarya “Ash” Baliga, Amit Seker[/caption]

They theorized that a heavily pre-trained state-of-the-art named entity recognition (NER) model could be fed domain-specific unannotated data to generate a massive “silver” dataset. Silver because, while the dataset is intended to be true, it is not produced or verified by a human being. This silver dataset could then be used to train a more lightweight (faster, but with fewer parameters) deep learning model. The expected outcome was that Rosette would be able to generalize better by leveraging information from the first pre-trained model, work faster because of the lighter weight of the second model, and require less data collection and annotation by humans overall.

After creating and testing their lightweight deep learning model, Team 6 compared its precision, recall, and f1 scores on unseen mentions (entities the model was not trained on) and compared them to those of Rosette. The results were mixed: the deep learning model showed a 6.2% decrease in precision, 11.7% increase in recall, and 1.9% increase in f1 measure. While this is more positive than negative, it’s not quite enough to definitively prove the superior option.

One possible cause for suboptimal results is any noise (incorrect or irrelevant data) generated in the silver dataset. It’s also possible that the precision of deep-learning models just isn’t high enough yet. Further tinkering may one day result in a model that proves their hypothesis correct. The important thing is that Team 6’s experiment makes one rethink the state of model training today, and their outside-the-box thinking makes one wonder what other unexpected ways we might utilize language models in the future.

Team 5, who named themselves the Transformers, tackled a very different problem with a similar approach. Typically, Rosette customers must generate their own gold data for evaluating name matching in Rosette. This is because our customers are far more familiar with their own data, but in practice, it involves creating many hundreds of name pairs with varying degrees of similarity to names in the index. It can be tedious, so, what if they didn’t have to do it? The idea was simple: create a tool that allows Rosette to automatically generate gold data for customers for any given index.

Photograph of Team 5 member Peter De Bie, Karin Lin, and Chris Mack, joined by Mike Harris on-screen via video chat.

From left to right: Mike Harris (on screen), Peter De Bie, Karin Lin, Chris Mack[/caption]

The project involved iterating versions of this tool with different models and data sources. A character-level seq2seq model was tried first, but was only capable of generating one transformation per name, and only a small percentage of those transformations were helpful as gold data. (Only name variants with a match score between 0.8 and 0.99 were considered “good.”) The team had much more success using a transformer model, hence their team name. This pre-trained text-to-text transfer transformer (t5) model was able to produce multiple variants per name, giving it an immediate advantage over the seq2seq model. Not only that, but 76.2% of those were good variants. Training this model on existing Rosette training data (for single token names) and open-source data from JRC-Names boosted that percentage to a whopping 96.7%.

While it’s important to note that this proof of concept was specific to person names (as opposed to location and organization names) in Latin script, it’s easy to see how scalable this project could be by adding support for all Rosette-supported languages and entity types. Most importantly, both these Hackathon projects point in the direction of a future with less time spent generating data and more time spent analyzing text. With all the time we’ll save, who knows what we’ll think up next?

Time-saving Tools for Generating Training and Evaluation Data