Improved visual annotation data model makes it easier to compare entity extractions

Paul Flamburis
BasisTech Hackathon 2022

It’s safe to say that every piece of human-facing software needs to be intelligible to humans. As far as technology has come, humans and computers still speak very different languages (or, at least, prefer different languages). Just as programmers have to learn languages like Python and JavaScript, software users often need an interpreter to help elucidate raw data. Team 1’s Hackathon 2022 project, an improved visual annotation data model for the entity extraction capabilities of Rosette, is a prime example of this.

Rosette stores all data pertaining to its analysis of a document (tokens, entities, parts of speech, etc.) in JSON format. We call these files annotation data models. These are great for standardizing the data in a way that can be communicated to other software, but they can be difficult for a human to parse. A visual annotation data model is a way of displaying this annotation data for the user. For example, Rosette uses color-coded tags to represent the various entity type labels. (You can see this for yourself at demo.rosette.com). The relationship between annotation data and the visual annotation data model is analogous to the relationship between a reel of film and a movie. While the reel contains all of the movie’s visual information and is compatible with many projectors, only the projected movie can be comfortably interpreted by the viewer. What good is data if you can’t make sense of it?

BasisTech Hackathon Team 1 members developing visual annotation data model
From left to right: Katsuya Tomioka, David Goldfarb, Jason Alonso, Oren Ronen, Isao Tanner

Team 1 was specifically concerned with making it easier to differentiate between multiple sets of extracted entities at a glance. Rosette is highly configurable, and even the smallest changes can significantly impact the way it extracts entities. When deciding how to configure Rosette for a given project, it can be extremely helpful to compare the outputs of different configurations. With the raw annotation data models, however, there is often too much noise to discern the information you actually want to see. For example, you might just want to see which entities are extracted when entity linking is enabled.

The project was actually based on a tool created earlier by team leader Oren Ronen, which can visually compare two uploaded annotation data model files side-by-side. For the Hackathon, Team 1 expanded on Oren’s annotation data model differentiation tool and increased its interactivity. Direct integration with Rosette was key to making this work. Once configured with the location of Rosette, the new tool can run it in the background. As a result, the user can simply upload text to the tool (instead of annotation data model files) and sit back while Rosette extracts and displays the resulting extracted entities for multiple configurations.

The importance of the visualization itself made the rest of the project essentially a front-end endeavor. With a minority of front-end developers on the team, the odds were stacked against them, but at the end of the day members with front-end experience, like David Goldfarb, delivered a clean and easy-to-comprehend interface. The user can compare annotations side-by-side to see the big picture differences, or go into more detail by hovering over any entity to view all the extracted properties. They can also choose to only display differences between the annotation data model files. Isao Tanner named the tool Panopticon, after an astronomical tool that combines a telescope with a microscope, because of its ability to provide visual insight into annotation data on both a micro and macro scale.

There’s still tons of room for Panopticon to grow in the future. While this iteration is only designed to interact with entity extraction, it’s not a stretch to imagine Panopticon being used for the rest of Rosette as well. And with employees already using tools like Panopticon in their day-to-day for comparing annotation data, who knows what other improvements might be on the horizon? But most importantly, the Panopticon project serves as an excellent reminder that the human side of software development is just as vital as the technical.