Extracting Entities in RapidMiner Studio™ with Rosette

It’s never been easier to access state of the art text analytics, code-free. Check out our Rosette Text Toolkit extension for RapidMiner—a popular, open source predictive analytics platform—and plug the power and accuracy of Rosette text analytics directly into your RapidMiner workflows.

Get up and running with Rosette for RapidMiner Studio with this quick start guide, which covers the installation and setup process. We also demonstrate how to get started extracting and linking entities with Rosette.

Installing RapidMiner and Rosette

If you aren’t already running RapidMiner Studio, download the application on RapidMiner’s website, and then download the Rosette Text Toolkit extension through the RapidMiner marketplace and sign up for a Rosette API key.

Open RapidMiner Studio, navigate to the Extensions menu and select Marketplace.

rm-1

A new window will open. Search for “rosette” and select Rosette Text Toolkit from the list of results. Click the Install 1 Packages button at the bottom of the window and follow the click-through instructions to complete the installation.

rm-2

Once the extension has finished installing, the Rosette operators will be visible in the Extensions folder of the Operators panel.

rapidminer-extension

Getting a Rosette API Key

In order to activate the Rosette Text Toolkit for RapidMiner Studio, you’ll need an API key and a Rosette developer account. Head over to developer.rosette.com and complete the signup process.

rm-4

You can create an account linked to either your email or your GitHub account. No credit card is required — our default plan gives you 10,000 calls a day for free! If you’re interested in upping your call quota, check out our paid plans.

rm-5

Once you have completed the signup process and verified your account, click on the API Key tab on the top left of the menu bar to display your key.

rm-6

Setting up your Rosette API Connection

Back in RapidMiner Studio, input your Rosette API key to start using any of Rosette’s operators. We’ll be looking at the entity extraction operator in the next section, so we’ll use it to set up the Rosette API connection now.

First, locate Extract Entities in the Rosette Text Toolkit folder in the Operators panel and drag it to the Process panel.

rm-7

You can see the various settings options for the Extract Entities operator in the The Parameters panel to the right of the Process panel. The first parameter is Connection. Click the Rosette icon to the right of the box.

rm-8

The Manage Connections window will open. Click the Add Connection button on the bottom left and select Rosette Connection from the Connection type dropdown list. Name your new connection and click the Create button.

rm-9

Select your new Rosette API connection from the list on the left and enter your Rosette API key in the API KEY box. Use the Test button at the bottom of the window to verify that your connection is working. If you run into any trouble, confirm that you have copied your API key correctly. When you are satisfied that everything is running smoothly, click the Save all changes button to return to the Parameters panel.

rm-10

Select your new connection from the Connection dropdown list.

rm-11

Extracting Entities

Now that you’ve installed the Rosette for RapidMiner extension and set up your API key and connection, you’re almost ready to start analyzing. Last step: download RapidMiner’s Text Processing extension in the RapidMiner Marketplace, a helpful set of operators that allow you to load, filter, and analyze text from a variety of different sources. With that installed, head to RapidMiner Studio where we’ll use three operators to create a simple entity extraction workflow, or process: Create Document and Documents to Data from Text Processing, and Extract Entities from Rosette. Drag these operators into the Process panel and connect them together, maintaining the order listed above. You can find the operators using the Operators Search Bar.

Select the Create Document operator. In the parameter panel, check the add label box. Under label type, select text and enter ‘my_text’ for label value. Click the Edit Text button at the top of the panel and copy the text below into the popup window.

“Bill Murray will appear in new Ghostbusters film: Dr. Peter Venkman was spotted filming a cameo in Boston this… http://dlvr.it/BnsFfS.”

Hit the Apply Changes button to save your work.

rm-12

Now select the Documents to Data operator. In the Parameters panel, enter ‘my_text’ in the text attribute field.

rm-13

Execute the process using the blue “play” button. The results show five extracted entities. As you can see, Rosette correctly extracted both the names and the location included in the text.

rm-14

Let’s make our input text a little longer. Add the sentence below to the parameter text and rerun the process.

“Another original Ghostbuster, Dan Akroyd, is also confirmed to have a cameo in the film.”

rm-15

From the results we can see that Rosette extracts Dan Akroyd’s name as expected. However, eagle-eyed readers may have noticed that “Akroyd” is misspelled. (It should be “Aykroyd.”) This is not uncommon. Name misspellings appear frequently, everywhere from personal blogs to the New York Times online. If you are trying to track a particular entity across a large collection of documents, you want to make sure that you are identifying all possible spellings of that entity’s name. Rosette automatically extracts and links entities with spelling variations and other textual anomalies, unifying them into a single entry.

rm-16

To demonstrate this functionality, let’s enable Link Entities in the Extract Entities parameter panel.

rm-17

Then, we’ll add a third line to the parameter text that includes the correct spelling of Dan Aykroyd’s name, like the one below:

“Actually, the correct spelling is Aykroyd.”

rm-18

When we run the process again, a new QID column appears in the results. Notice that “Dan Akroyd” and “Aykroyd” have the same QID value — Rosette has correctly identified them as the same entity.

rm-19

QID values are drawn from Wikidata, so if an entity has a Wikidata entry, Rosette should be able to link and resolve it.

rm-20

QIDs are very useful for machine reading-purposes, but for humans they can be difficult to keep track of. Let’s turn on the Include Entity Name parameter, which will allow us to see the entity names in addition to their QIDs.

rm-21

rm-22

Try it Yourself

Now that you’ve got the Rosette Text Toolkit up and running with RapidMiner Studio, you are well equipped to handle a host of text analytics tasks. Incorporate results like the ones above into your pre-existing data processes, and check out our other operators, including Categorization, Sentiment Analysis, Morphological Analysis, Tokenization, Sentence Tagging, Name Translation, and Name Matching.

While you’re at it, keep us posted! We love to hear what our users are working on, and would be thrilled to share your Rosette for RapidMiner story on our blog.

ghostbusters_waterloo
Ethan Doyle White at English Wikipedia, WikiMedia Commons