30 Oct 2017

Deduplicate Names in RapidMiner with Rosette

Announcing a new Rosette Cloud endpoint and Rapidminer operator for data cleansing

Recognizing and reconciling duplicate records is a common headache of database management especially when the differences are subtle and likely to be missed by most deduping systems. If your records include duplicate records that include misspellings, nicknames, and initials, you may be missing connections, keeping your agents and team members from the information they need.

Rosette Cloud launched a new /name-deduplication endpoint which utilizes our industry-leading fuzzy name matching to connect database records that contain moderate, or “fuzzy,” variations. Unlike other deduplicators that can only pick out exact matches, Rosette allows the user to find and reconcile similar records for cleaner databases. To make this functionality more easily accessible, we simultaneously released a “Deduplicate Names” operator for Rapidminer Studio which uses the new endpoint under the hood.

Deduplication at work

The Rosette Deduplicate Names operator identifies candidate duplicates from a list of names by assigning “group ids” to groups of matching names. The operator can process lists of up to 1,000 English names and assigns group ids based on a user-specified match threshold. The threshold sets the minimum similarity score required for two names to be considered duplicates. Thresholds can be set by clicking on the operator and entering a value between 0 and 1 in the “Threshold” field. We recommend starting with a .8 threshold, and experimenting with higher or lower values depending upon your use case and results.

Given a list of names as input, the output is a list of cluster IDs (integers) for each name—not in any particular order. The output may then be sorted by cluster ID to group together possible duplicate names. For example:


Input (names) Output (cluster ID)
John Smith 1
Cyndi McBoysen 2
Dmitri Shostakovich 4
Jim Hockenberry 3
Takeshi Suzuki 5
Jon Smythe 1
James Hawkenbury 3
Cindy MacBoysen 2

Note: If your data is split between multiple fields (ex. First Name, Last Name, Middle Name), you should first merge the fields into one before running through the Rosette Deduplicate Names operator. Rapidminer’s Generate Concatenation operator can be used to merge multiple fields into one. 

Further refine your results with additional fields

When you submit a name-deduplication request in Rapidminer, you need only input a list of names; however, you can also set the entity type–if known–to person (default), location, or organization to improve accuracy.

The Rosette Cloud /name-deduplication endpoint also supports additional language and script fields beyond those offered in Rapidminer to further improve your results. All possible input fields are below:

Field Description Required
name Name to match yes
language Three-letter language code for the name  (language codes) no (but strongly recommended if source language is known)
entityType PERSON (default), LOCATION, or ORGANIZATION no
script Four-letter language code for script in which name is written (script codes) no

Try it yourself

Ready to get started deduplicating the names in your data? First, sign up for a 30-day free trial of Rosette Cloud then head over to Rapidminer.

If you need to process large volumes of records or would prefer not to send your data to the cloud, talk to our sales team about custom solutions and on-premise deployments.