Entity Extraction


Adds structure to your unstructured, multilingual text by automatically identifying people, organizations, and locations

entity extraction

Overview

Things, not strings

Entities are the key actors in your free-form text data: the organizations, people, locations, products, and dates. Rosette uncovers these entities, delivering structure, clarity, and insight to your data with adaptability, easy deployment, and consistent accuracy and performance across a broad array of languages and text genres.

Rosette uses a synthesis of machine learning techniques, including perceptrons, support vector machines, word embeddings, and deep neural networks to balance performance and accuracy.

Real world applications

Entity extraction is the foundation for applications in eDiscovery, social media analysis, financial compliance and government intelligence. Rosette allows you to:

  • Resolve a person’s identity for government security and fraud detection
  • Track customer sentiment around products and companies
  • Analyze research for patent law, legal discovery, and compliance
  • Exploit valuable information from open source intelligence
  • Provide targeted search for content publishers and recommendation engine

Customizable to your unique needs

Rosette entity extraction is highly adaptable. In addition to supervised training, our on-premise field training kits enable you to create personalized entity extraction models for your use case by simply adding a quantity of your own data, without any annotation.

This ability makes it possible to train Rosette on a specific type of content, such as news articles, blogs, restaurant reviews, financial documents, medical records, legal contracts, and patent filings, or short strings of text like tweets. Users can also create new entity types beyond our prebuilt list, such as disease and drug names for a medical extractor or job titles and skills for resume evaluation.

Product highlights

  • 21 supported languages
  • 29 entity types and 450+ sub-types detected
  • Coreference resolution chains together mentions of the same entity
  • Link entities to knowledge bases using document context
  • Model fusion: a hybrid of techniques to accurately extract each type of entity
  • Deep learning models in select languages for greater accuracy
  • Accepts case-insensitive English input (all upper or lower case)
  • Confidence scores for each result
  • Cloud or enterprise deployments
  • Fast and scalable
  • Industrial-strength support
  • Constantly stress-tested and improved

How It Works

Hybrid approach balances performance and accuracy

For each entity type extracted, we choose the approach that will produce the best results. Rosette combines advanced statistical modeling and neural networks, complemented by pattern matching and entity lists. This hybrid system has the flexibility to detect entities missed by more simplistic solutions, improving precision and recall.

Machine learning

Statistical modeling finds entities based on their context, not strict matching of strings or patterns. For that reason, only high-quality training data will yield superior results. Rosette’s models are trained on a carefully curated corpus of millions of news articles, social media platforms, and blog posts. Our in-house team thoroughly annotates the data and cross-checks the tags for consistency by native speakers.

These machine-learned models do not rely on external knowledge bases to find entities, thus producing fewer false positives than systems that exclusively rely on external sources to extract entities. Rosette will detect novel entities that don’t appear in any database.

Gazetteers and entity lists

Unlike a home-brewed or academic extractor, our custom entity lists, or gazetteers, are regularly updated and stress-tested for enterprise- level speed and performance. With customers across industry and government, Rosette Entity Extractor can support gazetteers of several million entries with high performance.

Available to on-premise customers, gazetteers can be added when users know specific words or phrases that they expect to discover in their data. For example, a clothing manufacturer may add a list of basic colors they’d like to extract from tweets.

In addition, Rosette is pretrained on entity databases, such as Wikipedia and DBpedia, which simultaneously extract and link entity mentions. This technique is especially powerful for processing very short strings, such as tweets and captions.

Pattern matching

Rules expressed as regular expressions find entities that follow a pattern, such as dates, times, and email addresses. Many standard string patterns are prebuilt into our entity extractor, and on-premise customers can easily customize their extraction workflow by editing or adding rules based on their specific needs.

Customization in the field

For use cases where every additional point of accuracy is critical, or for domain-specific entity types, we offer customization tools and services for on-premise deployments. Within Rosette, you can add new entity types or boost your results by:

  • Adding/editing entity lists
  • Adding/editing regular expressions pattern matching
  • Retraining statistical models
    • Unsupervised training (using unannotated data) for greater accuracy on your data and domain
    • Supervised training (using annotated data) for greater accuracy or adding new entity types.

Coreference Resolution

Within a single document, this process of links together mentions that relate to the same real world entities. There are three types of coreference resolution.

Type
Example
Named entity Katherine Johnson’s calculations of orbital mechanics were critical to the success of NASA missions to the moon. Johnson calculated trajectories, launch windows, and emergency return paths.
Pronominal Katherine Johnson’s calculations of orbital mechanics were critical to the success of NASA missions to the moon. She calculated trajectories, launch windows, and emergency return paths.
Nominal Katherine Johnson’s calculations of orbital mechanics were critical to the success of NASA missions to the moon. The mathematician calculated trajectories, launch windows, and emergency return paths.

Tech Specs

Availability and platform support

Deployment availability:
Plugins:
Bindings:

Supported languages*

Arabic German Korean Spanish
Chinese, Simplified Hebrew Malay Urdu
Chinese, Traditional Hungarian Pashto Vietnamese
Dutch Indonesian Persian
English Italian Portuguese
French Japanese Russian

Entity types**

Person Nationality ID Number Time
Location Religion Phone Lat/Long
Organization Money E-Mail Anatomy
Product Credit Card Activity Language
Title URL Food Substance
Disease Event Species
Measure MISC Distance
Transport Number Date

*Rosette also supports case-insensitive input for English (i.e., all uppercase or all lowercase text).

**In addition to the entity types above, Rosette recognizes over 450 sub-entity types and will link to a WikiData QID and DBpedia parse tree when it is available.

As an example:

“Ibuprofen” will be tagged as “SUBSTANCE”, linked to the WikiData ID: Q186969, and assigned the DBpedia tree ”ChemicalSubstance/Drug”.

Try the Demo

Try the Demo

1) Select a sample or paste in your own text

2) Click “Analyze”

3) Select the “Entities” tab on the right side of the screen

4) Click on an entity for additional detail such as sentiment and knowledgebase links

 

Rosette Cloud

Easy to use

Built for the most demanding text analytics applications and engineered to deliver high accuracy without sacrificing speed, Rosette Cloud is instantly accessible and offers a variety of plans to suit both startups and enterprises. Our entity extraction endpoint is prebuilt to recognize and extract 700+ entity types with coverage across 21 languages.

To try entity extraction and the rest of Rosette Cloud’s endpoints, signup today for a 30-day free trial!

Get a Rosette Cloud Key

Quality documentation and support

Customers love our thorough and responsive support team. We also provide in-depth documentation that lists all the features and functions of the various Rosette Cloud endpoints alongside examples in the binding of your choice.

 Visit our GitHub for bindings and documentation.

Enterprise ready

Evaluate Rosette’s functional fit with your business and data needs on Rosette Cloud knowing that scalable, customizable, enterprise deployments are available if you need them.

{
  "entities": [
    {
      "type": "ORGANIZATION",
      "mention": "Securities and Exchange Commission",
      "normalized": "Securities and Exchange Commission",
      "count": 3,
      "mentionOffsets": [
        {
          "startOffset": 4,
          "endOffset": 38
        },
        {
          "startOffset": 166,
          "endOffset": 169
        },
        {
          "startOffset": 536,
          "endOffset": 539
        }
      ],
      "entityId": "Q953944",
      "confidence": 0.67070782,
      "linkingConfidence": 0.27190905,
      "dbpediaType": "Agent/Organisation/GovernmentAgency"
    },
    {
      "type": "PERSON",
      "mention": "Bridget Fitzpatrick",
      "normalized": "Bridget Fitzpatrick",
      "count": 2,
      "mentionOffsets": [
        {
          "startOffset": 99,
          "endOffset": 118
        },
        {
          "startOffset": 287,
          "endOffset": 298
        }
      ],
      "entityId": "T1",
      "confidence": 0.92063326
    },
    {
      "type": "PERSON",
      "mention": "David Gottesman",
      "normalized": "David Gottesman",
      "count": 2,
      "mentionOffsets": [
        {
          "startOffset": 174,
          "endOffset": 189
        },
        {
          "startOffset": 307,
          "endOffset": 316
        }
      ],
      "entityId": "Q5234268",
      "confidence": 0.92488831,
      "linkingConfidence": 0.47211223,
      "dbpediaType": "Agent/Person"
    },
    {
      "type": "TITLE",
      "mention": "Chief Litigation Counsel",
      "normalized": "Chief Litigation Counsel",
      "count": 1,
      "mentionOffsets": [
        {
          "startOffset": 134,
          "endOffset": 158
        }
      ],
      "entityId": "T2",
      "confidence": 0.3306601
    },
    {
      "type": "TITLE",
      "mention": "Deputy Chief Litigation Counsel",
      "normalized": "Deputy Chief Litigation Counsel",
      "count": 1,
      "mentionOffsets": [
        {
          "startOffset": 229,
          "endOffset": 260
        }
      ],
      "entityId": "T5",
      "confidence": 0.81287289
    },
    {
      "type": "TEMPORAL:DATE",
      "mention": "December 2016",
      "normalized": "December 2016",
      "count": 1,
      "mentionOffsets": [
        {
          "startOffset": 268,
          "endOffset": 281
        }
      ],
      "entityId": "T6"
    },
    {
      "type": "TITLE",
      "mention": "Ms.",
      "normalized": "Ms.",
      "count": 1,
      "mentionOffsets": [
        {
          "startOffset": 283,
          "endOffset": 286
        }
      ],
      "entityId": "T7",
      "confidence": 0.76600134
    },
    {
      "type": "TITLE",
      "mention": "Mr.",
      "normalized": "Mr.",
      "count": 1,
      "mentionOffsets": [
        {
          "startOffset": 303,
          "endOffset": 306
        }
      ],
      "entityId": "T9",
      "confidence": 0.72353458
    },
    {
      "type": "TITLE",
      "mention": "Co-Acting Chief Litigation Counsel",
      "normalized": "Co-Acting Chief Litigation Counsel",
      "count": 1,
      "mentionOffsets": [
        {
          "startOffset": 332,
          "endOffset": 366
        }
      ],
      "entityId": "T11",
      "confidence": 0.03582656
    },
    {
      "type": "LOCATION",
      "mention": "Washington D.C.",
      "normalized": "Washington D.C.",
      "count": 1,
      "mentionOffsets": [
        {
          "startOffset": 460,
          "endOffset": 475
        }
      ],
      "entityId": "Q61",
      "linkingConfidence": 0.66086622,
      "dbpediaType": "Place/PopulatedPlace/Settlement"
    }
  ]
}

Rosette Enterprise

Customize and scale your entity extraction on premise

For organizations with vast data quantities, unique integration needs, and data security restrictions, we provide on-premise deployments to be hosted on your internal servers. Our field training kits enable you to run unsupervised training on your own data to create personalized entity extraction models for your use case, or create custom entity types.

Request a free product evaluation

If your organization requires an enterprise solution, we’re happy to work with you to meet your unique needs. For a free evaluation of Rosette Enterprise, please complete the form below and our Customer Engineering team will provide you with an evaluation package.

Drop us a line

EMAIL:
info@basistech.com

PHONE:
+1-617-386-2000

Select Customers Include

konasearch salesforce

Deep Search for Salesforce

AI-driven Search Application for SalesForce

KonaSearch is a best-in-class search application for SalesForce enabling users to search every field, file, and object across multiple orgs and other data sources.

View on AppExchange

SalesForce Search