Introducing: Rosettepedia

A text analytics recipe for entity extraction enhancement

The Rosette Cloud team is always hard at work devising ways for our users to get more value from their unstructured text data. Last month we published a recipe on our community Github that combined multiple Rosette endpoints to produce document summaries. This month, we’re thrilled to announce “Rosettepedia,” a new recipe that gives users instant access to a wealth of additional information about the entities in their text data.

Rosette’s entity extraction endpoint recognizes and extracts 18 different entity types within your text, but what if Rosette extracts an entity you’re not familiar with yet? Or an entity you recognize but don’t know very much about? The Rosettepedia recipe allows you to enhance your entity extraction results with information from Wikipedia Infoboxes and Wikidata drawn from the MediaWiki API.

How it works

The Rosettepedia script calls Rosette Cloud’s entity extraction and entity linking capabilities,  connects to  publicly available Wikidata entries and automatically returns any relevant information along with the extracted entities, enriching your results while saving you time and effort.

Each entity in Wikidata has an identifier—a “QID”—that uniquely identifies it. The /entities endpoint of Rosette Cloud can resolve or link mentions of entities by assigning the appropriate identifier. For example, Washington (Q1223) refers to the state in the United States, Washington (Q61) refers to the city in the District of Columbia, and Washington (Q23) refers to the first president of the United States. This recipe demonstrates how to look up an entity by its QID, provided by Rosette Cloud, and access additional information provided by the Wikidata knowledge base.

At the time of writing, Rosette Cloud’s entity linking functionality supports four languages: Chinese, English, Japanese and Spanish.

Rosettepedia in action

The simplest way to use the script is to simply pipe in a string:

$ echo "OPEC will meet in Vienna this week." | ./rosettepedia.py -w eng > opec.json
Extracting entities via Rosette API ...
Done!
Augmenting entities via MediaWiki API ...
fetching "en" Infobox/Wikidata for entity: Q7795 (OPEC) ...
fetching "en" Infobox/Wikidata for entity: Q1741 (Vienna) ...
Done!

The script returns the following results:

[

 {

   "type": "ORGANIZATION",

   "mention": "OPEC",

   "normalized": "OPEC",

   "count": 1,

   "entityId": "Q7795",

   "wikipedia": {

     "infobox": {

       "name": "Organization of the Petroleum Exporting Countries",

       "image_flag": "Flag of OPEC.svg",

       "image_map": "OPEC.svg",

       "org_type": "International cartel",

       "membership_type": "Membership",

       "admin_center_type": "Headquarters",

       "admin_center": "Vienna, Austria",

       "languages_type": "Official language",

       "languages": "English",

       "leader_title1": "Secretary General",

       "leader_name1": "Mohammed Barkindo",

       "established": "Baghdad, Iraq",

       "established_event1": "Statute",

       "established_date1": "September 1960",

       "established_event2": "In effect",

       "established_date2": "January 1961",

       "currency": "(US$ /bbl)"

     },

     "wikidata": {

       "website": "http://www.opec.org",

       "image": "OPEC-building-01.jpg",

       "instance": "international organization",

       "category": "Category:OPEC"

     },

     "title": "OPEC",

     "url": "https://en.wikipedia.org/wiki/OPEC"

   }

 },

 {

   "type": "LOCATION",

   "mention": "Vienna",

   "normalized": "Vienna",

   "count": 1,

   "entityId": "Q1741",

   "wikipedia": {

     "infobox": {

       "name": "Vienna",

       "native_name": "Wien",

       "settlement_type": "Capital city",

       "image_flag": "Flag of Wien.svg",

       "image_seal": "Vienna seal 1926.svg",

       "image_shield": "Wien 3 Wappen.svg",

       "shield_size": "80px",

       "image_map": "Wien in Austria.svg",

       "map_caption": "Location of Vienna in Austria",

       "subdivision_type": "Country",

       "subdivision_name": "Austria",

       "leader_party": "SPÖ",

       "leader_title": "Mayor and Governor",

       "leader_name": "Michael Häupl",

       "leader_title1": "Vice-Mayors and Vice-Governors",

       "area_magnitude": "2 chaiz",

       "area_total_km2": "414.65",

       "area_land_km2": "395.26",

       "area_water_km2": "19.39",

       "elevation_m": "151 (Lobau) – 542 (Hermannskogel)",

       "elevation_ft": "495–1778",

       "population_total": "1,867,960",

       "population_as_of": "1. January 2017",

       "population_density_km2": "4326.1",

       "population_metro": "2,600,000",

       "population_blank2_title": "Ethnicity",

       "population_blank2": "61.2% Austrian38.8% Other",

       "population_demonym": "Viennese, Wiener",

       "population_note": "Statistik Austria, VCÖ – Mobilität mit Zukunft",

       "postal_code_type": "Postal code",

       "postal_code": "1010–1423, 1600, 1601, 1810, 1901",

       "website": "www.wien.gv.at",

       "footnotes": "frameless|x30px",

       "blank1_name": "- GDP total (2014)http://ec.europa.eu/eurostat/documents/2995521/7192292/1-26022016-AP-EN.pdf/602b34e8-abba-439e-b555-4c3cb1dbbe6e",

       "blank1_info": "€82 billion/ US$110 billion",

       "blank2_name": "- GDP per capita(2014)http://ec.europa.eu/eurostat/documents/2995521/7192292/1-26022016-AP-EN.pdf/602b34e8-abba-439e-b555-4c3cb1dbbe6e",

       "blank2_info": "€47,300/ US$63,000XE.com average GBP/ USD ex. rate in 2014",

       "timezone": "CET",

       "utc_offset": "+1",

       "timezone_DST": "CEST",

       "utc_offset_DST": "+2",

       "blank_name": "Vehicle registration",

       "blank_info": "W"

     },

     "wikidata": {

       "image": "Collage von Wien.jpg",

       "coordinates": {

         "latitude": 48.20833,

         "longitude": 16.373064,

         "altitude": null,

         "precision": 1e-06,

         "globe": "http://www.wikidata.org/entity/Q2"

       },

       "website": "https://www.wien.gv.at/",

       "instance": [

         "city",

         "capital",

         "city with millions of inhabitants",

         "federal capital",

         "municipality of Austria",

         "place with town rights and privileges",

         "statuatory city of Austria",

         "state of Austria",

         "district of Austria",

         "metropolis",

         "tourist destination"

       ],

       "country": [

         "Austria",

         "First Republic of Austria",

         "Austria-Hungary",

         "Republic of German-Austria",

         "Austrian Empire",

         "Federal State of Austria",

         "Nazi Germany",

         "Habsburg Empire",

         "Archduchy of Austria",

         "Duchy of Austria",

         "March of Austria",

         "Duchy of Bavaria",

         "Allied-occupied Austria"

       ],

       "category": "Category:Vienna"

     },

     "title": "Vienna",

     "url": "https://en.wikipedia.org/wiki/Vienna"

   }

 }

]

As you can see, the Rosettepedia script returns detailed results for OPEC and Vienna, augmenting the attributes that Rosette Cloud normally returns (the entity type, count, and QID) with an additional attribute that contains infobox data and Wikidata.

Another way to use the script is to have Rosette Cloud extract content from a web page by supplying a URL and using the -u/–content-uri option:

$ ./rosettepedia.py -u -i 'https://ja.wikipedia.org/wiki/アメリカスカップ' -w jpn > アメリカスカップ.json
Extracting entities via Rosette API ...
...
Done!
$ jq '.entities[]|select(.entityId == "Q29")' アメリカスカップ.json
{
  "type": "LOCATION",
  "mention": "Español",
  "normalized": "Español",
  "count": 1,
  "entityId": "Q29",
  "wikipedia": {
    "infobox": {},
    "wikidata": {
      "coordinates": {
        "latitude": 40,
        "longitude": -3,
        "altitude": null,
        "precision": 1,
        "globe": "http://www.wikidata.org/entity/Q2"
      },
      "image": "Relief Map of Spain.png",
      "continent": [
        "ヨーロッパ",
        "アフリカ"
      ],
      "instance": [
        "主権国家",
        "国",
        "欧州連合加盟国",
        "国際連合加盟国",
        "欧州評議会加盟国"
      ],
      "category": "Category:スペイン",
      "country": "スペイン"
    },
    "title": "スペイン",
    "url": "https://ja.wikipedia.org/wiki/スペイン"
  }
}

Given the additional information provided by the Wikipedia extended attributes, you can filter down to only those entities that satisfy certain properties. For instance, you can query for only those entities that have geo-coordinates:

$ jq '.entities[]|select(.wikipedia.wikidata|has("coordinates"))' アメリカスカップ.json
...
{
  "type": "LOCATION",
  "mention": "JPN",
  "normalized": "JPN",
  "count": 1,
  "entityId": "Q17",
  "wikipedia": {
    "infobox": {},
    "wikidata": {
      "coordinates": {
        "latitude": 35,
        "longitude": 136,
        "altitude": null,
        "precision": 1,
        "globe": "http://www.wikidata.org/entity/Q2"
      },
      "instance": [
        "主権国家",
        "国",
        "島国",
        "国際連合加盟国"
      ],
      "continent": "アジア",
      "category": "Category:日本",
      "country": "日本"
    },
    "title": "日本",
    "url": "https://ja.wikipedia.org/wiki/日本"
  }
}

 

Try it yourself

With access to Rosettepedia you’re now able to extract information from your text data instead of just entities. Speed up research projects and enhance intelligence analysts’ reports with public data. Have your own knowledge base of customer information or persons of interest? Talk to our customer engineering team about on-premise customization opportunities.

Ready to get started? First, sign up for a free API key (no credit card required) for a 30-day free trial!. Next, visit our Community Github for step by step instructions on installing and running the script.

Thought of another way to combine Rosette Cloud endpoints for a new use case? Let us know and we’ll feature you on our blog!