Enhancing Apache Solr with Entity ExtractionOctober 22, 2012
In a July blog post, we presented the many virtues and uses of entity extraction. A particularly common application of entity extraction is to enhance search applications. Having metadata—such as person names, locations, organization names, and so forth—aids navigation by allowing faceting of search results, and offers more precise search options with the ability to query or boost entities directly.
In this post, we’ll show you how to integrate entity extraction into Apache Solr, one of the most popular open source search engines. If you are also interested in improving keyword search results in general, you may want to look at an earlier blog post where we covered integration of advanced linguistic analysis (tokenization, lemmatization, decompounding, etc) into Solr.
Storing Entities in the Index
To make entities easily available to the search application for both search and document-based retrieval purposes, we need to store them in the Solr index. We suggest doing this by creating multi-valued fields in the Solr document, à la <fieldname>_<entitytype>, e.g., DESCRIPTION_LOCATION would store locations found in descriptions of hotels on a travel website. You would define fields for all the entities you want to store in your schema.xml file, either explicitly or via dynamic fields <http://wiki.apache.org/solr/SchemaXml#Dynamic_fields>:
<dynamicField name="DESCRIPTION_*" type="string" indexed="true" stored="true" multiValued="true"/>
This method would accommodate fields DESCRIPTION_LOCATION, DESCRIPTION_ORGANIZATION, DESCRIPTION_PHONENUMBER, etc.—the exact set depends on the entity extractor. If you store the description in multiple languages, this may become more complex and include the language, e.g. DESCRIPTION_ENG_LOCATION. Searching for and displaying these new fields in the search application is then a matter of updating your query to retrieve them, and your UI to show them. Filling them in the first place is where we will focus our attention.
Extracting Entities from a Solr Field
Unlike simple indexing, in this case we are performing analysis on one field (DESCRIPTION) and storing the results in new fields (DESCRIPTION_LOCATION). Because we are adding new fields to a document, we cannot use a Solr Analyzer; currently Analyzers do not have access to the Solr Document. Given this restriction, our options are: (1) Use an UpdateRequestProcessor within Solr. (2) Perform entity extraction as a precursor to the Solr indexing process. As you read the details of these approaches, the factors to keep in mind are: (1) Performance: How much data is going through the entity extractor? What will happen to your indexing speed? Index-time scalability? (2) Breadth of usage: Do you intend to use the entities anywhere other than the search application itself? For example, would you like to also store the entities in a database? (3) The usual: maintenance, level of effort, etc.
Entity Extraction with an UpdateRequestProcessor
Using an UpdateRequestProcessor involves changing the Solr configuration (solrconfig.xml) and implementing an UpdateRequestProcessor class which uses the entity extractor. The configuration in solrconfig.xml will look something like this, where EntityExtractorUpdateProcessorFactory extends UpdateRequestProcessorFactory <http://wiki.apache.org/solr/UpdateRequestProcessor>.
<requestHandler name="/update"> <lst name="defaults"> <str name="update.chain">rex</str> </lst> </requestHandler>
<updateRequestProcessorChain name="rex"> <processor> <str name="fields">DESCRIPTION</str> <!-- other parameters, as necessary --> </processor> <!-- other processors, as necessary --> </updateRequestProcessorChain>
Within your implementation of the UpdateRequestProcessor, you have many options. A simple approach would be to multi-thread the entire indexing process (by simply feeding data into Solr using more than one thread). At the indexing phase, it would be single-threaded, but that would, in general, be much faster than running entity extraction. Alternatively, arrange to call out to entity extraction via a web service. The UpdateRequestProcessor has full control of the Solr document at this point, so you can decide how best to distribute the processing. Plugging this step into Solr allows you to keep the implementation neat and compact. On the other hand, this approach precludes you from easily using the entities outside of the Solr index. An alternative to implementing your own UpdateRequestProcessor is using the Rosette linguistics platform, which offers an entity extraction UpdateRequestProcessor ready to be plugged into Solr. It comes with source, so you still have the freedom to tweak it to suit your application.
Entity Extraction Outside of Solr
If you plan to use the entities outside of Solr, it is generally better to do entity extraction before creating the Solr document, so you can add the extracted entities to the document yourself. In this case, the Solr document would contain a field DESCRIPTION_LOCATION, which would be pre-populated, and all you need from Solr is to index its contents. With respect to performance, in this scenario you have full control over where entity extraction is executed, much like when you implement your own UpdateRequestProcessor. You also have more freedom, because you are running entity extraction in advance of Solr indexing. This way allows you to decide how much hardware to throw at entity extraction and Solr indexing separately. In some cases, this method can save you some optimization effort, but often has a higher configuration and setup overhead. Many of our customers have gone in this direction, but what is best for you depends on the shape and size of the rest of the application.
The Rosette Entity Extractor Solution
One entity extractor that integrates easily into Solr is our Rosette® Entity Extractor, which can be called before indexing time through an API or web services, or used within an UpdateRequestProcessor via a plugin that integrates Rosette into a Solr installation without programming.
For more information about how Rosette works with Solr, see the presentation, “Integrating Advanced Text Analytics into Solr,” we gave at Lucene Revolution 2011.
Are you using entity extraction alongside Solr? Do you have ideas about how you might use it? Let us know in the comments below!