An Elegant and Efficient Way to Fuzzy Search Names in Elasticsearch

Introducing the Rosette name matching Plug-In for Elasticsearch

Brian Sawyer, chief developer of the plug-in, presents this plug-in at the Boston Elasticsearch meetup.
Slides are available here.

Elasticsearch developers who want to fuzzy search names across multiple fields and cover the spectrum of name variations (sometimes two or more in a single name), know how much of a bear it can be. Until now, the solution has not been completely satisfactory, comprehensive, nor clean, but that’s all about to change.

The Rosette name matching plug-in for Elasticsearch solves the fuzzy name matching issue.

Fuzzy Problem in Elasticsearch

Currently, Elasticsearch can be configured to provide some fuzziness by mixing its built-in edit-distance matching and phonetic analysis with more generic analyzers and filters. However, this approach requires a complex query against multiple fields, and recall is completely determined by Soundex/metaphone (phonetic similarity) and Lucene edit distance. As for precision, it is difficult to guarantee that the best results will be at the top and Lucene document scores are notoriously unreliable to place a threshold on.  No other types of variations (e.g., swapped name order) are taken into account in calculating the similarity.

Best-practice with using Elasticsearch “out of the box” is a multi_field type with a separate field for each type of variation:

"mappings": { ... "type": "multi_field", "fields": {
    "pty_surename": { "type": "string", "analyzer": "simple" },
    "metaphone": { "type": "string", "analyzer": "metaphone" },
    "porter": { "type": "string", "analyzer": "porter" } ...

This approach has trouble handling more than one type of spelling variation in a name.

“Jesus Alfonso Lopez Diaz” v. “LobEzDiaS, Chuy”

The above example has a name variation involving reordered name components, a missing initial, two spelling differences, a nickname for the first name and a missing space.

And, for developers that want a name field type that also:

  • Contributes a score specific to name matching phenomena.
  • Is part of queries using many field types.
  • Has multiple fields per document.
  • Has multiple values per field

…the Rosette Name Indexer has your back.

How Rosette Works Inside Elasticsearch

The Rosette plugin contains a custom mapper which does all the work behind the scenes

PUT /test/test/_mapping
    "test" : {
        "properties" : {
            "name" : { "type:" : "rni_name" }
            "aka" : { "type:" : "rni_name" }

At Index Time:
The Rosette plugin indexes keys for different phenomena (types of name variations) in separate (sub) fields

RNI plugin for Elastic index time

At Query Time:
The Rosette plugin generates analogous keys for a custom Lucene query that finds good candidates for re-ranking

POST /test/test/_search
  "query" : {  
    "match" : {  
      "name" : "LobEzDiaS, Chuy"
  "rescore" : {  
    "window_size" : 200,
    "query" : {  
      "rescore_query" : {  
        "function_score" : {  
          "name_score" : {  
            "field" : "name",
            "query_name" : "LobEzDiaS, Chuy"
      "query_weight" : 0.0,
      "rescore_query_weight" : 1.0

The ‘name_score’ function matches the query name against the indexed name in every candidate document and returns the similarity score

RNI plugin for Elastic implementation

There are Rescoring parameters that can be tweaked to give the system more speed or more accuracy. The fewer the documents tagged for reranking, the faster the query will be, but with the trade-off that potentially good matches, lower down the list, might not get rescored higher.

  • window_size – specifies how many documents from the base query should be rescored
  • minScoreToCheck – (added by Basis Technology) sets the score threshold that the top document must meet to be rescored

Within the entire query, the developer can decide how much weight to give to the name match vs. the overall query match.

  • rescore_query – Calls the name_score function to get a score, and then combines rescore_queries to query across multiple fields
  • query_weight – Controls how much weight is given to the main query, and allows the user to include queries on other non-name fields
  • rescore_query_weight – Specifies the weight to give to the rescored query

In a nutshell

The Rosette name matching Plug-in adds:

  • Custom field type mapping
  • Splits a single field into multiple fields covering different phenomena
  • Supports multiple name fields in a document
  • Intercepts the query to inject a custom Lucene query

Custom rescore function:

  • Rescores documents with an algorithm specific to name matching
  • Limits expensive calculations to only top candidate documents
  • Is highly configurable

The Rosette name matching plug-in was built and tested with Elasticsearch 2.3.4.

Request a product evaluation

1 Edit distance refers to the number of changes it takes to get from one spelling to another. Thus “ax” to “axe” has an edit distance of 1 (add “e”), and “axs” to “axe” has an edit distance of 2 (remove “s” add “e”).