Fuzzy Name Search and Name Matching Presentations in San Francisco

14 Apr 2015
Blog

Names connect data points and are frequently the most important piece of information in a document. But unlike common nouns and verbs, they defy standardization, making them an elusive search target.

But you can, in just two days, go from “neophyte” to “well-informed” in the realm of fuzzy name searching and matching.

David Murgatroyd

Basis Technology’s VP of Engineering, David Murgatroyd (@dmurga), talks about best practices for implementing fuzzy name search in Apache Solr™ on Wednesday, April 22 at the Apache Lucene/Solr Meetup in San Francisco. Then on Friday, April 24, at Text By the Bay in San Francisco, David speaks about the specific case of name matching and Airbnb. Trust is the cornerstone of the sharing economy in which individuals transact with individuals for rides, places to stay, and more.

We hope to see you there!

Simple Fuzzy Name Matching in Solr

6 pm, Wednesday, April 22

Apache Lucene/Solr Meetup in San Francisco

Lucidworks (340 Brannan Street, 4th floor, San Francisco, CA)

We all know normalization is crucial to delivering high quality search results. We don’t want uninteresting variations between the query and the document to lead to missed hits (e.g., “celebrity” v. “celebrities”). Normalization of dictionary words is well understood, but what if your application focuses on names? Whether you’re tackling patent examination, sports records, e-commerce, watchlist screening or many other topics, names are often the key. Can your users find “Abdul Jabbar, Karim” if they search for “Kareem Abdal Jabar” or “كريم عبد الجبار”? Solr application architects have attempted to address this through custom integration of nickname lists, edit distance, case normalization, phonetic encoding and n-grams (see example #1 or example #2), but doing so requires significant effort and may not address all desired variations. A simpler approach is to use a Solr field type for names that handles these linguistic nuances behind-the-scenes. We’ll talk about how we built this sort of field type via a Solr plug-in for the Rosette Name Indexer. We’ll also discuss examples of use cases this has enabled, how it can be tuned if necessary, and how it connects to the broader trend of entity-centric search.

Identity Resolution in the Sharing Economy

Friday, April 24

Text By the Bay, San Francisco, CA

A growing sharing economy demands new, cost effective ways of establishing and checking identity, to allow services and participants to accurately assess risks and make good choices.

For example, Airbnb verifies offline identities using a scan of your driver’s license or passport. This is checked against templates designed to examine things like the layout and other government indicators of authenticity to help confirm that it appears to be valid. Crucially it involves checking an applicant’s entered name – often in Latin script – against their name on the scanned document, which may be in another script or language, and subject to potentially egregious OCR errors.

More generally, connecting the public and private traces that people, organizations and things — like vehicles — leave in various information stores is essential to delivering valuable analytics and novel services. This is often called entity analytics or identity resolution.

In this talk, we will explore enabling technology in both structured and unstructured contexts, discuss current challenges and limitations, and explore additional examples.