Special delivery: A BasisTech address parser
Have you ever had to enter an address online and been annoyed that you had to manually fill out multiple separate fields? Why can’t the computer figure it out? Actually, it can. An address parser, also known as an address fielder, is a type of software that uses an address string to automatically populate address fields with the appropriate information. In the world of natural language processing, address parsers are a vital component of address matching. There are many on the market today, but at the BasisTech Hackathon 2022, Team 4 asked the question: “Why not do it ourselves?” The benefits would be many: all support and documentation would be in-house, it would be fully customizable, and improvements could be made on our own schedule.
To get a sense of the scope of this challenge, ask yourself: how many types of information are in an address? Take our address here at BasisTech: 1060 Broadway Somerville, MA 02144-2078. It has a building number, street name, city, state, and zip code. But it doesn’t always look the same. Sometimes the state will be spelled out as Massachusetts instead of abbreviated as MA, and the zip code might not always include a +4. Another important consideration is address structure variation across languages. This is especially important for BasisTech, where we aim to match addresses beyond just the U.S. For example, the following Moscow address, Москва, Краснопресненская набережная, 2, puts the city first, followed by the street name, followed by the house number. Despite these variations, it’s generally easy to teach humans which segment corresponds to which type of information. For software, however, it’s different, and that’s where address parsing comes into play.
Team 4, the BasisTech Postal Service, set themselves the goal of creating a tool that would use the context of each address field (such as house number or street name) to help determine the type of field. To accomplish this, the team decided to use a linear chain conditional random field (CRF) model. This is a machine learning model that, when assigning a label to a sample, considers adjacent samples as part of the decision-making process. When applied to address parsing, this is like telling the model that a state field is more likely to be adjacent to a city field than a house number field. This is known as sequence labeling. Team 4’s model, playfully dubbed RAPTuRE (Rosette Address Parser That usually Runs Exquisitely), had a window size of five. In CRF terms, that means that for each field, the model considered that field, two fields to the left, and two fields to the right.
At the end of the hackathon, BasisTech Postal Service had consistent performance data that indicated the proof of concept was a success. Particularly impressive was RAPTuRE’s number of tags per second, or high efficiency. That’s likely because the CRF model is typically less computationally expensive than, say, a neural network. The team got promising results from incorporating word vectors into the model, and it worked with both English and Russian addresses. On presentation day, the BasisTech Postal Service (clad in matching custom hats) walked away with the Best in Show award.
You can rest assured that’s not the end of RAPTuRE’s story. Like the other Hackathon projects, it was built over only two and a half days. The team has more ideas for where it could go. The obvious place to start is with training data. The success of any model depends on the quality and quantity of the data it is trained on. BasisTech Postal Service got a good start using publicly available annotated training data, but by getting more, better, and cleaner training data, the address parser will improve. The team is also interested in the possibility of using a different model. In particular, RAPTuRE and BasisTech’s own Rosette Entity Extractor (REX) could be a match made in heaven. Since BasisTech created REX, our engineers already know it inside and out. We also know how powerful it is (very!), not to mention the fact that REX already does context-sensitive labeling.
Don’t be surprised if you hear more about this address parser in the future; RAPTuRE is a truly useful tool that BasisTech is considering adopting in an official capacity. You can always expect fun at the BasisTech Hackathon, but projects like this show that work and play aren’t always mutually exclusive. Congratulations to the BasisTech Postal Service, and thanks for reminding us that it pays to think outside the (mail)box!