Search code examples
string-comparisonfuzzy-logicfuzzy-comparisonfuzzy

How two check if two unstructured street adresses strings are the same?


I need to compare two unstructured addresses and be able to identify if they are the same (or similar enough).

Scenario

  • Address is supplied by the end user in plain text.
  • There is nothing to help the user to write on a more identifiable manner (no autocomplete, nothing. Just an empty textbox).
  • "#102 Nice-Looking Street, Gotham City, NY" should match with "Nice Loking St., Gotham City, New York, apt 102".
  • Using a third-party service is not an option.
  • Search is not a problem. I already have the two strings. What I need is to check if they represent the same address, despite its differences on structure.

What I have found

I know we can use some Fuzzy logic for this kind of comparison, with some tolerance for misspelling, but...

  • There are some keywords (like, for instance, comparing "Street" to "St." or comparing "#102" to "apt 102", or "NY" to "New York") that are not supposed to penalize the degree of reliability.
  • Some words can be placed in different order (like the appartement in the above example).

I do not want to reinvent the Wheel. This problem seems like a common concern in different contexts and I think there is an algorithm (with some slight modifications, maybe) that might be a fit for this scenario.

Thanks in advance


Solution

  • I've helped build some open source tools to do this.

    Basically, the approach is to try to split and address into it's constituent parts and then intelligently compare those parts.

    Both parts of the problem are hard.

    The first part is often called address parsing. Here's what we use: https://github.com/datamade/usaddress

    The second part has many, many names but, let's call it fuzzy matching. Here's the library we made for that: https://github.com/datamade/dedupe

    We also provided some facilities for using them together: http://dedupe.readthedocs.io/en/latest/Variable-definition.html#address-type