Search code examples
databasestringpostgresqllevenshtein-distancejsonb

Similarity on many strings in database


What could be the best way to check if two objects with many properties are similar?

Lets say I have an object - address, which has 10 fields, like: location1, location2, location3, location4, ..., postalCode, owner, habitants..

They are all stored in postgres data base as jsonb types.

When new object comes in I need to check is there any similar address.

What are the most common techniques used in this kind of cases?

One idea is to concatenate all properties and check levenshtein distance.

I am not tied to any specific technology right now, requirements are that these objects can be a lot and they must be stored somewhere.


Solution

  • JSON and JSONB types imply data with elements tagged with different meaning. This generally means that these different elements can't be usefully treated all the same way, and this further implies that a one-size-fits-all approach probably isn't going to get good results.

    As you mentioned, the Levenshtein distance is a possible approach, but most of the time it'd have to be weighted in some way that's customized to your particular data, and even that probably won't be enough for most real data sets.

    For example, consider something like a basic address. Matching street numbers by themselves are meaningless. Ditto for matching street names. Really all the elements are dependent, and it's only if one starts with matching countries and works down through state/province etc. that "similarity" has true meaning. Simple weights aren't going to be able to capture this type of relationship.

    The solution is to use a stored procedure that determines similarity between rows in a given table. While PL/pgSQL can be used for this (and will work really well for simple tables) when things get complicated it's probably worthwhile digging into something like PL/Python. How efficient these stored procedures will be will vary wildly with how they are written, of course, but with a little care they can perform quite well even when used within large databases.

    For example (and there's not enough information in your question to make something that will straight out work here, so please treat this as something somewhat better than pseudocode but not as thoroughly tested PL/Python):

    CREATE OR REPLACE FUNCTION compare_json_addresses(addr1 JSON, addr2 JSON)
    RETURNS INTEGER AS
    $$
    BEGIN
      import simplejson as json
      a1, a2 = json.loads(addr1), json.loads(addr2)
      similarity = 0
      for unit in ('country', 'state', 'town', 'street', 'num'):
          if a1[unit] != a2[unit]:
              break
          else:
              similarity += 1
      return similarity
    END;
    $$
    LANGUAGE plpythonu STRICT IMMUTABLE;
    

    Obviously you'll have to modify this to also take into account the various additional location fields you're using, and figure out how you want them to relate.