Search code examples
javasqlpostgresqlcluster-analysis

How to cluster rows in a postgresql table that match an input value or match a value from any of the other matching rows?


I have a table that looks like this in my postgresql database

enter image description here

How can I bring back a cluster of contacts where each contact in the cluster shares either the contact_id_a or contact_id_b value (or both) with another contact in the cluster?

In the example in the screenshot image above, rows 1-6 would be in the same cluster and row 8 would belong to no cluster.

How can this be achieved using either a SQL query or a SQL query in combination with Java code?

For context, this table lists all potential duplicate contacts in a list of contacts. We want to present to the list owner all of the contacts that are potential duplicates so that the user can manually manage these duplicates.

Here is my starting code:

DuplicateCandidate firstDuplicate = db.sql("select * from duplicates where list_id = "+list_id+ " and ignore_duplicate is not true").first(DuplicateCandidate);
        String sql = "select * from duplicates where list_id = "+list_id+ "and ignore_duplicate is not true "
                + "and (contact_id_a = ? or contact_id_b = ? or contact_id_a = ? or contact_id_b = ?";
        List<DuplicateCandidate> groupOfDuplicates  = db.sql(sql, firstDuplicate.contact_id_a,firstDuplicate.contact_id_a, firstDuplicate.contact_id_b, firstDuplicate.contact_id_b).results(DuplicateCandidate.class);

This will bring back the first row and any other rows containing 16247096 or 16247097, but not other essential rows matching the contact_ids from the second query's results.

Cheers.


Solution

  • Clustering like this is an iterative process with an unknown number of steps. I have never found a solution that can be done within a recursive query.

    I have not worked on CRM in over six years, but the following function is similar to how we used to generate match groups. Doing this row-by-row did not perform well enough for our workload, and accomplishing this via host language using e.g. Java HashMap() and HashSet() and inverted indexing creates very messy code.

    Assuming this schema:

    \d contact_info 
                     Table "public.contact_info"
          Column      |  Type   | Collation | Nullable | Default 
    ------------------+---------+-----------+----------+---------
     contact_id_a     | bigint  |           |          | 
     contact_id_b     | bigint  |           |          | 
     ignore_duplicate | boolean |           |          | false
     list_id          | integer |           |          | 496
    
    select * from contact_info ;
     contact_id_a | contact_id_b | ignore_duplicate | list_id 
    --------------+--------------+------------------+---------
         16247096 |     16247097 | f                |     496
         16247096 |     16247098 | f                |     496
         16247096 |     16247099 | f                |     496
         16247097 |     16247098 | f                |     496
         16247097 |     16247099 | f                |     496
         16247098 |     16247099 | f                |     496
         16247094 |     16247095 | f                |     496
    (7 rows)
    
    

    This function creates two temp tables to hold intermediate clusters and then returns the result once there is no more clustering possible.

    create or replace function cluster_contact() 
      returns table (clust_id bigint, contact_id bigint) 
      language plpgsql as $$
    declare 
      last_count bigint := 1;
      this_count bigint := 0;
    begin
      create temp table contact_match (clust_id bigint, contact_id bigint) on commit drop;
      create index cm_1 on contact_match (contact_id, clust_id);
      create index cm_2 on contact_match using hash (clust_id);
      create temp table contact_hold (clust_id bigint, contact_id bigint) on commit drop;
    
      with dedup as (
        select distinct least(ci.contact_id_a) as clust_id,
               greatest(ci.contact_id_b) as contact_id
          from contact_info ci
         where not ci.ignore_duplicate
      )
      insert into contact_match
        select d.clust_id, d.clust_id from dedup d
        union
        select d.clust_id, d.contact_id from dedup d;
    
      while last_count > this_count loop
    
        if this_count = 0 then 
          select count(distinct cm.clust_id) into last_count from contact_match cm;
        else 
          last_count := this_count;
        end if;
    
        with new_cid as (
          select cm.contact_id as clust_id_old,
                 min(cm.clust_id) as clust_id_new
            from contact_match cm
           group by cm.contact_id
        )
        update contact_match
           set clust_id = nc.clust_id_new
          from new_cid nc
         where contact_match.clust_id = nc.clust_id_old;
    
        truncate table contact_hold;
        insert into contact_hold 
          select distinct * from contact_match;
     
        truncate table contact_match;
        insert into contact_match
          select * from contact_hold;
    
        select count(distinct cm.clust_id) into this_count from contact_match cm;
    
      end loop;
    
      return query select * from contact_match order by clust_id, contact_id;
    end $$;
    

    One of the biggest mental blocks I have seen developers face is neglecting to include the relationship of a contact_id to itself. This leads to disjoint handling and a mental model needlessly complicated by a left-side and a right-side.

    select * from cluster_contact();
     clust_id | contact_id 
    ----------+------------
     16247094 |   16247094
     16247094 |   16247095
     16247096 |   16247096
     16247096 |   16247097
     16247096 |   16247098
     16247096 |   16247099
    (6 rows)
    

    Please comment if you need clarification on any of the steps in this solution or if it does not work for you.

    Also, know that Levenshtein is available in fuzzystrmatch, and it works well.

    If you would rather have sequential clust_id starting at 1, change your return query in the function to this:

      return query 
        select dense_rank() over (order by cm.clust_id) as clust_id, 
               cm.contact_id 
          from contact_match cm 
         order by clust_id, contact_id;
    

    It will yield:

    select * from cluster_contact();
     clust_id | contact_id 
    ----------+------------
            1 |   16247094
            1 |   16247095
            2 |   16247096
            2 |   16247097
            2 |   16247098
            2 |   16247099
    (6 rows)