I have a table that looks like this in my postgresql database
How can I bring back a cluster of contacts where each contact in the cluster shares either the contact_id_a or contact_id_b value (or both) with another contact in the cluster?
In the example in the screenshot image above, rows 1-6 would be in the same cluster and row 8 would belong to no cluster.
How can this be achieved using either a SQL query or a SQL query in combination with Java code?
For context, this table lists all potential duplicate contacts in a list of contacts. We want to present to the list owner all of the contacts that are potential duplicates so that the user can manually manage these duplicates.
Here is my starting code:
DuplicateCandidate firstDuplicate = db.sql("select * from duplicates where list_id = "+list_id+ " and ignore_duplicate is not true").first(DuplicateCandidate);
String sql = "select * from duplicates where list_id = "+list_id+ "and ignore_duplicate is not true "
+ "and (contact_id_a = ? or contact_id_b = ? or contact_id_a = ? or contact_id_b = ?";
List<DuplicateCandidate> groupOfDuplicates = db.sql(sql, firstDuplicate.contact_id_a,firstDuplicate.contact_id_a, firstDuplicate.contact_id_b, firstDuplicate.contact_id_b).results(DuplicateCandidate.class);
This will bring back the first row and any other rows containing 16247096 or 16247097, but not other essential rows matching the contact_ids from the second query's results.
Cheers.
Clustering like this is an iterative process with an unknown number of steps. I have never found a solution that can be done within a recursive query.
I have not worked on CRM in over six years, but the following function is similar to how we used to generate match groups. Doing this row-by-row did not perform well enough for our workload, and accomplishing this via host language using e.g. Java HashMap()
and HashSet()
and inverted indexing creates very messy code.
Assuming this schema:
\d contact_info
Table "public.contact_info"
Column | Type | Collation | Nullable | Default
------------------+---------+-----------+----------+---------
contact_id_a | bigint | | |
contact_id_b | bigint | | |
ignore_duplicate | boolean | | | false
list_id | integer | | | 496
select * from contact_info ;
contact_id_a | contact_id_b | ignore_duplicate | list_id
--------------+--------------+------------------+---------
16247096 | 16247097 | f | 496
16247096 | 16247098 | f | 496
16247096 | 16247099 | f | 496
16247097 | 16247098 | f | 496
16247097 | 16247099 | f | 496
16247098 | 16247099 | f | 496
16247094 | 16247095 | f | 496
(7 rows)
This function creates two temp tables to hold intermediate clusters and then returns the result once there is no more clustering possible.
create or replace function cluster_contact()
returns table (clust_id bigint, contact_id bigint)
language plpgsql as $$
declare
last_count bigint := 1;
this_count bigint := 0;
begin
create temp table contact_match (clust_id bigint, contact_id bigint) on commit drop;
create index cm_1 on contact_match (contact_id, clust_id);
create index cm_2 on contact_match using hash (clust_id);
create temp table contact_hold (clust_id bigint, contact_id bigint) on commit drop;
with dedup as (
select distinct least(ci.contact_id_a) as clust_id,
greatest(ci.contact_id_b) as contact_id
from contact_info ci
where not ci.ignore_duplicate
)
insert into contact_match
select d.clust_id, d.clust_id from dedup d
union
select d.clust_id, d.contact_id from dedup d;
while last_count > this_count loop
if this_count = 0 then
select count(distinct cm.clust_id) into last_count from contact_match cm;
else
last_count := this_count;
end if;
with new_cid as (
select cm.contact_id as clust_id_old,
min(cm.clust_id) as clust_id_new
from contact_match cm
group by cm.contact_id
)
update contact_match
set clust_id = nc.clust_id_new
from new_cid nc
where contact_match.clust_id = nc.clust_id_old;
truncate table contact_hold;
insert into contact_hold
select distinct * from contact_match;
truncate table contact_match;
insert into contact_match
select * from contact_hold;
select count(distinct cm.clust_id) into this_count from contact_match cm;
end loop;
return query select * from contact_match order by clust_id, contact_id;
end $$;
One of the biggest mental blocks I have seen developers face is neglecting to include the relationship of a contact_id
to itself. This leads to disjoint handling and a mental model needlessly complicated by a left-side and a right-side.
select * from cluster_contact();
clust_id | contact_id
----------+------------
16247094 | 16247094
16247094 | 16247095
16247096 | 16247096
16247096 | 16247097
16247096 | 16247098
16247096 | 16247099
(6 rows)
Please comment if you need clarification on any of the steps in this solution or if it does not work for you.
Also, know that Levenshtein is available in fuzzystrmatch
, and it works well.
If you would rather have sequential clust_id
starting at 1
, change your return query
in the function to this:
return query
select dense_rank() over (order by cm.clust_id) as clust_id,
cm.contact_id
from contact_match cm
order by clust_id, contact_id;
It will yield:
select * from cluster_contact();
clust_id | contact_id
----------+------------
1 | 16247094
1 | 16247095
2 | 16247096
2 | 16247097
2 | 16247098
2 | 16247099
(6 rows)