sql postgresql pattern-matching similarity

Postgres query for cross-checking 2 tables

I have two tables, already populated with data as follows :

table1 data are citizens data, with fields:

uniqid  (a uniqueIP for person, like social sec num) 
first_name 
last_name
birthdate
address

table2 with fields:

first_name 
last_name
birthdate
address
gender
healthinsurance_type
...

table1 data and table2 data come from different and separate agencies. Names from those tables could be typed differently, and so on.

table1 is authoritative for names and ID. table2 is what I need to work on, but there's no ID (citizenID).

Now I need each row in table2 to get a citizenid, associated from table1, so that in the end I get table2 with additional ID column, correctly associated with each person.

Something like search in table1 for a person (a row in table2) where some conditions match, and if a record exists in table1, put the ID of that record to a column in table2.

But do it fast and for all rows.
row count of table1 is around 2 million.
row count of table2 is around 900.000

Solution

I am assuming you are the only one writing to these tables, so no concurrency conflicts.

Step 1

Add the ID column to table2, can be NULL for now:

ALTER TABLE table2 ADD COLUMN citizenid int; -- or whatever the type is

Consider an additional flag on table1 to take row "off the market" cheaply on the other side, too:

ALTER TABLE table1 ADD COLUMN hasmatch boolean;

Step 2

Assuming there are no duplicates on either side. Else you need to do more.

Step 3

Update all rows in table2 with a perfect, complete match. Flag matching rows in table1 right away, too. With a data-modifying CTE:

WITH u2 AS (
   UPDATE table2 t2
   SET    citizenid = t1.uniqid  
   FROM   table1 t1
   WHERE  t1.hasmatch IS NULL   -- always with this condition
   AND    t2.citizenid IS NULL  -- always with this condition
   AND    t2.first_name = t1.first_name
   AND    t2.last_name  = t1.last_name
   AND    t2.birthdate  = t1.birthdate
   AND    t2.address    = t1.address
   RETURNING citizenid
   )
UPDATE table1 t1
SET    hasmatch = TRUE
FROM   u2
WHERE  t2.citizenid = u2.citizenid;

As soon as a row has its citizenid it's "off the market" on both sides.

Step 4

Check how many rows are left and start softening predicates in baby steps to keep false positives at bay by always trying the more likely match first. Think of a systematic strategy before you start this cyclic process. Analyze remaining rows to find systematic typos or similar clues.

Possible options for fuzzy matching of character type columns are:

lower(t2.address) = lower(t1.address)
t2.address ILIKE (t1.address || %) -- here % is a wilcard for LIKE pattern
t1.address ILIKE (t2.address || %)
lower(left(t2.address, 20)) = lower(left(t1.address, 20))
t2.address % t1.address  -- here % is the similarity operator
levenshtein(t2.address, t1.address) <= 3  -- *very* expensive, no index support

Etc. The similarity operator % is provided by the additional module pg_trgm, which also provides trigram indexes to support LIKE and similarity matches. You will need indexes for fuzzy matching or your queries may take a long time.

PostgreSQL LIKE query performance variations

Consider partial indexes to remove rows from the index as soon as a match is found. Like:

CREATE INDEX t1_adr_gin_trgm_idx ON table1 USING gin (address gin_trgm_ops)
WHERE t1.hasmatch IS NULL;

CREATE INDEX t2_adr_gin_trgm_idx ON table2 USING gin (address gin_trgm_ops)
WHERE t2.citizenid IS NULL;

Etc.

You can fine tune the similarity threshold for the % operator with:

SELECT set_limit(0.8);

Step 5

A small percentage is going to remain unresolved. You can spend an increasing amount of time to assign them manually until you decide to discard the rest.

Step 6

Optional. When the process is finished, every row has a citizenid, which can now be set to NOT NULL, new rows must have a citizenid

More Details: