I have two tables, already populated with data as follows :
table1
data are citizens data, with fields:
uniqid (a uniqueIP for person, like social sec num)
first_name
last_name
birthdate
address
table2
with fields:
first_name
last_name
birthdate
address
gender
healthinsurance_type
...
table1
data and table2
data come from different and separate agencies. Names from those tables could be typed differently, and so on.
table1
is authoritative for names and ID. table2
is what I need to work on, but there's no ID (citizenID
).
Now I need each row in table2
to get a citizenid
, associated from table1
, so that in the end I get table2
with additional ID column, correctly associated with each person.
Something like search in table1
for a person (a row in table2
) where some conditions match, and if a record exists in table1
, put the ID of that record to a column in table2
.
But do it fast and for all rows.
row count of table1 is around 2 million.
row count of table2 is around 900.000
I am assuming you are the only one writing to these tables, so no concurrency conflicts.
Add the ID column to table2
, can be NULL for now:
ALTER TABLE table2 ADD COLUMN citizenid int; -- or whatever the type is
Consider an additional flag on table1
to take row "off the market" cheaply on the other side, too:
ALTER TABLE table1 ADD COLUMN hasmatch boolean;
Assuming there are no duplicates on either side. Else you need to do more.
Update all rows in table2 with a perfect, complete match. Flag matching rows in table1
right away, too. With a data-modifying CTE:
WITH u2 AS (
UPDATE table2 t2
SET citizenid = t1.uniqid
FROM table1 t1
WHERE t1.hasmatch IS NULL -- always with this condition
AND t2.citizenid IS NULL -- always with this condition
AND t2.first_name = t1.first_name
AND t2.last_name = t1.last_name
AND t2.birthdate = t1.birthdate
AND t2.address = t1.address
RETURNING citizenid
)
UPDATE table1 t1
SET hasmatch = TRUE
FROM u2
WHERE t2.citizenid = u2.citizenid;
As soon as a row has its citizenid
it's "off the market" on both sides.
Check how many rows are left and start softening predicates in baby steps to keep false positives at bay by always trying the more likely match first. Think of a systematic strategy before you start this cyclic process. Analyze remaining rows to find systematic typos or similar clues.
Possible options for fuzzy matching of character type columns are:
lower(t2.address) = lower(t1.address)
t2.address ILIKE (t1.address || %) -- here % is a wilcard for LIKE pattern
t1.address ILIKE (t2.address || %)
lower(left(t2.address, 20)) = lower(left(t1.address, 20))
t2.address % t1.address -- here % is the similarity operator
levenshtein(t2.address, t1.address) <= 3 -- *very* expensive, no index support
Etc.
The similarity operator %
is provided by the additional module pg_trgm, which also provides trigram indexes to support LIKE
and similarity matches. You will need indexes for fuzzy matching or your queries may take a long time.
Consider partial indexes to remove rows from the index as soon as a match is found. Like:
CREATE INDEX t1_adr_gin_trgm_idx ON table1 USING gin (address gin_trgm_ops)
WHERE t1.hasmatch IS NULL;
CREATE INDEX t2_adr_gin_trgm_idx ON table2 USING gin (address gin_trgm_ops)
WHERE t2.citizenid IS NULL;
Etc.
You can fine tune the similarity threshold for the %
operator with:
SELECT set_limit(0.8);
A small percentage is going to remain unresolved. You can spend an increasing amount of time to assign them manually until you decide to discard the rest.
Optional. When the process is finished, every row has a citizenid
, which can now be set to NOT NULL
, new rows must have a citizenid
More Details: