I am trying to update address fields in mytable with data from othertable. If I match on postal codes and search for city names from othertable in mytable, it works reasonably fast. But as I don't have postal codes in all cases, I also want to look for names only in a 2nd query. This takes hours (>12h). Any ideas what I can do to speed up the query? Please note that indexing did not help. Index scans in (2) weren't faster.
Code for matching on postal code + name (1)
update mytable t1 set
admin1 = t.admin1,
admin2 = t.admin2,
admin3 = t.admin3,
postal_code = t.postal_code,
lat = t.lat,
lng = t.lng from (
select * from othertable) t
where t.postal_code = t1.postal_code and t1.country = t.country
and upper(t1.address) like '%' || t.admin1 || '%' --looks whether city name from othertable shows up in address in t1
and admin1 is null;
Code for matching on name only (2)
update mytable t1 set
admin1 = t.admin1,
admin2 = t.admin2,
admin3 = t.admin3,
postal_code = t.postal_code,
lat = t.lat,
lng = t.lng from (
select * from othertable) t
where t1.country = t.country
and upper(t1.address) like '%' || t.admin1 || '%' --looks whether city name from othertable shows up in address in t1
and admin1 is null;
Query plan 1:
"Update on mytable t1 (cost=19084169.53..19205622.16 rows=13781 width=1918)"
" -> Merge Join (cost=19084169.53..19205622.16 rows=13781 width=1918)"
" Merge Cond: (((t1.postal_code)::text = (othertable.postal_code)::text) AND (t1.country = othertable.country))"
" Join Filter: (upper((t1.address)::text) ~~ (('%'::text || othertable.admin1) || '%'::text))"
" -> Sort (cost=18332017.34..18347693.77 rows=6270570 width=1661)"
" Sort Key: t1.postal_code, t1.country"
" -> Seq Scan on mytable t1 (cost=0.00..4057214.31 rows=6270570 width=1661)"
" Filter: (admin1 IS NULL)"
" -> Materialize (cost=752152.19..766803.71 rows=2930305 width=92)"
" -> Sort (cost=752152.19..759477.95 rows=2930305 width=92)"
" Sort Key: othertable.postal_code, othertable.country"
" -> Seq Scan on othertable (cost=0.00..136924.05 rows=2930305 width=92)"
Query plan 2:
"Update on mytable t1 (cost=19084169.53..27246633167.33 rows=5464884210 width=1918)"
" -> Merge Join (cost=19084169.53..27246633167.33 rows=5464884210 width=1918)"
" Merge Cond: (t1.country = othertable.country)"
" Join Filter: (upper((t1.address)::text) ~~ (('%'::text || othertable.admin1) || '%'::text))"
" -> Sort (cost=18332017.34..18347693.77 rows=6270570 width=1661)"
" Sort Key: t1.country"
" -> Seq Scan on mytable t1 (cost=0.00..4057214.31 rows=6270570 width=1661)"
" Filter: (admin1 IS NULL)"
" -> Materialize (cost=752152.19..766803.71 rows=2930305 width=92)"
" -> Sort (cost=752152.19..759477.95 rows=2930305 width=92)"
" Sort Key: othertable.country"
" -> Seq Scan on othertable (cost=0.00..136924.05 rows=2930305 width=92)"
In the second query, you are joining (more or less) on city name, but the othertable
has several entries per city name, so you are updating mytable
several times per record, with unpredictable value (which lat-long or other admin2/3 will be the last one to be updated?)
If othertable
has entries without postal code, use them by adding an extra condition AND othertable.posalcode is null
Else, you will want to get a subset of othertable
that returns one row per admin1
+ country
value. You would replace select * from othertable
by the following query. Of course you might want to adjust it to get another lat/long/admin2-3 than the 1st one..
SELECT admin1, country, first(postal_code) postal_code, first(lat) lat, first(lng) lng, first(admin2) admin2, first(admin3) admin3
FROM othertable
GROUP BY admin1,country
Worst, the second query overwrite what was updated in the 1st query, so you must ignore these records by adding and mytable.postalcode is null
The entire query could be
UPDATE mytable t1
SET
admin1 = t.admin1,
admin2 = t.admin2,
admin3 = t.admin3,
postal_code = t.postal_code,
lat = t.lat,
lng = t.lng
FROM (
SELECT admin1, country, first(postal_code) postal_code, first(lat) lat, first(lng) lng, first(admin2) admin2, first(admin3) admin3
FROM othertable
GROUP BY admin1,country) t
WHERE t1.country = t.country
AND upper(t1.address) like '%' || t.admin1 || '%' --looks whether city name from othertable shows up in address in t1
AND admin1 is null
AND mytable.postal_code is null;