I've been tinkering/reading for a while, but can't find any optimizations that work here... I've indexed the relevant ids in the joins, I tried a manual vacuum, and I also tried clustering on an index so that maybe the query optimizer wouldn't think it's more efficient to scan the whole table because of some scattered rows (though I don't really know much about query planning).
I am trying to get join results for a single id (for debugging purposes). I found that queries for some single ids take ~2 minutes, while most (99%?) return in under 1 second. Here are some explain analyze
s (I changed some names with sed for confidentiality):
main=> explain analyze SELECT e.customer_id, l.*
FROM abc.encounter e
JOIN abc.log l
ON e.encounter_id = l.encounter_id
AND e.customer_id = '1655563';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=2751.69..2566740.95 rows=13262 width=75) (actual time=122038.725..226694.004 rows=249 loops=1)
Hash Cond: (l.encounter_id = e.encounter_id)
-> Seq Scan on log l (cost=0.00..2190730.92 rows=99500192 width=66) (actual time=0.005..120825.675 rows=99500192 loops=1)
-> Hash (cost=2742.81..2742.81 rows=710 width=18) (actual time=0.309..0.309 rows=89 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> Bitmap Heap Scan on encounter e (cost=17.93..2742.81 rows=710 width=18) (actual time=0.037..0.197 rows=89 loops=1)
Recheck Cond: (customer_id = '1655563'::text)
Heap Blocks: exact=46
-> Bitmap Index Scan on idx_abc_encounter_customer_id (cost=0.00..17.76 rows=710 width=0) (actual time=0.025..0.025 rows=89 loops=1)
Index Cond: (customer_id = '1655563'::text)
Planning time: 0.358 ms
Execution time: 226694.311 ms
(12 rows)
main=> explain analyze SELECT e.customer_id, l.*
FROM abc.encounter e
JOIN abc.log l
ON e.encounter_id = l.encounter_id
AND e.customer_id = '121652491';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=36.67..53168.06 rows=168 width=75) (actual time=0.090..0.422 rows=11 loops=1)
-> Index Scan using idx_abc_encounter_customer_id on encounter e (cost=0.43..40.53 rows=9 width=18) (actual time=0.017..0.047 rows=17 loops=1)
Index Cond: (customer_id = '121652491'::text)
-> Bitmap Heap Scan on log l (cost=36.24..5888.00 rows=1506 width=66) (actual time=0.016..0.017 rows=1 loops=17)
Recheck Cond: (encounter_id = e.encounter_id)
Heap Blocks: exact=6
-> Bitmap Index Scan on idx_abc_log_encounter_id (cost=0.00..35.86 rows=1506 width=0) (actual time=0.013..0.013 rows=1 loops=17)
Index Cond: (encounter_id = e.encounter_id)
Planning time: 0.361 ms
Execution time: 0.478 ms
(10 rows)
I'll also add that, for a long running query, even if only 250 rows are returned after 2 minutes, adding "LIMIT 100" can make the query return instantly. I investigated whether or not the speed was related to the amount of data returned by the query, and I don't see any obvious trend. I can't help feeling that Postgres is quite mistaken (100x?) about which of its methods will be faster. What are my options here?
PostgreSQL's row count estimates for encounter
are off by a factor of almost 10. My first attempt would be to improve that.
For that you can change the statistics target for the column:
ALTER TABLE abc.encounter ALTER customer_id SET STATISTICS 1000;
A subsequent ANALYZE
will then collect better statistics for that column. If 1000 is not enough, try 10000. With a better row count estimate you have better chances to get the best plan.
If the cost of the repeated index scans for the nested loop join is still overestimated in comparison with a sequential scan, you can lower the parameter random_page_cost
from its default value 4 to something closer to seq_page_cost
(default 1). That will bias PostgreSQL in favor of the nested loop join.