postgresql caching indexing postgresql-performance

Why do seq/index scans take so long when running query after a while? How to make it fast?

Problem:

I have a query that joins three tables. Whenever I run this query after a while (say 24hrs), it would take a lot of time to execute. But from that time onwards, it would execute really fast (~ 70x faster). I wanted to know what's the problem that it takes so long to execute for the first time, and how to solve it.

Table conditions:

The tables are: property_2, property_attribute_2, and property_address_2. Each of which is a partition of a bigger table (i.e. property, property_attribute, and property_address). Also, rows in property_attribute_2 and property_address_2 have reference key to property_2 using column property_id. These columns (property.id, property_attribute_2.property_id, and property_address_2.property_id) are all indexed.

The query is:

select * from public.property_2 a 
inner join public.property_attribute_2 b on a.id = b.property_id 
left join public.property_address_2 c on a.id=c.property_id

The query plan when I run it after a while is:

Hash Right Join  (cost=670010.33..983391.75 rows=2477776 width=185) (actual time=804159.499..1065892.338 rows=2477924 loops=1)
  Hash Cond: (c.property_id = a.id)
  ->  Seq Scan on property_address_2 c  (cost=0.00..131660.48 rows=4257948 width=72) (actual time=289.781..247906.955 rows=4257973 loops=1)
  ->  Hash  (cost=595483.13..595483.13 rows=2477776 width=117) (actual time=803833.183..803833.185 rows=2477921 loops=1)
        Buckets: 32768  Batches: 128  Memory Usage: 3165kB
        ->  Hash Join  (cost=94193.96..595483.13 rows=2477776 width=117) (actual time=98061.326..802753.642 rows=2477921 loops=1)
              Hash Cond: (a.id = b.property_id)
              ->  Seq Scan on property_2 a  (cost=0.00..265463.84 rows=6176884 width=105) (actual time=1349.284..696922.438 rows=4272433 loops=1)
              ->  Hash  (cost=48702.76..48702.76 rows=2477776 width=20) (actual time=95497.307..95497.308 rows=2477921 loops=1)
                    Buckets: 65536  Batches: 64  Memory Usage: 2624kB
                    ->  Seq Scan on property_attribute_2 b  (cost=0.00..48702.76 rows=2477776 width=20) (actual time=464.476..94126.890 rows=2477921 loops=1)
Planning time: 4.034 ms
Execution time: 1065995.827 ms

And the query plan after the first run is:

Hash Right Join  (cost=670010.33..983391.75 rows=2477776 width=185) (actual time=8828.873..13764.283 rows=2477924 loops=1)
  Hash Cond: (c.property_id = a.id)
  ->  Seq Scan on property_address_2 c  (cost=0.00..131660.48 rows=4257948 width=72) (actual time=0.050..1411.877 rows=4257973 loops=1)
  ->  Hash  (cost=595483.13..595483.13 rows=2477776 width=117) (actual time=8826.620..8826.623 rows=2477921 loops=1)
        Buckets: 32768  Batches: 128  Memory Usage: 3165kB
        ->  Hash Join  (cost=94193.96..595483.13 rows=2477776 width=117) (actual time=1356.224..7925.850 rows=2477921 loops=1)
              Hash Cond: (a.id = b.property_id)
              ->  Seq Scan on property_2 a  (cost=0.00..265463.84 rows=6176884 width=105) (actual time=0.034..2652.013 rows=4272433 loops=1)
              ->  Hash  (cost=48702.76..48702.76 rows=2477776 width=20) (actual time=1354.828..1354.829 rows=2477921 loops=1)
                    Buckets: 65536  Batches: 64  Memory Usage: 2624kB
                    ->  Seq Scan on property_attribute_2 b  (cost=0.00..48702.76 rows=2477776 width=20) (actual time=0.023..630.081 rows=2477921 loops=1)
Planning time: 1.181 ms
Execution time: 13872.977 ms

Also worth noting that I have a couple of other Postgres databases on this machine and different jobs use different tables on these databases on a regular basis.

Solution

If cold cache is the problem, as it seems to be the case, you can warm it up before running the query. Postgres ships with the additional module pg_prewarm providing a range of tools to populate the cache.

Instructions how to set it up here:

PostgreSQL: Force data into memory

Then you run something like:

SELECT pg_prewarm('public.property_2', 'prefetch');
SELECT pg_prewarm('public.property_attribute_2', 'prefetch');
SELECT pg_prewarm('public.property_address_2', 'prefetch');

Of course, if you always run the same SELECT query without filter predicates, you might as well just run the same query to populate the cache, without using the fancy module. Possibly scheduled with a cron job?

... are all indexed.

As you can see in the EXPLAIN output, your indexes go unused. You fetch all rows without filter predicate, so indexes typically won't help. And you do it with SELECT *, i.e. get all columns from all joined tables, so index-only scans are out, too. You might improve performance by listing only the columns you actually need in the SELECT list.

Obviously, more RAM (and proper configuration for PostgreSQL buffer cache) can help, too.

Or you might be able to reduce RAM requirements with VACUUM (FULL) or, possibly, with an optimized table definition with proper column types and order. Not just for the tables at hand, also for other tables competing for the same resources (thereby evicting "your" blocks from the cache). See:

Calculating and saving space in PostgreSQL