Search code examples
postgresqlheroku-postgrespostgresql-10postgres-14

Postgres query planner won't use less expensive composite index


We've run into a strange issue with Postgres, initially unearthed as we prepare to upgrade from 10.21 to 11+. What we'd initially thought was a performance regression in newer versions of Postgres appears, in fact, to be a performance regression once more accurate stats are generated for our database. As it turns out, both last_analyze and last_autoanalyze are NULL for the table in question. The upgrade corrects this, but makes a certain subset of queries much slower.

Background

We're currently running PG 10.21 in production. The table we're interacting with here, events_test includes two columns that we're testing on, account (varchar(40)) and timestamp (timestamptz). There exist indexes on account, timestamp and a composite index on account and timestamp.

Our test query is

EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM events_test 
WHERE account = '1234' AND TIMESTAMP >= (NOW() AT TIME ZONE 'EST'  - interval '1 month');

We're testing it against a couple different account values — some we expect to return a large quantity of data (tens of thousands of rows for account 1234) and some a low quantity (tens of rows for account 5678).

On our current production DB, this runs quickly enough for our purposes, around 80-100ms for the large response (account 1234) and ~1ms for the small response (account 5678). In both cases, the query planner opts to start with an index scan on the composite index events_test_account_timestamp_idx.

Out of date stats

For the account returning a large number of rows, the query planner vastly underestimates the number of rows to be returned — it estimates ~500 rows, but gets around ~60,000. This is a result of the table stats being out of date on our production DB.

Manually running ANALYZE events_test;, or upgrading to PG11, PG12 or PG14 (which seems to cause this reanalysis to be run shortly after the upgrade) updates the table stats and causes the row estimate to be closer to the true value — the new estimate is on the order of ~20,000 rows for the account with more data returned.

Query planner

Unfortunately, this more-accurate estimate seems to result in the query planner running a much less efficient query. For the account with little data, the row estimate doesn't change and the planner continues to use the composite index. For the account with a larger amount of returned data, however, the planner now opts for an index scan on the events_test_timestamp_idx, followed by a filter on the account field, which takes the execution time of the query up to 400-500ms.

The query that expects a small number of rows to be returned continues to use the composite index on all versions of Postgres.

What we've tried

We've dug around on SO looking for similar questions about Postgres bypassing a composite index, but nothing we've found is of use. Also:

Analyze, reindex, stats target

We've tried updating the default_statistics_target from 100 to 1000 (and have also tried it briefly at 10 and 10000), followed by reanalyzing our table and reindexing, and the slower query plan has not changed.

Postgres versions

This issue was inintially revealed as we validated against newer versions of Postgres. We've tested after upgrading to PG11, PG12 and PG14 and the slower performance query planning shows up in all cases. We can also cause it to happen on PG10 by reanalyzing to update our stats, so it doesn't appear that this is a change in planner behaviour in a Postgres update.

Force composite index

We've verified that we can indeed force Postgres to use the composite index by (temporarily) dropping the timestamp index the planner is now preferring with this test command:

BEGIN;

DROP INDEX events_test_timestamp_idx;

EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM events_test 
WHERE account = '1234' AND TIMESTAMP >= (NOW() AT TIME ZONE 'EST'  - interval '1 month');

ROLLBACK;

This causes Postgres to revert to the composite index events_test_account_timestamp_idx, however it now chooses to do a bitmap index scan on this index. This continues to run in under 100ms, and is comparable in performance to the previous straight index scan on the composite index.

Query EXPLAINs

Here is what the query looks like when it's running more efficiently, in production, on PG10:

Index Scan using events_test_account_timestamp_idx on events_test  (cost=0.11..1109.11 rows=579 width=206) (actual time=0.062..116.439 rows=71700 loops=1)
  Index Cond: (((account)::text = '1234'::text) AND ("timestamp" >= (timezone('EST'::text, now()) - '1 mon'::interval)))
  Buffers: shared hit=68619 read=1249
  I/O Timings: read=23.426
Planning time: 0.662 ms
Execution time: 119.339 ms

And here is what it looks like once it has updated statistics, on PG14 (but the output is similar on all versions):

Index Scan using events_test_timestamp_idx on events_test  (cost=0.11..31867.79 rows=18994 width=204) (actual time=0.164..311.960 rows=55583 loops=1)
  Index Cond: ("timestamp" >= ((now() AT TIME ZONE 'EST'::text) - '1 mon'::interval))
  Filter: ((account)::text = '1234'::text)
  Rows Removed by Filter: 462327
  Buffers: shared hit=492867
Planning:
  Buffers: shared hit=144 read=30
  I/O Timings: read=6.160
Planning Time: 7.021 ms
Execution Time: 314.676 ms

And finally, if we force it to use the composite index by temporarily dropping the timestamp index, on PG14 (but again, similar on all versions):

Bitmap Heap Scan on events_test  (cost=187.05..35867.12 rows=18992 width=204) (actual time=11.373..38.937 rows=55582 loops=1)
  Recheck Cond: (((account)::text = '1234'::text) AND ("timestamp" >= ((now() AT TIME ZONE 'EST'::text) - '1 mon'::interval)))
  Heap Blocks: exact=13586
  Buffers: shared hit=13803
  ->  Bitmap Index Scan on events_test_account_timestamp_idx  (cost=0.00..186.10 rows=18992 width=0) (actual time=9.376..9.376 rows=55582 loops=1)
        Index Cond: (((account)::text = '1234'::text) AND ("timestamp" >= ((now() AT TIME ZONE 'EST'::text) - '1 mon'::interval)))
        Buffers: shared hit=217
Planning:
  Buffers: shared hit=5
Planning Time: 0.316 ms
Execution Time: 41.446 ms

Edit: It seems like the core issue here is that, with an accurate estimate, Postgres expects an index and filter on the single index to be faster (31867.79) than using the composite index (35867.12), when in fact the composite is nearly 10x faster in the above case. Which is interesting, given that the last two EXPLAINs above are using the same set of statistics to come up with these predictions.


Solution

  • I am pretty sure I know what is going on here, but I don't know what to do about it.

    The timestamp column is highly correlated with the order of rows in table (that is, pg_stats.correlation is nearly 1 for this tablename and attname). (Which correlation I see you confirmed in a comment). The planner therefore assumes the range scan on that segment of the index on (timestamp) will end up reading a small chunk of the table, and doing so more or less sequentially, and so gives a big break on the IO cost for use of that index. However, for the use of the index on (account, timestamp), it does not got this break. The planner only looks at the correlation stats for the first column of the index, and since account is not correlated with row order, it assumes it is going to have hop all over the entire table doing random reads. This is wrong, because once you equate the specific account and set the boundaries on the timestamp column, that chunk of the index is still highly correlated with the row order so you are still reading the (small) relevant chunk of the table and doing it in sequential order, but the planner never looks far enough to realize this.

    So the "wrong" index gets an unfair credit for an IO convenience which is true, but which is also true (but unknown) for the "right" index. But the plan with the wrong index needs to do a lot of filtering to remove the rows with wrong account values. This consumes CPU time, and the planner knows this, but the amount of CPU time burned is thought to be small compared to the amount of IO it incorrectly thinks will be saved.

    One thing you could do about this is to set autovacuum_enabled = off for this table. But for this to take effect, you would have to recreate the table, and then populate the table after the change is made. The equivalent of this is I suspect what someone somehow managed to do on your old server. Another thing you could consider is partitioning. If you only had to consider the most recent partition (or two partitions), the correlation effect would be much smaller, as the planner would already know it is going to look at a much smaller table regardless of which index is used.

    Another thing you could do, although quite grotty, is to manually monkey around with the statistics table. In my hands this did the job:

    update pg_statistic set stanumbers2='{0}' where starelid=25929 and staattnum=2;
    

    But of course in your system that relid would be different than mine, and the attnum probably as well, and I'm not sure stanumbers2 would be the right field to set on all systems. And this would just get overwritten again the next time analyze was done. And of course it might interfere with other plans the planner is trying to come up with, causing it to make choices there that are far worse than the benefit you might get here.