Search code examples
sqlpostgresqlquery-performance

Sequential scan over index scan used on truncated DATE field


I use PostgreSQL and I have a table called table. This table contains a column called created_at (data_type is timestamptz) which is indexed using BTREE.

I would like to count the number of rows grouped by created_at::date over a period of time (filter by from_date and end_date).

I run the following query (the result is as expected):

SELECT ("table"."created_at" AT TIME ZONE 'UTC')::date AS "date",
       COUNT("table"."id") AS "count"
FROM "table"
WHERE ("table"."created_at" >= '2018-08-05T00:00:00+00:00'::timestamptz AND "table"."created_at" <= '2020-09-05T00:00:00+00:00'::timestamptz)
GROUP BY ("table"."created_at" AT TIME ZONE 'UTC')::date
ORDER BY "date" ASC

This query takes a long time to run (over 2 million rows) and when looking at the Query Plan, I noticed that there is a heavy Seq Scan:

GroupAggregate  (cost=538741.06..605206.42 rows=2954016 width=12) (actual time=3866.460..5077.054 rows=559 loops=1)
  Group Key: ((timezone('UTC'::text, created_at))::date)
  ->  Sort  (cost=538741.06..546126.10 rows=2954016 width=8) (actual time=3866.414..4413.922 rows=2954016 loops=1)
        Sort Key: ((timezone('UTC'::text, created_at))::date)
        Sort Method: external merge  Disk: 52056kB
        ->  Seq Scan on table  (cost=0.00..140489.32 rows=2954016 width=8) (actual time=0.070..2194.108 rows=2954016 loops=1)
              Filter: ((created_at >= '2018-08-05 00:00:00+00'::timestamp with time zone) AND (created_at <= '2020-09-05 00:00:00+00'::timestamp with time zone))
Planning time: 1.018 ms
Execution time: 5094.280 ms

I would like to understand the following:

  1. What should be improved in the query (if any)
  2. What should be improved in the table structure (if any)
  3. Is the index type used (BTREE) appropriate for such queries?

Solution

  • It seems your WHERE condition doesn't actually filter out any rows, so all rows from the table are processed. In that case, using a Seq Scan is the most efficient way to retrieve the data. If you make the time range smaller so that only a small percent of the table's rows are retrieved the optimizer should use the index.

    The Seq Scan takes only half of the query time, the other half is spent in the GROUP BY (or the sorting for it). If you increase work_mem at least the sorting/grouping should be faster (with more work_mem) the sorting is likely to be replaced with a hash aggregate.

    Assuming id is defined as not null, then using count(*) instead of count(id) would also make the query faster. For one because the "null check" is no longer needed in the count function. But more importantly because Postgres is likely to an Index Only Scan as only the created_at column is needed which is available directly in the index. If this doesn't switch to an Index Only Scan, you might wan to run vacuum analyze the_table; to update the visibility map.