Search code examples
sqlpostgresqlaggregatepostgresql-performance

PostgreSQL aggregate query is very slow


I have a table, which contains a timestamp column and a source column varchar(20). I insert a couple thousand entries into this table every hour and I would like to show an aggregate on this data. My query looks like this:

EXPLAIN (analyze, buffers) SELECT
    count(*) AS count
FROM frontend_car c
WHERE date_created at time zone 'cet' > now() at time zone 'cet' - interval '1 week'
GROUP BY source, date_trunc('hour', c.date_created at time zone 'CET')
ORDER BY source ASC, date_trunc('hour', c.date_created at time zone 'CET') DESC

I have already created an index like so:

create index source_date_created on
table_name(
    (date_created AT TIME ZONE 'CET') DESC,
    source ASC,
    date_trunc('hour', date_created at time zone 'CET') DESC
);

And the output of my ANALYZE is:

    QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=142888.08..142889.32 rows=495 width=16) (actual time=10242.141..10242.188 rows=494 loops=1)
   Sort Key: source, (date_trunc('hour'::text, timezone('CET'::text, date_created)))
   Sort Method: quicksort  Memory: 63kB
   Buffers: shared hit=27575 read=28482
   ->  HashAggregate  (cost=142858.50..142865.93 rows=495 width=16) (actual time=10236.393..10236.516 rows=494 loops=1)
         Group Key: source, date_trunc('hour'::text, timezone('CET'::text, date_created))
         Buffers: shared hit=27575 read=28482
         ->  Bitmap Heap Scan on frontend_car c  (cost=7654.61..141002.20 rows=247507 width=16) (actual time=427.894..10122.438 rows=249056 loops=1)
               Recheck Cond: (timezone('cet'::text, date_created) > (timezone('cet'::text, now()) - '7 days'::interval))
               Rows Removed by Index Recheck: 141143
               Heap Blocks: exact=27878 lossy=26713
               Buffers: shared hit=27575 read=28482
               ->  Bitmap Index Scan on frontend_car_source_date_created  (cost=0.00..7592.74 rows=247507 width=0) (actual time=420.415..420.415 rows=249056 loops=1)
                     Index Cond: (timezone('cet'::text, date_created) > (timezone('cet'::text, now()) - '7 days'::interval))
                     Buffers: shared hit=3 read=1463
 Planning time: 2.430 ms
 Execution time: 10242.379 ms
(17 rows)

Clearly this is too slow and in my mind it should be able to be computed only using indexes, if I use only either time or source for aggregation, it is reasonably fast, but together somehow its slow.

This is on a rather small VPS with only 512MB of ram and the database presently contains about 700k rows.

From what I read it seems that the majority of time is spent on recheck, which means that the index did not fit in memory?


Solution

  • It sounds like what you really need is a separate aggregate table that gets records inserted or updated via a trigger in your detailed table. The summary table would have your source column, a date/time field to hold only the date and hour portion (truncating any minutes), and finally the running count.

    As records are inserted, this summary table gets updated, then your query could be directly on this table. Since it will already be pre-aggregated by source, date and hour, your query would just need to apply the where clause and order that by the source.

    I'm not fluent at all with postgresql, but am sure they have their own means of insert triggers. So, if you have 1000s of entries each hour, and say you have 10 sources. Your entire result set of this aggregate summary table would only be 24(hrs) * ex 50(sources) = 1200 records per day vs 50k, 60k, 70k+ per day. If you then need the exact details per a given date/hour basis, you could then drill-into the details on an as-needed basis. But really, how many "sources" are you dealing with is unclear.

    I would strongly consider this as a solution to your needs.