Search code examples
sqlpostgresqlpostgresql-performancegenerate-seriesset-returning-functions

Optimize query on big table executing generate_series()


The following query takes more than 7 minutes in PostgreSQL 11.1:

SELECT 
    '2019-01-19' as date, 
    '2019-01-19'::date - generate_series(first_observed, last_observed, interval '1 day')::date as days_to_date, 
    ROUND(AVG(price)) as price,
    area_id
FROM 
    table_example
GROUP BY 
    days_to_date, area_id;

table_example has around 15 million rows.
Are there any ways to optimize it? I have already added the following indexes:

CREATE INDEX ON table_example (first_observed, last_observed);
CREATE INDEX ON table_example (area_id);

This is output from EXPLAIN (ANALYZE,BUFFERS):

GroupAggregate  (cost=3235559683.68..3377398628.68 rows=1418000 width=72) (actual time=334933.966..440096.869 rows=21688 loops=1)
  Group Key: (('2019-01-19'::date - ((generate_series((first_observed)::timestamp with time zone, (last_observed)::timestamp with time zone, '1 day'::interval)))::date)), area_id
  Buffers: local read=118167 dirtied=118167 written=117143, temp read=1634631 written=1635058
  ->  Sort  (cost=3235559683.68..3271009671.18 rows=14179995000 width=40) (actual time=334923.933..391690.184 rows=380203171 loops=1)
        Sort Key: (('2019-01-19'::date - ((generate_series((first_observed)::timestamp with time zone, (last_observed)::timestamp with time zone, '1 day'::interval)))::date)), area_id
        Sort Method: external merge  Disk: 9187584kB
        Buffers: local read=118167 dirtied=118167 written=117143, temp read=1634631 written=1635058
        ->  Result  (cost=0.00..390387079.39 rows=14179995000 width=40) (actual time=214.798..171717.941 rows=380203171 loops=1)
              Buffers: local read=118167 dirtied=118167 written=117143
              ->  ProjectSet  (cost=0.00..71337191.89 rows=14179995000 width=44) (actual time=214.796..102823.749 rows=380203171 loops=1)
                    Buffers: local read=118167 dirtied=118167 written=117143
                    ->  Seq Scan on table_example  (cost=0.00..259966.95 rows=14179995 width=44) (actual time=0.031..2449.511 rows=14179995 loops=1)
                          Buffers: local read=118167 dirtied=118167 written=117143
Planning Time: 0.409 ms
JIT:
  Functions: 18
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 5.034 ms, Inlining 13.010 ms, Optimization 121.440 ms, Emission 79.996 ms, Total 219.480 ms
Execution Time: 441133.410 ms

This is what table_example looks like:

column name        data type
'house_pk'         'integer'    
'date_in'          'date'   
'first_observed'   'date'   
'last_observed'    'date'   
'price'            'numeric'    
'area_id'          'integer'    

There are 60 distinct area_ids.

Query is being run on a multi-core machine (24 cores) with 128 GB of memory. It is possible that settings are not optimal, however.


Solution

  • While processing the whole table, indexes are typically useless (with the possible exception of an index-only scan if table rows are much wider than the index).

    And while processing the whole table, I don't see much room for performance optimization of the query itself. One minor thing:

    SELECT d.the_date
         , generate_series(d.the_date - last_observed
                         , d.the_date - first_observed) AS days_to_date
         , round(avg(price)) AS price
         , area_id
    FROM   table_example
         , (SELECT date '2019-01-19') AS d(the_date)
    GROUP  BY days_to_date, area_id;

    Assuming first_observed & last_observed are date NOT NULL and always < date '2019-01-19'. Else you need to cast / do more.

    This way, you have only two subtractions and then generate_series() works with integers (fastest).

    The added mini-subquery is just for convenience, to only provide the date once. In a prepared statement or function, you can use a parameter and don't need this:

         , (SELECT date '2019-01-19') AS d(the_date)
    

    Other than that, if EXPLAIN (ANALYZE, BUFFERS) mentions "Disk" (example: Sort Method: external merge Disk: 3240kB), then a (temporary) higher setting for work_mem should help. See:

    If you can't afford more RAM and the aggregate and/or sort steps still spill to disk, it might help to divide & conquer with a query like, using a LATERAL join:

    SELECT d.the_date, f.*, a.area_id
    FROM   area a
         , (SELECT date '2019-01-19') AS d(the_date)
         , LATERAL (
       SELECT generate_series(d.the_date - last_observed
                            , d.the_date - first_observed) AS days_to_date
            , round(avg(price)) AS price
       FROM   table_example
       WHERE  area_id = a.area_id
       GROUP  BY 1
       ) f;
    

    Assuming a table area, obviously.