sql postgresql postgresql-performance generate-series set-returning-functions

Optimize query on big table executing generate_series()

The following query takes more than 7 minutes in PostgreSQL 11.1:

SELECT 
    '2019-01-19' as date, 
    '2019-01-19'::date - generate_series(first_observed, last_observed, interval '1 day')::date as days_to_date, 
    ROUND(AVG(price)) as price,
    area_id
FROM 
    table_example
GROUP BY 
    days_to_date, area_id;

table_example has around 15 million rows.
Are there any ways to optimize it? I have already added the following indexes:

CREATE INDEX ON table_example (first_observed, last_observed);
CREATE INDEX ON table_example (area_id);

This is output from EXPLAIN (ANALYZE,BUFFERS):

GroupAggregate  (cost=3235559683.68..3377398628.68 rows=1418000 width=72) (actual time=334933.966..440096.869 rows=21688 loops=1)
  Group Key: (('2019-01-19'::date - ((generate_series((first_observed)::timestamp with time zone, (last_observed)::timestamp with time zone, '1 day'::interval)))::date)), area_id
  Buffers: local read=118167 dirtied=118167 written=117143, temp read=1634631 written=1635058
  ->  Sort  (cost=3235559683.68..3271009671.18 rows=14179995000 width=40) (actual time=334923.933..391690.184 rows=380203171 loops=1)
        Sort Key: (('2019-01-19'::date - ((generate_series((first_observed)::timestamp with time zone, (last_observed)::timestamp with time zone, '1 day'::interval)))::date)), area_id
        Sort Method: external merge  Disk: 9187584kB
        Buffers: local read=118167 dirtied=118167 written=117143, temp read=1634631 written=1635058
        ->  Result  (cost=0.00..390387079.39 rows=14179995000 width=40) (actual time=214.798..171717.941 rows=380203171 loops=1)
              Buffers: local read=118167 dirtied=118167 written=117143
              ->  ProjectSet  (cost=0.00..71337191.89 rows=14179995000 width=44) (actual time=214.796..102823.749 rows=380203171 loops=1)
                    Buffers: local read=118167 dirtied=118167 written=117143
                    ->  Seq Scan on table_example  (cost=0.00..259966.95 rows=14179995 width=44) (actual time=0.031..2449.511 rows=14179995 loops=1)
                          Buffers: local read=118167 dirtied=118167 written=117143
Planning Time: 0.409 ms
JIT:
  Functions: 18
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 5.034 ms, Inlining 13.010 ms, Optimization 121.440 ms, Emission 79.996 ms, Total 219.480 ms
Execution Time: 441133.410 ms

This is what table_example looks like:

column name        data type
'house_pk'         'integer'    
'date_in'          'date'   
'first_observed'   'date'   
'last_observed'    'date'   
'price'            'numeric'    
'area_id'          'integer'

There are 60 distinct area_ids.

Query is being run on a multi-core machine (24 cores) with 128 GB of memory. It is possible that settings are not optimal, however.

Solution

While processing the whole table, indexes are typically useless (with the possible exception of an index-only scan if table rows are much wider than the index).

And while processing the whole table, I don't see much room for performance optimization of the query itself. One minor thing:

SELECT d.the_date
     , generate_series(d.the_date - last_observed
                     , d.the_date - first_observed) AS days_to_date
     , round(avg(price)) AS price
     , area_id
FROM   table_example
     , (SELECT date '2019-01-19') AS d(the_date)
GROUP  BY days_to_date, area_id;

Assuming first_observed & last_observed are date NOT NULL and always < date '2019-01-19'. Else you need to cast / do more.

This way, you have only two subtractions and then generate_series() works with integers (fastest).

The added mini-subquery is just for convenience, to only provide the date once. In a prepared statement or function, you can use a parameter and don't need this:

     , (SELECT date '2019-01-19') AS d(the_date)

Other than that, if EXPLAIN (ANALYZE, BUFFERS) mentions "Disk" (example: Sort Method: external merge Disk: 3240kB), then a (temporary) higher setting for work_mem should help. See:

If you can't afford more RAM and the aggregate and/or sort steps still spill to disk, it might help to divide & conquer with a query like, using a LATERAL join:

SELECT d.the_date, f.*, a.area_id
FROM   area a
     , (SELECT date '2019-01-19') AS d(the_date)
     , LATERAL (
   SELECT generate_series(d.the_date - last_observed
                        , d.the_date - first_observed) AS days_to_date
        , round(avg(price)) AS price
   FROM   table_example
   WHERE  area_id = a.area_id
   GROUP  BY 1
   ) f;

Assuming a table area, obviously.