The following query takes more than 7 minutes in PostgreSQL 11.1:
SELECT
'2019-01-19' as date,
'2019-01-19'::date - generate_series(first_observed, last_observed, interval '1 day')::date as days_to_date,
ROUND(AVG(price)) as price,
area_id
FROM
table_example
GROUP BY
days_to_date, area_id;
table_example
has around 15 million rows.
Are there any ways to optimize it? I have already added the following indexes:
CREATE INDEX ON table_example (first_observed, last_observed);
CREATE INDEX ON table_example (area_id);
This is output from EXPLAIN (ANALYZE,BUFFERS)
:
GroupAggregate (cost=3235559683.68..3377398628.68 rows=1418000 width=72) (actual time=334933.966..440096.869 rows=21688 loops=1)
Group Key: (('2019-01-19'::date - ((generate_series((first_observed)::timestamp with time zone, (last_observed)::timestamp with time zone, '1 day'::interval)))::date)), area_id
Buffers: local read=118167 dirtied=118167 written=117143, temp read=1634631 written=1635058
-> Sort (cost=3235559683.68..3271009671.18 rows=14179995000 width=40) (actual time=334923.933..391690.184 rows=380203171 loops=1)
Sort Key: (('2019-01-19'::date - ((generate_series((first_observed)::timestamp with time zone, (last_observed)::timestamp with time zone, '1 day'::interval)))::date)), area_id
Sort Method: external merge Disk: 9187584kB
Buffers: local read=118167 dirtied=118167 written=117143, temp read=1634631 written=1635058
-> Result (cost=0.00..390387079.39 rows=14179995000 width=40) (actual time=214.798..171717.941 rows=380203171 loops=1)
Buffers: local read=118167 dirtied=118167 written=117143
-> ProjectSet (cost=0.00..71337191.89 rows=14179995000 width=44) (actual time=214.796..102823.749 rows=380203171 loops=1)
Buffers: local read=118167 dirtied=118167 written=117143
-> Seq Scan on table_example (cost=0.00..259966.95 rows=14179995 width=44) (actual time=0.031..2449.511 rows=14179995 loops=1)
Buffers: local read=118167 dirtied=118167 written=117143
Planning Time: 0.409 ms
JIT:
Functions: 18
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 5.034 ms, Inlining 13.010 ms, Optimization 121.440 ms, Emission 79.996 ms, Total 219.480 ms
Execution Time: 441133.410 ms
This is what table_example looks like:
column name data type
'house_pk' 'integer'
'date_in' 'date'
'first_observed' 'date'
'last_observed' 'date'
'price' 'numeric'
'area_id' 'integer'
There are 60 distinct area_ids.
Query is being run on a multi-core machine (24 cores) with 128 GB of memory. It is possible that settings are not optimal, however.
While processing the whole table, indexes are typically useless (with the possible exception of an index-only scan if table rows are much wider than the index).
And while processing the whole table, I don't see much room for performance optimization of the query itself. One minor thing:
SELECT d.the_date
, generate_series(d.the_date - last_observed
, d.the_date - first_observed) AS days_to_date
, round(avg(price)) AS price
, area_id
FROM table_example
, (SELECT date '2019-01-19') AS d(the_date)
GROUP BY days_to_date, area_id;
Assuming first_observed
& last_observed
are date NOT NULL
and always < date '2019-01-19'
. Else you need to cast / do more.
This way, you have only two subtractions and then generate_series()
works with integers (fastest).
The added mini-subquery is just for convenience, to only provide the date once. In a prepared statement or function, you can use a parameter and don't need this:
, (SELECT date '2019-01-19') AS d(the_date)
Other than that, if EXPLAIN (ANALYZE, BUFFERS)
mentions "Disk" (example: Sort Method: external merge Disk: 3240kB
), then a (temporary) higher setting for work_mem
should help. See:
If you can't afford more RAM and the aggregate and/or sort steps still spill to disk, it might help to divide & conquer with a query like, using a LATERAL
join:
SELECT d.the_date, f.*, a.area_id
FROM area a
, (SELECT date '2019-01-19') AS d(the_date)
, LATERAL (
SELECT generate_series(d.the_date - last_observed
, d.the_date - first_observed) AS days_to_date
, round(avg(price)) AS price
FROM table_example
WHERE area_id = a.area_id
GROUP BY 1
) f;
Assuming a table area
, obviously.