so I have a query and I'm running it on the worst case scenario and it's taking 10-12 minutes. if I remove the time check on the where query it goes down to 20-30 seconds so I was wondering how to optimize it?
I tried adding index on the conversion of the timestamp to time but it didn't really help... the rs tables (register_status) has over 70 million rows, register_date around 280k and the cp one less than 1k.
The idea of the query is getting all the results of CP grouped by status over a period of dates, included in a time range. And that's the worst case scenario, so that's the first date in the database and if the user select the whole day as time range. Query is the following:
explain analyze SELECT
COUNT(rs.status) filter (where rs.status = 'Occ') as total_occ,
COUNT(rs.status) filter (where rs.status = 'Part') as total_part,
COUNT(rs.status) filter (where rs.status = 'OOS') as total_oos,
COUNT(rs.status) filter (where rs.status = 'OOC') as total_ooc,
cp.id as charge_point_id,
cp.address,
cp.type as charge_point_type,
cp.latitude,
cp.longitude
FROM register_date rd
inner join register_status rs on rs.fk_register_date = rd.id
inner join charge_point cp on cp.id = rs.fk_charge_point
WHERE
rd.date::date >= '2016-11-01' and rd.date::date <= '2019-08-01'
AND
rd.date::time >= time '00:00' AND rd.date::time <= time '23:59'
group by cp.id
And the EXPLAIN ANALYZE result is the following one, I can see a lot of space usage...
"Finalize GroupAggregate (cost=34412.78..34536.10 rows=780 width=124) (actual time=689440.380..699740.172 rows=813 loops=1)"
" Group Key: cp.id"
" -> Gather Merge (cost=34412.78..34519.27 rows=722 width=124) (actual time=689421.445..699736.996 rows=1579 loops=1)"
" Workers Planned: 1"
" Workers Launched: 1"
" -> Partial GroupAggregate (cost=33412.77..33438.04 rows=722 width=124) (actual time=649515.576..659674.461 rows=790 loops=2)"
" Group Key: cp.id"
" -> Sort (cost=33412.77..33414.57 rows=722 width=96) (actual time=649496.720..654001.697 rows=24509314 loops=2)"
" Sort Key: cp.id"
" Sort Method: external merge Disk: 2649104kB"
" Worker 0: Sort Method: external merge Disk: 2652840kB"
" -> Nested Loop (cost=0.56..33378.49 rows=722 width=96) (actual time=1.343..504948.423 rows=24509314 loops=2)"
" -> Parallel Seq Scan on register_date rd (cost=0.00..6443.69 rows=4 width=4) (actual time=0.021..294.724 rows=139760 loops=2)"
" Filter: (((date)::date >= '2016-11-01'::date) AND ((date)::date <= '2019-08-01'::date) AND ((date)::time without time zone >= '00:00:00'::time without time zone) AND ((date)::time without time zone <= '23:59:00'::time without time zone))"
" -> Nested Loop (cost=0.56..6725.90 rows=780 width=100) (actual time=0.077..3.574 rows=175 loops=279519)"
" -> Seq Scan on charge_point cp (cost=0.00..21.80 rows=780 width=92) (actual time=0.002..0.077 rows=813 loops=279519)"
" -> Index Only Scan using register_status_fk_charge_point_fk_register_date_status_key on register_status rs (cost=0.56..8.58 rows=1 width=12) (actual time=0.004..0.004 rows=0 loops=227248947)"
" Index Cond: ((fk_charge_point = cp.id) AND (fk_register_date = rd.id))"
" Heap Fetches: 49018627"
"Planning Time: 0.506 ms"
"Execution Time: 700065.010 ms"
This might be faster with a lateral join:
SELECT cp.*, rd.*
FROM charge_point cp CROSS JOIN LATERAL
(SELECT COUNT(*) filter (where rs.status = 'Occ') as total_occ,
COUNT(*) filter (where rs.status = 'Part') as total_part,
COUNT(*) filter (where rs.status = 'OOS') as total_oos,
COUNT(*) filter (where rs.status = 'OOC') as total_ooc,
FROM register_date rd JOIN
register_status rs
ON rs.fk_register_date = rd.id
WHERE cp.id = rs.fk_charge_point AND
rd.date >= '2016-11-01' and
rd.date < '2019-08-01' + interval '1 day'
) rd;
Indexes on register_date(fk_charge_point, date)
and register_status(id, status)
are recommended.
Note that I changed the date comparisons so they are more index-friendly. I see no reason to filter by time
, so I removed those conditions.