Search code examples
postgresqlquery-optimizationtimescaledb

PostgreSQL query performance critically drops with specific query parameters


I have an issue with one of my databases running PostgreSQL 14 on an Ubuntu VM. This particular database contains a timeseries schema which hosts a meteo hypertable (timescaledb extension ON). The database's aim is to collect data from meteo stations in Switzerland, such as temperature, humidity, solar radiation, etc.

Global performances are great most of the time, however I can't figure out why a few specific cases cause significant drops in performance. At first, I thought it was due to the amount of data in the table, but that doesn't seem to be the case (at least, there's no clear evidence of such issue). I am no postgreSQL specialist, therefore I seek your guidance here.

Here's a sample query that returns data in a few milliseconds (and a row count for context):

select min(time) from timeseries.meteo where metadata_id='1089';
         min
---------------------
 2019-08-12 09:50:00
(1 row)

select count(*) from timeseries.meteo where metadata_id='1089';
 count
--------
 230765
(1 row)

And here's the troublesome query I identified after many hours of looking for signs of performance issue. This particular query takes about 90'000ms to execute (so ~1000x the time of the previous one), as well as the row count for context:

select min(time) from timeseries.meteo where metadata_id='1019';
         min
---------------------
 2023-04-04 07:50:00
(1 row)

select count(*) from timeseries.meteo where metadata_id='1019';
 count
-------
 42668
(1 row)

I uploaded the result of an EXPLAIN (ANALYZE, BUFFERS) on those queries:

My question is the following: how can such a "basic" query with a small amount of data take so much time, and what may I try to perform to improve performances on that specific case? Is my database broken or something?

I appreciate your help on solving this (mysterious) case.

PS: I checked the RAM and CPU usage and those do not seem to be much of an issue as the server is basically underutilized.


Solution

  • The real query found in depesz is select min(time) from timeseries.meteo where metadata_id='1019';

    This is just a non-obvious presentation of the classic ORDER BY...LIMIT problem. The problem id has a deficiency of rows with low value of time, so the plan needs to walk over a very long sequence of rows in time order before finding one with a qualifying metadata_id and then getting to stop early.

    You can solve this with a multicolumn index on (metadata_id, time). Then it can jump right to the needed data.