Search code examples
sqlpostgresqlwindow-functionspostgresql-performanceconditional-aggregation

How to improve SQL query performance containing partially common subqueries


I have a simple table tableA in PostgreSQL 13 that contains a time series of event counts. In stylized form it looks something like this:

event_count     sys_timestamp

100             167877672772
110             167877672769
121             167877672987
111             167877673877
...             ...

With both fields defined as numeric.

With the help of answers from stackoverflow I was able to create a query that basically counts the number of positive and negative excess events within a given time span, conditioned on the current event count. The query looks like this:

SELECT t1.*,

    (SELECT COUNT(*) FROM tableA t2 
        WHERE t2.sys_timestamp > t1.sys_timestamp AND 
        t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
        t2.event_count >= t1.event_count+10)
    AS positive, 

    (SELECT COUNT(*) FROM tableA t2 
       WHERE t2.sys_timestamp > t1.sys_timestamp AND 
       t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
       t2.event_count <= t1.event_count-10) 
    AS negative 

FROM tableA as t1

The query works as expected, and returns in this particular example for each row a count of positive and negative excesses (range + / - 10) given the defined time window (+ 1000 [milliseconds]).

However, I will have to run such queries for tables with several million (perhaps even 100+ million) entries, and even with about 500k rows, the query takes a looooooong time to complete. Furthermore, whereas the time frame remains always the same within a given query [but the window size can change from query to query], in some instances I will have to use maybe 10 additional conditions similar to the positive / negative excesses in the same query.

Thus, I am looking for ways to improve the above query primarily to achieve better performance considering primarily the size of the envisaged dataset, and secondarily with more conditions in mind.

My concrete questions:

  1. How can I reuse the common portion of the subquery to ensure that it's not executed twice (or several times), i.e. how can I reuse this within the query?

     (SELECT COUNT(*) FROM tableA t2 
      WHERE t2.sys_timestamp >  t1.sys_timestamp
      AND   t2.sys_timestamp <= t1.sys_timestamp + 1000)
    
  2. Is there some performance advantage in turning the sys_timestamp field which is currently numeric, into a timestamp field, and attempt using any of the PostgreSQL Windows functions? (Unfortunately I don't have enough experience with this at all.)

  3. Are there some clever ways to rewrite the query aside from reusing the (partial) subquery that materially increases the performance for large datasets?

  4. Is it perhaps even faster for these types of queries to run them outside of the database using something like Java, Scala, Python etc. ?


Solution

  • How can I reuse the common portion of the subquery ...?

    Use conditional aggregates in a single LATERAL subquery:

    SELECT t1.*, t2.positive, t2.negative
    FROM   tableA t1
    CROSS  JOIN LATERAL (
       SELECT COUNT(*) FILTER (WHERE t2.event_count >= t1.event_count + 10) AS positive
            , COUNT(*) FILTER (WHERE t2.event_count <= t1.event_count - 10) AS negative
       FROM   tableA t2 
       WHERE  t2.sys_timestamp >  t1.sys_timestamp
       AND    t2.sys_timestamp <= t1.sys_timestamp + 1000
       ) t2;
    

    It can be a CROSS JOIN because the subquery always returns a row. See:

    Use conditional aggregates with the FILTER clause to base multiple aggregates on the same time frame. See:

    event_count should probably be integer or bigint. See:

    sys_timestamp should probably be timestamp or timestamptz. See:

    An index on (sys_timestamp) is minimum requirement for this. A multicolumn index on (sys_timestamp, event_count) typically helps some more. If the table is vacuumed enough, you get index-only scans from it.

    Depending on exact data distribution (most importantly how much time frames overlap) and other db characteristics, a tailored procedural solution may be faster, yet. Can be done in any client-side language. But a server-side PL/pgsql solution is superior because it saves all the round trips to the DB server and type conversions etc. See: