Search code examples
sqlpostgresqlgrafana

SQL Query to average multiple data and compare with one data point to check offset


I am new to SQL so please bear with me. In one of my tables I have a series of inverter voltage values, written every minute. I am displaying the data on GRAFANA and would like to find anomalous points, if any. A really simple solution for me now is to get the average of all voltages at a specific time, and see whether 1 value is off by "a lot".

Some background: Using postresql, and the columns "timestamp", "value", "name" name is the identifier and right now the inverters are in the format 'INVX_INVERTER_SG_Branch VoltageY' where Y = {1,3,5,7,9,11,13,15,17,19,21,23} and X = {01-18}.

The first temporary query works as expected, it produces a table of timestamps and average values of all combinations at that time. In the second SELECT query I am getting bullshit values. There is no other data with name containing 'SG_Branch Voltage' on the DB Table. Our friendly chatgpt is spitting garbage.

Ideally the result will show me timestamps and values for anomalous points. I shall then plot those points on the graph. This will simply be an indicator of the point being not as expected.

WITH average AS (
    SELECT
        "timestamp",
        AVG(value) AS avg_value
    FROM
        st_values
    WHERE
        "timestamp" BETWEEN NOW() - INTERVAL '1 day' AND NOW() AND
        "name" LIKE 'INV%_INVERTER_SG_Branch Voltage%'
    GROUP BY "timestamp"
)
SELECT
    v."timestamp",
    v.value
FROM
    st_values v
JOIN
    average av ON av."timestamp" = v."timestamp"
WHERE
    v."timestamp" BETWEEN NOW() - INTERVAL '1 day' AND NOW() AND
    v."name" LIKE 'INV%_INVERTER_SG_Branch Voltage%' AND
    ABS(v.value - av.avg_value) / av.avg_value > 0.3 
ORDER BY
    v."timestamp";

The code above is (semi?) not working. I am not sure how to debug this. This yields points that are truly anomalous (i.e. lie on one of my lines) but some that are (fake?) not lying on any lines.

Edit : A screenshot of the results. As you can see some red dots are on the purple (anomalous) line, however some are above the line where no values should be

Sample data


Solution

  • There's nothing in your query that could spawn bogus values. It might be worth it to revisit your WHERE conditions:

    1. If you only want these sources:

      X = {01-18} and Y = {1,3,5,7,9,11,13,15,17,19,21,23}.
      

      You can express specifically that, instead of resorting to the catch-all % in the regular expression:

      "name" SIMILAR TO 
      'INV(0[1-9]|1[0-8])\_INVERTER\_SG\_Branch Voltage(1|3|5|7|9|11|13|15|17|19|21|23)' 
      ESCAPE '\'
      
      "name" ~ 
      'INV(0[1-9]|1[0-8])_INVERTER_SG_Branch Voltage(1|3|5|7|9|11|13|15|17|19|21|23)' 
      

      Which works just as well for both SQL standard regular expression (SIMILAR TO) and POSIX syntax (~), except the escaping part: _ underscore is a single-character wildcard in LIKE and SIMILAR TO. \ backslash is their default escape character, so the ESCAPE '\' is just for verbosity.

      I wouldn't suspect you overlooked a variant of the text that your wildcard caught against your intention, but this could technically be the case.

    2. You could also make sure the unexpected values you're getting aren't there because you're feeding the data into the table having a different TimeZone setting than when you reading it here. In such case this

      "timestamp" BETWEEN NOW() - INTERVAL '1 day' AND NOW()`
      

      could be getting you less than the last 24h of data, if the source is behind, or a 24h slice between 4h and 28h ago, if the source is 4h ahead. If your "timestamp" column is actually timestamptz, this shouldn't be a problem.

    3. You didn't share what "the lines" are and how exactly you plot them, but that could also be the source of error.

    Unfortunately, without a reproducible example that's mostly just guesswork.


    Window functions can get you the same result, significantly faster and with less code: DB<>Fiddle demo

    SELECT "timestamp", value FROM (
        SELECT "timestamp",
               value,
               ABS(value - avg(value)over w1) / avg(value)over w1 as deviation
        FROM st_values
        WHERE "timestamp" BETWEEN NOW() - INTERVAL '1 day' AND NOW()
          AND "name" LIKE 'INV%_INVERTER_SG_Branch Voltage%'
        WINDOW w1 AS (PARTITION BY "timestamp") ) AS subq
    WHERE .3 < deviation
    ORDER BY "timestamp"
    

    Since you're dealing with time series data, you can remove the default 10% empty space in your index to speed things up. You can also squeeze everything of interest into the index and get index-only scans:

    create index tnv_idx on st_values ("timestamp","name")
      include("value")
      with(fillfactor=100);
    

    If you have queries that still need to run sequential scans, those could benefit from a cluster:

    cluster st_values using tnv_idx;