Search code examples
google-bigquerycalculated-columnsmedian

Do calculations in bigquery within row and median - is this possible?


My problem is getting a raw set of sensorydata, that needs some processing before I can use it. loading the data to client and do the processing is pretty slow, so looking for possibilty to unload this logic to bigquery.

Imagine I have some constants for a set of sensors. They can change, but I have them when I want to do the query

A: 1, B: 2, C: 3, D: 2, E: 1, F: 2

Sensors are connected, I know what sensors are connected to each other. It has a meaning below.

A: BC
D: EF

This is a table with measurements per timestamp per sensor. Imagine thousands of rows.

TS    A  |  B  |  C  |  D  |  E  |  F  
01    10 |  20 |  10 |  10 |  15 | 10
02    11 |  10 |  20 |  20 |  10 | 10
03    12 |  20 |  10 |  10 |  12 | 11
04    13 |  10 |  10 |  20 |  15 | 15
05    11 |  20 |  10 |  15 |  14 | 14
06    10 |  20 |  10 |  10 |  15 | 12

I want to query ts 01 to ts 06 (in real it can be 1000's of rows again). I don't want it to return this raw data, but have it do some calculations:

First, for each row, i need to detract the constants, so row 01 would look like:

01    9 |  18 |  17 |  8 |  14 | 8

Then, BC need to have A detracted, and EF to have D detracted:

01    9 |   9 |   8 |  8 |   6 | 0

Last step, when I have all rows, I want to return rows, where each sensor has the median value of the proceding X rows for this sensor. So

    TS    A  |  B  | 
    01    10 |  1  | 
    02    11 |  2  |  
    03    12 |  2  |  
    04    13 |  1  |  
    05    11 |  2  | 
    06    10 |  3  |  
    07    10 |  4  | 
    08    11 |  2  |  
    09    12 |  2  |  
    10    13 |  10 |  
    11    11 |  20 | 
    12    10 |  20 |  

returns (for X is 4)

    TS    A  |  B  | 
   //first 3 needed for median for 4th value
    04    11.5 |  etc  |   //median 10, 11, 12, 13
    05    11.5 |  etc  |   //median 11, 12, 13, 11
    06    11.5 |  etc  |   //median 12, 13, 11, 10
    07    etc |  etc | 
    08    etc |  etc |  
    09    etc |  etc |  
    10    etc |  etc |  
    11    etc |  etc | 
    12    etc |  etc |  

Getting the data to my server and do the calc is very slow, I am really wondering if we can get these amounts of data in bigQuery, so I am able to get a quick calculated set with my own settings of choice!

I do this in Node.js... but in BigQuery SQL.. i am lost.


Solution

  • Below is for BigQuery Standard SQL

    If you would look for AVG values - this would be as "simple" as below

    #standardSQL
    WITH constants AS (
      SELECT 1 val_a, 2 val_b, 3 val_c, 2 val_d, 1 val_e, 2 val_f
    ), temp AS (
      SELECT ts,
        a - val_a AS a, 
        b - val_b - a + val_a AS b,
        c - val_c - a + val_a AS c,
        d - val_d AS d,
        e - val_e - d + val_d  AS e,
        f - val_f - d + val_d AS f
      FROM `project.dataset.measurements`, constants
    )
    SELECT ts, 
      AVG(a) OVER(ORDER BY ts ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) a,  
      AVG(b) OVER(ORDER BY ts ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) b,  
      AVG(c) OVER(ORDER BY ts ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) c,  
      AVG(d) OVER(ORDER BY ts ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) d,  
      AVG(e) OVER(ORDER BY ts ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) e,  
      AVG(f) OVER(ORDER BY ts ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) f  
    FROM temp
    

    For MEDIAN you need to add a little extras - like in below example

    #standardSQL
    WITH constants AS (
      SELECT 1 val_a, 2 val_b, 3 val_c, 2 val_d, 1 val_e, 2 val_f
    ), temp AS (
      SELECT ts,
        a - val_a AS a, 
        b - val_b - a + val_a AS b,
        c - val_c - a + val_a AS c,
        d - val_d AS d,
        e - val_e - d + val_d  AS e,
        f - val_f - d + val_d AS f
      FROM `project.dataset.measurements`, constants
    )
    SELECT ts,
      (SELECT PERCENTILE_CONT(a, 0.5) OVER() FROM UNNEST(a) a LIMIT 1) a,
      (SELECT PERCENTILE_CONT(b, 0.5) OVER() FROM UNNEST(b) b LIMIT 1) b,
      (SELECT PERCENTILE_CONT(c, 0.5) OVER() FROM UNNEST(c) c LIMIT 1) c,
      (SELECT PERCENTILE_CONT(d, 0.5) OVER() FROM UNNEST(d) d LIMIT 1) d,
      (SELECT PERCENTILE_CONT(e, 0.5) OVER() FROM UNNEST(e) e LIMIT 1) e,
      (SELECT PERCENTILE_CONT(f, 0.5) OVER() FROM UNNEST(f) f LIMIT 1) f
    FROM (
      SELECT ts, 
        ARRAY_AGG(a) OVER(ORDER BY ts ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) a,  
        ARRAY_AGG(b) OVER(ORDER BY ts ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) b,  
        ARRAY_AGG(c) OVER(ORDER BY ts ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) c,  
        ARRAY_AGG(d) OVER(ORDER BY ts ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) d,  
        ARRAY_AGG(e) OVER(ORDER BY ts ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) e,  
        ARRAY_AGG(f) OVER(ORDER BY ts ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) f  
      FROM temp
    )
    

    You can test, play with above using sample data from your question as in example below

    #standardSQL
    WITH `project.dataset.measurements` AS (
      SELECT 01 ts, 10 a, 20 b, 20 c, 10 d, 15 e, 10 f UNION ALL
      SELECT 02, 11, 10, 20, 20, 10, 10 UNION ALL
      SELECT 03, 12, 20, 10, 10, 12, 11 UNION ALL
      SELECT 04, 13, 10, 10, 20, 15, 15 UNION ALL
      SELECT 05, 11, 20, 10, 15, 14, 14 UNION ALL
      SELECT 06, 10, 20, 10, 10, 15, 12 
    ), constants AS (
      SELECT 1 val_a, 2 val_b, 3 val_c, 2 val_d, 1 val_e, 2 val_f
    ), temp AS (
      SELECT ts,
        a - val_a AS a, 
        b - val_b - a + val_a AS b,
        c - val_c - a + val_a AS c,
        d - val_d AS d,
        e - val_e - d + val_d  AS e,
        f - val_f - d + val_d AS f
      FROM `project.dataset.measurements`, constants
    )
    SELECT ts,
      (SELECT PERCENTILE_CONT(a, 0.5) OVER() FROM UNNEST(a) a LIMIT 1) a,
      (SELECT PERCENTILE_CONT(b, 0.5) OVER() FROM UNNEST(b) b LIMIT 1) b,
      (SELECT PERCENTILE_CONT(c, 0.5) OVER() FROM UNNEST(c) c LIMIT 1) c,
      (SELECT PERCENTILE_CONT(d, 0.5) OVER() FROM UNNEST(d) d LIMIT 1) d,
      (SELECT PERCENTILE_CONT(e, 0.5) OVER() FROM UNNEST(e) e LIMIT 1) e,
      (SELECT PERCENTILE_CONT(f, 0.5) OVER() FROM UNNEST(f) f LIMIT 1) f
    FROM (
      SELECT ts, 
        ARRAY_AGG(a) OVER(ORDER BY ts ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) a,  
        ARRAY_AGG(b) OVER(ORDER BY ts ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) b,  
        ARRAY_AGG(c) OVER(ORDER BY ts ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) c,  
        ARRAY_AGG(d) OVER(ORDER BY ts ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) d,  
        ARRAY_AGG(e) OVER(ORDER BY ts ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) e,  
        ARRAY_AGG(f) OVER(ORDER BY ts ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) f  
      FROM temp
    )
    -- ORDER BY ts   
    

    with result

    Row ts  a       b       c       d       e       f    
    1   1   null    null    null    null    null    null     
    2   2   9.0     9.0     8.0     8.0     6.0     0.0  
    3   3   9.5     3.5     7.5     13.0    -1.5    -5.0     
    4   4   10.0    7.0     7.0     8.0     3.0     0.0  
    5   5   10.5    2.5     1.5     13.0    -0.5    -2.5     
    6   6   10.5    2.5     -3.5    15.5    -2.0    -3.0