Search code examples
mysqlarraysjsonquery-optimizationsnowflake-cloud-data-platform

Aggregating row values in MySQl or Snowflake


I would like to calculate the std dev. min and max of the mer_data array into 3 other fields called std_dev,min_mer and max_mer grouped by mac and timestamp. This needs to be done without flattening the data as each mer_data row consists of 4000 float values and multiplying that with 700k rows gives a very high dimensional table. The mer_data field is currently saved as varchar(30000) and maybe Json format might help, I'm not sure. Input:

enter image description here

Output:

enter image description here

This can be done in Snowflake or MySQL. Also, the query needs to be optimized so that it does not take much computation time.


Solution

  • While you don't want to split the data up, you will need to if you want to do it in pure SQL. Snowflake has no problems with such aggregations.

    WITH fake_data(mac, mer_data) AS (
        SELECT * FROM VALUES
            ('abc','43,44.25,44.5,42.75,44,44.25,42.75,43'),
            ('def','32.75,33.25,34.25,34.5,32.75,34,34.25,32.75,43')
    )
    SELECT f.mac,
        avg(d.value::float) as avg_dev,
        stddev(d.value::float) as std_dev,
        MIN(d.value::float) as MIN_MER,
        Max(d.value::float) as Max_MER
    FROM fake_data f, table(split_to_table(f.mer_data,',')) d
    GROUP BY 1
    ORDER BY 1;
    

    I would however discourage the use of strings in the grouping process, so would break it apart like so:

    WITH fake_data(mac, mer_data, timestamp) AS (
        SELECT * FROM VALUES
            ('abc','43,44.25,44.5,42.75,44,44.25,42.75,43', '01-01-22'),
            ('def','32.75,33.25,34.25,34.5,32.75,34,34.25,32.75,43', '02-01-22')
    ), boost_data AS (
        SELECT seq8() as seq, *
        FROM fake_data
    ), math_step AS (
      SELECT f.seq,
          avg(d.value::float) as avg_dev,
          stddev(d.value::float) as std_dev,
          MIN(d.value::float) as MIN_MER,
          Max(d.value::float) as Max_MER
      FROM boost_data f, table(split_to_table(f.mer_data,',')) d
      GROUP BY 1
    )
    SELECT b.mac,
        m.avg_dev,
        m.std_dev,
        m.MIN_MER,
        m.Max_MER, 
        b.timestamp
    FROM boost_data b
    JOIN math_step m
        ON b.seq = m.seq
    ORDER BY 1;
    
    MAC AVG_DEV STD_DEV MIN_MER MAX_MER TIMESTAMP
    abc 43.5625 0.7529703087 42.75 44.5 01-01-22
    def 34.611111111 3.226141056 32.75 43 02-01-22

    performance testing:

    so using this SQL to make 70K rows of 4000 values each:

    create table fake_data_tab AS
    WITH cte_a AS (
       SELECT SEQ8() as s
       FROM TABLE(GENERATOR(ROWCOUNT =>70000))
    ), cte_b AS (
       SELECT a.s, uniform(20::float, 50::float, random()) as v
       FROM TABLE(GENERATOR(ROWCOUNT =>4000))
       CROSS JOIN cte_a a
    )
    SELECT s::text as mac
        ,LISTAGG(v,',') AS mer_data
        ,dateadd(day,s,'2020-01-01')::date as timestamp
    FROM cte_b
    GROUP BY 1,3;
    

    takes 79 seconds on a XTRA_SMALL,

    now with that we can test the two solutions:

    The second set of code (group by numbers, with a join):

    WITH boost_data AS (
        SELECT seq8() as seq, *
        FROM fake_data_tab
    ), math_step AS (
      SELECT f.seq,
          avg(d.value::float) as avg_dev,
          stddev(d.value::float) as std_dev,
          MIN(d.value::float) as MIN_MER,
          Max(d.value::float) as Max_MER
      FROM boost_data f, table(split_to_table(f.mer_data,',')) d
      GROUP BY 1
    )
    SELECT b.mac,
        m.avg_dev,
        m.std_dev,
        m.MIN_MER,
        m.Max_MER, 
        b.timestamp
    FROM boost_data b
    JOIN math_step m
        ON b.seq = m.seq
    ORDER BY 1;
    

    takes 1m47s

    the original group by strings/dates

    SELECT f.mac,
        avg(d.value::float) as avg_dev,
        stddev(d.value::float) as std_dev,
        MIN(d.value::float) as MIN_MER,
        Max(d.value::float) as Max_MER,
        f.timestamp
    FROM fake_data_tab f, table(split_to_table(f.mer_data,',')) d
    GROUP BY 1,6
    ORDER BY 1;
    

    takes 1m46s

    Hmm, so leaving the "mac" as a number made the code very fast (~3s), and dealing with strings in ether way changed the data processed from 1.5GB for strings and 150MB for numbers.