mysql arrays json query-optimization snowflake-cloud-data-platform

Aggregating row values in MySQl or Snowflake

I would like to calculate the std dev. min and max of the mer_data array into 3 other fields called std_dev,min_mer and max_mer grouped by mac and timestamp. This needs to be done without flattening the data as each mer_data row consists of 4000 float values and multiplying that with 700k rows gives a very high dimensional table. The mer_data field is currently saved as varchar(30000) and maybe Json format might help, I'm not sure. Input:

Output:

This can be done in Snowflake or MySQL. Also, the query needs to be optimized so that it does not take much computation time.

Solution

While you don't want to split the data up, you will need to if you want to do it in pure SQL. Snowflake has no problems with such aggregations.

WITH fake_data(mac, mer_data) AS (
    SELECT * FROM VALUES
        ('abc','43,44.25,44.5,42.75,44,44.25,42.75,43'),
        ('def','32.75,33.25,34.25,34.5,32.75,34,34.25,32.75,43')
)
SELECT f.mac,
    avg(d.value::float) as avg_dev,
    stddev(d.value::float) as std_dev,
    MIN(d.value::float) as MIN_MER,
    Max(d.value::float) as Max_MER
FROM fake_data f, table(split_to_table(f.mer_data,',')) d
GROUP BY 1
ORDER BY 1;

I would however discourage the use of strings in the grouping process, so would break it apart like so:

WITH fake_data(mac, mer_data, timestamp) AS (
    SELECT * FROM VALUES
        ('abc','43,44.25,44.5,42.75,44,44.25,42.75,43', '01-01-22'),
        ('def','32.75,33.25,34.25,34.5,32.75,34,34.25,32.75,43', '02-01-22')
), boost_data AS (
    SELECT seq8() as seq, *
    FROM fake_data
), math_step AS (
  SELECT f.seq,
      avg(d.value::float) as avg_dev,
      stddev(d.value::float) as std_dev,
      MIN(d.value::float) as MIN_MER,
      Max(d.value::float) as Max_MER
  FROM boost_data f, table(split_to_table(f.mer_data,',')) d
  GROUP BY 1
)
SELECT b.mac,
    m.avg_dev,
    m.std_dev,
    m.MIN_MER,
    m.Max_MER, 
    b.timestamp
FROM boost_data b
JOIN math_step m
    ON b.seq = m.seq
ORDER BY 1;

MAC	AVG_DEV	STD_DEV	MIN_MER	MAX_MER	TIMESTAMP
abc	43.5625	0.7529703087	42.75	44.5	01-01-22
def	34.611111111	3.226141056	32.75	43	02-01-22

performance testing:

so using this SQL to make 70K rows of 4000 values each:

create table fake_data_tab AS
WITH cte_a AS (
   SELECT SEQ8() as s
   FROM TABLE(GENERATOR(ROWCOUNT =>70000))
), cte_b AS (
   SELECT a.s, uniform(20::float, 50::float, random()) as v
   FROM TABLE(GENERATOR(ROWCOUNT =>4000))
   CROSS JOIN cte_a a
)
SELECT s::text as mac
    ,LISTAGG(v,',') AS mer_data
    ,dateadd(day,s,'2020-01-01')::date as timestamp
FROM cte_b
GROUP BY 1,3;

takes 79 seconds on a XTRA_SMALL,

now with that we can test the two solutions:

The second set of code (group by numbers, with a join):

WITH boost_data AS (
    SELECT seq8() as seq, *
    FROM fake_data_tab
), math_step AS (
  SELECT f.seq,
      avg(d.value::float) as avg_dev,
      stddev(d.value::float) as std_dev,
      MIN(d.value::float) as MIN_MER,
      Max(d.value::float) as Max_MER
  FROM boost_data f, table(split_to_table(f.mer_data,',')) d
  GROUP BY 1
)
SELECT b.mac,
    m.avg_dev,
    m.std_dev,
    m.MIN_MER,
    m.Max_MER, 
    b.timestamp
FROM boost_data b
JOIN math_step m
    ON b.seq = m.seq
ORDER BY 1;

takes 1m47s

the original group by strings/dates

SELECT f.mac,
    avg(d.value::float) as avg_dev,
    stddev(d.value::float) as std_dev,
    MIN(d.value::float) as MIN_MER,
    Max(d.value::float) as Max_MER,
    f.timestamp
FROM fake_data_tab f, table(split_to_table(f.mer_data,',')) d
GROUP BY 1,6
ORDER BY 1;

takes 1m46s

Hmm, so leaving the "mac" as a number made the code very fast (~3s), and dealing with strings in ether way changed the data processed from 1.5GB for strings and 150MB for numbers.