Search code examples
mysqlquery-optimizationsnowflake-cloud-data-platform

Get statistical measures of a varchar field in snowflake


I have a field called MER_DATA in a snowflake table having a value as shown below:

[43,44.25,44.5,42.75,44,44.25,42.75,43,42.5,42.5,36.75,42.25,42.75,43.25,43.25,43.25,42.75,43.5,42,43,43.75,43.75,43.25,41.75,43.25,42.5,43.25,42.75,43.25,43.5,43.25,43.25,43.75,...]

Each row has approximately 4k(This varies from row to row)numbers in them and the data type of the field is varchar(30000). The data is around 700k rows

Now I want to calculate the standard deviation of each row using the numbers present in the list shown above.

I have tried doing this in MySQL using the following query:

select mac, `timestamp`, std(res), min(res), max(res) 
from 
    (select mac, `timestamp`, r.res from table cmr ,
    json_table(mer_data, '$[*]' columns (res float path '$'))r)T 
group by mac, `timestamp`;

which gives me the right result but takes a lot of time for 700k rows.

I want to do the same in snowflake. Is there an optimal way to do this?

Also the query needs to run within 10 mins in snowflake. The mysql query can take upto 1 hours.


Solution

  • Without the table definition and example source data it's difficult to produce a complete solution for your problem, but here is an example of how to do this using the STRTOK_SPLIT_TO_TABLE table function which first splits your varchar numbers to rows, so we can then re-aggregate the Value's to get the standard deviations per row.

    First generate some test data at the right scale:

    Create or Replace Table cmr (mer_data varchar) as
    With gen as (
    select 
       uniform(1,700000, random()) row_num,
       normal(50, 1, random(0))::decimal(4,2) num
    from table(generator(rowcount => 2800000000)) v 
    )
    Select listagg(num, ',') listNums from gen group by row_num
    ;
    

    Check we have 700k rows and varying count of numbers per row.

    Select 
      count(*) row_count, 
      min(REGEXP_COUNT( mer_data , '[,]' ))+1 min_num_count, 
      max(REGEXP_COUNT( mer_data , '[,]' ))+1 max_num_count 
    from cmr limit 10;
    

    Split the varchar number lists to rows with STRTOK_SPLIT_TO_TABLE and group by the generated SEQ column to calculate the stddev of the VALUE.

    Select 
      seq row_num, 
      stddev(value) stdListNums, 
      min(value) minNum, max(value) maxNum, 
      count(value) countListNums
    from cmr, table(STRTOK_SPLIT_TO_TABLE(mer_data ,',')) 
    Group By 1
    ;
    

    For my data the query takes just over 3 minutes on and XSMALL Virtual Warehouse, and a little over 30 seconds on LARGE Virtual Warehouse.

    You can read about the STRTOK_SPLIT_TO_TABLE function here.