Search code examples
snowflake-cloud-data-platform

Aggregations by Array object - Get distinct number of customer ids for each descriptor


I have data similar to

with data(custid, descriptors) as (
    select 1, ['Corporate', 'fun times', 'but not really']
    union all 
    select 2, ['lame times', 'Corporate', 'boring']
    union all 
    select 3, ['boring', 'Corporate', 'fun times', 'but not really']
)
    
select 
* 
from data

with > 30k rows and an unknown number of unique descriptors across all of the arrays. I'd like to count how many distinct customerid values have a descriptor array that contains a given string. For any specific string, I could use

select 
count(distinct custid)
from data
where array_contains('Corporate'::variant, descriptors)

but I'd like to get a count of the number of customerid values that have every array value, across all 30k+ rows, rather than one-at-a-timing-it.

Ultimately, I'd like to have a table like

descriptor n_custids
Corporate 3
fun times 2
but not really 2
lame times 1
boring 2

for each possible unique string in every array but I'm not sure how to proceed to programmatically get all of the array members, then do a count(distinct custid)...where array_contains() for each of them. I've been reading the docs on RESULTSETs and cursors and FOR loops, but I'm relatively new to Snowflake and finding that group of documentation not entirely helpful. I know I could use array_distinct(array_agg()) to combine all of the arrays and get just the unique values but after this, I'm at a loss. I suspect there's a simple way but whatever that is, I'm missing it.

Thanks for your help!


Solution

  • You are close. It would help if you flattened the array object before you perform group by operation. Documentation for flatten function https://docs.snowflake.com/en/sql-reference/functions/flatten

    with data(custid, descriptors) as (
        select 1, ['Corporate', 'fun times', 'but not really']
        union all
        select 2, ['lame times', 'Corporate', 'boring']
        union all
        select 3, ['boring', 'Corporate', 'fun times', 'but not really']
    )
    
    select
        value::string as descriptor,
        array_unique_agg(custid) distinct_values,
        count(distinct custid) count_distinct
    from data,
    lateral flatten( input => data.descriptors)
    group by 1;
    
    +--------------+-----------------+--------------+
    |DESCRIPTOR    |DISTINCT_VALUES  |COUNT_DISTINCT|
    +--------------+-----------------+--------------+
    |Corporate     |[                |3             |
    |              |1,               |              |
    |              |2,               |              |
    |              |3                |              |
    |              |]                |              |
    |fun times     |[                |2             |
    |              |1,               |              |
    |              |3                |              |
    |              |]                |              |
    |but not really|[                |2             |
    |              |1,               |              |
    |              |3                |              |
    |              |]                |              |
    |lame times    |[                |1             |
    |              |2                |              |
    |              |]                |              |
    |boring        |[                |2             |
    |              |2,               |              |
    |              |3                |              |
    |              |]                |              |
    +--------------+-----------------+--------------+