I have data similar to
with data(custid, descriptors) as (
select 1, ['Corporate', 'fun times', 'but not really']
union all
select 2, ['lame times', 'Corporate', 'boring']
union all
select 3, ['boring', 'Corporate', 'fun times', 'but not really']
)
select
*
from data
with > 30k rows and an unknown number of unique descriptors across all of the arrays. I'd like to count how many distinct customerid
values have a descriptor array that contains a given string. For any specific string, I could use
select
count(distinct custid)
from data
where array_contains('Corporate'::variant, descriptors)
but I'd like to get a count of the number of customerid
values that have every array value, across all 30k+ rows, rather than one-at-a-timing-it.
Ultimately, I'd like to have a table like
descriptor | n_custids |
---|---|
Corporate | 3 |
fun times | 2 |
but not really | 2 |
lame times | 1 |
boring | 2 |
for each possible unique string in every array but I'm not sure how to proceed to programmatically get all of the array members, then do a count(distinct custid)...where array_contains()
for each of them. I've been reading the docs on RESULTSET
s and cursors and FOR loops, but I'm relatively new to Snowflake and finding that group of documentation not entirely helpful. I know I could use array_distinct(array_agg())
to combine all of the arrays and get just the unique values but after this, I'm at a loss. I suspect there's a simple way but whatever that is, I'm missing it.
Thanks for your help!
You are close. It would help if you flattened the array object before you perform group by operation. Documentation for flatten function https://docs.snowflake.com/en/sql-reference/functions/flatten
with data(custid, descriptors) as (
select 1, ['Corporate', 'fun times', 'but not really']
union all
select 2, ['lame times', 'Corporate', 'boring']
union all
select 3, ['boring', 'Corporate', 'fun times', 'but not really']
)
select
value::string as descriptor,
array_unique_agg(custid) distinct_values,
count(distinct custid) count_distinct
from data,
lateral flatten( input => data.descriptors)
group by 1;
+--------------+-----------------+--------------+
|DESCRIPTOR |DISTINCT_VALUES |COUNT_DISTINCT|
+--------------+-----------------+--------------+
|Corporate |[ |3 |
| |1, | |
| |2, | |
| |3 | |
| |] | |
|fun times |[ |2 |
| |1, | |
| |3 | |
| |] | |
|but not really|[ |2 |
| |1, | |
| |3 | |
| |] | |
|lame times |[ |1 |
| |2 | |
| |] | |
|boring |[ |2 |
| |2, | |
| |3 | |
| |] | |
+--------------+-----------------+--------------+