Search code examples
sqlgoogle-bigqueryoverlapping

Find Overlapping within categories in SQL Big Query


I'm trying to do something similar to this but in BigQuery. I have several users that may have 1 or more categories. I must find the overlapping within categories. Something like this:

enter image description here

What I want as result is something like this:

enter image description here

That is, for example, only one user has only category D (and no other), two users have categories 10 and 30, and so on.

The main problem is that I have a lot of categories (over 40). Previously I had done something like:

SELECT sum(cat1), sum(cat2), sum(cat3)
FROM  table
where cat1 = 0 and cat2 = 1 and cat3 = 0

That way worked, but is too manual and impossible to do it right now because I have lots of categories. Want to use BigQuery if it is possible.


Solution

  • The main problem is that I have a lot of categories (over 40).

    Consider below (BigQuery) approach - works for any reasonable amount of categories

    execute immediate (
    select '''
      select * from (
        select distinct t1.usr, 
          t1.categories category, t2.categories category2
        from `your_table` t1 left join `your_table` t2 
        on t1.usr = t2.usr and t1.categories != t2.categories
        union all
        select usr, any_value(categories) category, any_value(categories) category2
        from `your_table`
        group by usr
        having count(1) = 1
      )
      pivot (count(usr) cat for category2 in (''' || list || '''))
      order by category
    '''
    from (
      select string_agg("'" || categories || "'" order by categories) list 
      from (select distinct categories from `your_table`)
      )
    )     
    

    if applied to sample data in your question - output is

    enter image description here