Search code examples
sqlcountaveragesnowflake-cloud-data-platformwindow-functions

Can I use SQL to Sum() and/or count() for all elements in the group except for the returned category?


I have a table of data. Column 1 is a list of categories, and column 2 is a boolean. I have N number of categories, with N number of rows per category.

I would like to return a table with the data grouped by category, and summary of the number of rows for each category, and the sum of the boolean column (number of rows with value = 1).

I would also like to return a summary of: (Sum(BooleanField)/Count(BooleanField))/(Sum(BooleanField)/Count(BooleanField)), where the numerator does not include rows with the category (Category_name) that my Group By function returns for, and the denominator is all-inclusive (all categories).

So far, I have the code

SELECT(Category_name),
COUNT(BooleanField),
SUM(BooleanField),
SUM(BooleanField)/COUNT(BooleanField) -- this is % True for each category

-- some logic that takes the % true for all categories except the category 
-- that we are grouping by later / by the % true overall (all observations) 

FROM Data.Source
GROUP BY Category_Name

This code so far is just exploratory.

The "magic number" column explains what I am looking for next, with the other columns representing what is being returned by my code so far: https://docs.google.com/spreadsheets/d/17oienILCeATmH-kNzBZqz0s0Bj9ptjKZ9HfcQJCvAdA/edit#gid=0

Thanks for any help.

Sample Data:

Category BooleanField
Cat1    0
cat1    1
cat2    1
cat2    1
cat2    1
Cat2    0
Cat2    0
Cat2    1
Cat2    1
Cat2    1
Cat3    0
Cat3    0
Cat3    0
Cat3    1
Cat4    1
Cat4    0
Cat4    0
Cat4    0
Cat4    0
Cat4    1

Desired Result

Category    Percent True    Sum Count   Magic Number
 Cat1       50.00%            1   2      1.0000
 Cat2       60.00%            6   8      0.6667
 Cat3       25.00%            1   4      1.1250
 Cat4       33.33%            2   6      1.1429

The magic number column is the trouble I'm finding. I need to find this magic number column so that I can determine which categories are driving down the overall % true the most. Such that removing the most influentially negative category would increase the overall %T the most.


Solution

  • You can use window functions. I think the logic you want is:

    select category, 
        avg(cast(booleanField as int)) as percent_true,
        sum(cast(booleanField as int)) as total,
        count(*) cnt,
        (sum(sum(cast(booleanField as int))) over() - sum(cast(booleanField as int))) 
            / (sum(count(*)) over() - count(*))
            / avg(cast(booleanField as int))
            as magic_number
    from mytable
    group by category
    order by category