I have a table of data. Column 1 is a list of categories, and column 2 is a boolean. I have N number of categories, with N number of rows per category.
I would like to return a table with the data grouped by category, and summary of the number of rows for each category, and the sum of the boolean column (number of rows with value = 1).
I would also like to return a summary of: (Sum(BooleanField)/Count(BooleanField))/(Sum(BooleanField)/Count(BooleanField)), where the numerator does not include rows with the category (Category_name) that my Group By function returns for, and the denominator is all-inclusive (all categories).
So far, I have the code
SELECT(Category_name),
COUNT(BooleanField),
SUM(BooleanField),
SUM(BooleanField)/COUNT(BooleanField) -- this is % True for each category
-- some logic that takes the % true for all categories except the category
-- that we are grouping by later / by the % true overall (all observations)
FROM Data.Source
GROUP BY Category_Name
This code so far is just exploratory.
The "magic number" column explains what I am looking for next, with the other columns representing what is being returned by my code so far: https://docs.google.com/spreadsheets/d/17oienILCeATmH-kNzBZqz0s0Bj9ptjKZ9HfcQJCvAdA/edit#gid=0
Thanks for any help.
Sample Data:
Category BooleanField
Cat1 0
cat1 1
cat2 1
cat2 1
cat2 1
Cat2 0
Cat2 0
Cat2 1
Cat2 1
Cat2 1
Cat3 0
Cat3 0
Cat3 0
Cat3 1
Cat4 1
Cat4 0
Cat4 0
Cat4 0
Cat4 0
Cat4 1
Desired Result
Category Percent True Sum Count Magic Number
Cat1 50.00% 1 2 1.0000
Cat2 60.00% 6 8 0.6667
Cat3 25.00% 1 4 1.1250
Cat4 33.33% 2 6 1.1429
The magic number column is the trouble I'm finding. I need to find this magic number column so that I can determine which categories are driving down the overall % true the most. Such that removing the most influentially negative category would increase the overall %T the most.
You can use window functions. I think the logic you want is:
select category,
avg(cast(booleanField as int)) as percent_true,
sum(cast(booleanField as int)) as total,
count(*) cnt,
(sum(sum(cast(booleanField as int))) over() - sum(cast(booleanField as int)))
/ (sum(count(*)) over() - count(*))
/ avg(cast(booleanField as int))
as magic_number
from mytable
group by category
order by category