How do I get a column that shows a count of total distinct values side by side with a "group by" count of distinct values.
For example, in the query below I'm getting counts of distinct patients in each value set. I would also like to get the total count of distinct patients as well. How do I do this? The query below works but does not get me the counts for the total number of patients. This happens to be in Databricks SQL/Apache Spark but I imagine has a generalized solution across most sql database implementations.
select
value_set_name,
oid value_set_oid,
count(distinct rx.patient_id) as count
-- ??? count of total distinct patient_id's ???
from
rx
join value_set vs on rx.code = vs.code
group by 1,2
order by 1
A subquery is possibly the simplest method:
select value_set_name,
oid as value_set_oid,
count(distinct rx.patient_id) as count,
(select count(distinct rx2.patient_id) from rx rx2) as num_total_distinct
-- ??? count of total distinct patient_id's ???
from rx join
value_set vs
on rx.code = vs.code
group by 1,2
order by 1;
Because you are using count(distinct)
, you cannot simply sum the total count(distinct)
s from all rows -- patients in more than on group would be counted multiple times.
Your query is actually pretty simple. It might be faster to do two levels of aggregation:
select value_set_name, value_set_oid, count(*) as num_patients,
sum(count(*)) over () as num_total_patients
from (select value_set_name, oid as value_set_oid, rx_patient_id, count(*) as cnt
from rx join
value_set vs
on rx.code = vs.code
group by 1, 2, 3
) rv
group by 1, 2
order by 1;