Search code examples
sqlcountapache-spark-sqldistinct

Query to get total distinct values side by side with group by values


How do I get a column that shows a count of total distinct values side by side with a "group by" count of distinct values.

For example, in the query below I'm getting counts of distinct patients in each value set. I would also like to get the total count of distinct patients as well. How do I do this? The query below works but does not get me the counts for the total number of patients. This happens to be in Databricks SQL/Apache Spark but I imagine has a generalized solution across most sql database implementations.

select
  value_set_name,
  oid value_set_oid,
  count(distinct rx.patient_id) as count
  -- ??? count of total distinct patient_id's ???
from
  rx
  join value_set vs on rx.code = vs.code
group by 1,2
order by 1

Solution

  • A subquery is possibly the simplest method:

    select value_set_name,
           oid as value_set_oid,
           count(distinct rx.patient_id) as count,
           (select count(distinct rx2.patient_id) from rx rx2) as num_total_distinct
      -- ??? count of total distinct patient_id's ???
    from rx join
         value_set vs
         on rx.code = vs.code
    group by 1,2
    order by 1;
    

    Because you are using count(distinct), you cannot simply sum the total count(distinct)s from all rows -- patients in more than on group would be counted multiple times.

    Your query is actually pretty simple. It might be faster to do two levels of aggregation:

    select value_set_name, value_set_oid, count(*) as num_patients,
           sum(count(*)) over () as num_total_patients
    from (select value_set_name, oid as value_set_oid, rx_patient_id, count(*) as cnt
          from rx join
               value_set vs
               on rx.code = vs.code
          group by 1, 2, 3
         ) rv
    group by 1, 2
    order by 1;