Search code examples
sqldatabricksdata-wranglinghavingdatabricks-sql

SQL data wrangling help using the Having statement


The below code (Databricks SQL) produces the table following it. I am trying to adjust this code so that the output only includes zip5 records that have only 1 (or less) of each facility_type associated with it. Facility_type has 4 possible values (Hospital, ASC, Other, and null). In the table below, I want zip5 (10003) to be output, as it has only 1 of each of it's associated facility_types. Zip5 10016 would not be output, as it has 2 Hospital values. Zip5 10021 has 3 values with a Hospital facility_type, so it would also not be output. Zip5 10025 and 10029 would both be output.

I've tried using different having statements, but they all have allowed some unwanted zip5's to sneak into the output. For example, the following statement allows zip5 10016 into the output.

How can I achieve what I need to here? Is the having statement not the way to go?

HAVING (COUNT(DISTINCT ok.facility_type) <= 1 and count(distinct a.org_id) <=1)
SELECT a.zip5, a.org_id, ok.facility_type 
FROM sales_table a 
LEFT JOIN (SELECT ok.org_id, 
                  CASE WHEN cot.COT_DESC IN ('Outpatient') THEN 'ASC'
                       WHEN cot.cot_desc IN ('Hospital') THEN cot.cot_desc 
                       ELSE 'Other' 
                       END AS facility_type
           FROM ref_table1 ok
           LEFT JOIN ref_table2 cot ON ok.ID = cot.ID) ok ON a.org_id = ok.org_id 
GROUP BY a.zip5, a.org_id, ok.facility_type 
zip5 org_id facility_type
10003 845307 Other
10003 001564 Hospital
10003 006054 null
10016 932258 Hospital
10016 005484 Hospital
10016 null null
10016 584790 ASC
10021 005491 Hospital
10021 005154 Hospital
10021 002166 Hospital
10021 null null
10025 001565 Hospital
10029 005425 Other
10029 005483 Hospital

Solution

    1. cte calculates the count of each facility_type per zip5
    2. then filters out zip5s where any facility_type count is greater than 1
    3. next, selects records where zip5 is not in the filtered list, making sure that only zip5s with 1 or less of each facility_type are included (10003, 10025, and 10029 are included while 10016 and 10021 excluded)

    -- Step 1: Calculate the counts of each facility_type per zip5
    WITH facility_counts AS (
      SELECT 
        a.zip5, 
        ok.facility_type, 
        COUNT(a.org_id) AS facility_count
      FROM sales_table a 
      LEFT JOIN (
        SELECT 
          ok.org_id, 
          CASE 
            WHEN cot.COT_DESC IN ('Outpatient') THEN 'ASC'
            WHEN cot.cot_desc IN ('Hospital') THEN cot.cot_desc 
            ELSE 'Other' 
          END AS facility_type
        FROM ref_table1 ok
        LEFT JOIN ref_table2 cot ON ok.ID = cot.ID
      ) ok ON a.org_id = ok.org_id 
      GROUP BY a.zip5, ok.facility_type
    )
    
    -- Step 2: Filter zip5s where any facility_type count is greater than 1
    , filtered_zip5s AS (
      SELECT zip5
      FROM facility_counts
      WHERE facility_count > 1
      GROUP BY zip5
    )
    
    -- Step 3: Select zip5s that do not appear in the filtered_zip5s
    SELECT a.zip5, a.org_id, ok.facility_type 
    FROM sales_table a 
    LEFT JOIN (
      SELECT 
        ok.org_id, 
        CASE 
          WHEN cot.COT_DESC IN ('Outpatient') THEN 'ASC'
          WHEN cot.cot_desc IN ('Hospital') THEN cot.cot_desc 
          ELSE 'Other' 
        END AS facility_type
      FROM ref_table1 ok
      LEFT JOIN ref_table2 cot ON ok.ID = cot.ID
    ) ok ON a.org_id = ok.org_id 
    WHERE a.zip5 NOT IN (SELECT zip5 FROM filtered_zip5s)
    GROUP BY a.zip5, a.org_id, ok.facility_type
    ORDER BY a.zip5, a.org_id;