In my Snowflake SQL query I have an output like below, all other fields contain the same data when grouped by Column A:
Column A | Column B | Column C | Column D | Tag |
---|---|---|---|---|
A | NULL | AAAA | NULL | 1 |
A | BBBB | AAAA | CCCC | |
B | DDDD | NULL | EEEE | 3 |
B | FFFF | NULL | GGGG | 4 |
B | HHHH | NULL | IIII | 2 |
B | HHHH | JJJJ | IIII |
The query uses both SELECT DISTINCT and GROUP BY clauses to remove any duplication where it exists. I'm aware that these are not true 'duplicates' but I'd like to remove the rows i've tagged as 1 and 2 as they have NULLS, so my output should look like this:
Column A | Column B | Column C | Column D |
---|---|---|---|
A | BBBB | AAAA | CCCC |
B | DDDD | NULL | EEEE |
B | FFFF | NULL | GGGG |
B | HHHH | JJJJ | IIII |
I've already tried using ROW_NUMBER(), PARTITION BY, RANK() etc to eliminate the rows i don't require, but as rows tagged 3 and 4 legitimately contain NULLS it also removes them. There are no other fields (for example datetime) that i could use to order by and remove based on them.
A slimmed down example of my current query is:
SELECT
COLUMN A,
COLUMN B,
COLUMN C,
COLUMN D,
COLUMN E,
COLUMN F
FROM
(
WITH CTE1 AS
(
SELECT DISTINCT
COLUMN A,
COLUMN B,
COLUMN C,
COLUMN D,
COLUMN E,
COLUMN F
FROM
SOURCE_TABLE WHERE SOURCE = '1'
),
CTE2 AS
(
SELECT DISTINCT
COLUMN A,
COLUMN B,
COLUMN C,
COLUMN D,
COLUMN G,
COLUMN H
FROM
SOURCE_TABLE WHERE SOURCE = '2'
),
CTE3 AS
(
SELECT DISTINCT
COLUMN A,
COLUMN B,
COLUMN C,
COLUMN D,
COLUMN E,
COLUMN F
FROM
SOURCE_TABLE WHERE SOURCE = '3'
),
BASE_CTE AS
(
SELECT DISTINCT
COLUMN A,
COLUMN B,
COLUMN C,
COLUMN D,
MAX(COLUMN E) AS COLUMN E,
LISTAGG(COLUMN F, ', ') AS COLUMN F,
FROM
(
SELECT
COLUMN A,
COLUMN B,
COLUMN C,
COLUMN D,
COLUMN E,
COLUMN F
FROM
CTE1
UNION
(
SELECT
COLUMN A,
COLUMN B,
COLUMN C,
COLUMN D,
COLUMN E,
COLUMN F
FROM
CTE2
)
UNION
(
SELECT
COLUMN A,
COLUMN B,
COLUMN C,
COLUMN D,
COLUMN E,
COLUMN F
FROM
CTE3
)
) GROUP BY ALL
)
SELECT * FROM BASE_CTE
);
You can use the following approach:
CASE WHEN
to mark rows with 0 when left side of join is a duplicate of the right side; and with 0 otherwise.MAX(dedup) OVER(...)
), and filter by it (QUALIFY ...
)DISTINCT
rows to avoid duplications.Two notes:
dedup
field.WITH data AS (
SELECT 'A' AS col_a, NULL AS col_b, 'AAAA' AS col_c, NULL AS col_d
UNION ALL
SELECT 'A' AS col_a, 'BBBB' AS col_b, 'AAAA' AS col_c, 'CCCC' AS col_d
UNION ALL
SELECT 'B' AS col_a, 'DDDD' AS col_b, NULL AS col_c, 'EEEE' AS col_d
UNION ALL
SELECT 'B' AS col_a, 'FFFF' AS col_b, NULL AS col_c, 'GGGG' AS col_d
UNION ALL
SELECT 'B' AS col_a, 'HHHH' AS col_b, NULL AS col_c, 'IIII' AS col_d
UNION ALL
SELECT 'B' AS col_a, 'HHHH' AS col_b, 'JJJJ' AS col_c, 'IIII' AS col_d
UNION ALL
SELECT 'C' AS col_a, 'AAAA' AS col_b, 'BBBB' AS col_c, 'CCCC' AS col_d
UNION ALL
SELECT 'D' AS col_a, NULL AS col_b, NULL AS col_c, NULL AS col_d
UNION ALL
SELECT 'E' AS col_a, NULL AS col_b, NULL AS col_c, NULL AS col_d
UNION ALL
SELECT 'E' AS col_a, 'AAAA' AS col_b, 'BBBB' AS col_c, 'CCCC' AS col_d
),
processed AS (
SELECT
d1.col_a,
d1.col_b,
d1.col_c,
d1.col_d,
CASE
WHEN (
(d1.col_b IS NULL OR d1.col_b = d2.col_b)
AND (d1.col_c IS NULL OR d1.col_c = d2.col_c)
AND (d1.col_d IS NULL OR d1.col_d = d2.col_d)
)
THEN 1
ELSE 0
END AS dedup
FROM data d1
LEFT JOIN data d2 ON d1.col_a = d2.col_a AND hash(d1.col_b, d1.col_c, d1.col_d) != hash(d2.col_b, d2.col_c, d2.col_d)
QUALIFY MAX(dedup) OVER(PARTITION BY d1.col_a, d1.col_b, d1.col_c, d1.col_d) = 0
)
SELECT DISTINCT
col_a,
col_b,
col_c,
col_d
FROM processed
ORDER BY 1,2,3,4