Search code examples
sqlhivehiveql

How to get distinct count over multiple columns in Hive SQL?


I have a table that looks like this. And I want to get the distinct count horizontally across the three columns ignoring nulls.

ID Column1 Column 2 Column 3
1 A B C
2 A A B
3 A A

The desired output I'm looking for is:

ID Column1 Column 2 Column 3 unique_count
1 A B C 3
2 A A B 2
3 A A 1

Solution

  • One possible option would be

    WITH sample AS (
      SELECT 'A' Column1, 'B' Column2, 'C' Column3 UNION ALL
      SELECT 'A', 'A', 'B' UNION ALL
      SELECT 'A', 'A', NULL UNION ALL
      SELECT '', 'A', NULL
    )
    SELECT Column1, Column2, Column3, COUNT(DISTINCT NULLIF(TRIM(c), '')) unique_count
      FROM (SELECT *, ROW_NUMBER() OVER () rn FROM sample) t LATERAL VIEW EXPLODE(ARRAY(Column1, Column2, Column3)) tf AS c
     GROUP BY Column1, Column2, Column3, rn;
    
    output
    +---------+---------+---------+--------------+
    | column1 | column2 | column3 | unique_count |
    +---------+---------+---------+--------------+
    |         | A       | NULL    |            1 |
    | A       | A       | NULL    |            1 |
    | A       | A       | B       |            2 |
    | A       | B       | C       |            3 |
    +---------+---------+---------+--------------+