Search code examples
sqlgroup-byhivecountwindow-functions

calculate percentage of null values based on group by


I'm trying to modify this solution (the first answer) that counts null values based on group by, the only difference is I want percentage of them (e.g. 30% of the column1 for year 2016 is null), not count (e.g. 6521 of th column1 for year 2016 is null). My query:

WITH t1nulltest AS
( select date_column
,SUM(IF(c1 IS NULL,1,0))/count(*) OVER (PARTITION BY date_column) as c1null
,SUM(IF(c2 IS NULL,1,0))/count(*) OVER (PARTITION BY date_column) as c2null
,SUM(IF(c3 IS NULL,1,0))/count(*) OVER (PARTITION BY date_column) as c3null
,SUM(IF(c4 IS NULL,1,0))/count(*) OVER (PARTITION BY date_column) as c4null
,SUM(IF(c5 IS NULL,1,0))/count(*) OVER (PARTITION BY date_column) as c5null
,row_number() OVER (PARTITION BY date_column) as rowno
from t1) 
select 
  date_column, c1null, c2null,c3null,c4null,c5null from t1nulltest
  where rowno =1;

The only difference from the original solution is that I add /count(*), but this doesn't work, and I wonder why. The original query works. My query gives the error:

Error while compiling statement: FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: line 2:68 Expression not in GROUP BY key 'date_column'

Solution

  • I suspect that the whole logic can be simplified into a simple aggregation query.

    select
        date_column,
        1.0 * (count(*) - count(c1)) / count(*) c1_null_ratio,
        1.0 * (count(*) - count(c2)) / count(*) c2_null_ratio,
        1.0 * (count(*) - count(c3)) / count(*) c3_null_ratio,
        1.0 * (count(*) - count(c4)) / count(*) c4_null_ratio,
        1.0 * (count(*) - count(c5)) / count(*) c5_null_ratio
    from t1
    group by date_column
    

    This can also be phrased with avg():

    select
        date_column,
        avg(case when c1 is null then 1 else 0 end) c1_null_ratio,
        avg(case when c2 is null then 1 else 0 end) c2_null_ratio,
        avg(case when c3 is null then 1 else 0 end) c3_null_ratio,
        avg(case when c4 is null then 1 else 0 end) c4_null_ratio,
        avg(case when c5 is null then 1 else 0 end) c5_null_ratio,
    from t1
    group by date_column