I'm trying to modify this solution (the first answer) that counts null values based on group by, the only difference is I want percentage of them (e.g. 30% of the column1 for year 2016 is null), not count (e.g. 6521 of th column1 for year 2016 is null). My query:
WITH t1nulltest AS
( select date_column
,SUM(IF(c1 IS NULL,1,0))/count(*) OVER (PARTITION BY date_column) as c1null
,SUM(IF(c2 IS NULL,1,0))/count(*) OVER (PARTITION BY date_column) as c2null
,SUM(IF(c3 IS NULL,1,0))/count(*) OVER (PARTITION BY date_column) as c3null
,SUM(IF(c4 IS NULL,1,0))/count(*) OVER (PARTITION BY date_column) as c4null
,SUM(IF(c5 IS NULL,1,0))/count(*) OVER (PARTITION BY date_column) as c5null
,row_number() OVER (PARTITION BY date_column) as rowno
from t1)
select
date_column, c1null, c2null,c3null,c4null,c5null from t1nulltest
where rowno =1;
The only difference from the original solution is that I add /count(*)
, but this doesn't work, and I wonder why. The original query works. My query gives the error:
Error while compiling statement: FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: line 2:68 Expression not in GROUP BY key 'date_column'
I suspect that the whole logic can be simplified into a simple aggregation query.
select
date_column,
1.0 * (count(*) - count(c1)) / count(*) c1_null_ratio,
1.0 * (count(*) - count(c2)) / count(*) c2_null_ratio,
1.0 * (count(*) - count(c3)) / count(*) c3_null_ratio,
1.0 * (count(*) - count(c4)) / count(*) c4_null_ratio,
1.0 * (count(*) - count(c5)) / count(*) c5_null_ratio
from t1
group by date_column
This can also be phrased with avg()
:
select
date_column,
avg(case when c1 is null then 1 else 0 end) c1_null_ratio,
avg(case when c2 is null then 1 else 0 end) c2_null_ratio,
avg(case when c3 is null then 1 else 0 end) c3_null_ratio,
avg(case when c4 is null then 1 else 0 end) c4_null_ratio,
avg(case when c5 is null then 1 else 0 end) c5_null_ratio,
from t1
group by date_column