I have a pretty complicated query I am trying to convert over to use with Hive.
Specifically, I am running it as a Hive "step" in an AWS EMR cluster.
I have tried to clean up the query a bit for the post and just leave essence of the thing.
The full error message is:
FAILED: SemanticException [Error 10128]: Line XX:XX Not yet supported place for UDAF 'COUNT'
The line number is pointing to the COUNT
at the bottom of the select statement:
INSERT INTO db.new_table (
new_column1,
new_column2,
new_column3,
... ,
new_column20
)
SELECT MD5(COALESCE(TBL1.col1," ")||"_"||COALESCE(new_column5," ")||"_"||...) AS
new_col1,
TBL1.col2,
TBL1.col3,
TBL1.col3 AS new_column3,
TBL1.col4,
CASE
WHEN TBL1.col5 = …
ELSE “some value”
END AS new_column5,
TBL1.col6,
TBL1.col7,
TBL1.col8,
CASE
WHEN TBL1.col9 = …
ELSE "some value"
END AS new_column9,
CASE
WHEN TBL1.col10 = …
ELSE "value"
END AS new_column10,
TBL1.col11,
"value" AS new_column12,
TBL2.col1,
TBL2.col2,
from_unixtime(…) AS new_column13,
CAST(…) AS new_column14,
CAST(…) AS new_column15,
CAST(…) AS new_column16,
COUNT(DISTINCT TBL1.col17) AS new_column17
FROM db.table1 TBL1
LEFT JOIN
db.table2 TBL2
ON TBL1.col311 = TBL2.col311
WHERE TBL1.col14 BETWEEN "low" AND "high"
AND TBL1.col44 = "Y"
AND TBL1.col55 = "N"
GROUP BY 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20;
If I have left out too much, please let me know.
Thanks for your help!
Updates
It turns out, I did in fact leave out way too much info. Sorry for those who have already tried to help...
I made the updates above.
Removing the 20th group by column, eg:
GROUP BY 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19;
Produced: Expression not in GROUP BY key '' ''
LATEST
Removing the 20th group by column and adding the first one, eg:
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19;
Produced:
Line XX:XX Invalid table alias or column reference 'new_column5':(possible column
names are: TBL1.col1, TBL1.col2, (looks like all columns of TBL1),
TBL2.col1, TBL2.col2, TBL2.col311)
Line # is referring the line with the SELECT statement. Just those three columns from TBL2 are listed in the error output.
The error seems to be pointing to COALESCE(new_column5)
. Note that I have a CASE
statement within the TBL 1 select which I am running with AS new_column5
.
You are addressing calculated column name new_column5
at the same subquery level where it is being calculated. This is not possible in Hive. Replace it with calculation itself or use upper level subquery.
This:
MD5(COALESCE(TBL1.col1," ")||"_"||COALESCE(CASE WHEN TBL1.col5 = … ELSE “some value” END," ")||"_"||...) AS new_col1,
Instead of this:
MD5(COALESCE(TBL1.col1," ")||"_"||COALESCE(new_column5," ")||"_"||...) AS
new_col1,