Running the following Hive query returns special characters:
SELECT t6.amt amt2,t6.color color
FROM(
SELECT t5.color color, t5.c1 amt
FROM(
SELECT t1.c1 c1, t1.c2 AS color
from(
SELECT 7716 AS c1, "Red" AS c2 UNION
SELECT 6203 AS c1, "Blue" AS c2
) t1
) t5
order by color) t6
ORDER BY color
It returns the results as
amt color
4 �
3 �
Is it a known hive bug?
Explain plan
Map 5 <- Union 2 (CONTAINS)
Reducer 3 <- Union 2 (SIMPLE_EDGE)
Reducer 4 <- Reducer 3 (SIMPLE_EDGE)
Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 4
File Output Operator [FS_331359]
compressed:false
Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
Select Operator [SEL_331358]
| outputColumnNames:["_col0","_col1"]
| Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
|<-Reducer 3 [SIMPLE_EDGE]
Reduce Output Operator [RS_331357]
key expressions:_col1 (type: int)
sort order:+
Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
value expressions:_col0 (type: string)
Select Operator [SEL_331351]
outputColumnNames:["_col0","_col1"]
Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator [GBY_331350]
| keys:KEY._col0 (type: int), KEY._col1 (type: string)
| outputColumnNames:["_col0","_col1"]
| Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
|<-Union 2 [SIMPLE_EDGE]
|<-Map 1 [CONTAINS]
| Reduce Output Operator [RS_331349]
| key expressions:_col0 (type: int), _col1 (type: string)
| Map-reduce partition columns:_col0 (type: int), _col1 (type: string)
| sort order:++
| Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
| Group By Operator [GBY_331348]
| keys:_col0 (type: int), _col1 (type: string)
| outputColumnNames:["_col0","_col1"]
| Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
| Select Operator [SEL_331342]
| outputColumnNames:["_col0","_col1"]
| Statistics:Num rows: 1 Data size: 91 Basic stats: COMPLETE Column stats: COMPLETE
| TableScan [TS_331341]
| alias:_dummy_table
| Statistics:Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: COMPLETE
|<-Map 5 [CONTAINS]
Reduce Output Operator [RS_331349]
key expressions:_col0 (type: int), _col1 (type: string)
Map-reduce partition columns:_col0 (type: int), _col1 (type: string)
sort order:++
Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator [GBY_331348]
keys:_col0 (type: int), _col1 (type: string)
outputColumnNames:["_col0","_col1"]
Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator [SEL_331344]
outputColumnNames:["_col0","_col1"]
Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
TableScan [TS_331343]
alias:_dummy_table
Statistics:Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: COMPLETE
Can disabling or enabling a configuration parameter help me here?
If I reverse the order of columns in the outer most select then the query returns expected results. I would have expected the result to be
color amt
Blue 6203
Red 7716
I tried the same query on my Hive 2.3 with both MR and Tez and the results are the same as yours. I turned off all the query optimizations, stats gathering and rcp but the result remains the same. The problem is Hive making order by
on single reducer and since you have two sequential order by
's Hive merges them into single reduce stage (it's easy to see if you look and extended or formatted query plans). To be more precise Hive uses _col0, _col1
etc. for column aliases, in t5
subquery your key is _col0
but in t6
it's _col1
that's why in select operator you see
expressions:: "_col1 (type: string), _col0 (type: int)"
and in reduce output operator
key expressions:: "_col1 (type: int)"
So Hive some how switched key type when the select columns are swapped. If the type order is the same in t5 and t6 then there is no problem
key expressions:: "_col0 (type: string)"
How to avoid this -- I don't really know as making sequential order by
in single reducer is not due to extra optimizations.