I am facing a problem where exporting results from hive server 2 to ORC files show some kind of default column names (e.g. _col0, _col1, _col2) instead of the original ones created in hive. We are using pretty much default components from HDP-2.6.3.0.
I am also wondering if the below issue is related:
https://issues.apache.org/jira/browse/HIVE-4243
Below are the steps we are taking:
Connecting:
export SPARK_HOME=/usr/hdp/current/spark2-client
beeline
!connect jdbc:hive2://HOST1:2181,HOST2:2181,HOST2:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2
Creating test table and inserting sample values:
create table test(str string);
insert into test values ('1');
insert into test values ('2');
insert into test values ('3');
Running test query:
select * from test;
+-----------+--+
| test.str |
+-----------+--+
| 1 |
| 2 |
| 3 |
+-----------+--+
Exporting as ORC:
insert overwrite directory 'hdfs://HOST1:8020/tmp/test' stored as orc select * from test;
Getting the results:
hdfs dfs -get /tmp/test/000000_0 test.orc
Checking the results:
java -jar orc-tools-1.4.1-uber.jar data test.orc
Processing data file test.orc [length: 228]
{"_col0":"1"}
{"_col0":"2"}
{"_col0":"3"}
java -jar orc-tools-1.4.1-uber.jar meta test.orc
Processing data file test.orc [length: 228]
Structure for test.orc
File Version: 0.12 with HIVE_13083
Rows: 2
Compression: SNAPPY
Compression size: 262144
Type: struct<_col0:string>
Stripe Statistics:
Stripe 1:
Column 0: count: 2 hasNull: false
Column 1: count: 2 hasNull: false min: 1 max: 3 sum: 2
File Statistics:
Column 0: count: 2 hasNull: false
Column 1: count: 2 hasNull: false min: 1 max: 3 sum: 2
Stripes:
Stripe: offset: 3 data: 11 rows: 2 tail: 60 index: 39
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 28
Stream: column 1 section DATA start: 42 length 5
Stream: column 1 section LENGTH start: 47 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 228 bytes
Padding length: 0 bytes
Padding ratio: 0%
Looking at the results I can see _col0 as the column name while expecting the original str.
Any ideas on what I am missing?
Update
I noticed that the connection from beeline was going to hive 1.x, and not 2.x as wanted. I changed the connection to the Hive Server 2 Interactive URL:
Connected to: Apache Hive (version 2.1.0.2.6.3.0-235)
Driver: Hive JDBC (version 1.21.2.2.6.3.0-235)
Transaction isolation: TRANSACTION_REPEATABLE_READ
And tried again with the same sample. It even prints out the schema correctly:
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:test.str, type:string, comment:null)], properties:null)
But still no luck in getting it to the ORC file.
You need to enable Hive LLAP (Interactive SQL) in Ambari, then change the connection string you are using. For example, my connection became jdbc:hive2://.../;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-hive2
Note the additional "-hive2" at the end of the URL. Here is a tutorial vid from hortonworks.
After connecting to the updated Hive endpoint, I ran
create table t_orc(customer string, age int) stored as orc;
insert into t_orc values('bob', 12),('kate', 15);
Then
~$ hdfs dfs -copyToLocal /apps/hive/warehouse/t_orc/000000_0 ~/tmp/orc/hive2.orc
~$ orc-metadata tmp/orc/hive2.orc
{ "name": "tmp/orc/hive2.orc",
"type": "struct<customer:string,age:int>",
"rows": 2,
"stripe count": 1,
"format": "0.12", "writer version": "HIVE-13083",
"compression": "zlib", "compression block": 262144,
"file length": 305,
"content": 139, "stripe stats": 46, "footer": 96, "postscript": 23,
"row index stride": 10000,
"user metadata": {
},
"stripes": [
{ "stripe": 0, "rows": 2,
"offset": 3, "length": 136,
"index": 67, "data": 23, "footer": 46
}
]
}
Where orc-metadata
is a tool distributed by the ORC repo on github.