Trying to load a well-formatted ~80GB text file (CHR 01 delimited) into Hive through beeline:
beeline -u "<connector:server:port>;principal=<principal_name>" \
-e "LOAD DATA INPATH '/path/to/hdfs/dir' OVERWRITE INTO TABLE database.tableName;"
The table was created with the correct/appropriate schema & datatypes (100s of columns), with the following parameters:
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS TEXTFILE;
When the data is loaded it appears that all the columns have the correct information - at least the same order of magnitude of the input (10's of millions of records), sampling of the values in the columns match the expected values; however the first column (coincidentally, the SKEY) is duplicated severely - as it were applied to the records below its first occurrence.
SKEY ValA ValB ValC ValD
Record1 1 2 3 Apple
Record2 7 12 22 Baseball
Record3 9 28 10 Tennis
Record4 6 2 3 Jeans
...
RecordN 8 4 12 Wishbone
...
Becomes:
SKEY ValA ValB ValC ValD
Record1 1 2 3 Apple
Record1 7 12 22 Baseball
Record1 9 28 10 Tennis
....
Record4 6 2 3 Jeans
Record4 8 4 12 Wishbone
...
Anyone have experience overcoming this issue, or have an idea about the root cause? I believe I can get better results with another format (ie/ AVRO) but it's a little unsatisfactory.
Is there a maximum limit on textfile import to Hive?
What is the data type of the column key?
--Updated after looking at boethius comments----
I would advise you to use String, big int or decimal for your primary key. With float you lose precision. e.g. if you have two skeys 8611317762 and 8611317761. I suspect as float they are both interpreted as 8.611317762 x 10^10. And that is why distinct is returning a wrong answer.