Search code examples
hivehdfsbeeline

Hive Load Data Inpath overwrite on text format file causing duplicate SKEY column values


Trying to load a well-formatted ~80GB text file (CHR 01 delimited) into Hive through beeline:

beeline -u "<connector:server:port>;principal=<principal_name>"  \
-e "LOAD DATA INPATH '/path/to/hdfs/dir' OVERWRITE INTO TABLE database.tableName;"

The table was created with the correct/appropriate schema & datatypes (100s of columns), with the following parameters:

ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS TEXTFILE;

When the data is loaded it appears that all the columns have the correct information - at least the same order of magnitude of the input (10's of millions of records), sampling of the values in the columns match the expected values; however the first column (coincidentally, the SKEY) is duplicated severely - as it were applied to the records below its first occurrence.

SKEY     ValA ValB ValC ValD
Record1  1    2    3    Apple
Record2  7    12   22   Baseball
Record3  9    28   10   Tennis
Record4  6    2    3    Jeans
...
RecordN  8    4    12   Wishbone
...

Becomes:

SKEY     ValA ValB ValC ValD
Record1  1    2    3    Apple
Record1  7    12   22   Baseball
Record1  9    28   10   Tennis
....
Record4  6    2    3    Jeans
Record4  8    4    12   Wishbone
...

Anyone have experience overcoming this issue, or have an idea about the root cause? I believe I can get better results with another format (ie/ AVRO) but it's a little unsatisfactory.

Is there a maximum limit on textfile import to Hive?


Solution

  • What is the data type of the column key?

    --Updated after looking at boethius comments----

    I would advise you to use String, big int or decimal for your primary key. With float you lose precision. e.g. if you have two skeys 8611317762 and 8611317761. I suspect as float they are both interpreted as 8.611317762 x 10^10. And that is why distinct is returning a wrong answer.