I was loading some text data into Apache Hive containing int
columns. It was storing null values at unexpected places. So, I ran some tests:
create table testdata (c1 INT, c2 FLOAT) row format delimited fields terminated by ',' stored as textfile;
load data local inpath "testdata.csv" overwrite into table testdata;
select * from testdata;
testdata.csv contains this data:
1,1.0
1, 1.0
1 ,1.0
1 , 1.0
As you can see, dataset contains some extra whitespace around numbers. But this is causing hive to store null values in integer columns, while float is being parsed correctly.
Select query output:
1 1.0
NULL 1.0
NULL 1.0
NULL 1.0
Why this is happening so, and how to correctly handle these cases?
You can not do it in one step. First load the data as string in stg table and then load into final table from stg table by removing space.
create table testdata (c1 string, c2 string) row format delimited fields terminated by ',' stored as textfile;
create table stgtestdata as select * from testdata;
load data local inpath "testdata.csv" overwrite into table stgtestdata;
Insert overwrite testdata
select
Cast(trim(c1) as int) as c1,
Cast(trim(c2) as float) as c2
from stgtestdata;