I have a .xls file with thousand rows with the following structure :
id | number | date | description
1232 | 41515 | 3/9/16 | amazing
I'm trying to load it skipping the first header row and without date column (so just id, number, description and I haven't found yet how to) using Pig with the following script :
REGISTER /usr/hdp/current/pig-client/lib/piggybank.jar
data = LOAD '/user/maria_dev/file.xls' using org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER')
as (Id:chararray,case_number:chararray,date:chararray,block:chararray,iucr:chararray);
data_sample = LIMIT data 10;
DUMP data_sample;
but I'm getting a weird result from the dump with lines such as :
( � � � � � �,,,,)
Thanks for your help
There is no direct way of loading .xls files into hdfs using csvexcelstorage.You will have to save the .xls file as .csv file and then use csvexcelstorage to load it.
Also note that you have 4 fields and your schema has 5 fields.
data = LOAD '/user/maria_dev/file.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as (Id:chararray,case_number:chararray,date:chararray,desc:chararray);
data_sample = LIMIT data 10;
DUMP data_sample;