Search code examples
apache-pighdfs

how can i ignore " (double quotes) while loading file in PIG?


I have following data in file

"a","b","1","2"
"a","b","4","3"
"a","b","3","1"

I am reading this file using below command

File1 = LOAD '/path' using PigStorage (',') as (f1:chararray,f2:chararray,f3:int,f4:int)

But here it is ignoring the data of field 3 and 4

How to read this file correctly or any way to make PIG skip '"'

Additional information i am using Apache Pig version 0.10.0


Solution

  • You may use the REPLACE function (it won't be in one pass though) :

    file1 = load 'your.csv' using PigStorage(',');
    data = foreach file1 generate $0 as (f1:chararray), $1 as (f2:chararray), REPLACE($2, '\\"', '') as (f3:int), REPLACE($3, '\\"', '') as (f4:int);
    

    You may also use regexes with REGEX_EXTRACT :

    file1 = load 'your.csv' using PigStorage(',');
    data = foreach file1 generate $0, $1, REGEX_EXTRACT($2, '([0-9]+)', 1), REGEX_EXTRACT($3, '([0-9]+)', 1);
    

    Of course, you could erase " for f1 and f2 the same way.