Search code examples
apache-pig

How to use -tagFile option with CSVExcelStorage in Pig


I have to get the filename with each row so i used

data = LOAD 'data.csv' using PigStorage(',','-tagFile') AS (filename:chararray);

But in data.csv some columns have comma(,) in content as well so to handle comma issue i used

data = LOAD 'data.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage()AS (filename:chararray);

But I didn't get any option to use -tagFile option with CSVExcelStorage. Please let me know how can i use CSVExcelStorage and -tagFile option at once?

Thanks


Solution

  • I got the way to perform both operation(get the file name in each row and replace delimiter if it appears in column content)

    data = LOAD 'data.csv' using PigStorage(',','-tagFile') AS (filename:chararray, record:chararray);
    
    /*replace comma(,) if it appears in column content*/
    replaceComma = FOREACH data GENERATE filename, REPLACE (record, ',(?!(([^\\"]*\\"){2})*[^\\"]*$)', '');
    
    /*replace the quotes("") which is present around the column if it have comma(,) as its a csv file feature*/
    replaceQuotes = FOREACH replaceComma GENERATE filename, REPLACE ($4,'"','') as record;
    

    Once data is loaded properly without comma , i am free to perform any operation. Detailed use case is available at my blog