Search code examples
apache-pig

replace comma(,) only if its inside quotes("") in Pig


I have data like this:

1,234,"john, lee", john@xyz.com

I want to remove , inside "" with space using pig script. So that my data will look like:

1,234,john lee, john@xyz.com

I tried using CSVExcelStorage to load this data but i need to use '-tagFile' option as well which is not supported in CSVExcelStorage . So i am planning to use PigStorage only and then replace any comma (,) inside quotes. I am stuck on this. Any help is highly appreciated. Thanks


Solution

  • I got the perfect way to do this. A very generic solution is as below:

    data = LOAD 'data.csv' using PigStorage(',','-tagFile') AS (filename:chararray, record:chararray);
    
    /*replace comma(,) if it appears in column content*/
    replaceComma = FOREACH data GENERATE filename, REPLACE (record, ',(?!(([^\\"]*\\"){2})*[^\\"]*$)', '');
    
    /*replace the quotes("") which is present around the column if it have comma(,) as its a csv file feature*/
    replaceQuotes = FOREACH replaceComma GENERATE filename, REPLACE ($4,'"','') as record;
    

    Detailed use case is available at my blog