Search code examples
apache-pigbigdata

Remove single quotes from data using Pig


This is what my data looks like

(10, 'ACCOUNTING', 'NEW YORK')
(20, 'RESEARCH', 'DALLAS')
(30, 'SALES', 'CHICAGO')
(40, 'OPERATIONS', 'BOSTON')

I want to remove (, ) and ' from this data using Pig Script. I want my data to look like this-

10, ACCOUNTING, NEW YORK
20, RESEARCH, DALLAS
30, SALES, CHICAGO
40, OPERATIONS, BOSTON

I am stuck on this from quite long time. Please help. Thanks in advance.


Solution

  • Can you try REPLACE function with the below regex?

    Explanation:
    In Regex there are few characters have special meanings \ ^ $ . , | ? * + ( ) [ {. These special characters are called as "metacharacters". If you want to use any of these characters as part of your regex, then you need to escape them with a single backslash. In our case Pig uses Java based regex engine so all the specials characters needs be escaped with double backslash (Java uses \\ double backslash to differentiate the special characters ).

    To remove '(' ')' and '(single quote) characters from your input.
    1. Just Replace () with double backslash \\(\\).
    2. '(single quote) is special character in Pig(default string literal), so this also required double backslash to remove the special meaning but double backslash doesn't convince pig parser(you will get error for double backslash) that is the reason i used three backslash for single quote \\\' to remove the special meaning.
    3. [] is character class, this will match only one out of several characters. Simply place the characters inside the square bracket that you want to match ie. in our case its [()'].
    4. + symbol is for matching one or more characters.

    input

    (10, 'ACCOUNTING', 'NEW YORK')
    (20, 'RESEARCH', 'DALLAS')
    (30, 'SALES', 'CHICAGO')
    (40, 'OPERATIONS', 'BOSTON')
    

    PigScript1:

    A = LOAD 'input' AS (line:chararray);
    B = FOREACH A GENERATE REPLACE(line,'[\\\'\\(\\)]+','');
    STORE B INTO 'output';
    

    Pigscript2:

    A = LOAD 'input' USING PigStorage(',') AS (col1:chararray,col2:chararray,col3:chararray);
    B = FOREACH A GENERATE REPLACE(col1,'[\\(]+',''),REPLACE(col2,'[\\\']',''),REPLACE(col3,'[\\)\\\']+','');
    STORE B into 'output1' USING PigStorage(',');
    

    Output: will be stored in output/part-m-00000 file

    10, ACCOUNTING, NEW YORK
    20, RESEARCH, DALLAS
    30, SALES, CHICAGO
    40, OPERATIONS, BOSTON