This is what my data looks like
(10, 'ACCOUNTING', 'NEW YORK')
(20, 'RESEARCH', 'DALLAS')
(30, 'SALES', 'CHICAGO')
(40, 'OPERATIONS', 'BOSTON')
I want to remove (
, )
and '
from this data using Pig Script. I want my data to look like this-
10, ACCOUNTING, NEW YORK
20, RESEARCH, DALLAS
30, SALES, CHICAGO
40, OPERATIONS, BOSTON
I am stuck on this from quite long time. Please help. Thanks in advance.
Can you try REPLACE
function with the below regex?
Explanation:
In Regex there are few characters have special meanings \ ^ $ . , | ? * + ( ) [ {
. These special characters are called as "metacharacters
".
If you want to use any of these characters as part of your regex, then you need to escape them with a single backslash
. In our case Pig uses Java based regex engine
so all the specials characters needs be escaped with double backslash
(Java uses \\ double backslash to differentiate the special characters ).
To remove '(' ')' and '(single quote)
characters from your input.
1. Just Replace ()
with double backslash \\(\\)
.
2. '(single quote)
is special character in Pig(default string literal)
, so this also required double backslash to remove the special meaning but double backslash doesn't convince pig parser
(you will get error for double backslash
) that is the reason i used three backslash
for single quote \\\'
to remove the special meaning.
3. [] is character class
, this will match only one out of several characters. Simply place the characters inside the square bracket that you want to match ie. in our case its [()']
.
4. + symbol
is for matching one or more characters.
input
(10, 'ACCOUNTING', 'NEW YORK')
(20, 'RESEARCH', 'DALLAS')
(30, 'SALES', 'CHICAGO')
(40, 'OPERATIONS', 'BOSTON')
PigScript1:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE REPLACE(line,'[\\\'\\(\\)]+','');
STORE B INTO 'output';
Pigscript2:
A = LOAD 'input' USING PigStorage(',') AS (col1:chararray,col2:chararray,col3:chararray);
B = FOREACH A GENERATE REPLACE(col1,'[\\(]+',''),REPLACE(col2,'[\\\']',''),REPLACE(col3,'[\\)\\\']+','');
STORE B into 'output1' USING PigStorage(',');
Output: will be stored in output/part-m-00000 file
10, ACCOUNTING, NEW YORK
20, RESEARCH, DALLAS
30, SALES, CHICAGO
40, OPERATIONS, BOSTON