I receive data in the form
id1|attribute1a,attribute1b|attribute2a|attribute3a,attribute3b,attribute3c....
id2||attribute2b,attribute2c|..
I'm trying to merge it all into a form where I just have a bag of tuples of an id field followed by a tuple containing a list of all my other fields merged together.
(id1,(attribute1a,attribute1b,attribute2a,attribute3a,attribute3b,attribute3c...)) (id2,(attribute2b,attribute2c...))
Currently I fetch it like
my_data = load '$input' USING PigStorage(|) as
(id:chararray, attribute1:chararray, attribute2:chararray)...
then I've tried all combinations of FLATTEN, TOKENIZE, GENERATE, TOTUPLE, BagConcat, etc. to massage it into the form I want, but I'm new to pig and just can't figure it out. Can anyone help? Any open source UDF libraries are fair game.
Load each line as an entire string, and then use the features of the built-in STRPLIT
UDF to achieve the desired result. This relies on there being no tabs in your list of attributes, and assumes that |
and ,
are not to be treated any differently in separating out the different attributes. Also, I modified your input a little bit to show more edge cases.
input.txt
:
id1|attribute1a,attribute1b|attribute2a|,|attribute3a,attribute3b,attribute3c
id2||attribute2b,attribute2c,|attribute4a|,attribute5a
test.pig
:
my_data = LOAD '$input' AS (str:chararray);
split1 = FOREACH my_data GENERATE FLATTEN(STRSPLIT(str, '\\|', 2)) AS (id:chararray, attr:chararray);
split2 = FOREACH split1 GENERATE id, STRSPLIT(attr, '[,|]') AS attributes;
DUMP split2;
Output of pig -x local -p input=input.txt test.pig
:
(id1,(attribute1a,attribute1b,attribute2a,,,attribute3a,attribute3b,attribute3c))
(id2,(,attribute2b,attribute2c,,attribute4a,,attribute5a))