Search code examples
apache-pig

Apache Pig: Merging list of attributes into a single tuple


I receive data in the form

id1|attribute1a,attribute1b|attribute2a|attribute3a,attribute3b,attribute3c....
id2||attribute2b,attribute2c|..

I'm trying to merge it all into a form where I just have a bag of tuples of an id field followed by a tuple containing a list of all my other fields merged together.

(id1,(attribute1a,attribute1b,attribute2a,attribute3a,attribute3b,attribute3c...)) (id2,(attribute2b,attribute2c...))

Currently I fetch it like

my_data = load '$input' USING PigStorage(|) as 
(id:chararray, attribute1:chararray, attribute2:chararray)...

then I've tried all combinations of FLATTEN, TOKENIZE, GENERATE, TOTUPLE, BagConcat, etc. to massage it into the form I want, but I'm new to pig and just can't figure it out. Can anyone help? Any open source UDF libraries are fair game.


Solution

  • Load each line as an entire string, and then use the features of the built-in STRPLIT UDF to achieve the desired result. This relies on there being no tabs in your list of attributes, and assumes that | and , are not to be treated any differently in separating out the different attributes. Also, I modified your input a little bit to show more edge cases.

    input.txt:

    id1|attribute1a,attribute1b|attribute2a|,|attribute3a,attribute3b,attribute3c
    id2||attribute2b,attribute2c,|attribute4a|,attribute5a
    

    test.pig:

    my_data = LOAD '$input' AS (str:chararray);
    split1 = FOREACH my_data GENERATE FLATTEN(STRSPLIT(str, '\\|', 2)) AS (id:chararray, attr:chararray);
    split2 = FOREACH split1 GENERATE id, STRSPLIT(attr, '[,|]') AS attributes;
    DUMP split2;
    

    Output of pig -x local -p input=input.txt test.pig:

    (id1,(attribute1a,attribute1b,attribute2a,,,attribute3a,attribute3b,attribute3c))
    (id2,(,attribute2b,attribute2c,,attribute4a,,attribute5a))