Search code examples
tuplesapache-pigbag

How to use FLATTEN for one level in Pig?


Problem

I have an inner bag comprised of nested tuples that are unnecessary for my expected schema. I'd like to remove one of the tuple layers so that I'm left with just a simple inner bag. I'm using Pig 0.14.

Example

A sample of my input data.

((1,100,0),(2))
((1,100,1),(3,500,60))

My desired output.

(100,{(2),(3,500,60)})

My current state after some minor manipulation (see below), which prompted the question above.

(100,{((2)),((3,500,60))})

Attempts

I feel like my complication is that I'm attempting to group on an item inside a tuple. I did a simple group statement, which appears to leave the grouped elements in the tuple (I'm fairly new to Pig).

a = LOAD 'data' as (key:tuple(), data:tuple());
b = GROUP a BY key.$1;
c = FOREACH b GENERATE group as vid, b.data as data;

Dumping c provides the undesired output above. The multi-part key (a,b,c) needs to be stripped such that a is removed, b is used as a group, and c can either be removed or not, but only after it is used to create the inner bag.

Attempting to FLATTEN ungroups the elements. I can then FLATTEN again and re-group, but this seems a little ridiculous. Is there a better way than this?

d = FOREACH c GENERATE vid, FLATTEN(data) as data;
e = FOREACH d GENERATE vid, FLATTEN(data);
f = GROUP e BY $0;

This still doesn't really provide what I want, since it keeps the key around:

(100,{(100,2),(100,3,500,60)})

What am I missing?


Solution

  • You can try this. This will help a bit.. but this is not a effective solution . lets wait some good brains to post their answers.

    Input :

     (1,100,0)|(2)
     (1,100,1)|(3,500,60)
    

    Pig Script :

     records = LOAD '/home/user/bags.txt'  USING PigStorage('|')  AS(key:tuple(),value:tuple());
    
     records_each = FOREACH records GENERATE key.$1 as grouping_key, flatten(value);
    
     records_grp = GROUP records_each BY $0;
    
     records_nested_each = FOREACH records_grp
    
                          {
                            inner_each=   FOREACH records_each GENERATE $1..;
    
                             GENERATE group, inner_each;
    
                         };
    
     dump records_nested_each;
    

    Output :

      (100,{(2),(3,500,60)})