I have an inner bag comprised of nested tuples that are unnecessary for my expected schema. I'd like to remove one of the tuple layers so that I'm left with just a simple inner bag. I'm using Pig 0.14.
A sample of my input data.
((1,100,0),(2))
((1,100,1),(3,500,60))
My desired output.
(100,{(2),(3,500,60)})
My current state after some minor manipulation (see below), which prompted the question above.
(100,{((2)),((3,500,60))})
I feel like my complication is that I'm attempting to group on an item inside a tuple. I did a simple group statement, which appears to leave the grouped elements in the tuple (I'm fairly new to Pig).
a = LOAD 'data' as (key:tuple(), data:tuple());
b = GROUP a BY key.$1;
c = FOREACH b GENERATE group as vid, b.data as data;
Dumping c
provides the undesired output above. The multi-part key (a,b,c)
needs to be stripped such that a
is removed, b
is used as a group, and c
can either be removed or not, but only after it is used to create the inner bag.
Attempting to FLATTEN
ungroups the elements. I can then FLATTEN
again and re-group, but this seems a little ridiculous. Is there a better way than this?
d = FOREACH c GENERATE vid, FLATTEN(data) as data;
e = FOREACH d GENERATE vid, FLATTEN(data);
f = GROUP e BY $0;
This still doesn't really provide what I want, since it keeps the key around:
(100,{(100,2),(100,3,500,60)})
What am I missing?
You can try this. This will help a bit.. but this is not a effective solution . lets wait some good brains to post their answers.
Input :
(1,100,0)|(2)
(1,100,1)|(3,500,60)
Pig Script :
records = LOAD '/home/user/bags.txt' USING PigStorage('|') AS(key:tuple(),value:tuple());
records_each = FOREACH records GENERATE key.$1 as grouping_key, flatten(value);
records_grp = GROUP records_each BY $0;
records_nested_each = FOREACH records_grp
{
inner_each= FOREACH records_each GENERATE $1..;
GENERATE group, inner_each;
};
dump records_nested_each;
Output :
(100,{(2),(3,500,60)})