I've been stuck on this question for a while. I have a data file that looks like this:
(1,N,N,5,High,H,House,d)
(1,N,N,6,High,H,House,a)
(2,N,N,10,Low,H,House,t)
(2,N,N,11,Medium,H,House,e)
I wanted my output in the below format. Can I achieve it using Pig???
{1,(N,N),{(5,High),(H,House),d},{(6,High),(H,House),a}}
{2,(N,N),{(10,Low),(H,House),t}{(11,Medium),(H,House),e}}
I actually tried to group it by first column.
datafile = LOAD '/user/zbc/xyz.txt' USING PigStorage() AS (id:int,
flag1:chararray, flag2:chararray, typcode:chararray, typ_name:chararray,
groupcode:charray, groupname:chararray, date:chararray);
collected = FOREACH datafile Generate TOBAG(gst_id, TOTUPLE(flag1,flag2),
TOBAG(TOTUPLE(typcode, typname), TOTUPLE(groupcode, groupname), date));
I'm not getting how to proceed further. To group by "one field and one tuple".
Well, you were in the right direction, but you are creating the bags yourself instead of letting Pig do it when grouping. After loading the data, simplify your second step only creating the tuple you want, the combination of both flags:
collected = FOREACH datafile Generate id, TOTUPLE(flag1, flag2), $3..;
The $3..
tells Pig to include from the fourth (it starts at $0
) onwards, so you don't have to repeat the whole list of parameters. Now you will have this:
(1,(N,N),5,High,H,House,d)
(1,(N,N),6,High,H,House,a)
(2,(N,N),10,Low,H,House,t)
(2,(N,N),11,Medium,H,House,e)
Now, you can use the group by
operator to group by any combination of fields you want, which in this case is by the id
and the flags tuple:
desired_output = group collected by (id, $1);
After this, you get the data grouped as you wanted:
((1,(N,N)),{(1,(N,N),6,High,H,House,a),(1,(N,N),5,High,H,House,d)})
((2,(N,N)),{(2,(N,N),11,Medium,H,House,e),(2,(N,N),10,Low,H,House,t)})
EDIT
If you don't want the fields you grouped by to appear in the final bag, you can take them out using a nested foreach:
filtered_output = foreach desired_output {
AUX = foreach collected generate $2..;
generate group, AUX;
}
Output:
((1,(N,N)),{(6,High,H,House,a),(5,High,H,House,d)})
((2,(N,N)),{(11,Medium,H,House,e),(10,Low,H,House,t)})