I'm new to PIG and I try to figure out how to get the minimum rank within a group. What I want is getting from following dataset:
ID clickcounter
A A1
A A2
A A3
B B1
B B2
C C1
D D1
E E1
E E2
E E3
E E4
... to following dataset:
ID clickcounter Rank minRank_of_ID
A A1 1 1
A A2 2 1
A A3 3 1
B B1 4 4
B B2 5 4
C C1 6 6
D D1 7 7
E E1 8 8
E E2 9 8
E E3 10 8
E E4 11 8
I tried the following code and it is working, however I'm wondering if there is a better solution ?
A = LOAD 'datapath' using PigStorage() as (ID:chararray, clickcount:chararray);
B = rank A;
C = group B by ID;
D = foreach C generate group, flatten($1.clickcount), MIN($1.rank_A);
E = rank D;
Dump D;
You are on the right track. In your code everything up to D
is correct. You'll get your expected output with just a couple of changes:
D = FOREACH C GENERATE group, FLATTEN(B.($0, clickcount)), MIN(B.$0) ;
-- D should not be your expected output!
Since the output of C
is like:
(A, {(1, A, A1), (2, A, A2), (3, A, A3)})
(B, {(4, B, B1), (5, B, B2)})
etc.
Your FLATTEN
is going to need both the rank given in B
and the clickcount
field. The RANK
in E
is not going to do what you expect because the data is no longer guaranteed to be in the same order it was in the file.