Search code examples
javaapache-pig

Apache Pig: Most frequent value in bag


My data look like this:

SWE "{(Figure Skating),(Tennis),(Tennis)}"
GER "{(Figure Skating),(Figure Skating)}"

And I want to produce this:

SWE Tennis
GER "Figure Skating"

symbol of relation: x
symbol of field #1: NOC
symbol of field #2: sports

The obvious idea is to produce counts and filter by maximum count, but I don't even know how to iterate over field sports. How is this done in action?


Solution

  • I would recommend using the DataFu CountEach UDF to count the instances of each sport in the bag. You can then find the highest count in each bag. One way to do this is to order the 'sports' bags by the counts then take the first tuple from each bag, using the FirstTupleFromBag UDF.

    I've used CountEach in flatten mode as it means we won't have the sport names 'nested' in the result, but you can define the UDF without 'flatten' if you prefer.

    DEFINE CountEachFlatten datafu.pig.bags.CountEach('flatten');
    DEFINE FirstTupleFromBag datafu.pig.bags.FirstTupleFromBag();
    
    sports_counted = FOREACH x GENERATE
        NOC,
        CountEachFlatten(sports) AS sports:{(sport_name, sport_count)};
    
    max_sports = FOREACH sports_counted {
        ordered_sports = ORDER sports BY sport_count DESC;
        GENERATE
        NOC,
        FirstTupleFromBag(ordered_sports, null);
    }