Search code examples
hadoopapache-pig

Hadoop Pig GROUP by id, get owner_id?


In Hadoop I have many that look like this: (item_id,owner_id,counter) - there could be duplicates but ALWAYS the item_id has the same owner_id!

I want to get the SUM of the counter for each item_id so I have the following script:

alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
known_items = FILTER alldata BY owner_id > 0L;
group_by_item = GROUP known_data BY (item_id);
data = FOREACH group_by_item GENERATE group AS item_id, OWNER_ID_COLUMN_SOMEHOW, SUM(known_items.counter) AS items_count;

The problem is that in the FOREACH if I want to take known_items.owner_id - that would be a tuple that has the sum of all grouped item_id. What would be the most efficient way to get the first one of the owners?


Solution

  • The simplest solution gives you the right answer if your assumption that each item_id has the same owner_id is correct, and will let you know if it is not: incude the owner_id as part of the group.

    alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
    known_items = FILTER alldata BY owner_id > 0L;
    group_by_item = GROUP known_data BY (item_id, owner_id);
    data = FOREACH group_by_item GENERATE FLATTEN(group), SUM(known_items.counter) AS items_count;