In Hadoop I have many that look like this:
(item_id,owner_id,counter)
- there could be duplicates but ALWAYS the item_id
has the same owner_id
!
I want to get the SUM of the counter
for each item_id
so I have the following script:
alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
known_items = FILTER alldata BY owner_id > 0L;
group_by_item = GROUP known_data BY (item_id);
data = FOREACH group_by_item GENERATE group AS item_id, OWNER_ID_COLUMN_SOMEHOW, SUM(known_items.counter) AS items_count;
The problem is that in the FOREACH
if I want to take known_items.owner_id
- that would be a tuple that has the sum of all grouped item_id
. What would be the most efficient way to get the first one of the owners?
The simplest solution gives you the right answer if your assumption that each item_id
has the same owner_id
is correct, and will let you know if it is not: incude the owner_id
as part of the group.
alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
known_items = FILTER alldata BY owner_id > 0L;
group_by_item = GROUP known_data BY (item_id, owner_id);
data = FOREACH group_by_item GENERATE FLATTEN(group), SUM(known_items.counter) AS items_count;