I am using PigLatin. And I want to remove the duplicates from the bags and want to retain the last element of the particular key.
Input:
User1 7 LA
User1 8 NYC
User1 9 NYC
User2 3 NYC
User2 4 DC
Output:
User1 9 NYC
User2 4 DC
Here the first filed is a key. And I want the last record of that particular key to be retained in the output.
I know how to retain the first element. It is as below. But not able to retain the last element.
inpt = load '......' ......;
user_grp = GROUP inpt BY $0;
filtered = FOREACH user_grp {
top_rec = LIMIT inpt 1;
GENERATE FLATTEN(top_rec);
};
Can anybody help me on this? Thanks in advance!
@Anil : If you order by one of the fields in descending order. You will be able to get the last record. In the below code, have ordered by second field of input (field name : no in script)
Input :
User1,7,LA
User1,8,NYC
User1,9,NYC
User2,3,NYC
User2,4,DC
Pig snippet :
user_details = LOAD 'user_details.csv' USING PigStorage(',') AS (user_name:chararray,no:long,city:chararray);
user_details_grp_user = GROUP user_details BY user_name;
required_user_details = FOREACH user_details_grp_user {
user_details_sorted_by_no = ORDER user_details BY no DESC;
top_record = LIMIT user_details_sorted_by_no 1;
GENERATE FLATTEN(top_record);
}
Output : DUMP required_user_details
(User1,9,NYC )
(User2,4,DC)