Search code examples
hadoopapache-pigduplicatesdatastage

Removing duplicates using PigLatin and retaining the last element


I am using PigLatin. And I want to remove the duplicates from the bags and want to retain the last element of the particular key.

Input:
User1  7 LA 
User1  8 NYC 
User1  9 NYC 
User2  3 NYC
User2  4 DC 


Output:
User1  9 NYC 
User2  4 DC 

Here the first filed is a key. And I want the last record of that particular key to be retained in the output.

I know how to retain the first element. It is as below. But not able to retain the last element.

inpt = load '......' ......;
user_grp = GROUP inpt BY $0;
filtered = FOREACH user_grp {
      top_rec = LIMIT inpt 1;
      GENERATE FLATTEN(top_rec);
};

Can anybody help me on this? Thanks in advance!


Solution

  • @Anil : If you order by one of the fields in descending order. You will be able to get the last record. In the below code, have ordered by second field of input (field name : no in script)

    Input :

    User1,7,LA 
    User1,8,NYC 
    User1,9,NYC 
    User2,3,NYC
    User2,4,DC
    

    Pig snippet :

    user_details = LOAD 'user_details.csv'  USING  PigStorage(',') AS (user_name:chararray,no:long,city:chararray);
    
    user_details_grp_user = GROUP user_details BY user_name;
    
    required_user_details = FOREACH user_details_grp_user {
        user_details_sorted_by_no = ORDER user_details BY no DESC;
        top_record = LIMIT user_details_sorted_by_no 1;
        GENERATE FLATTEN(top_record);
    }
    

    Output : DUMP required_user_details

    (User1,9,NYC )
    (User2,4,DC)